This guide examines three PHP HTML parsing techniques and compares their strengths and differences:
- Parsing HTML with PHP
- Why Parse HTML in PHP?
- Prerequisites
- HTML Retrieval in PHP
- HTML Parsing in PHP: 3 Approaches
- Parsing HTML in PHP: Comparison Table
- Conclusion
HTML Parsing in PHP involves converting HTML content into its DOM (Document Object Model) structure. Once in the DOM format, you can easily navigate and manipulate the HTML content.
In particular, the top reasons to parse HTML in PHP are:
- Data extraction: Retrieve specific content from web pages, including text or attributes from HTML elements.
- Automation: Streamline tasks such as content scraping, reporting, and data aggregation from HTML.
- Server-side HTML handling: Parse and manipulate HTML to clean, format, or modify web content before rendering it in your application.
Before you start coding, make sure you have PHP 8.4+ installed on your machine. You can verify this by running the following command:
php -v
The output should look something like this:
PHP 8.4.3 (cli) (built: Jan 19 2025 14:20:58) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.3, Copyright (c) Zend Technologies
with Zend OPcache v8.4.3, Copyright (c), by Zend Technologies
Next, initialize a Composer project to make dependency management easier. If Composer is not installed on your system, download it and follow the installation instructions.
First, create a new folder for your PHP HTML project:
mkdir php-html-parser
Navigate to the folder in your terminal and initialize a Composer project inside it using the composer init
command:
composer init
During this process, you'll be asked a few questions. The default answers are sufficient, but you can provide more specific details to tailor the setup for your PHP HTML parsing project if needed.
Next, open the project folder in your favorite IDE. Visual Studio Code with the PHP extension or IntelliJ WebStorm are good choices for PHP development.
Now, add an empty index.php
file to the project folder. Your project structure should now look like this:
php-html-parser/
├── vendor/
├── composer.json
└── index.php
Open index.php
and add the following code to initialize your project:
<?php
require_once __DIR__ . "/vendor/autoload.php";
// scraping logic...
Run your script with this command:
php index.php
Before parsing HTML in PHP, you need some HTML to parse. In this section, we will see two different approaches to accessing HTML content in PHP. We suggest to read our guide on web scraping with PHP too.
PHP natively supports cURL, a popular HTTP client used to perform HTTP requests. Enable the cURL extension or install it on Ubuntu Linux with:
sudo apt-get install php8.4-curl
You can use cURL to send an HTTP GET request to an online server and retrieve the HTML document returned by the server. This example script makes a simple GET request and retrieves HTML content:
// initialize cURL session
$ch = curl_init();
// set the URL you want to make a GET request to
curl_setopt($ch, CURLOPT_URL, "https://www.scrapethissite.com/pages/forms/?per_page=100");
// return the response instead of outputting it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// execute the cURL request and store the result in $response
$html = curl_exec($ch);
// close the cURL session
curl_close($ch);
// output the HTML response
echo $html;
Add the above code snippet to index.php
and launch it. It will produce the following HTML code:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />
<!-- Omitted for brevity... -->
</html>
Let's assume you have a file named index.html
that contains the HTML of the “Hockey Teams” page from Scrape This Site, which was previously retrieved using cURL:
This section explains using three different libraries to parse HTML in PHP:
- Using
Dom\HTMLDocument
for vanilla PHP - Using the Simple HTML DOM Parser library
- Using Symfony’s
DomCrawler
component
In all three cases, you parse the HTML from the local index.html
file to select all hockey team entries on the page and extract data from them:
The final result will be a list of scraped hockey team entries containing the following details:
- Team Name
- Year
- Wins
- Losses
- Win %
- Goals For (GF)
- Goals Against (GA)
- Goal Difference
You can extract them from the HTML table with this structure:
Each column in a table row has a specific class, allowing you to extract data by selecting elements with their class as a CSS selector and retrieving their content through their text.
PHP 8.4+ comes with a built-in Dom\HTMLDocument
class. This represents an HTML document and allows you to parse HTML content and navigate the DOM tree.
Dom\HTMLDocument
is part of the Standard PHP Library. Still, you need to enable the DOM extension or install it with this Linux command to use it:
sudo apt-get install php-dom
You can parse the HTML string as below:
$dom = \DOM\HTMLDocument::createFromString($html);
You can parse the index.html
file with:
$dom = \DOM\HTMLDocument::createFromFile("./index.html");
$dom
is a Dom\HTMLDocument
object that exposes the methods you need for data parsing.
You can select all hockey team entries using \DOM\HTMLDocument
with the following approach:
// select each row on the page
$table = $dom->getElementsByTagName("table")->item(0);
$rows = $table->getElementsByTagName("tr");
// iterate through each row and extract data
foreach ($rows as $row) {
$cells = $row->getElementsByTagName("td");
// extracting the data from each column
$team = trim($cells->item(0)->textContent);
$year = trim($cells->item(1)->textContent);
$wins = trim($cells->item(2)->textContent);
$losses = trim($cells->item(3)->textContent);
$win_pct = trim($cells->item(5)->textContent);
$goals_for = trim($cells->item(6)->textContent);
$goals_against = trim($cells->item(7)->textContent);
$goal_diff = trim($cells->item(8)->textContent);
// create an array for the scraped team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print ("\n");
}
\DOM\HTMLDocument
does not offer advanced query methods. So you have to rely on methods like getElementsByTagName()
and manual iteration.
Here is a breakdown of the methods used:
getElementsByTagName()
: Retrieve all elements of a given tag (like<table>
,<tr>
, or<td>
) within the document.item()
: Return an individual element from a list of elements returned bygetElementsByTagName()
.textContent
: This property gives the raw text content of an element, allowing you to extract the visible data (like the team name, year, etc.).
We also used trim()
to remove extra whitespace before and after the text content for cleaner data.
When added to index.php
, the above snippet will produce this result:
Array
(
[team] => Boston Bruins
[year] => 1990
[wins] => 44
[losses] => 24
[win_pct] => 0.55
[goals_for] => 299
[goals_against] => 264
[goal_diff] => 35
)
// omitted for brevity...
Array
(
[team] => Detroit Red Wings
[year] => 1994
[wins] => 33
[losses] => 11
[win_pct] => 0.688
[goals_for] => 180
[goals_against] => 117
[goal_diff] => 63
)
Simple HTML DOM Parser is a lightweight PHP library that makes it easy to parse and manipulate HTML content.
You can install Simple HTML Dom Parser via Composer with this command:
composer require voku/simple_html_dom
Alternatively, you can manually download and include the simple_html_dom.php
file in your project.
Then, import it in index.php
with this line of code:
use voku\helper\HtmlDomParser;
To parse an HTML string, use the file_get_html()
method:
$dom = HtmlDomParser::str_get_html($html);
For parsing index.html
, write file_get_html()
instead:
$dom = HtmlDomParser::file_get_html($str);
This will load the HTML content into a $dom
object, which allows you to navigate the DOM easily.
Extract the hockey team data from the HTML using Simple HTML DOM Parser:
// find all rows in the table
$rows = $dom->findMulti("table tr.team");
// loop through each row to extract the data
foreach ($rows as $row) {
// extract data using CSS selectors
$team_element = $row->findOne(".name");
$team = trim($team_element->plaintext);
$year_element = $row->findOne(".year");
$year = trim($year_element->plaintext);
$wins_element = $row->findOne(".wins");
$wins = trim($wins_element->plaintext);
$losses_element = $row->findOne(".losses");
$losses = trim($losses_element->plaintext);
$win_pct_element = $row->findOne(".pct");
$win_pct = trim($win_pct_element->plaintext);
$goals_for_element = $row->findOne(".gf");
$goals_for = trim($goals_for_element->plaintext);
$goals_against_element = $row->findOne(".ga");
$goals_against = trim(string: $goals_against_element->plaintext);
$goal_diff_element = $row->findOne(".diff");
$goal_diff = trim(string: $goal_diff_element->plaintext);
// create an array with the extracted team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print("\n");
}
The Simple HTML DOM Parser features used above are:
findMulti()
: Select all elements identified by the given CSS selector.findOne()
: Locate the first element matching the given CSS selector.plaintext
: An attribute to get the raw text content inside an HTML element.
This time, we applied CSS selectors with a more comprehensive and robust logic. However, the result remains the same as in the initial PHP HTML parsing approach.
Symfony’s DomCrawler
component provides an easy way to parse HTML documents and extract data from them.
Note: The component is part of the Symfony framework but can also be used standalone, as we will do in this section.
Install Symfony’s DomCrawler
component with this Composer command:
composer require symfony/dom-crawler
Then, import it in the index.php
file:
use Symfony\Component\DomCrawler\Crawler;
To parse an HTML string, create a Crawler
instance with the html()
method:
$crawler = new Crawler($html);
For parsing a file, use file_get_contents()
and create the Crawler
instance:
$crawler = new Crawler(file_get_contents("./index.html"));
The above lines will load the HTML content into the $crawler
object, which provides easy methods to traverse and extract data.
Extract the hockey team data using the DomCrawler
component:
// select all rows within the table
$rows = $crawler->filter("table tr.team");
// loop through each row to extract the data
$rows->each(function ($row, $i) {
// extract data using CSS selectors
$team_element = $row->filter(".name");
$team = trim($team_element->text());
$year_element = $row->filter(".year");
$year = trim($year_element->text());
$wins_element = $row->filter(".wins");
$wins = trim($wins_element->text());
$losses_element = $row->filter(".losses");
$losses = trim($losses_element->text());
$win_pct_element = $row->filter(".pct");
$win_pct = trim($win_pct_element->text());
$goals_for_element = $row->filter(".gf");
$goals_for = trim($goals_for_element->text());
$goals_against_element = $row->filter(".ga");
$goals_against = trim($goals_against_element->text());
$goal_diff_element = $row->filter(".diff");
$goal_diff = trim($goal_diff_element->text());
// create an array with the extracted team data
$team_data = [
"team" => $team,
"year" => $year,
"wins" => $wins,
"losses" => $losses,
"win_pct" => $win_pct,
"goals_for" => $goals_for,
"goals_against" => $goals_against,
"goal_diff" => $goal_diff
];
// print the scraped team data
print_r($team_data);
print ("\n");
});
The DomCrawler
methods used are:
each()
: To iterate over a list of selected elements.filter()
: Select elements based on CSS selectors.text()
: Extract the text content of the selected elements.
You can compare the three approaches to parsing HTML in PHP explored here in the summary table below:
\DOM\HTMLDocument | Simple HTML DOM Parser | Symfony’s DomCrawler | |
---|---|---|---|
Type | Native PHP component | External Library | Symfony Component |
GitHub Stars | — | 880+ | 4,000+ |
XPath Support | ❌ | ✔️ | ✔️ |
CSS Selector Support | ❌ | ✔️ | ✔️ |
Learning Curve | Low | Low to Medium | Medium |
Simplicity of Use | Medium | High | High |
API | Basic | Rich | Rich |
While these solutions work, they won’t be effective if the target web pages rely on JavaScript for rendering. In such cases, simple HTML parsing approaches like those above won’t suffice. Instead, you'll need a fully-featured scraping browser with advanced HTML parsing capabilities, such as Scraping Browser.
If you want to bypass HTML parsing and access structured data instantly, explore our ready-to-use datasets, covering hundreds of websites!
Create a Bright Data account today and start testing our data and scraping solutions with a free trial!