Web Scraping in PHP, Part 1

Manvir Singh
3 min readMay 17, 2020

As a web developer, tester, you must hear about web scraping, web crawling. Even your best friend (Google) uses web crawling to bring the results. However, that’s not a straight forward process. Behind the scenes lot of work has been done.

Before starting the implementation one must have to understand what is web scraping and why we need this.

What is Web Scraping ?

It is the process of fetching or extracting the data from websites either by using some script (This script can be considered as a robot). Fetching the process to download a page contents (Which browser renders on your screen). Once fetching is a complete extraction takes place. By extraction we mean to extract an email, data table, image or anything that is available on the web page.

Why do we need it ?

There may be different scenarios when you need to scrap the website. Let’s consider a simple scenario.
A news website which is not providing API’s to access their recent or trending news. You can crawl this website and extract this data.

In other words, we can say that you need data from source in different format.

Tools Required

  1. A library or function to fetch data from web page. (Curl, Axios).
  2. A library to extract specific data from the downloaded web page. (Regex, Symfony DOM Crawler).

We will use PHP curl class to send HTTP requests and fetching the web page. For data extraction we are going to use Symfony’s Dom Crawler.

Let’s start the implementation.

  1. In your project’s directory, run
composer require php-curl-class/php-curl-class

Note: You must have composer installed on your system. You can get the composer here.

2. Next, install DOM Crawler by running

composer require symfony/dom-crawler
Installation output for DOM Crawler

Take a look on the line “symfony/dom-crawler suggests installing symfony/css-selector”.
You also have to install symfony’s CSS selector as DOM crawler makes the use of css selectors to extract data from nodes.

3. Install Symfony CSS Selector.

composer require symfony/css-selector

So, everything is settled up. We are done with the required installations. Now let’s get our hand dirty. In this tutorial we will fetch the Github page and extract trending repositories.

In the above code, first of all we load the autoload.php file which will load the installed libraries in your code. After that, we are sending an GET request to the page. It will return the source code from github page. After that we are passing this response to crawler class. The Crawler instance represents the set of DOMElement objects.

Let’s move to data extraction.

Suppose, We want to extract title of first repository.

$crawler->filter('article.Box-row > h1 > a')->text();

To get title of second post

$crawler->filter('article.Box-row:nth-child(2) > h1 > a')->text();

You just need to write css selector to reach at the element.

What if, we want to extract title, link, number of stars for all posts.

Code to extract title, link and number of stars

That’s all.
In above tutorial you will learn the basic techniques to extract data using CSS selectors. You can read more about DOMCrawler and its features.

You can see the code in Github repo.

Happy Coding :-)

--

--