Skip to content

Instantly share code, notes, and snippets.

@anchetaWern
Created August 4, 2013 13:18
Show Gist options
  • Save anchetaWern/6150297 to your computer and use it in GitHub Desktop.
Save anchetaWern/6150297 to your computer and use it in GitHub Desktop.
web scraping in php

Have you ever wanted to get a specific data from another website but there's no API available for it? That's where Web Scraping comes in, if the data is not made available by the website we can just scrape it from the website itself.

But before we dive in let us first define what web scraping is. According to Wikipedia:

{% blockquote %} Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. {% endblockquote %}

So yes, web scraping lets us extract information from websites. But the thing is there are some legal issues regarding web scraping. Some consider it as an act of trespassing to the website where you are scraping the data from. That's why it is wise to read the terms of service of the specific website that you want to scrape because you might be doing something illegal without knowing it. You can read more about it in this Wikipedia page.

##Web Scraping Techniques

There are many techniques in web scraping as mentioned in the Wikipedia page earlier. But I will only discuss the following:

  • Document Parsing
  • Regular Expressions

###Document Parsing

Document parsing is the process of converting HTML into DOM (Document Object Model) in which we can traverse through. Here's an example on how we can scrape data from a public website:

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); //get the html returned from the following url

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)){ //if any html is actually returned

	$pokemon_doc->loadHTML($html);
	libxml_clear_errors(); //remove errors for yucky html
	
	$pokemon_xpath = new DOMXPath($pokemon_doc);

	//get all the h2's with an id
	$pokemon_row = $pokemon_xpath->query('//h2[@id]');

	if($pokemon_row->length > 0){
		foreach($pokemon_row as $row){
			echo $row->nodeValue . "<br/>";
		}
	}
}
?>

What we did with the code above was to get the html returned from the url of the website that we want to scrape. In this case the website is pokemondb.net.

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); 
?>

Then we declare a new DOM Document, this is used for converting the html string returned from file_get_contents into an actual Document Object Model which we can traverse through:

<?php
$pokemon_doc = new DOMDocument();
?>

Then we disable libxml errors so that they won't be outputted on the screen, instead they will be buffered and stored:

<?php
libxml_use_internal_errors(TRUE); //disable libxml errors
?>

Next we check if there's an actual html that has been returned:

<?php
if(!empty($html)){ //if any html is actually returned
}
?>

Next we use the loadHTML() function from the new instance of DOMDocument that we created earlier to load the html that was returned. Simply use the html that was returned as the argument:

<?php
$pokemon_doc->loadHTML($html);
?>

Then we clear the errors if any. Most of the time yucky html causes these errors. Examples of yucky html are inline styling (style attributes embedded in elements), invalid attributes and invalid elements. Elements and attributes are considered invalid if they are not part of the HTML specification for the doctype used in the specific page.

<?php
libxml_clear_errors(); //remove errors for yucky html
?>

Next we declare a new instance of DOMXpath. This allows us to do some queries with the DOM Document that we created. This requires an instance of the DOM Document as its argument.

<?php
$pokemon_xpath = new DOMXPath($pokemon_doc);
?>

Finally, we simply write the query for the specific elements that we want to get. If you have used jQuery before then this process is similar to what you do when you select elements from the DOM. What were selecting here is all the h2 tags which has an id, we make the location of the h2 unspecific by using double slashes // right before the element that we want to select. The value of the id also doesn't matter as long as there's an id then it will get selected. The nodeValue attribute contains the text inside the h2 that was selected.

<?php
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//h2[@id]');

if($pokemon_row->length > 0){
	foreach($pokemon_row as $row){
		echo $row->nodeValue . "<br/>";
	}
}
?>

This results to the following text printed out in the screen:

Generation 1 - Red, Blue, Yellow
Generation 2 - Gold, Silver, Crystal
Generation 3 - Ruby, Sapphire, Emerald
Generation 4 - Diamond, Pearl, Platinum
Generation 5 - Black, White, Black 2, White 2

Let's do one more example with the document parsing before we move on to regular expressions. This time were going to get a list of all pokemons along with their specific type (E.g Fire, Grass, Water).

First let's examine what we have on pokemondb.net/evolution so that we know what particular element to query.

checking

As you can see from the screenshot, the information that we want to get is contained within a span element with a class of infocard-tall . Yes, the space there is included. When using XPath to query spaces are included if they are present, otherwise it wouldn't work.

Converting what we know into actual query, we come up with this:

//span[@class="infocard-tall "]

This selects all the span elements which has a class of infocard-tall . It doesn't matter where in the document the span is because we used the double forward slash before the actual element.

Once were inside the span we have to get to the actual elements which directly contains the data that we want. And that is the name and the type of the pokemon. As you can see from the screenshot below the name of the pokemon is directly contained within an anchor element with a class of ent-name. And the types are stored within a small element with a class of aside.

info card

We can then use that knowledge to come up with the following code:

<?php
$pokemon_list = array();

$pokemon_and_type = $pokemon_xpath->query('//span[@class="infocard-tall "]');

if($pokemon_and_type->length > 0){	
	
	//loop through all the pokemons
	foreach($pokemon_and_type as $pat){
		
		//get the name of the pokemon
		$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
		
		$pkmn_types = array(); //reset $pkmn_types for each pokemon
		$types = $pokemon_xpath->query('small[@class="aside"]/a', $pat);

		//loop through all the types and store them in the $pkmn_types array
		foreach($types as $type){
			$pkmn_types[] = $type->nodeValue; //the pokemon type
		}

		//store the data in the $pokemon_list array
		$pokemon_list[] = array('name' => $name, 'types' => $pkmn_types);
		
	}
}

//output what we have
echo "<pre>";
print_r($pokemon_list);
echo "</pre>";
?>

There's nothing new with the code that we have above except for using query inside the foreach loop. We use this particular line of code to get the name of the pokemon, you might notice that we specified a second argument when we used the query method. The second argument is the current row, we use it to specify the scope of the query. This means that were limiting the scope of the query to that of the current row.

<?php
$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
?>

The results would be something like this:

Array
(
    [0] => Array
        (
            [name] => Bulbasaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [1] => Array
        (
            [name] => Ivysaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [2] => Array
        (
            [name] => Venusaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )

###Regular Expressions

##Web Scraping Tools

###Simple HTML Dom

To make web scraping easier you can use libraries such as simple html DOM. Here's an example of getting the names of the pokemon using simple html DOM:

<?php
$html = file_get_html('http://pokemondb.net/evolution');

foreach($html->find('a[class=ent-name]') as $element){
	echo $element->innertext . '<br>'; //outputs bulbasaur, ivysaur, etc...
} 
?>

The syntax is more simple so the code that you have to write is lesser plus there are also some convenience functions and attributes which you can use. An example is the plaintext attribute which extracts all the text from a web page:

<?php
echo file_get_html('http://pokemondb.net/evolution')->plaintext; 
?>

###Ganon

##Scraping non-public parts of website

###Scraping Amazon

##Resources

@AlexCarlson
Copy link

Nice work .. But what if i want to extract the data from two or more web pages ? .... at an instant of time....

@asadpk
Copy link

asadpk commented Feb 13, 2015

Hi,
I need these detail in xls sheet .can you modify this script? in a way that i can get data in xls format in same way bellow.

Item name ASIN By 1st price 2nd price 3rd price Category

@prasadmunna
Copy link

thanks for such a nice post..

@knaveenchand
Copy link

Very well written.

@TheKetan2
Copy link

suppose we have structure like : Main Link one->Sub Link->Sub Sub Link and we have to get info from all those links and come back to Main Link page and do the same thing with Main Link Two->Sub Link->Sub Sub Link ..how should we so that.

@saurabh-vijayvargiya
Copy link

awesome dude, nice explanation along with the code.

@bhawnam193
Copy link

very informative article but for the first example when i use different URL than the one listed it does not show anything and that url has h2 tags, tried changing url not for one but for umpteen number of URLs. Any idea why?

@verma-ashish
Copy link

verma-ashish commented Jan 2, 2017

At present, scraping the data is coming as plain text not with existing html tags. So, is it possible to scrap the data with all html tags as well e.g. <a>, <b>, <i>?

@sebastian2609
Copy link

sebastian2609 commented Feb 17, 2017

I need to get content from table how can i do this...? please help me as soon as posible..

WINNING NUMBERS
2017-02-15 (Wed) 3712/17
1ST 2ND 3RD

9411

3367

9162

@EdwinChua
Copy link

EdwinChua commented Feb 23, 2017

Thanks! I managed to write my first web scraper for a local news website thanks to this article. :)

@verma-ashish have you tried query('//a'); ?

@sebastian2609 try query('//tr //td'); or something similar

@CCHFWBAN
Copy link

CCHFWBAN commented Mar 3, 2017

Instead of echo -ing the values how can you update them as different rows in a MySQL table?

@CCHFWBAN
Copy link

CCHFWBAN commented Mar 3, 2017

Or, is there a way to select one specific H2 from the array instead of just outputting all the H2's?

@Girish0406
Copy link

that's a very nice explanation . but i want to know if i want to store data into mysql than how to do i m not getting can anyone help me out as soon as possible

@vishnu1991
Copy link

vishnu1991 commented Apr 20, 2017

@CCHFWBAN
for outputting just the specific h2 value u can use the array index; like say ,if u want the third h2 then u can use

$scrap_row = $scrap_xpath->query('//h2');
echo `"Get first H2 Value:".$scrap_row[2]->nodeValue."<br>";

@Whip
Copy link

Whip commented Jun 9, 2017

Any ideas about how to get contents of the page which requires login? Assuming you do have the login access for the website.

@manualvarado22
Copy link

Thank you so much! This is amazing.

@anamahmed2012
Copy link

That's what I was looking for.

@imran300
Copy link

What if i have i div class="description" and it contains a ul with 5 li tags
now i want to extract these li data but this ul doesn't have a class or id ans their are t=hundreds of li on a single web page so how are we gonna extract this information

@norcaljohnny
Copy link

VeeK727 I would assume just like any site you can use the l/p in the url itself to gain access.
Example.. http://username:[email protected]/

@norcaljohnny
Copy link

@imran300 you can try using the simple_html_dom.php and then included it in the php file.
As such.

find('li') as $element) echo $element ; ?>

Yes, it is literally that easy and will scrape the full details for each 'li'

@norcaljohnny
Copy link

norcaljohnny commented Aug 25, 2017

looks like it got cropped. Testing in full once more. (removing opening and closing tags to post)

include_once 'simple_html_dom.php';
// Create DOM from URL
$html = file_get_html('https://www.example.com/');

// Find all links
foreach($html->find('li') as $element)
echo $element ;

@kasabesiddhi
Copy link

Awesome explanation

@hamzamumtaz007
Copy link

I have an element with multiple classes how can i detect it using html document parsing?

@peterpilip
Copy link

Great work bro..

@stephanoapiolaza
Copy link

Nice Article

@EhabElzeny
Copy link

very nice & i think there more ways for this thank you

@fahadhowlader
Copy link

fahadhowlader commented Sep 12, 2018

Please have a look my HTML likes
<li>১, ২, ৩, ৪। <br>1, 2, 3, 4 </br><span>It's test.</span></li>
But i need to scraping only ১, ২, ৩, ৪ and
not need <br>1, 2, 3, 4 </br><span>It's test.</span>
how can it possible ?

@nootype
Copy link

nootype commented Oct 27, 2020

wow cool written thank you very much. Can you tell me on what scrape data from website on this service https://finddatalab.com/how-to-scrape-data-from-a-website?? What language are they using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment