anchetaWern/php-webscraping.md

Created August 4, 2013 13:18

Star (153) You must be signed in to star a gist
Fork (73) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/anchetaWern/6150297.js"></script>
Save anchetaWern/6150297 to your computer and use it in GitHub Desktop.

web scraping in php

Raw

Have you ever wanted to get a specific data from another website but there's no API available for it? That's where Web Scraping comes in, if the data is not made available by the website we can just scrape it from the website itself.

But before we dive in let us first define what web scraping is. According to Wikipedia:

{% blockquote %} Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. {% endblockquote %}

So yes, web scraping lets us extract information from websites. But the thing is there are some legal issues regarding web scraping. Some consider it as an act of trespassing to the website where you are scraping the data from. That's why it is wise to read the terms of service of the specific website that you want to scrape because you might be doing something illegal without knowing it. You can read more about it in this Wikipedia page.

##Web Scraping Techniques

There are many techniques in web scraping as mentioned in the Wikipedia page earlier. But I will only discuss the following:

Document Parsing
Regular Expressions

###Document Parsing

Document parsing is the process of converting HTML into DOM (Document Object Model) in which we can traverse through. Here's an example on how we can scrape data from a public website:

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); //get the html returned from the following url

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)){ //if any html is actually returned

	$pokemon_doc->loadHTML($html);
	libxml_clear_errors(); //remove errors for yucky html
	
	$pokemon_xpath = new DOMXPath($pokemon_doc);

	//get all the h2's with an id
	$pokemon_row = $pokemon_xpath->query('//h2[@id]');

	if($pokemon_row->length > 0){
		foreach($pokemon_row as $row){
			echo $row->nodeValue . "<br/>";
		}
	}
}
?>

What we did with the code above was to get the html returned from the url of the website that we want to scrape. In this case the website is pokemondb.net.

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); 
?>

Then we declare a new DOM Document, this is used for converting the html string returned from file_get_contents into an actual Document Object Model which we can traverse through:

<?php
$pokemon_doc = new DOMDocument();
?>

Then we disable libxml errors so that they won't be outputted on the screen, instead they will be buffered and stored:

<?php
libxml_use_internal_errors(TRUE); //disable libxml errors
?>

Next we check if there's an actual html that has been returned:

<?php
if(!empty($html)){ //if any html is actually returned
}
?>

Next we use the loadHTML() function from the new instance of DOMDocument that we created earlier to load the html that was returned. Simply use the html that was returned as the argument:

<?php
$pokemon_doc->loadHTML($html);
?>

Then we clear the errors if any. Most of the time yucky html causes these errors. Examples of yucky html are inline styling (style attributes embedded in elements), invalid attributes and invalid elements. Elements and attributes are considered invalid if they are not part of the HTML specification for the doctype used in the specific page.

<?php
libxml_clear_errors(); //remove errors for yucky html
?>

Next we declare a new instance of DOMXpath. This allows us to do some queries with the DOM Document that we created. This requires an instance of the DOM Document as its argument.

<?php
$pokemon_xpath = new DOMXPath($pokemon_doc);
?>

Finally, we simply write the query for the specific elements that we want to get. If you have used jQuery before then this process is similar to what you do when you select elements from the DOM. What were selecting here is all the h2 tags which has an id, we make the location of the h2 unspecific by using double slashes // right before the element that we want to select. The value of the id also doesn't matter as long as there's an id then it will get selected. The nodeValue attribute contains the text inside the h2 that was selected.

<?php
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//h2[@id]');

if($pokemon_row->length > 0){
	foreach($pokemon_row as $row){
		echo $row->nodeValue . "<br/>";
	}
}
?>

This results to the following text printed out in the screen:

Generation 1 - Red, Blue, Yellow
Generation 2 - Gold, Silver, Crystal
Generation 3 - Ruby, Sapphire, Emerald
Generation 4 - Diamond, Pearl, Platinum
Generation 5 - Black, White, Black 2, White 2

Let's do one more example with the document parsing before we move on to regular expressions. This time were going to get a list of all pokemons along with their specific type (E.g Fire, Grass, Water).

First let's examine what we have on pokemondb.net/evolution so that we know what particular element to query.

As you can see from the screenshot, the information that we want to get is contained within a span element with a class of infocard-tall . Yes, the space there is included. When using XPath to query spaces are included if they are present, otherwise it wouldn't work.

Converting what we know into actual query, we come up with this:

//span[@class="infocard-tall "]

This selects all the span elements which has a class of infocard-tall . It doesn't matter where in the document the span is because we used the double forward slash before the actual element.

Once were inside the span we have to get to the actual elements which directly contains the data that we want. And that is the name and the type of the pokemon. As you can see from the screenshot below the name of the pokemon is directly contained within an anchor element with a class of ent-name. And the types are stored within a small element with a class of aside.

We can then use that knowledge to come up with the following code:

<?php
$pokemon_list = array();

$pokemon_and_type = $pokemon_xpath->query('//span[@class="infocard-tall "]');

if($pokemon_and_type->length > 0){	
	
	//loop through all the pokemons
	foreach($pokemon_and_type as $pat){
		
		//get the name of the pokemon
		$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
		
		$pkmn_types = array(); //reset $pkmn_types for each pokemon
		$types = $pokemon_xpath->query('small[@class="aside"]/a', $pat);

		//loop through all the types and store them in the $pkmn_types array
		foreach($types as $type){
			$pkmn_types[] = $type->nodeValue; //the pokemon type
		}

		//store the data in the $pokemon_list array
		$pokemon_list[] = array('name' => $name, 'types' => $pkmn_types);
		
	}
}

//output what we have
echo "<pre>";
print_r($pokemon_list);
echo "</pre>";
?>

There's nothing new with the code that we have above except for using query inside the foreach loop. We use this particular line of code to get the name of the pokemon, you might notice that we specified a second argument when we used the query method. The second argument is the current row, we use it to specify the scope of the query. This means that were limiting the scope of the query to that of the current row.

<?php
$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
?>

The results would be something like this:

Array
(
    [0] => Array
        (
            [name] => Bulbasaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [1] => Array
        (
            [name] => Ivysaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [2] => Array
        (
            [name] => Venusaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )

###Regular Expressions

##Web Scraping Tools

###Simple HTML Dom

To make web scraping easier you can use libraries such as simple html DOM. Here's an example of getting the names of the pokemon using simple html DOM:

<?php
$html = file_get_html('http://pokemondb.net/evolution');

foreach($html->find('a[class=ent-name]') as $element){
	echo $element->innertext . '<br>'; //outputs bulbasaur, ivysaur, etc...
} 
?>

The syntax is more simple so the code that you have to write is lesser plus there are also some convenience functions and attributes which you can use. An example is the plaintext attribute which extracts all the text from a web page:

<?php
echo file_get_html('http://pokemondb.net/evolution')->plaintext; 
?>

###Ganon

##Scraping non-public parts of website

###Scraping Amazon

##Resources

AlexCarlson commented Jan 15, 2015

Nice work .. But what if i want to extract the data from two or more web pages ? .... at an instant of time....

asadpk commented Feb 13, 2015

Hi,
I need these detail in xls sheet .can you modify this script? in a way that i can get data in xls format in same way bellow.

Item name ASIN By 1st price 2nd price 3rd price Category

prasadmunna commented Feb 4, 2016

thanks for such a nice post..

knaveenchand commented Apr 14, 2016

Very well written.

TheKetan2 commented May 31, 2016

suppose we have structure like : Main Link one->Sub Link->Sub Sub Link and we have to get info from all those links and come back to Main Link page and do the same thing with Main Link Two->Sub Link->Sub Sub Link ..how should we so that.

saurabh-vijayvargiya commented Aug 2, 2016

awesome dude, nice explanation along with the code.

bhawnam193 commented Sep 5, 2016

very informative article but for the first example when i use different URL than the one listed it does not show anything and that url has h2 tags, tried changing url not for one but for umpteen number of URLs. Any idea why?

verma-ashish commented Jan 2, 2017 •

edited

Loading

At present, scraping the data is coming as plain text not with existing html tags. So, is it possible to scrap the data with all html tags as well e.g. <a>, , ?

sebastian2609 commented Feb 17, 2017 •

edited

Loading

I need to get content from table how can i do this...? please help me as soon as posible..

WINNING NUMBERS 2017-02-15 (Wed) 3712/17
1ST	2ND	3RD
9411	3367	9162

EdwinChua commented Feb 23, 2017 •

edited

Loading

Thanks! I managed to write my first web scraper for a local news website thanks to this article. :)

@verma-ashish have you tried query('//a'); ?

@sebastian2609 try query('//tr //td'); or something similar

CCHFWBAN commented Mar 3, 2017

Instead of echo -ing the values how can you update them as different rows in a MySQL table?

CCHFWBAN commented Mar 3, 2017

Or, is there a way to select one specific H2 from the array instead of just outputting all the H2's?

Girish0406 commented Mar 21, 2017

that's a very nice explanation . but i want to know if i want to store data into mysql than how to do i m not getting can anyone help me out as soon as possible

vishnu1991 commented Apr 20, 2017 •

edited

Loading

@CCHFWBAN
for outputting just the specific h2 value u can use the array index; like say ,if u want the third h2 then u can use

$scrap_row = $scrap_xpath->query('//h2');
echo `"Get first H2 Value:".$scrap_row[2]->nodeValue."<br>";

Whip commented Jun 9, 2017

Any ideas about how to get contents of the page which requires login? Assuming you do have the login access for the website.

manualvarado22 commented Aug 2, 2017

Thank you so much! This is amazing.

anamahmed2012 commented Aug 4, 2017

That's what I was looking for.

imran300 commented Aug 15, 2017

What if i have i div class="description" and it contains a ul with 5 li tags
now i want to extract these li data but this ul doesn't have a class or id ans their are t=hundreds of li on a single web page so how are we gonna extract this information

norcaljohnny commented Aug 25, 2017

VeeK727 I would assume just like any site you can use the l/p in the url itself to gain access.
Example.. http://username:password@www.example.com/

norcaljohnny commented Aug 25, 2017

@imran300 you can try using the simple_html_dom.php and then included it in the php file.
As such.

find('li') as $element) echo $element ; ?>

Yes, it is literally that easy and will scrape the full details for each 'li'

norcaljohnny commented Aug 25, 2017 •

edited

Loading

looks like it got cropped. Testing in full once more. (removing opening and closing tags to post)

include_once 'simple_html_dom.php';
// Create DOM from URL
$html = file_get_html('https://www.example.com/');

// Find all links
foreach($html->find('li') as $element)
echo $element ;

kasabesiddhi commented Sep 16, 2017

Awesome explanation

hamzamumtaz007 commented Nov 22, 2017

I have an element with multiple classes how can i detect it using html document parsing?

peterpilip commented Jan 12, 2018

Great work bro..

stephanoapiolaza commented Jan 20, 2018

Nice Article

EhabElzeny commented Apr 17, 2018

very nice & i think there more ways for this thank you

fahadhowlader commented Sep 12, 2018 •

edited

Loading

Please have a look my HTML likes
<li>১, ২, ৩, ৪। 1, 2, 3, 4 It's test.</li>
But i need to scraping only ১, ২, ৩, ৪ and
not need 1, 2, 3, 4 It's test.
how can it possible ?

nootype commented Oct 27, 2020

wow cool written thank you very much. Can you tell me on what scrape data from website on this service https://finddatalab.com/how-to-scrape-data-from-a-website?? What language are they using?

anchetaWern/php-webscraping.md

AlexCarlson commented Jan 15, 2015

Uh oh!

asadpk commented Feb 13, 2015

Uh oh!

prasadmunna commented Feb 4, 2016

Uh oh!

knaveenchand commented Apr 14, 2016

Uh oh!

TheKetan2 commented May 31, 2016

Uh oh!

saurabh-vijayvargiya commented Aug 2, 2016

Uh oh!

bhawnam193 commented Sep 5, 2016

Uh oh!

verma-ashish commented Jan 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian2609 commented Feb 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

9411

3367

9162

Uh oh!

EdwinChua commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CCHFWBAN commented Mar 3, 2017

Uh oh!

CCHFWBAN commented Mar 3, 2017

Uh oh!

Girish0406 commented Mar 21, 2017

Uh oh!

vishnu1991 commented Apr 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Whip commented Jun 9, 2017

Uh oh!

manualvarado22 commented Aug 2, 2017

Uh oh!

anamahmed2012 commented Aug 4, 2017

Uh oh!

imran300 commented Aug 15, 2017

Uh oh!

norcaljohnny commented Aug 25, 2017

Uh oh!

norcaljohnny commented Aug 25, 2017

Uh oh!

norcaljohnny commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasabesiddhi commented Sep 16, 2017

Uh oh!

hamzamumtaz007 commented Nov 22, 2017

Uh oh!

peterpilip commented Jan 12, 2018

Uh oh!

stephanoapiolaza commented Jan 20, 2018

Uh oh!

EhabElzeny commented Apr 17, 2018

Uh oh!

fahadhowlader commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nootype commented Oct 27, 2020

Uh oh!

verma-ashish commented Jan 2, 2017 •

edited

Loading

sebastian2609 commented Feb 17, 2017 •

edited

Loading

EdwinChua commented Feb 23, 2017 •

edited

Loading

vishnu1991 commented Apr 20, 2017 •

edited

Loading

norcaljohnny commented Aug 25, 2017 •

edited

Loading

fahadhowlader commented Sep 12, 2018 •

edited

Loading