How to Scrape Website Contents using PHP – Data Miner

From this tutorial, you can quickly learn how to scrape and parse contents of a website using PHP. You can retrieve images, links, headings (h1-h6), paragraphs, or even scrape contents of a div based on class or id. Isn’t it pretty cool? This can be used to create crawlers in Search Engines, for data mining, scrap database of website or even to extract data from a website and show it in your site (check copyright and legal notices before you do that). I used this first time when I was building a Search Engine as my college mini project. Later it was helpful in many of my projects.


Scrape Contents of a Website using PHP

We are using the htmlsql library available in GitHub. There are many other PHP scrapers available. But this is one very cool. Because the code is similar to SQL queries. So those who are familiar with PHP-MySQL, this will be very easy. You can scrape contents like ‘SELECT…’ query in SQL. Let’s get into the code.

Demo Download Source Code

To scrape contents based on class or id, you can this code. You can change the links to images, text, heading or anything as you wish.


Tips, Features, Troubleshooting and more

With the above code, you can retrieve the following of a website

  • title
  • links
  • images
  • headings (h1-h6)
  • paragraph texts
  • parse based on class
  • parse based on id

If you have checked the demo, sometimes it may display images, list, heading as links etc. Don’t think that this is an error. It’s because it is retrieving the HTML code.  So you can remove the HTML tags from it. You can use strip_tags() function in PHP for that.

Are Links returning # tags? Well if you are a web developer you may already know why this is coming. You can check whether the first character is #, if so just ‘continue’ the loop.