Scrape All Links and Images in a Website Using PHP

Scrape all links and images from a given website’s URL. Data mining or scraping is used in many PHP projects. The first time I used this extractor is to parse and extract contents from websites while I was building a search engine (crawler) in PHP as my college mini project. Later I used the same script to for many purposes. And some of them even helped me to extract data from some websites and display¬†it in my site without buying their API ūüėČ

The most commonly mined data in website scrapping are links and images (as a coder a href’s and img src’s). So I decided to write a tutorial on it. Though more articles on scraping title, meta tags, headings, paragraph etc are coming soon.

link-image-scrapper-php

The library here we are using is htmlsql. You many thinking why a library is necessary because it can done via regex or dom parser. Well, this library has many additional features that cannot be done (or tough) to do with regex or something. I will explain additional features of htmlsql in my future articles. Moreover, htmlsql will be very easy for people who are familiar with MySQL or SQLite. Because it has a similar syntax as SQL queries.¬†Let’s get into the code.

Parse and Extract Links and Images in a Website

Demo  Download Source Code

Snoopy.class.php and¬†htmlsql.class.php are the libraries that I¬†have described before. First it send an HTTP request to the provided URL and retrieve the HTML code. It is then parsed using our queries like “SELECT href FROM a” and “SELECT src FROM img”. For those who are familiar with PHP-MySQL coding, this won’t be a big deal.

Comment below if you have any doubts, suggestions or feedback.