Check for Broken Links with PHP Part 1

January 13, 2014 — Tags: cURL, DOMDocument, PHP, productivity | 2 Comments

One aspect of managing a website that I want to streamline is the process of checking for broken links. Clicking on all of those links manually can be tedious. Especially if you have a page dedicated to posting external links. Now I know there are link-checking services available, but I've been looking for an excuse to experiment with cURL which is available through PHP. For those interested, I wanted to share what I have so far.

Getting the Links

The first thing PHP needs is the URL of the page to check for broken links. Let's pass it using a GET variable so we can easily test different pages without modifying the script each time. We'll also create some other variables which will be used later.

<?php //INITIALIZE VARIABLES $pageToCheck = $_GET['link']; $badLinks = array(); $goodLinks = array(); $badStatusCodes = array('308', '404'); ?>

To locate all the links on our page, we'll use PHP's DOMDocument class. The class needs to be initialized and provided the page's HTML code.

<?php //INITIALIZE VARIABLES $pageToCheck = $_GET['link']; $badLinks = array(); $goodLinks = array(); $badStatusCodes = array('308', '404'); //INITIALIZE DOMDOCUMENT $domDoc = new DOMDocument; $domDoc->preserveWhiteSpace = false; //IF THE PAGE BEING CHECKED LOADS if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML //process HTML here } ?>

Next, we can utilize DOMDocument's getElementsByTagName() method to search for all the anchor tags within the page. Then loop through those tags looking for their "href" attribute which contains the link to check.

<?php //... //IF THE PAGE BEING CHECKED LOADS if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA $pageLinks = $domDoc->getElementsByTagName('a'); foreach($pageLinks as $currLink) { //LOOP THROUGH ATTRIBUTES FOR CURRENT LINK foreach($currLink->attributes as $attributeName=>$attributeValue) { //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS if($attributeName == 'href') { //test current link here } } } } ?>

Now that we have an address to check, let's initialize a cURL session and attempt to visit the website.

<?php //... //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS if($attributeName == 'href') { //INITIALIZE CURL AND TEST THE LINK $ch = curl_init($attributeValue->value); curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_exec($ch); $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); } //... ?>

The above cURL request returns an HTTP code which lets us know the status of the link. That code is used to determine if the link is bad or not.

<?php //... //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS if($attributeName == 'href') { //INITIALIZE CURL AND TEST THE LINK $ch = curl_init($attributeValue->value); curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_exec($ch); $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); //TRACK THE RESPONSE if(in_array($returnCode, $badStatusCodes)) { $badLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value); } else { $goodLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value); } } //... ?>

Once all the links have been processed, we just need to display the results.

<?php //... //DISPLAY RESULTS print '<h2>Bad Links</h2>'; print '<pre>' . print_r($badLinks, true) . '</pre>'; print '<h2>Good Links</h2>'; print '<pre>' . print_r($goodLinks, true) . '</pre>'; //... ?>

Final Code

To help give you a better sense on how the pieces fit together, here is the entire script:

<?php //INITIALIZE VARIABLES $pageToCheck = $_GET['link']; $badLinks = array(); $goodLinks = array(); $badStatusCodes = array('308', '404'); //INITIALIZE DOMDOCUMENT $domDoc = new DOMDocument; $domDoc->preserveWhiteSpace = false; //IF THE PAGE BEING CHECKED LOADS if(@$domDoc->loadHTMLFile($pageToCheck)) { //note that errors are suppressed so DOMDocument doesn't complain about XHTML //LOOP THROUGH ANCHOR TAGS IN THE MAIN CONTENT AREA $pageLinks = $domDoc->getElementsByTagName('a'); foreach($pageLinks as $currLink) { //LOOP THROUGH ATTRIBUTES FOR CURRENT LINK foreach($currLink->attributes as $attributeName=>$attributeValue) { //IF CURRENT ATTRIBUTE CONTAINS THE WEBSITE ADDRESS if($attributeName == 'href') { //INITIALIZE CURL AND TEST THE LINK $ch = curl_init($attributeValue->value); curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_exec($ch); $returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); curl_close($ch); //TRACK THE RESPONSE if(in_array($returnCode, $badStatusCodes)) { $badLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value); } else { $goodLinks[] = array('name'=>$currLink->nodeValue, 'link'=>$attributeValue->value); } } } } //DISPLAY RESULTS print '<h2>Bad Links</h2>'; print '<pre>' . print_r($badLinks, true) . '</pre>'; print '<h2>Good Links</h2>'; print '<pre>' . print_r($goodLinks, true) . '</pre>'; } ?>

Conclusion

The code could be saved to a file called "linkChecker.php" and uploaded to your website's root folder. To check a page for broken links, you would visit the page by typing something like

http://www.yourwebsite.com/linkChecker.php?link=http://www.yourwebsite.com/pageToCheck.php

Note that it may take a few minutes for the script to process the entire page. It all depends on how many links are on the page. There are some things that can be done to minimize the process, but that will have to wait until another time. In the meantime, let me know if you have any questions or suggestions in the comments section below.

Previous Post: Year-End Review and Goals for 2014

Next Post: Check for Broken Links with PHP Part 2: Capture Redirected Links

2 Comments

#2 Patrick Nichols on 10.17.15 at 10:01 am
@Matthew – Are you referring to how the variables are named in this post? If so, sorry about that. That's just the naming convention which stuck with me since my early days of programming in C++.
#1 Matthew Fisher on 10.17.15 at 8:41 am
I hate it when people name variables in PHP as if they were writing Javascript.