Check Your Error 404 Database for Broken Links with cURL
Some websites that I maintain have a database for tracking which pages have moved. The problem is that some of the links that indicate where a page has moved to break. So I end up with a 404 error saying a page has moved. When the visitor goes to the new page, they are greeted with another 404 error saying the page has moved. So let's look into writing a script to look through the database for broken links.
Background
For the sake of this example, we'll use the same database table as the previous post (Help Visitors Find Moved Pages with a Simple Error 404 Database).
id | date | oldAddress | message |
---|---|---|---|
1 | 2014-01-24 | /about/oldFile.pdf | has been removed |
2 | 2014-04-02 | /about/bio/johnsmith.php | has moved; <a href="/about/viewbio.php?id=1">view John Smith's bio</a> |
3 | 2014-04-02 | /about/bio/jakebible.php | has moved; <a href="/about/viewbio.php?id=2">view Jake Bible's bio</a> |
4 | 2014-05-01 | /resources/old_page.php | has been removed |
We're also going to leverage the code from the 3-part post titled "Check for Broken Links with PHP." Of course, the code will be modified to work with a database.
Checking for Broken Links
The goal of this program is to check for broken links in the message field from the database. To do that, we'll need to connect with the database. Let's also create a few variables to be used later and establish a DOMDocument object.
<?php
//CONNECT WITH DATABASE
require "{$_SERVER['DOCUMENT_ROOT']}/../database_connection.php";
$connect = new connect(true);
$mysqli = $connect->databaseObject;
//INITIALIZE VARIABLES
$badLinks = array();
$changedLinks = array();
$goodLinks = array();
$badStatusCodes = array('308', '404');
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
?>
Note that you can find more information about the above database connection script in the post titled "End PHP Scripts Gracefully After a Failed Database Connection." Next, we'll need the 404 error messages to loop through.
<?php
//...
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
//GET ERROR 404 ENTRIES
$sql = "SELECT id, message FROM error404";
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
}
?>
Since the messages are strings, DOMDocument's loadHTML() method is used to load the HTML.
<?php
//...
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
//IF ERROR 404 MESSAGE LOADS
if(@$domDoc->loadHTML($row['message'])) {
//ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
} else {
print '<div>DOMDocument failed</div>';
}
}
?>
We can then loop through any anchor tag(s) embedded within the message looking for ones that have an "href" attribute.
<?php
//...
while($row = $result->fetch_assoc()) {
//IF ERROR 404 MESSAGE LOADS
if(@$domDoc->loadHTML($row['message'])) {
//LOOP THROUGH ANCHOR TAGS IN THE ERROR 404 MESSAGE
$messageLinks = $domDoc->getElementsByTagName('a');
foreach($messageLinks as $currLink) {
//LOOP THROUGH ATTRIBUTES FOR CURRENT ANCHOR TAG
foreach($currLink->attributes as $attributeName=>$attributeValue) {
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
}
}
}
//ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
} else {
print '<div>DOMDocument failed</div>';
}
}
?>
Since the database contains root-relative links, we'll need to convert them into absolute links.
<?php
//...
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
//IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
if(substr($attributeValue->value, 0, 1) == '/') {
$attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
}
}
//...
?>
We're now ready to execute a cURL request to check if the link is still valid. The result will be stored in the variables created earlier.
<?php
//...
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
//IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
if(substr($attributeValue->value, 0, 1) == '/') {
$attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
}
//RUN cURL TO CHECK THE LINK
$ch = curl_init($attributeValue->value);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$finalURL = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
//PROCESS THE RETURN CODE
if(in_array($returnCode, $badStatusCodes)) {
$badLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
} elseif($finalURL != $attributeValue->value) {
$changedLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
} else {
$goodLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
}
}
//...
?>
All that's left to do is display the results.
<?php
//...
//ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
} else {
print '<div>DOMDocument failed</div>';
}
}
//DISPLAY RESULTS
print '<h2>Bad Links</h2>';
print '<pre>' . print_r($badLinks, true) . '</pre>';
print '<h2>Changed Links</h2>';
print '<pre>' . print_r($changedLinks, true) . '</pre>';
print '<h2>Good Links</h2>';
print '<pre>' . print_r($goodLinks, true) . '</pre>';
?>
Final Code
To help give you a better sense on how the pieces fit together, here is the entire script:
<?php
//CONNECT WITH DATABASE
require "{$_SERVER['DOCUMENT_ROOT']}/../database_connection.php";
$connect = new connect(true);
$mysqli = $connect->databaseObject;
//INITIALIZE VARIABLES
$badLinks = array();
$changedLinks = array();
$goodLinks = array();
$badStatusCodes = array('308', '404');
//INITIALIZE DOMDOCUMENT
$domDoc = new DOMDocument;
$domDoc->preserveWhiteSpace = false;
//GET ERROR 404 ENTRIES
$sql = "SELECT id, message FROM error404";
$result = $mysqli->query($sql);
while($row = $result->fetch_assoc()) {
//IF ERROR 404 MESSAGE LOADS
if(@$domDoc->loadHTML($row['message'])) {
//LOOP THROUGH ANCHOR TAGS IN THE ERROR 404 MESSAGE
$messageLinks = $domDoc->getElementsByTagName('a');
foreach($messageLinks as $currLink) {
//LOOP THROUGH ATTRIBUTES FOR CURRENT ANCHOR TAG
foreach($currLink->attributes as $attributeName=>$attributeValue) {
//IF CURRENT ATTRIBUTE CONTAINS A WEBSITE LINK
if($attributeName == 'href') {
//IF LINK IS ROOT-RELATIVE, MAKE IT ABSOLUTE
if(substr($attributeValue->value, 0, 1) == '/') {
$attributeValue->value = 'http://www.yourwebsite.com' . $attributeValue->value;
}
//RUN cURL TO CHECK THE LINK
$ch = curl_init($attributeValue->value);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$returnCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$finalURL = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
//PROCESS THE RETURN CODE
if(in_array($returnCode, $badStatusCodes)) {
$badLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
} elseif($finalURL != $attributeValue->value) {
$changedLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value, 'newLink'=>$finalURL);
} else {
$goodLinks[] = array('id'=>$row['id'], 'name'=>$currLink->nodeValue, 'link'=>$attributeValue->value);
}
}
}
}
//ELSE...UNABLE TO LOAD MESSAGE FOR CHECKING
} else {
print '<div>DOMDocument failed</div>';
}
}
//DISPLAY RESULTS
print '<h2>Bad Links</h2>';
print '<pre>' . print_r($badLinks, true) . '</pre>';
print '<h2>Changed Links</h2>';
print '<pre>' . print_r($changedLinks, true) . '</pre>';
print '<h2>Good Links</h2>';
print '<pre>' . print_r($goodLinks, true) . '</pre>';
?>
Conclusion
Now keep in mind that the script can be a little slow. After all, it needs to visit each website address referenced in the database to see what happens. The more links you have, the longer it can take.
The script could be sped up by maintaining a list of links already checked. You would just need to check the website address being processed against the already checked addresses before running the cURL request.
0 Comments
There are currently no comments.
Leave a Comment