nichrome

Extracting URLs from the Wayback Machine

March 2, 2020

This post covers two methods that are used to filter URLs from webpages. These curated URLs are subsequently processed and saved as a text file. The text file is then referenced by the bash script to coordinate a scrape of the wayback machine for specific file types.


Extract URLs from web.archive.org

Start by navigating to a URL, changing the example.com root domain to your target domain.

JSON format:

http://web.archive.org/cdx/search/cdx?url=example.com*&output=json

Output: output=json can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3

TXT format:

http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt

Limit captures by using a range

If you need to limit the time frame of the crawl add the following parameters, yyyyMMddhhmmss to the end of the URL to narrow the range by using &from=2010&to=2018.

http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&from=2010&to=2018

You can also decrease or increase the limit to match your needs by using &output=txt&limit=999999

http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&limit=999999

You can find a full rundown of the available Wayback CDX Server API filtering options on Github.

Extract URLs using the browser’s console

A browser’s console allows developers and designers to instantly try out their code. To extract links from any website, copy the code below, then paste it into your browser console and hit enter. Hyperlinks will be extracted from the webpage and displayed in the console. This method is proven to be the fastest for extracting URLs. Different variations of javascript code are given below.

Extract URLs & corresponding anchor text

The following is a cross-browser supported code for extracting URLs along with their anchor text.

var urls=$$('a');
for(url in urls){
 console.log("#"+url+" > "+urls[url].innerHTML +" >> "+urls[url].href)
}

If you are using Chrome or Firefox use the following code for a styled version of the same.

var urls=$$('a');
for(url in urls){
 console.log("%c#"+url+" > %c"+urls[url].innerHTML +" >> %c"+urls[url].href,"color:red;","color:green;","color:blue;");
}

Extract URLs only 

Use the following code to extract just the links without the anchor text.

var urls=$$('a');
for(url in urls)
 console.log(urls[url].href);

Extract external URLs only

External Links are the ones that point outside the current domain. If you want to extract the external URLs only, then this is the code you need to use.

var links = $$('a');
for (var i = links.length - 1; i > 0; i--) {
    if (links[i].host !== location.host) {
       console.log(links[i].href);
    }
}

Extract URLs with a specific extension

If you would like to extract links having a particular extension then paste the following code into the console. Pass the extension wrapped in quotes to the getLinksWithExtension() function. Please note that the following code extracts links from HTML link tag only <a></a>  and not from other tags such as a script or image tag.

function getLinksWithExtension(extension) {
    var links = document.querySelectorAll('a[href$="' + extension + '"]'),
        i;

    for (i=0; i<links.length; i++){
        console.log(links[i]);
    }
}
getLinksWithExtension('mp3') //change mp3 to any extension