Long list of Apple domains

October 21, 2020

This post represents an effort to find various filetypes that might exit against this 597 strong Apple subdomains list that was recently found on a random website. Unfortunately the original URL hosting the list was not documented.

Irregardless, a dry run was executed against all 597 domains using the wayback_machine_downloader with the -l option switched on which prevents any downloads from occurring. While the script run the stdout was logged to a text file to document the process. This resulted in a 1.05GB log file. Although the script was only out to curate pkg, as, hqx, cpt, bin, sea, sit, sitx, dd, and pit files, it’s clear that many other filetypes were documented in the log file.

Block of domains

This block of Apple subdomains is the source with which the wayback_machine_downloader will be applied to curate (download) files. Some of the subdomains may have never been captured by the Wayback Machine for a variety of reasons.,,,,,,,,,,,,,,,,,, .br

Shell script used to list urls

Before curation, the wayback_machine_downloader was employed to sample the block of subdomains and collect a list of every directory available for each subdomain. This process was logged, generating a 2.52GB log text file. The following shell script was used to generate the said log file. The -l option prevented downloads.

while read -r line
wayback_machine_downloader -l -d "$line"_IA -c6 --only "/\.(pkg|as|hqx|cpt|bin|sea|sit|sitx|dd|pit)$/i" "$line"
done < /path/to/domains.txt | tee -a log.txt

Descriptions of the options applied

Option Short Form Option Unabbreviated Description
-l –list Only list file urls in a JSON format with the archived timestamps, won’t download anything
-s –all-timestamps download all snapshots/timestamps for a given website
-d –directory PATH Directory to save the downloaded files into
-c –concurrency NUMBER Number of multiple files to download at a time

-s was omitted because it forces every timestamp to download into its own directory which requires a substantial amount of time to post-process. See this github issue for more.

The Finder window on the left shows an ongoing download with the -s flag omitted. On the other hand, applying the -s flag results in a very complex directory tree that packages each snapshot into their own timestamp folder. For the purpose of this project packaging into timestamp folders is not favorable.The Finder window on the left shows an ongoing download with the -s flag omitted. On the other hand, applying the -s flag results in a very complex directory tree that packages each snapshot into their own timestamp folder. For the purpose of this project packaging into timestamp folders is not favorable.

Filetypes found in the log file

Scrolling through the dry run log.txt file some familiar filetypes were found but it quickly became obvious that going through a 2.5GB file manual would be both an impractical and very lengthy process.

The following four shell command were considered in an attempt to isolate file extensions scatterde throughout the log.txt file but due to the size of the file, isolation was not without its challenges. Each of these commands presented different results. The first command was rather effective, trimming most of the noise out of the file while the fourth was much more conservative, requiring more post manual editing than the first.

sed -E 's/^.*(\.[^\.]+)$/\1/' logfile.txt | sort | uniq -c
grep 'http://.*[^/]$' logfile.txt | awk -F/ '{print $NF}' | sed -E 's/^.*(\.[^\.]+)$/\1/' | sort | uniq -c
grep 'http://.*[^/]$' logfile.txt | awk -F/ '{print $NF}' | cut -d. -f2 | sort | uniq -c
grep 'http://.*[^/]$' logfile.txt | grep -v '/?' | awk -F/ '{print $NF}' | sed -E 's/^.*(\.[^\.]+)$/\1/' | sort | uniq -c
Number of files curated

The script considered only certain filetypes for curation. As mentioned earlier, other filetypes could be curated, notably PDF along with text files but the latter could result in many robots.txt files travelling inbound.

At first glance the robots.txt file in itself may appear to be of little interest but sometimes they reveal interesting subdomains that are ignored by spiders, misc. notes and other tidbits of data. The internet archive’s waybackmachine dutifully ignores the contents of robots.txt files.

How filetype tallies were calculated
Counts the number of files in a directory

This command walks (traverses) the working directory tree and provides the total number of files while ignoring .DS_Store files.

find . ! -name '.DS_Store' ! -type d | wc -l
Counts the number of filetypes, over a directory structure

This command traverses the working directory tree, find all filetypes, with the exception of .DS_Store files, and displays the number of files found for each filetype.

find . ! -name '.DS_Store' ! -type d|sed -E 's/^.*(\.[^\.]+)$/\1/'|sort|uniq -c
Provides a file extension count per folder inside the main folder

This command works precisely as the one above by identifying the numebr of unique filetypes and which directories they are found in. This command is responsible for the data in the table below.

for d in *; do echo "$d"; find "$d" ! -name .DS_Store ! -type d|sed -E 's/^.*(\.[^\.]+)$/\1/'|sort|uniq -c; done
Filetypes curated on the first run

Some of the domains in the original block that files were curated against are not listed in the table bleow because when the Wayback Machine probed the domain the object no longer existed on the server, the link followed was either outdated, inaccurate, or a robots.txt file set a prohibition. In some cases a (40x and 50x) or redirections (30x) may have caused the unavailabilty.

The first run was restricted to the following common Macintosh filetypes

Domain - Macintosh filetypes .as .bin .hqx .pdf .pkg .sit .sea .sitx .cpt 2 2 37 8 5 11 2 2 8 11 25 17 401 23 260 9115 21 11 16 152 40 1 1 4 1 4 28 61 2 138 11 3 5 722 31974 20 2 1 5 51 448 63 23 6 7 1 50 4 3 5 5 18 2 1 30 2 206 182 4 3 3 46 5 192 2 3 20 19 450 1 75 7 3 4 4 3 10 15 33 8 138 275 238 6218 2 539 12 64 3 1 11 26 84 783 1 1 82 12 1 14 3 9 893 1 1 82 1 82
Filetypes curated on the second run

The second run was restricted to the following disk image filetypes

Domain - Macintosh images .as .bin .hqx .pdf .pkg .sit .sea .sitx .cpt 2
Filetypes curated on the third and final run

The third run was restricted to txt filetypes; robots.txt files were isolated from other .txt files.

Down the digital rabbit hole

As files travel inbound some 734 sit, hqx, cpt.bin, bin, sea and other archive filetypes already populate their destination folder and these are from the now defunct domain alone. It wouldn’t be too far fetched to assume that these files in some cases might represent developer works in which case documentation should also be considered as accomanying documents or the obscure files might not have perspective. Some files from the domain will be familiar to some; RasterOps_3.2.1.cpt.bin, RadiusWare3.4.1.cpt.bin, Halo_1.05.3_Updater.sit, Halo_1.03_Updater.sit, Glider9.sit, and the list goes on.

Stage one downloads

The following file formats have been retrieved as a first step. Normally filetypes outside those listed below would be considered but considering that these are Apple subdomains it could yield some interesting results.