February 26, 2020
Update: Links are now pointing in the right direction. Whitefiles outlines some file types not originally considered so a new follow-up scrape has been performed to secure any missed files. This scrape has concludes and all posts are now online.
The repo can be directly accessed from the Garden’s CDN.
For the time being only hqx|sit|sitx|dd|pkg|bin|sea|cpt will be downloaded during this first phase. Also during this phase, URLs will also be collected. It was discovered that some .hqx and .sit files are partials and don’t decompress but it’s preferred that pace be the priority. With that being said, files are not generally being decompressed but seeing 3KB-6KB files has raised suspicions.
Once URLs begin to deplete, the entire list of URLs collected will be scraped again but for pdf|txt|html files. PDFs generally represent manuals or other literature that points back to the software downloads. TXT files may include notes on releases, readme files and other supporting texts that can shed light, especially more obscure titles. HTML file sometimes surfaces interesting content and not necessarily pages but as HTML usually has references to images they may need to come down as well.
Up to now the bash script was limiting downloads to ~ year 2002. Going forward no limit on year will be set for software.