If you have a home delivery subscription to the New York Times (even only the Sunday Times), check out the TimesMachine — a collection of full-page image scans of the newspaper from 1851-1922. That’s every issue and every page and article, advertisements and all, viewable in their original format.
To read how this was done, click here.
“Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files – all of it ready to be assembled into a TimesMachine. . . . “