The papers in this post describe research on archiving and preserving digital information. Most papers have multiple authors, although I've credited by name only the author(s) who presented the paper. Entire papers are available in the conference proceedings; let me know if you'd like to see one.
Using Timed-Release Cryptography to Mitigate the Preservation Risk of Embargo Periods
Michael L. Nelson, et al
The speaker humorously described his team's research as one more in a series of preservation hacks. This research focuses on the problem of preserving open-access journals that have a subscription-only embargo period. If the journal suddenly stops publishing, open-access customers lose the journal's embargoed issues. As an interesting side note, Nelson points out that the typical embargoed journal model is the inverse of the typical newspaper embargo model, since newspapers generally offer free current content and paid access to archives.
Nelson refers to the time between a journal's open-access release and its current release (usually 6-12 months) as the Preservation Risk Interval. LOCKSS, CLOCKSS, and Portico solve the problem of this risk interval for paid subscribers, but there is no current model for preserving open access to journal issues that will at a known future point become public. Nelson and his team are working within the concept of "lazy preservation" (McCown 2007, Smith 2008), which attempts to reconstruct lost public Web content from search engine caches. The team attempts to extend this concept to encrypted embargoed journal content, but must then address the further problem of preventing hackers from breaking a repository's encryption before the embargo period ends. To solve this problem, enter timed-release cryptography. Where "regular" public/private key encryption yields to a brute force attack (multiple machines working together to solve a puzzle), timed-release cryptography creates a puzzle that must be solved with serial computing. This means that there's no advantage to a brute force attack because each computer attempting to break the encryption must work the entire puzzle in order. So a timed-release encryption might be broken in 35 years by a 1999 computer, but broken in only 1 year by a 2033 computer - the quantity of processing time remains the same, regardless of the processor's power. The team's hypothesis is that timed-release cryptography can solve the problem of reliably encrypting embargoed journal content in a repository until its public release date.
The research team created a repository for embargoed journal articles to test this hypothesis. The repository was divided into various time-lock strengths that open as the embargoes expire over time. The repository was implemented with mod_oai, an Apache module for harvesting OAI data, and CRATE, a packing module. The team was able to show that this repository model would work, and point out that it's a complement to other preservation methods, including LOCKSS, CLOCKSS, and Portico.
What Happens When Facebook is Gone?
Frank McCown, et al
A fascinating paper that asks the question, "What happens to your data when your Facebook account disappears?" Many people spend a lot of time and effort on their Facebook pages, and could be devastated by the loss of all that information - photos, messages from friends and family, etc. The researchers have been working on ways of archiving Web 2.0 content. This is a difficult problem where Facebook is concerned because Facebook's terms of service agreement specifically forbids any automated crawling or data extraction, even of a user's personal content for personal use. The researchers have created an archiving framework for personal or 3rd party preservation of this kind of information, but haven't yet completed a viable software application. Their goal is to produce an automated archiver that installs as a Firefox add-on. While they admit that such a tool would totally violate Facebook's terms of service, they hope to use its existence to push Facebook into changing its policies.
Robust Registration of Manuscript Images
W. Brent Seales, et al
The researchers describe a process of 3D scanning that can account for and transform warped surfaces into flat, readable surfaces. This project worked on a 10th century manuscript of Homer's Illiad (the earliest known complete copy) which is very interesting to medieval and classical scholars not only for its text, but also for the marginalia scribbled in by centuries of readers and scribes. Over time, the vellum that the text was written on has warped, so traditional flat scanning or photographic methods are not able to capture completely readable images of the pages. The manuscript has also been rebound many times (trimmed a little more each time), so many marginal comments are now too close to the center of the book to be reliably imaged at all. The speaker describes some of the mathematics behind the 3D scanning, but most of the details are in the paper. Suffice it to say that the example images before and after flattening are pretty amazing. The project also captured images of the book taken under different lights - UV, infrared, natural, etc - and worked to overlay all those images into a final, flattened product. The results reveal text that's not readable (or even visible, sometimes) from the faded ink on the original vellum. Fascinating stuff.