Network News

X My Profile
View More Activity

A field trip to the Internet Archive

SAN FRANCISCO -- Many people think of the Internet Archive only as the home of the Wayback Machine, the site that lets you see what pages looked like years ago.

archive_building.jpg

But the archive is also of the real world, a 501(c)(3) nonprofit organization that makes its home in a former church in the Richmond neighborhood here. Archive founder Brewster Kahle took an hour to show me around the place and talk about its work -- an increasing amount of which has little to do with old Web pages.

The archive moved into this building, an old Christian Science church, last November, and as a result its lobby still features a large collection of boxes. (Kahle noted that the building dates to 1923, "the last year of the public domain"; most works created since then remain under copyright.) The main hall still looks like a church, down to the pews, but Kahle aims to eventually rebuild it into a library of sorts. Kahle and other staffers have their offices on the lower level.

archive_scanners.jpg

Next door, the old Christian Science reading room has been turned into a scanning center, as part of the archive's mission to preserve print as well as pixels. On each side, staffers were operating specialized scanners -- operated by pedals, like old sewing machines -- that photograph two pages of a book at a time. In the center, other employees were running computer-driven microfilm scanners. "That looks like the 1900 census," Kahle said as he peered over one staffer's shoulder at a screenful of handwritten documents.

Poring over page after page in a room made hot by that accumulation of computing machinery seemed like it could get a tad repetitive. I asked Kahle if there was a risk of burnout. Yes, he said, pointing to himself as an example of the wrong sort of person for that work: "I would get fired!" But some employees, he said, have been there three years.

One of the archive's newer projects is a site called Open Library, which both catalogues books and provides access to electronic copies of them. Anyone can download public-domain works, while visually impaired users can access text-to-speech versions of works through a program set up by the Library of Congress. The archive is also working to set up a system for direct downloads of e-book loans.

archive_doormat.jpg

Much of the archive's work with books might seem to duplicate what Google is already doing with its Google Books site. But Kahle (who doesn't own an e-book reader) objected to the way some libraries have begun to rely on Google's collections and "de-accessioning" paper copies -- that is, trashing them, which seems the sort of thing that happens only in science-fiction novels. He'd rather see libraries keep their original source material while also using the Internet to make that content available to more people. "Let's not lose it all," he said.

Funding for the archive, director of administration Jacques Cressaty said, comes from foundation grants and donations (plus a subsidy from the city of San Francisco to underwrite some employees' salaries) and from fees earned by providing indexing and scanning services to other libraries. Last year, he said, about 40 percent of its income was contributed and 60 percent came from services.

I wrapped up our interview by asking Kahle for his preferred file formats for long-term storage, since I get that kind of question fairly often from readers. He said the archive uses FLAC (Free Lossless Audio Compression) for music, had adopted H.264 for video storage after trying five other formats, used JPEG for photos and employed a related format, JPEG 2000, for text-heavy images. But he also said that for personal storage, PDF or nearly universally supported commercial formats -- even Microsoft Office -- would be fine, too.

Anything else you'd like to know about the archive or Kahle? Post your questions in the comments, and I'll try to get them answered.

By Rob Pegoraro  |  May 18, 2010; 7:16 PM ET
Categories:  Digital culture  
Save & Share:  Send E-mail   Facebook   Twitter   Digg   Yahoo Buzz   Del.icio.us   StumbleUpon   Technorati   Google Buzz   Previous: Notes on the digital-music business: Things could be worse
Next: Live from Google I/O: Android, Chrome, Web apps and more

Comments

Interesting that they use a lossless format (FLAC) for audio, but lossy formats for images. Any chance they mentioned why?

Posted by: sniz15 | May 18, 2010 8:12 PM | Report abuse

I am curious about the mention of PDF. This may be a bit of technical minutia, but PDF/A (http://en.wikipedia.org/wiki/PDF/A) is considered to be a long-term preservation format. From the definition at Wikipedia: "PDF/A is in fact a subset of PDF, obtained by leaving out PDF features not suited to long-term archiving." I wonder if Brewster Kahle thinks this tightened format definition is better for long-term preservation, or if the more general PDF format is good enough.

Posted by: dltj | May 19, 2010 9:06 AM | Report abuse

sniz15 asks: lossy formats for images. Any chance they mentioned why?

We used to store uncompressed TIFF (50MB/image) and then RAW (17MB/image), but when you are scanning 1000 books/day it gets big, and also, we (folks from Harvard, Library of Congress, UC and Internet Archive) did studies to find out what kind of degradation we get if we use jpeg-2000 at about 1MB per image and we found very very little. So for mass scanning we are using jpeg-2000. A big problem is that it is not supported in browsers.

Try zooming in on one of our books-- I hope you will find it pretty good: http://www.archive.org/stream/lifeofabrahamli2463tarb#page/n7/mode/2up

Posted by: brewster2 | May 19, 2010 10:34 AM | Report abuse

dltj asks: what about the PDF/A?

Yes, PDF/A is better for long term access. For our book scanning we were disappointed that it did not support an image layer as jpeg-2000 (or at least not originally) and we found that to be a dramatic enough improvement in quality per megabyte over a jpeg layer for books that we chose normal PDF. We also don't think of this as the preservation format for these books.

What most end-users are doing is scanning documents with their scanner or taking office documents and writing them to disk. For these purposes, PDF, we have found, works quite well. PDF in this case is a container format that keeps metadata, images, and text together. sometimes even has page numbers, chapter starts etc.

When users upload these to the Internet Archive for long term preservation, we use open source tools to process these files and adobe has not gone after those developers, so we are happy with the format.

Posted by: brewster2 | May 19, 2010 10:44 AM | Report abuse

What an interesting project! Thanks, Mr. Kahle, and Rob for the information & technical specs.

As a librarian, I share Mr. Kahle's concerns about thoughtless de-accessioning, and sadly it is becoming farther from sci-fi every day.

Posted by: MiuBot116 | May 19, 2010 10:53 AM | Report abuse

Will older music sets in SHN be transferred to FLAC?

Posted by: Hemisphire | May 19, 2010 11:04 AM | Report abuse

Hemisphire asks: Will older music sets in SHN be transferred to FLAC?

We are not migrating from user uploads from SHN to FLAC yet. They are pretty big, and SHN is still commonly supported. If people are finding this a problem, please let us know on the archive.org forums.

Posted by: brewster2 | May 19, 2010 1:44 PM | Report abuse

Folks: If you were curious, "brewster2" really is Brewster Kahle (he confirmed that to me in an e-mail). So...

@brewster2: Thanks! If only the principals at Apple or Microsoft were as quick to answer reader queries here :)

- RP

Posted by: robpegoraro | May 19, 2010 2:44 PM | Report abuse

when will you begin doing the most
basic of corrections on your o.c.r.,
so your e-books can start to be free
of flaws that make them unreadable?

-bowerbird

Posted by: bowerbird1 | May 19, 2010 6:30 PM | Report abuse

I really, really want to be able to search the thing using other means than an URL or a date. I want to be able to search on terms I enter, that are content oriented not oriented towards time initially.

If I could beg for one thing it would be different search dynamics.

Posted by: Nymous | May 19, 2010 7:11 PM | Report abuse

I can understand that it does not make sense to store the RAW files for text pages, but what about pages that have illustrations? Also do you plan on making high resolution scans of images?

Posted by: yetanotherpassword | May 21, 2010 3:53 PM | Report abuse

The comments to this entry are closed.

 
 
RSS Feed
Subscribe to The Post

© 2010 The Washington Post Company