Sunday, January 6, 2013

Library of Congress Has Now Archived 170 Billion Tweets

The dream of a library of Twitter is getting closer to reality.

The Library of Congress announced Friday that it is just weeks away from completing its archive of all public tweets from Twitter's launch in 2006 through 2010, but there are technology challenges that need to be figured out before the archive becomes usable.

So far, the national library has compiled a massive collection of about 170 billion tweets from that time period, and expects to finish with this stage of the archive by the end of January. The volume of tweets collected into the archive on a daily basis has grown exponentially from 140 million in early 2011 to 500 million as of October, 2012.

"The Library's first objectives were to acquire and preserve the 2006 -2010 archive

"The Library's first objectives were to acquire and preserve the 2006 -2010 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date," Gayle Osterberg, director of communications for the Library of Congress, wrote in a blog post. "This month, all those objectives will be completed."

Twitter and the Library of Congress announced the initiative in April, 2010 with the goal being to preserve all public posts from the social network for cultural research purposes. As part of its agreement with Twitter, only public tweets were included in the archive and tweets can only be made available to researchers six months after they're posted.

In the nearly three years since then, the Library has received hundreds of inquiries from researchers around the world interested in accessing all that data, but it must first contend with the technological challenges that come with building and managing such a massive digital archive. For example, the Library noted in a separate White Paper released Friday that it can take 24 hours to perform a single search for a term in the Twitter archive, which isn't a workable option.

"It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data," the Library said in the research paper. "Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task."

For now, the Library is working to "develop a basic level of access" that will suffice for research purposes until better archival technologies become available.

Even when all of that is done, don't expect to have easy access to the archive. The Library has agreed not to make most of the archive easily downloadable on its website, and researchers must agree not to use the data in the archive for commercial purposes in order to gain access to it.

Image courtesy of Flickr, ctj71081

Share This!


No comments:

Post a Comment

Powered By Blogger · Designed By Mashable Articles