Free Database of the Entire Web May Spawn the Next Google

Google famously started out as little more than a more efficient algorithm for ranking web pages. But the company also built its success on crawling the web â€" using software that visits every page in order to build up a vast index of online content.

A nonprofit called Common Crawl is now using its own web crawler and making a giant copy of the web that it makes accessible to anyone. The organization offers up more than 5 billion web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Googleâ€™s.

"The web represents, as far as I know, the largest accumulation of knowledge, and thereâ€™s so much you can build on top,â€ says entrepreneur Gilad Elbaz, who founded Common Crawl. â€œBut simply doing the huge amount of work thatâ€™s necessary to get at all that information is a large blocker; few organizations â€¦ have had the resources to do that."

New search engines are just one of the things that can be built using an index of the web, says Elbaz, who points out that Googleâ€™s translation software was trained using online text available in multiple languages. â€œThe only way they could do that was by starting with a massive crawl. Thatâ€™s put them on the way to build the Star Trek translator,â€ he says. â€œHaving an open, shared, corpus of human knowledge is simply a way of democratizing access to information thatâ€™s fundamental to innovation.â€

Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at Google because it was the only place they could test those ideas. He says Common Crawlâ€™s data will make it easier for novel ideas to gain traction, both in the world of startups and in academic research.

Elbaz is the founder and CEO of big data company Factual, and before that founded a company bought by Google to be the basis of its ad business for webpages. Common Crawl also has Googleâ€™s director of research, Peter Norvig, and MIT Media Lab director Joi Ito on its advisory board.

Common Crawl has so far indexed more than 5 billion pages, adding up to 81 terabytes of data, made available through Amazonâ€™s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawlâ€™s director. The Internet Archive, another nonprofit, also compiles a copy of the web and offers a service called the â€œWayback Machineâ€ that can show old versions of a particular page. However, it doesnâ€™t allow anyone to analyze all its data at once in that way.

Common Crawl has already inspired or helped out some new web startups. TinEye, a â€œreverseâ€ search engine that finds images similar to one provided by the user, made use of early Common Crawl data to get started. One programmerâ€™s personal project using Common Crawl data to measure how many of the web's pages connect to Facebook â€" some 22%, he concluded â€" led to his securing funding for a startup, Lucky Oyster, based on helping people find useful information in their social data.

Other ideas enabled by the project emerged from a contest run last year that awarded prizes for the best use cases. One of the winners used Wikipedia links in crawl data to build a service capable of defining the meanings of words; another tried to determine public attitudes toward congressional legislation by analyzing the content of online discussions about new laws.

Rich Skrenta, cofounder and CEO of search engine startup Blekko, says Common Crawlâ€™s data fulfills a definite need in the startup community. He says Blekko has been approached by startups with technology needing access to large collections of online data. â€œThat kind of data is now easily available from Common Crawl,â€ says Skrenta, whose company contributed some of its own data to the project in December 2012. Blekko shared information from its system that categorizes Web pages by content, for example labeling whether they contain pornography or spam.

Ben Zhao, an assistant professor at the University of California, Santa Barbara, who uses large collections of web data for research into activity on social sites, says Common Crawlâ€™s data is likely unique.

"Fresh, large-scale crawls are quite rare, and I am not personally aware of places to get large crawl data on the web," he says.

However, Zhao notes that some of the most interesting and valuable parts of the web wonâ€™t be well represented in Common Crawlâ€™s data: "Social sites are quite sensitive about their content these days, and many implement anti-crawling mechanisms to limit the speed anyone can access their content."

To access this data, researchers must strike up relationships with companies and rely on whatever they will release â€" a route less available to startups who may be seen as competition.

Image courtesy of Flickr, Creativity103

This article originally published at MIT Technology Review here

Mashable Articles

Thursday, January 24, 2013

Free Database of the Entire Web May Spawn the Next Google

Share This!

No comments:

Post a Comment