Google famously started out as little more than a more efficient algorithm for ranking web pages. But the company also built its success on crawling the web â" using software that visits every page in order to build up a vast index of online content.
A nonprofit called Common Crawl is now using its own web crawler and making a giant copy of the web that it makes accessible to anyone. The organization offers up more than 5 billion web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Googleâs.
"The web represents, as far as I know, the largest accumulation of knowledge, and thereâs so much you can build on top,â says entrepreneur Gilad Elbaz, who founded Common Crawl. âBut simply doing the huge amount of work thatâs necessary to get at all that information is a large blocker; few organizations ⦠have had the resources to do that."
New search engines are just one of the things that can be built using an index of the web, says Elbaz, who points out that Googleâs translation software was trained using online text available in multiple languages. âThe only way they could do that was by starting with a massive crawl. Thatâs put them on the way to build the Star Trek translator,â he says. âHaving an open, shared, corpus of human knowledge is simply a way of democratizing access to information thatâs fundamental to innovation.â
Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at Google because it was the only place they could test those ideas. He says Common Crawlâs data will make it easier for novel ideas to gain traction, both in the world of startups and in academic research.
Elbaz is the founder and CEO of big data company Factual, and before that founded a company bought by Google to be the basis of its ad business for webpages. Common Crawl also has Googleâs director of research, Peter Norvig, and MIT Media Lab director Joi Ito on its advisory board.
Common Crawl has so far indexed more than 5 billion pages, adding up to 81 terabytes of data, made available through Amazonâs cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawlâs director. The Internet Archive, another nonprofit, also compiles a copy of the web and offers a service called the âWayback Machineâ that can show old versions of a particular page. However, it doesnât allow anyone to analyze all its data at once in that way.
Common Crawl has already inspired or helped out some new web startups. TinEye, a âreverseâ search engine that finds images similar to one provided by the user, made use of early Common Crawl data to get started. One programmerâs personal project using Common Crawl data to measure how many of the web's pages connect to Facebook â" some 22%, he concluded â" led to his securing funding for a startup, Lucky Oyster, based on helping people find useful information in their social data.
Other ideas enabled by the project emerged from a contest run last year that awarded prizes for the best use cases. One of the winners used Wikipedia links in crawl data to build a service capable of defining the meanings of words; another tried to determine public attitudes toward congressional legislation by analyzing the content of online discussions about new laws.
Rich Skrenta, cofounder and CEO of search engine startup Blekko, says Common Crawlâs data fulfills a definite need in the startup community. He says Blekko has been approached by startups with technology needing access to large collections of online data. âThat kind of data is now easily available from Common Crawl,â says Skrenta, whose company contributed some of its own data to the project in December 2012. Blekko shared information from its system that categorizes Web pages by content, for example labeling whether they contain pornography or spam.
Ben Zhao, an assistant professor at the University of California, Santa Barbara, who uses large collections of web data for research into activity on social sites, says Common Crawlâs data is likely unique.
"Fresh, large-scale crawls are quite rare, and I am not personally aware of places to get large crawl data on the web," he says.
However, Zhao notes that some of the most interesting and valuable parts of the web wonât be well represented in Common Crawlâs data: "Social sites are quite sensitive about their content these days, and many implement anti-crawling mechanisms to limit the speed anyone can access their content."
To access this data, researchers must strike up relationships with companies and rely on whatever they will release â" a route less available to startups who may be seen as competition.
Image courtesy of Flickr, Creativity103
This article originally published at MIT Technology Review here
No comments:
Post a Comment