Monday, August 15, 2005

Calculating the size of the Internet... or at least for foreign languages

Yahoo! recently announced that their index now exceeds 20 billion items, over twice the size of the Google index. A short paper by Cheney, Perry, and Burton from the NCSA claims that it is unlikely that Yahoo!'s index is over twice as large as Google's. In their quick study, they noted that Google typically returns twice as many results than Yahoo! for long-tail queries.

I'm not sure if there's an answer to this issue. The size of the web was determined to be at least 11.5-billion documents back in January. Given the infrequency with which the major search engines publicly announce the size of their indexes, and the overall growth rate of the web, 20-billion may not be an unrealistic number.

What I find interesting in this whole story is how Cheney, Perry, and Burton conducted their study. In particular, they conducted random queries using words culled from publicly available International Ispell dictionaries. Their general approach, and the availability of these dictionaries seems to reveal another means of studying search engine bias. One could take a sample of Cornish words, for example, and randomly probe search engines. Take two words at random, hit Google, and record the number of results. Maintain a moving average. Terminate when that average has converged. Do several runs and take the average. Therefore, for an "average" search in the Cornish language, one could determine how many results are returned. Do the same for the english language. The relative size should give some indication of relative index sizes. Here's how it could work:

1. Draw two (or more) sample words from a list of words
2. Conduct search
3. Record number of results returned
4. Determine the moving average: newAverage = (oldAverage)*((numInterations-1)/numIterations) + newValue * (1/numIterations)
5. Compare averages: IF newAverage-oldAverage > threshold, ITERATE. ELSE, record oldAverage and numInterations.

Here are some hypotheses: H1. The relative sizes of the search results are reflective of the number of speakers. H2. The number of iterations required for the search to converge is related to the size of the corpus.