Monday, August 07, 2006

A LOT of Data Available for R&D

Google have decided to share some data with the wider IT community, in the form of:
That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
It will fill 6 DVDs. Presumably there will be a small cost, but that's fine.

0 Comments:

Post a Comment

<< Home