A LOT of Data Available for R&D
Google have decided to share some data with the wider IT community, in the form of:
That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.It will fill 6 DVDs. Presumably there will be a small cost, but that's fine.



0 Comments:
Post a Comment
<< Home