For the benefit of the community, Google has released various datasets over years of data collection & scaling and training corpora of public web pages.
Some of them are,
- Co-occurrence of words for word n-gram model training
- Job queue traces from Google clusters
- 800M documents annotated with Freebase entities
- 40M disambiguated mentions in 10M web pages linked to Wikipedia entities
- Human-judged corpus of binary relations about Wikipedia public figures
- Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities)
- Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL)
These data sets are interesting, especially for those interested in n-grams for computational linguistics and probability. This is a real delight for anyone looking into getting started with data and visualizations. All the data you can get!
You’re invited to tinker around and collaborate on projects related to large-scale data processing, data driven approaches or visualizations. Come up with cool ideas, collaborate with information designers and tinker around during this weekend atDevthon.
[via Daily Learnings]
Originally Published on September 30, 2013