For the benefit of the community, Google has released various datasets over years of data collection & scaling and training corpora of public web pages.

Some of them are,

  • Co-occurrence of words for word n-gram model training
  • Job queue traces from Google clusters
  • 800M documents annotated with Freebase entities
  • 40M disambiguated mentions in 10M web pages linked to Wikipedia entities
  • Human-judged corpus of binary relations about Wikipedia public figures
  • Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities)
  • Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL)

These data sets are interesting, especially for those interested in n-grams for computational linguistics and probability. This is a real delight for anyone looking into getting started with data and visualizations. All the data you can get!

You’re invited to tinker around and collaborate on projects related to large-scale data processing, data driven approaches or visualizations. Come up with cool ideas, collaborate with information designers and tinker around during this weekend atDevthon.

The aggregated links to the data sets can be found here. More links can be found in the Hackernews discussion.

[via Daily Learnings]

Originally Published on September 30, 2013

0 CommentsClose Comments

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.