Tuesday 1:40 p.m.–2:25 p.m.
Putting 1 million new words into the dictionary
Manuel Ebert
- Audience level:
- Intermediate
- Category:
- Science
Description
Abstract
This talk is the result of a collaboration with the NPO Wordnik. They recently launched a campaign on Kickstarter with the ambitious goal of putting a million new words into the dictionary. Why? Because people look perfectly "cromulent" words like duplecture, manosphere, or misogynoir, and can't find them in the dictionary. However, many blogs and news outlets use these words.
To solve this, Wordnik hired summer.ai to build an engine that finds so called free-range definitions (FRDs) for these words. An FRD is a sentence like "She was quick to dismiss him as a misogynoir, that is, someone who's particularly hateful towards black women, and switched off the TV".
We built a pipeline that searches for these kinds of sentences on blogs and vocabulary-forward magazines such as the New Yorker or The Atlantic, determines whether a sentence sufficiently defines the word in question using various classification methods based on syntactic analysis and stochastic processes, predict how useful the definition will be to a human, and put the definition into Wordnik's dictionary.
Because of the sheer size of the project (if done in serial, an average computer would take more than 8 months to accomplish this task), we had to take a few surprising shortcuts. But we didn't use any supercomputer clusters, but only technology and platforms that are available and affordable to all developers.
This talk walks through all different stages. It won't go into the details of machine learning algorithms, but rather aims to demonstrate what is possible with a few hundred lines of Python code.
All of our work is released as open source and can be used to build dictionaries for any other language, too.