Tuesday 1:40 p.m.–2:25 p.m.
Putting 1 million new words into the dictionary
Manuel Ebert
- Audience level:
- Intermediate
- Category:
- Science
Description
2015 was the year of spocking, amabots, dadbuds, and smol. Like half of all english words used every day, these words are not in the dictionary. Until we put them there. In this talk, I’ll describe how we found definitions for 1 Million words that were missing from dictionaries, what it takes to do Natural Language Processing at that scale, and how to be the least popular scrabble winner.
Abstract
This talk is the result of a [collaboration](http://www.nytimes.com/2015/10/04/technology/scouring-the-web-to-make-new-words-lookupable.html) with the NPO [Wordnik](wordnik.com). They recently launched a [campaign on Kickstarter](https://www.kickstarter.com/projects/1574790974/lets-add-a-million-missing-words-to-the-dictionary) with the ambitious goal of putting a million new words into the dictionary. Why? Because people look perfectly "cromulent" words like duplecture, manosphere, or misogynoir, and can't find them in the dictionary. However, many blogs and news outlets use these words.
To solve this, Wordnik hired [summer.ai](http://summer.ai) to build an engine that finds so called free-range definitions (FRDs) for these words. An FRD is a sentence like *"She was quick to dismiss him as a misogynoir, that is, someone who's particularly hateful towards black women, and switched off the TV"*.
We built a pipeline that searches for these kinds of sentences on blogs and vocabulary-forward magazines such as the New Yorker or The Atlantic, determines whether a sentence sufficiently defines the word in question using various classification methods based on syntactic analysis and stochastic processes, predict how useful the definition will be to a human, and put the definition into Wordnik's dictionary.
Because of the sheer size of the project (if done in serial, an average computer would take more than 8 months to accomplish this task), we had to take a few surprising shortcuts. But we didn't use any supercomputer clusters, but only technology and platforms that are available and affordable to all developers.
This talk walks through all different stages. It won't go into the details of machine learning algorithms, but rather aims to demonstrate what is possible with a few hundred lines of Python code.
All of our work is released as [open source](https://github.com/summerai/wordnik) and can be used to build dictionaries for any other language, too.