Introduction to NLTK

Jacob Perkins

Type:: Tutorial
Audience level:: Intermediate
Category:: Useful libraries

March 8th 1:20 p.m. – 4:40 p.m.

Description

Learn the basics of natural language processing with NLTK, the Natural Language ToolKit. First we'll cover tokenization, stemming and wordnet. Next we'll get into part-of-speech tagging, chunking & named entity recognition. Then we'll close with text classification and sentiment analysis. You'll walk out with new super-powers and an appreciation of the difficulties of analyzing human language.

Abstract

This tutorial will be a hands on approach to learning natural language processing using NLTK, the Natural Language ToolKit. We will cover everything from tokenizing sentences to phrase extraction, from splitting words to training your own text classifiers for sentiment analysis. Please come prepared with NLTK already installed so we can dive into the code & data immediately.

Hour 1: Tokenization, Stemming & Corpora

Tokenization & familiarity with corpus readers and models are required knowledge before you can get into the more interesting aspects of NLTK. This first hour will include:

an overview of modules & data
loading pickled models
sentence & word tokenization
stemming & lemmatization
an overview wordnet and other included corpora

Hour 2: Part-of-Speech Tagging & Chunking/NER

Using tokenization and a working knowledge of corpus readers & pickled models, we'll dive into part-of-speech tagging and chunking/NER, including:

using a part-of-speech tagger
an overview of tags and tagged corpora
training a custom tagger with nltk-trainer
using a chunker for phrase extraction and named entity recognition
an overview of chunked corpora
training a custom chunker with nltk-trainer

Hour 3: Text Classification & Sentiment Analysis

After using classifiers for training part-of-speech taggers and chunkers, this final hour will explain text classification in greater detail with:

an overview of classified corpora
text feature extraction
an overview of classification algorithms & when to use them
training a sentiment analysis classifier on movie reviews with nltk-trainer
using a classifier for sentiment analysis
hierarchical classification for sentiment analysis
binary vs multi-label classification

Wrapping Up

Now that you know how to use NLTK to process some of the included English corpora, we'll wrap up by covering:

non-english corpora included with NLTK
other Python libraries for NLP
custom corpus creation