Sunday 1:20 p.m.–4:40 p.m.
Making an Impact with Python Natural Language Processing Tools
Hobson Lane, Dan Fellin, Jeremy Robin
- Audience level:
- Intermediate
- Category:
- Science
Description
Abstract
Prerequisites
Students who have experience writing python
scripts or modules and are familiar with the string
manipulation and formatting capabilities built into python will have the necessary skill to complete this tutorial.
In addition, any students who are familiar with linear algebra, and basic statistics concepts (like probability and variance) will be able to grasp the mathematics behind the tools assembled during the tutorial, but this is not required. Likewise, familiarity with scikit-learn
and pandas
would enable participants to incorporate more advanced features into their NLP pipeline.
Also, students who are familiar with git
and GitHub will be able to follow along with the logistics of the workshop sessions more quickly and spend more time developing their NLP pipeline.
Python Development Environment
Students will need iPython, Pandas, NLTK, scipy, scikit-learn, and gensim installed on their laptops in order to run the examples in this tutorial and build the tweet impact predictor tool. Students can install these requirements in one of 3 ways:
For those with a Linux environment, the dependencies can be installed either natively or within a
virtualenv
with.pip install -r https://raw.githubusercontent.com/totalgood/pycon-2016-nlp-tutorial/master/requirements.txt
Alternative install recipes using Anaconda will be provided.
A Vagrant VirtualBox customized for NLP has been packaged for those who want the power of Linux and Python within their nonfree, closed-source OS.
In addition, students have the option of installing a python twitter API client rather than utilizing the preprocessed collection of twitter feeds provided with the course material.
Overview
Participants will develop a tweet natural language processing pipeline in three modules. The first section of the pipeline will be a natural language feature extractor and normalizer based on python builtins collections
, string
, and re
combined with the powerful Pandas DataFrame
data structure. The second section will utilize scikit-learn
and numpy
to simplify the feature set to a manageable number of features. It will find optimal combinations of reduced numbers of features that provide the greatest information about the subject matter of the tweets being processed. The final section of the pipeline will compute additional features not contained in the tweet text, including time of day, day of week, number of favorites, and number of retweets. Students will use these features to compute an "impact" score and train a machine-learning model to predict the impact of proposed (not yet sent) tweets. In the fourth, final workshop, participants will assess the performance of their existing machine learning pipeline, ask questions or get clarification about the performance of the pipeline, and optionally incorporate more advanced NLP techniques.
Student Handout
No handouts have been provided yet for this tutorial