pycon logo

PyCon 2011 Atlanta

March 9th–17th

Log in or Sign Up

Statistical machine learning for text classification with scikit-learn

log in to bookmark this presentaton

Experienced / Talk
March 11th 4:15 p.m. – 4:45 p.m.
The goal of this talk is to give a state-of-the-art overview of machine learning algorithms applied to text classification tasks ranging from language and topic detection in tweets and web pages to sentiment analysis in consumer products reviews.

Abstract

Unstructured or semi-structured text data is ubiquitous thanks to the read-write nature of the web. However human authors are often lazy and don't fill-in structured metadata forms in web applications. It is however possible to automate some structured knowledge extraction with simple and scalable statistical learning tools implemented in python. For instance:

  • guessing the language and topic of tweets and web pages
  • analyze the sentiment (positive or negative) in consumer products reviews in blogs or customer emails

This talk will introduce the main operational steps of supervised learning:

  • extracting the relevant features from text documents
  • selecting the right machine learning algorithm to train a model for the task at hand
  • using the trained model on previously unseen documents
  • evaluating the predictive accuracy of the trained model

We will also demonstrate the results obtained for above tasks using the scikit-learn package and compare it to other implementations such as nltk and the Google Prediction API