Web scraping: Reliably and efficiently pull data from pages that don't expect it

Asheesh Laroia

Type:: Tutorial
Audience level:: Intermediate
Category:: Best Practices/Patterns

March 7th 1:20 p.m. – 4:40 p.m.

Description

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

Abstract

Basics of parsing

The website is the API
HTML is a mess, but we can parse it anyway
Why regular expressions are a bad idea
Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
Expect exceptions: How to handle errors

Basics of crawling

A quick review of HTTP
Why cookies are necessary for maintaining a session
How servers can track you
How to submit forms with mechanize

Debugging the web

Comparing FireBug and Chrome's DOM Inspector
The "Net" tab
Using a logging HTTP proxy to record traffic

Counter-measures, and how to circumvent them

JavaScript
Hidden form fields (e.g., Django CSRF)
CAPTCHAs
IP address limitations

How to cover your scraping code with tests

Why you should store snapshotted pages
Using mock objects to avoid network I/O
Using a fake getPage for Twisted

Parallelism

A quick tour of different models:
Twisted
gevent
celery

Handling JavaScript

Automating a full web browser with Selenium RC
Running JavaScript within Python using python-spidermonkey

Conclusion

Use your power for good, not evil.
Q&A