Web scraping: Reliably and efficiently pull data from pages that don't expect it

Type:
Tutorial
Audience level:
Intermediate
Category:
Best Practices/Patterns
March 7th 1:20 p.m. – 4:40 p.m.

Description

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

Abstract

Basics of parsing

  • The website is the API
  • HTML is a mess, but we can parse it anyway
  • Why regular expressions are a bad idea
  • Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
  • Expect exceptions: How to handle errors

Basics of crawling

  • A quick review of HTTP
  • Why cookies are necessary for maintaining a session
  • How servers can track you
  • How to submit forms with mechanize

Debugging the web

  • Comparing FireBug and Chrome's DOM Inspector
  • The "Net" tab
  • Using a logging HTTP proxy to record traffic

Counter-measures, and how to circumvent them

  • JavaScript
  • Hidden form fields (e.g., Django CSRF)
  • CAPTCHAs
  • IP address limitations

How to cover your scraping code with tests

  • Why you should store snapshotted pages
  • Using mock objects to avoid network I/O
  • Using a fake getPage for Twisted

Parallelism

  • A quick tour of different models:
  • Twisted
  • gevent
  • celery

Handling JavaScript

  • Automating a full web browser with Selenium RC
  • Running JavaScript within Python using python-spidermonkey

Conclusion

  • Use your power for good, not evil.
  • Q&A