Web scraping: Reliably and efficiently pull data from pages that don't expect it
- Type:
- Tutorial
- Audience level:
- Intermediate
- Category:
- Best Practices/Patterns
March 7th 1:20 p.m. – 4:40 p.m.
Description
Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable.
We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.
Abstract
Basics of parsing
- The website is the API
- HTML is a mess, but we can parse it anyway
- Why regular expressions are a bad idea
- Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
- Expect exceptions: How to handle errors
Basics of crawling
- A quick review of HTTP
- Why cookies are necessary for maintaining a session
- How servers can track you
- How to submit forms with mechanize
Debugging the web
- Comparing FireBug and Chrome's DOM Inspector
- The "Net" tab
- Using a logging HTTP proxy to record traffic
Counter-measures, and how to circumvent them
- JavaScript
- Hidden form fields (e.g., Django CSRF)
- CAPTCHAs
- IP address limitations
How to cover your scraping code with tests
- Why you should store snapshotted pages
- Using mock objects to avoid network I/O
- Using a fake getPage for Twisted
Parallelism
- A quick tour of different models:
- Twisted
- gevent
- celery
Handling JavaScript
- Automating a full web browser with Selenium RC
- Running JavaScript within Python using python-spidermonkey
Conclusion
- Use your power for good, not evil.
- Q&A