Friday
11:30 a.m.–noon
Scrapy: it GETs the web
- Audience level:
- Intermediate
- Category:
- Best Practices/Patterns
Description
Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.
If you use requests, mechanize, or celery for HTTP, you should probably switch to scrapy.
Abstract
Extracting data from the web is often error-prone, hard to test, and slow. Scrapy changes all of that.
In this talk, we take two different types of web data retrieval -- one that scrapes data out of HTML, and another that uses a RESTful API -- and show how both can be improved by Scrapy.
Part I: Scraping without Scrapy
- Web pages render into DOM nodes
- Demonstrate a basic way to scrape a page: urllib2.urlopen() + lxml.html
- Send the data somewhere by a synchronous call
Part II: Importing Scrapy components for programmer sanity
- Using scrapy.items.Item to define what you are scraping out
- Using scrapy.spider.BaseSpider to clarify the code
- Running spiders: You just got async for free
- Discussion: What does async buy you? Quick benchmarks of 200 simultaneous connections with Scrapy and without.
- Sending data out the data pipeline
Part III: Everyone "loves" JavaScript
- spidermonkey with Scrapy
- Automating an entire Firefox with Selenium RC
Part IV: Automated testing when using Scrapy
- Why testing is hard with synchronous scrapers
- How to run scrapy.spider.BaseSpiders from Python unittest
- How to test offline (by keeping a copy of needed pages)
- No synchronous calls, so tests run fast
Part V: Improving a Wikipedia API client with Scrapy
- Start with a synchronous API client
- When the web service is down, watch it crash
- Make it a "scrapy.spider", and get automatic retry on failure
- Configure the request scheduler to not hammer Wikipedia
Part VI. Scrapy with Django
- Like ModelForms, but for scraping: Scrapy's DjangoItem
- Imperfect integration, but discipline gives results
- Think in terms of scrapy.items.Item
Conclusion:
Asheesh's rules for sane scraping
- Separate downloading from parsing.
- Maintain high test coverage.
- Be explicit about what data you pass from the wild, wild web into your application code.
- Coding with Scrapy gives you all of these, unlike other scraping libraries.
- When Scrapy isn't appropriate
- Short scripts sometimes feel a serious burden from the verbose API.
- If you really want exceptions on failure.
- Even if you use something else, you will love Scrapy's documentation on scraping in general.