Friday 12:10 p.m.–12:40 p.m.

Python Scraping Showdown: A performance and accuracy review of top scraping libraries

Katharine Jarmul

Audience level:
Intermediate
Category:
Python Libraries

Description

Ever wondered how python web-scraping libraries compare in terms of speed and accuracy? I’ll review lxml, html5lib, BeautifulSoup and scrapy with a series of sites evaluating how quickly they can parse pages and how accurately they can find data, particularly pieces of data that render after DOM loading and other pesky bits like hidden form data, internationalized data and mobile-compliant sites.

Abstract

This talk doesn't hope to be an end-all-be-all of "which library is best" but instead will aim to provide a more in-depth analysis of the variety of scraping libraries that are available. It aims to give the attendees the tools they need in order to evaluate how they can utilize python scraping libraries to achieve their own goals and which pitfalls they might avoid by architecting their projects and library choice to best reflect the goals they have for their project.