PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Saturday 1:20 p.m.–4:40 p.m.

Python for Social Scientists: Cleaning and Prepping Data

Renee Chu

Audience level:
Novice
Category:
Best Practices & Patterns

Description

If you're learning to code, working with data is a great way to implement your new skills. However, before you can do analysis or visualization, you must have a cleaned, prepped data set. This tutorial uses Python basics to unify data sets from disparate sources. It also shows you to write your programs as modules so you can re-use them for future projects.

Abstract

  • Intro:

    • Who you are, why you are here, what we will learn
    • Introduce the project: we are interested in sovereign debt, i.e. money borrowed by the governments of countries. What are the factors in debt and credit rating, we want to analyze them side-by-side
    • Time: 10 min, 10 min total
  • Download data

    • Debt as % of GDP in 2014, reported by the World Bank (http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS).
    • Import it to python using csv DictReader class.
    • Briefly discuss Python data structures, lists vs dicts, representing CSV rows as dictionaries.
    • Time: 20 min, 30 min total
  • Clean data

    • Write a function that gets rid of extraneous fields and returns tabular data only.
    • Discuss functions, modularity
    • Time: 20 min, 50 min total
  • Merging data.

    • Download data set of GDP % growth Y/Y for 2014, clean it as well (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG).
    • Download a data set of inflation % Y/Y for 2014, clean it up also (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG)
    • Write a function that takes all cleaned csv objects and creates a master table of all indicators for each country
    • Time: 20 min, 1 hr 10 min total
  • Review what we did so far - 5 min, 1:15 total

  • Break : 15 min, 1:30 total

  • Importing data that's a different format.

    • Download Moody's credit ratings for each country (http://www.theguardian.com/news/datablog/2010/apr/30/credit-ratings-country-fitch-moodys-standard)
    • Import and clean Moody's data as learned before the break.
    • Discuss issues with adding Moody's data to master table (standardization of country names)
    • Time: 20 min 1:50 total
  • Write a class that resolves country names (all valid variations) to ISO codes

    • Get ISO codes from Wikipedia (https://en.wikipedia.org/wiki/ISO3166-1#Currentcodes)
    • Discuss Python classes, modularity, re-use
    • Time: 30 min 2:20 total
  • Create unified table, with the help of your name standardizer

    • Modify the exiting World Bank importer to use the name standardizer.
    • Add Moody's data to master table, also using name standardizer
    • Time: 20 min, 2:40 total
  • Review everything we did: 10 min, 2:50 total

  • Questions: 10 min, 3:00 total

Student Handout

No handouts have been provided yet for this tutorial