Saturday 1:20 p.m.–4:40 p.m.
Python for Social Scientists: Cleaning and Prepping Data
Renee Chu
- Audience level:
- Novice
- Category:
- Best Practices & Patterns
Description
If you're learning to code, working with data is a great way to implement your new skills. However, before you can do analysis or visualization, you must have a cleaned, prepped data set. This tutorial uses Python basics to unify data sets from disparate sources. It also shows you to write your programs as modules so you can re-use them for future projects.
Abstract
Intro:
- Who you are, why you are here, what we will learn
- Introduce the project: we are interested in sovereign debt, i.e. money borrowed by the governments of countries. What are the factors in debt and credit rating, we want to analyze them side-by-side
- Time: 10 min, 10 min total
Download data
- Debt as % of GDP in 2014, reported by the World Bank (http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS).
- Import it to python using csv DictReader class.
- Briefly discuss Python data structures, lists vs dicts, representing CSV rows as dictionaries.
- Time: 20 min, 30 min total
Clean data
- Write a function that gets rid of extraneous fields and returns tabular data only.
- Discuss functions, modularity
- Time: 20 min, 50 min total
Merging data.
- Download data set of GDP % growth Y/Y for 2014, clean it as well (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG).
- Download a data set of inflation % Y/Y for 2014, clean it up also (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG)
- Write a function that takes all cleaned csv objects and creates a master table of all indicators for each country
- Time: 20 min, 1 hr 10 min total
Review what we did so far - 5 min, 1:15 total
Break : 15 min, 1:30 total
Importing data that's a different format.
- Download Moody's credit ratings for each country (http://www.theguardian.com/news/datablog/2010/apr/30/credit-ratings-country-fitch-moodys-standard)
- Import and clean Moody's data as learned before the break.
- Discuss issues with adding Moody's data to master table (standardization of country names)
- Time: 20 min 1:50 total
Write a class that resolves country names (all valid variations) to ISO codes
- Get ISO codes from Wikipedia (https://en.wikipedia.org/wiki/ISO3166-1#Currentcodes)
- Discuss Python classes, modularity, re-use
- Time: 30 min 2:20 total
Create unified table, with the help of your name standardizer
- Modify the exiting World Bank importer to use the name standardizer.
- Add Moody's data to master table, also using name standardizer
- Time: 20 min, 2:40 total
Review everything we did: 10 min, 2:50 total
Questions: 10 min, 3:00 total
Student Handout
No handouts have been provided yet for this tutorial