Saturday 1:20 p.m.–4:40 p.m.
Python for Social Scientists: Cleaning and Prepping Data
Renee Chu
- Audience level:
- Novice
- Category:
- Best Practices & Patterns
Description
If you're learning to code, working with data is a great way to implement your new skills. However, before you can do analysis or visualization, you must have a cleaned, prepped data set. This tutorial uses Python basics to unify data sets from disparate sources. It also shows you to write your programs as modules so you can re-use them for future projects.
Abstract
* Intro:
* Who you are, why you are here, what we will learn
* Introduce the project: we are interested in sovereign debt, i.e. money borrowed by the governments of countries. What are the factors in debt and credit rating, we want to analyze them side-by-side
* Time: 10 min, 10 min total
* Download data
* Debt as % of GDP in 2014, reported by the World Bank (http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS).
* Import it to python using csv DictReader class.
* Briefly discuss Python data structures, lists vs dicts, representing CSV rows as dictionaries.
* Time: 20 min, 30 min total
* Clean data
* Write a function that gets rid of extraneous fields and returns tabular data only.
* Discuss functions, modularity
* Time: 20 min, 50 min total
* Merging data.
* Download data set of GDP % growth Y/Y for 2014, clean it as well (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG).
* Download a data set of inflation % Y/Y for 2014, clean it up also (http://data.worldbank.org/indicator/NY.GDP.DEFL.KD.ZG)
* Write a function that takes all cleaned csv objects and creates a master table of all indicators for each country
* Time: 20 min, 1 hr 10 min total
* Review what we did so far - 5 min, 1:15 total
* Break : 15 min, 1:30 total
* Importing data that's a different format.
* Download Moody's credit ratings for each country (http://www.theguardian.com/news/datablog/2010/apr/30/credit-ratings-country-fitch-moodys-standard)
* Import and clean Moody's data as learned before the break.
* Discuss issues with adding Moody's data to master table (standardization of country names)
* Time: 20 min 1:50 total
* Write a class that resolves country names (all valid variations) to ISO codes
* Get ISO codes from Wikipedia (https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes)
* Discuss Python classes, modularity, re-use
* Time: 30 min 2:20 total
* Create unified table, with the help of your name standardizer
* Modify the exiting World Bank importer to use the name standardizer.
* Add Moody's data to master table, also using name standardizer
* Time: 20 min, 2:40 total
* Review everything we did: 10 min, 2:50 total
* Questions: 10 min, 3:00 total
Student Handout
No handouts have been provided yet for this tutorial