PyCon 2019 in Cleveland, Ohio

Wednesday 9 a.m.–12:20 p.m. in Room 19

Pandas is for Everyone

Daniel Chen

Description

Data Science and Machine learning have been synonymous with languages like Python. Libraries like Numpy and Pandas have become the de facto standard when working with data. The DataFrame object provided by Pandas gives us the ability to work with heterogeneous unstructured data that is commonly used in "real world" data. New learners are often drawn to Python and Pandas because of all the different and exciting types of models and insights the language can do and provide, but are awestruck when faced with the initial learning curve. This tutorial aims to guide the learner from using spreadsheets to using the Pandas DataFrame. Not only does moving to a programming language allow the user to have a more reproducible workflow, but as datasets get larger, some cannot even be opened in a spreadsheet program. The goal is to have an absolute beginner proficient enough with Pandas that they can start working with data in Python. We will cover how to load and view our data and introduce what Dr. Hadley Wickham has coined "tidy data". Tidy data is an important concept because the process of tidying data will fix a host of data problems that are needed to perform analytics. We then cover functions and applying methods to our data with a focus on data cleaning, and how we can use the concept of split-apply-combine (groupby) to summarize or reduce our data. Finally, we cover the role of Pandas in analysis packages such as scikit learn. The tutorial will end with a fitted model. The goal is to get people familiar with Python and Pandas so they can learn and explore many other parts of the Python ecosystem (e.g., scikit learn, dask, seaborn, etc).

Student Handout

No handouts have been provided yet for this tutorial