Social Network data is not just Twitter and Facebook - networks permeate our world - yet we often don't know what to do with them. In this tutorial, we will introduce both theory and practice of Social Network Analysis - gathering, analyzing and visualizing data using Python, NetworkX and PiCloud. We will walk the attendees through an entire project, from gathering data to presenting results.
SNA techniques are derived from sociological and social-psychological theories and take into account the whole network (or, in case of very large networks such as Twitter -- a large segment of the network). Thus, we may arrive at results that may seem counter-intuitive -- e.g. that Justin Bieber (7.5 mil. followers) and Lady Gaga (7.2 mil. followers) have relatively little actual influence despite their celebrity status -- while a middle-of-the-road blogger with 30K followers is able to generate tweets that "go viral" and result in millions of impressions.
In this tutorial, we will conduct social network analysis of a real dataset, from gathering data from online sources (Twitter!), cleaning data to analysis and visualization of results. We will use Python and a set of open-source libraries, including NetworkX, NumPy and Matplotlib.
Outline:
Introduction. Why should we do this? What is the data like? Why is this different from other techniques? What can we learn?
Gathering data: how to extract useful data from the Twitter API? How to data-mine the US Congress?
How to use PiCloud for massively parallel data gathering efforts?
Basic analysis: Degree, closeness, betweenness, PageRank, Klout Score
Beyond Klout Score: Finding communities of interest, finding clusters in networks
Analysis of content using network methods
Big Network Data analysis in the cloud using PiCloud
Information diffusion in networks -- how do things go viral?
All contents of the tutorial will be supplemented with code available on GitHub -- we hope that attendees will follow along!