PyCon 2019 in Cleveland, Ohio

Thursday 11 a.m.–12:30 p.m. in Room 25C

Using Machine Learning to Create Proxy Labels for Transaction Data

Tobi Bosede


In banking we often find ourselves wanting to predict an outcome or variable for which no labels exist to indicate ground truth. Does this mean we cannot apply supervised machine learning techniques? No! In this talk, you will learn how we created proxy labels for recurrent transactions using python and Apache Spark. Some examples of recurring transactions are Netflix and Spotify subscriptions. The best part about the approach we used is that it did not require paying humans to manually create labels, saving both time and money while increasing accuracy since it bypasses human error. Because of the robustness of the proxy labels created, we were able to improve upon a previously rule-based process for determining recurring transactions. This ensures that customers do not experience interruption when their card is replaced and they are notified when subscription costs increase significantly, among many other benefits. ### Abstract #### Audience The target audience is machine learning engineers, data scientists, and finance/banking folks. However, anyone with an interest in machine learning or banking will benefit from the talk. #### Objectives The audience will come away with an understanding of an approach to create labels for unlabeled data and get exposure to a situation in which python and Apache Spark provide business value. #### Notes The project has evolved greatly since this article was written, but still provides useful context for understanding value proposition of predicting recurring transactions.