Data science and machine learning rely on high quality datasets for visualization, statistical inference, and modeling. However, the barriers to testing data processing, analysis, or model-training code are high, even with the extensive tooling that the python ecosystem offers, such as
pandas
, pytest
, and hypothesis
.
To address this problem, in this talk I define statistical typing as a general concept describing a runtime typing system, which extends primitive data types like bool
, str
, and float
into the class of statistical data types. By providing additional semantics about the properties held by a collection of data points, statistical typing enables us to naturally express types as multivariate schemas. It also enables us to implement schemas as generative data contracts, which serve to both validate data at runtime and generate valid samples for testing purposes.
I'll use pandera, a pandas data testing library, to illustrate how statistical typing makes data testing easier by enabling you to validate real-world data with reusable schemas and isolate units of processing, analysis, and model-training code.