Researchers have been modeling text difficulty for over 50 years. A variety of models have been developed, but few have focused on books for emerging readers (Grades K-2). We used Python for nearly every aspect of the project including collecting data from reading educators, analyzing text features, and creating a predictive model. Tools used include scipy, scikit-learn, PiCloud, and others.
Researchers have been modeling the difficulty of text for over 50 years using a variety of approaches.
There are features of text in beginning reading books that are not well modeled by existing approaches.
To predict the difficulty of text we must first establish empirical measures of difficulty. We use the Rasch model to place reading materials on a scale of difficulty that students can also be placed on using read assessments. This is called a 'conjoint measurement model.'
Consulting with experts in the field, a representative sample of early reading materials was compiled.
Empirical measures of difficulty were established on the texts in our dataset. The first measure of difficulty was established through a paired-comparisons task.
For a smaller set of texts, empirical difficulties were established using an assessment task done by a set of 1,200 first and second grade students.
Based on previous research in the field and consulting with reader experts we developed a set of 166 unique quantifiable text features.
Features were developed to address these unique aspects of beginning reading books:
Using an iterative process we reduced the set of variables down to 12 variables
Because of the large number of models evaluated, we employed PiCloud to speed up the process.
Using this reduced variable set, we achieved high correlations in predicting empirically derived measures of text difficulty