The Ordinator: Comparing Apples and Oranges at Scale

Jeff Elmore

Audience level:


MetaMetrics has developed a new technology called the Ordinator that facilitates the collection and analysis of paired-comparisons data to create equal-interval scales for arbitrary items along arbitrary constructs. Some examples include ordering foods on preference, ordering books along difficulty, and ordering tasks along complexity and priority. We plan to launch a private beta at PyCon.


Introduction --------------- One of the keys of effective distributed workforces is breaking large problems down into small units of work that require little to no training. Training individuals to effectively rate items on a rubric can be time-consuming and error-prone. If the rubric is an ordinal or Likert scale, then a distributed paired-comparisons approach can overcome these liabilities. Implementation -------------------- Using mathematics from the field of psychometrics, we developed a system for collecting and analyzing a wide variety of paired-comparisons data. Data is collected using a web application we developed called The Ordinator. Users are presented with a series of paired comparisons: two texts, images, or videos, and asked to rank one item as greater than the other in relation to some set of criteria. Items can be ranked using any question such as: “Which of these texts is more difficult?”, or “Which of these fruits do you prefer?” The web application can be accessed using computers and mobile devices such as tablets and smartphones. Large sets of items can be broken into smaller chunks with each user only ordering a subset of items. The Ordinator has been developed as an XMLRPC API, so the technology can be integrated into existing products. We have also developed several different front-ends for different applications. Optimal pairs are selected using the popular timsort algorithm. Ratings from many users can all be analyzed simultaneously with a BTL model using the Choppin algorithm. The result is an equal-interval logit scale score for each item in the study. Scale scores can be interpreted in terms of the likelihood that an individual would rank one item higher than another item. For example, if item 1 has a logit score of 1.0 and Item 2 has a logit score of 3.0, then there is an 88% chance that a user would rank item 1 higher than item 2. If two items have the same scale score, then there is a 50% chance of ranking one item over another. In addition to producing an equal-interval logit scale, the system is able to analyze the degree to which participants agree on the relative ordering of the items and produce estimates of uncertainty for each item on the scale. Scales with a more narrow range of values indicate greater disagreement between raters. Scales with wider range of scores indicate a generally agreed-upon unidimensional ordering of items Example Applications --------------------------- Some example studies include early-reading educators ordering children’s books by increasing complexity; and Mechanical Turk workers creating scales of the familiarity of terms in biology, chemistry, and American history. In addition to ordering on complexity and familiarity, we have also performed experiments on user preferences. Preference tasks include ratings of preferences for foods, names for new products, and ordering Super Bowl commercials from worst to best. For the book-ordering task, a validation experiment shows that crowdsourced ratings correspond highly with empirical complexity measures from an assessment task using the same books. Similar validation experiments have been performed on the American History, Biology, and Chemistry scales. Conference --------------- In addition to a presentation or poster, we are prepared to provide a demonstration of the technology to conference goers. Participants could order a small set of items and see how their ratings agree with the consensus ordering. We will also accept sign-ups for a private beta of the technology. References -------------- * Choppin, B. (1985). A fully conditional estimation procedure for Rasch Model parameters. Evaluation in Education, 9, 29-42. * Bradley, R.A. and Terry, M.E. (1952). Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324–345. * Garner, M., & Engelhard, G. (2000). The method of paired comparisons, Rasch measurement theory, and Graph Theory. In M. Wilson, G. Engelhard, & M. Stone (Eds.), Objective measurement: Theory into practice. Vol. 5. * Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika, 30 (1–2): 81–89. * Peters, T. (2002). [Python-Dev] Sorting. Python Development Mailing List, * Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.