OSF SciNet: Crowd-Sourcing the Scientific Citation Network

Harry Rybacki

Audience level:


It is difficult to obtain metadata for scholarly articles. Tools like Google Scholar make use this data, but others wishing to create tools or analyze the data are constrained by publisher terms of service, thus stifling innovation in literature discovery and metascience. Our goal is to collect, aggregate, clean, and distribute this dataset without restriction


The primary purpose of the OSF SciNet is to provide the general public a free, open, and comprehensive dataset containing meta-data for academic citations as well as corresponding references. This dataset will provide the public with a vital resource from which they can access, analyze, and distribute public, academic citation meta-data without restriction. While there are currently many resources available to gather information on citations such as CiteSeerX, Google Scholar, and PubMed, each has its own set of limitations. Some common issues include: incomplete or inaccurate data, exclusion of linked citations, and inability to easily access or obtain data in mass for analysis. The goal is to overcome each of these limitations while demonstrating the necessity and advantages of the open source model. We plan to address the inaccurate and incomplete dataset issue by: Utilizing crowd-sourcing methods for data collection and cleaning Negotiating with publishers directly And, obtaining article metadata from a wide range of other, accessible sources Additionally, by analyzing and combining the metadata obtained we will ensure each citation maintains a completed set of linked citations. Finally, the completed dataset as well as any analysis conducted on it will be free and easy to access by the general public. As an initial step, we expect to show that the relationships between articles within disciplines demonstrate small-world network distributional properties.