Date of Completion


Document Type

Honors College Thesis


Data Science and Complex Systems

Thesis Type

Honors College

First Advisor

James P. Bagrow

Second Advisor

Laurent Hébert-Dufresne


Open Science, Traceability, Open Access, Software Engineering, Bipartite Network, Co-occurrence, GitHub, Citation Network, Code Artifacts, Reproducibility


Reproducibility is the foundation of published science by which results are validated or refuted and is a key principle of open science. The relative novelty of the current open science paradigm demands inspection of its reproducibility and citing or attribution practices. We extract over 60,000 links to GitHub repository code artifacts within paper texts from the Semantic Scholar Open Research Corpus. We examine these artifacts, extrapolating that a majority of them involve a repository directly created by an author of the paper they were found in. We describe several qualities of this set of links including the degree distribution of linked papers, the frequency of links found over time, and the bidirectionality of the link from repository to paper. We look at the co-occurrence of citations to papers and their associated repositories through the underlying network structure. Finally, we attempt to elucidate the presence of missing or deleted traces to code artifacts.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Available for download on Sunday, May 19, 2024