Informatics Professor Cristina Lopes received a $600,000 Defense Advanced Research Agency (DARPA) grant under the Mining and Understanding Software Enclaves (MUSE) program. The program was started with the goal of reviewing billions of lines of open-source code to discover new relationships among this “big code,” thereby helping to build more robust software. As part of this effort, Lopes is researching software analytics for big code.
Lopes is working with fellow UCI Assistant Project Scientist Pedro Martins (Institute for Software Research); UCI informatics graduate students Vaibhav Saini and Di Yang; and researchers from the Czech Technical University, Northeastern University and Microsoft Research. In particular, the team analyzed a corpus of 4.5 million non-fork projects (that is, those not stemming from the source code of another software package) on GitHub and found a “staggering” amount of file-level duplication. Of the 428 million files written in Java, C++, Python and JavaScript that the group analyzed, only 85 million were unique files.
Such code duplication has considerable implications, given that research is increasingly conducted using large collections of open-source projects available on GitHub. Lopes and her team argue that the duplication can skew research conclusions if there was an underlying assumption regarding the dataset’s project diversity.
To address this issue, the team created DéjàVu, a publicly available index of file-level code duplication in the GitHub repository. Lopes hopes that DéjàVu will help researchers and developers better understand code cloning in GitHub so they can avoid it if needed.
— Shani Murray