Mining Software Engineering Data from GitHub

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

34 Citations (Scopus)

Abstract

GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.
Original languageEnglish
Title of host publicationProceedings of the 39th International Conference on Software Engineering Companion
Place of PublicationPiscataway, NJ, USA
PublisherIEEE
Pages501-502
Number of pages2
ISBN (Print)978-1-5386-1589-8
DOIs
Publication statusPublished - 2017

Publication series

NameICSE-C '17
PublisherIEEE Press

Keywords

  • GHTorrent, GitHub, empirical software engineering, git

Fingerprint

Dive into the research topics of 'Mining Software Engineering Data from GitHub'. Together they form a unique fingerprint.

Cite this