Mining Software Engineering Data from GitHub

Georgios Gousios; Diomidis Spinellis

doi:10.1109/ICSE-C.2017.164

Mining Software Engineering Data from GitHub

Software Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

34 Citations (Scopus)

Abstract

GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.

Original language	English
Title of host publication	Proceedings of the 39th International Conference on Software Engineering Companion
Place of Publication	Piscataway, NJ, USA
Publisher	IEEE
Pages	501-502
Number of pages	2
ISBN (Print)	978-1-5386-1589-8
DOIs	https://doi.org/10.1109/ICSE-C.2017.164
Publication status	Published - 2017

Publication series

Name	ICSE-C '17
Publisher	IEEE Press

Keywords

GHTorrent, GitHub, empirical software engineering, git

Access to Document

10.1109/ICSE-C.2017.164

Cite this

@inproceedings{31ba2cd602ea4afcaf6e14bbffaa4460,

title = "Mining Software Engineering Data from GitHub",

abstract = "GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.",

keywords = "GHTorrent, GitHub, empirical software engineering, git",

author = "Georgios Gousios and Diomidis Spinellis",

year = "2017",

doi = "10.1109/ICSE-C.2017.164",

language = "English",

isbn = "978-1-5386-1589-8",

series = "ICSE-C '17",

publisher = "IEEE",

pages = "501--502",

booktitle = "Proceedings of the 39th International Conference on Software Engineering Companion",

address = "United States",

}

TY - GEN

T1 - Mining Software Engineering Data from GitHub

AU - Gousios, Georgios

AU - Spinellis, Diomidis

PY - 2017

Y1 - 2017

N2 - GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.

AB - GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.

KW - GHTorrent, GitHub, empirical software engineering, git

U2 - 10.1109/ICSE-C.2017.164

DO - 10.1109/ICSE-C.2017.164

M3 - Conference contribution

SN - 978-1-5386-1589-8

T3 - ICSE-C '17

SP - 501

EP - 502

BT - Proceedings of the 39th International Conference on Software Engineering Companion

PB - IEEE

CY - Piscataway, NJ, USA

ER -

Mining Software Engineering Data from GitHub

Abstract

Publication series

Keywords

Access to Document

Fingerprint

Cite this