A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-03232015-135032

Type of Document Master's Thesis
Author Hei, Songtao
URN etd-03232015-135032
Title A Decision Tree Based Approach to Filter Candidates for Software Engineering Jobs Using GitHub Data
Degree Master of Science
Department Computer Science
Advisory Committee
Advisor Name Title
Christopher Jules White Committee Member
Douglas C. Schmidt Committee Member
  • GitHub
  • Algorithmic recruiting
  • machine learning
  • decision tree
Date of Defense 2015-04-14
Availability unrestricted
A challenge for companies hiring software engineers is the large number of candidate profiles on LinkedIn, Monster.com, and other job websites and the inability to easily filter top candidates from these lists. In this paper, we propose a novel approach for utilizing the social network structure in GitHub and a decision tree algorithm to solve this problem efficiently and filter candidate software engineers. The approach is based off the idea that the centrality value of a node (i.e., candidate engineer) in the graph of GitHub users is an approximate indicator of the value of the programmer. To reduce the number of candidates that are considered in a job selection process, a threshold centrality value can be used to filter job candidates based on their importance in the GitHub user graph. A challenge with this approach is that, since GitHub has millions of users, calculating the centrality for every node in the GitHub user graph is an expensive operation. To overcome this challenge, we train a decision tree to predict a user’s centrality based on a limited subset of their attributes. To generate training data for the decision tree from the unlabeled GitHub user graph, a threshold centrality value is chosen and a part of the user graph is labeled with Accepted or Rejected based on whether or not the corresponding user meets the threshold centrality. We also collect the total number of the each kind of public GitHub event each user has generated and we use the number of these GitHub events as training attributes for each user in the training dataset. Once decision trees are built with this training dataset, recruiters can use these decision trees to process large quantities of software engineering job candidates and to improve the judgment of HR departments. Based on empirical results from experiments that we conducted with GitHub user data, our approach can reach a precision of 96%. Moreover, this method saves future expensive network centrality computation as the GitHub social graph changes over time.
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  SHei.pdf 889.63 Kb 00:04:07 00:02:07 00:01:51 00:00:55 00:00:04

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.