Overview
Recently, there is a growing need for tools to aid in intellectual property litigation. We aim to develop tools that can help lawyers when attempting to evaluate intellectual property violations. To date, we have explored several areas of source code authorship identification and begun creating tools to facilitate authorship identification. Our approach considers authenticated samples of source code from a population of known developers in order to build fingerprints for those developers. The fingerprints describe inherent characteristics in the styles of developers based on a set of code-based metrics. We began by limiting our set of metrics to those based on simple style (e.g. average variable name length, average line length, tokens per line, indentation patterns) and consecutive character patterns of a fixed length. Using these metrics, we build fingerprints for each developer that discriminates her from the rest of the population while still capturing the stylistic characteristic that she repeatedly exhibits. We can then compute a set of metrics for an unidentified piece of source code and determine, using machine-learned classification algorithms, which known developer most likely authored the source code. Using test sets of students, advanced developers, and open source developers we are able to classify the unidentified author with nearly an 80% success rate.
Members
PhD Students:
Ed Stehle