University of Maryland, College Park Doctoral Student in Computer Science Email: stuckman (at symbol) umd (dot symbol) edu |
My main research interest relates to learning and evaluating methods to improve software security by gaining insights from past security vulnerabilities. Much related work has focused on analyzing characteristics of vulnerabilities and modelling their relationships to features found in source code. My current work focuses on:
Below are several current and past research projects which support these interests:
In conjunction with our paper Predicting Vulnerable Components: Software Metrics vs Text Mining, presented at ISSRE 2014, we have released a dataset containing data on security vulnerabilities found in three open-source PHP web applications: PHPMyAdmin, Drupal, and Moodle. This dataset contains information on the revisions where each vulnerability was introduced and fixed, along with the file which contained the vulnerable code at each revision of the software. This fine-grained information on the evolution of each vulnerability allows for the time dimension to be considered when building predictive models for vulnerabilities -- a dimension that has traditionally been difficult to work with using readily available vulnerability date.
The dataset is available here
Related work in vulnerability prediction models has identified ways to compute the relative likelihood that regions of source code are associated with security vulnerabilities. We seek to recast these vulnerability prediction models into forms which can support decision-making in additional scenarios, such as choosing products which are more likely to be secure in the context of a user's particular environment.
To facilitate better empirical vulnerability research, are developing two public software security vulnerability datasets. The first, BugBox, is a corpus of PHP web application vulnerabilities which allows for the behavior of exploits to be measured in a simulated runtime environment. The second, a dataset linking PHP web application vulnerabilities to source code artifacts, is slated for release later in 2014.
To complement the datasets discussed above, we are currently developing tools supporting easier replication of prediction studies for defects and vulnerabilities.
Defect and vulnerability prediction often involves the construction of models which estimate the likelihood that a defect will exist in a particular source code artifact. Sometimes, these models can also be utilized in a generative capacity to produce synthetic data (such as counts of simulated defects). We are examining if characteristics of this synthetic data can be compared to those of real defect data in order to examine the consistency of the model against the data that was actually observed. In addition, we are studying ways to improve cross-project prediction performance by increasing the generality of predictive models, in the same way that avoiding overfitting can improve performance in within-project prediction.
This section describes the companion datasets to Measuring the Wikisphere and its related work.
Due to the inherent difficulty in obtaining experimental data from wikis, past quantitative wiki research has largely been focused on Wikipedia, limiting the degree that it can be generalized. We developed WikiCrawler, a tool that automatically downloads and analyzes wikis, and studied 151 popular wikis running Mediawiki (none of them Wikipedias).
Available for download is a dataset describing the articles of each analyzed wiki, the users of each wiki, and the wiki's revisions (which indicate that a certain user edited a certain wiki article on a certain date and time.) Wikis and wiki articles are identified only by ID numbers, and due to copyright issues, data that would readily link a specific wiki in our data set to a specific wiki in real life is not publicly released. This means that URLs, the text of articles, the titles of articles, and revision comments are not available (although the link graph is preserved.)
The file containing information on each wiki's pages and users (wikidata) is a CSV file containing the following columns (different types of rows can be distinguished by the number of columns therein):
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | |
User count | Wiki ID | Number of users | |||
Page info | Wiki ID | URL ID | Article ID | Word count | Word count of clickable links |
Link graph | Wiki ID | Origin URL ID | Destination URL ID |
The file containing information on each wiki page's edit history (wikihistorydata) is a CSV file containing the following columns:
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 | |
History entry | Wiki ID | URL ID | User ID | Minutes into day edited | Year edited | Month edited | Day edited | Reserved |
If your research requires the use of data that has not been publicly released, please contact the authors. Possible resolutions include the release of a more detailed scrubbed dataset, or a private release of the full dataset.
Software packages which we developed to support the above projects are listed here. Due to the realities of writing research code, this software is largely undocumented and, hence, is not directly available on this site. However, this software will be made available to anyone who e-mails and requests it.
Much research on defect prediction with software metrics has studied languages such as Java and C; however, there has been comparatively little research on metrics in conjunction with scripting languages such as PHP. Currently available tools which work on PHP source code are only capable of computing a relatively small number of metrics. We have developed a metrics computation tool for PHP that computes a wider variety of size, complexity, and coupling metrics.
WikiCrawler is the software package which was used to generate the wiki corpus available on this site. It has been designed to quickly download the wiki data from Mediawiki instances and extract its relevant features without downloading unnecessary data.
WikiCrawler is written in Java with an Apache Derby backend. R functions to import the data are also available.