Jeff Stuckman

University of Maryland, College Park
Doctoral Student in Computer Science
Email: stuckman (at symbol) umd (dot symbol) edu

Research Interests

My main research interest relates to learning and evaluating methods to improve software security by gaining insights from past security vulnerabilities. Much related work has focused on analyzing characteristics of vulnerabilities and modelling their relationships to features found in source code. My current work focuses on:

  1. How studies of security vulnerabilities can be refactored or redesigned to provide actionable insights to multiple stakeholder communities, such as developers, system administrators, and users
  2. Specific activities which stakeholders can perform, such as system reconfiguration, which can improve security without neutralizing the beneficial functionality provided by a potentially vulnerable application.
  3. Ways to collect security vulnerability datasets in an automated or semi-automated manner, supporting the above efforts.

Research projects

Below are several current and past research projects which support these interests:

Vulnerability research dataset

In conjunction with our paper Predicting Vulnerable Components: Software Metrics vs Text Mining, presented at ISSRE 2014, we have released a dataset containing data on security vulnerabilities found in three open-source PHP web applications: PHPMyAdmin, Drupal, and Moodle. This dataset contains information on the revisions where each vulnerability was introduced and fixed, along with the file which contained the vulnerable code at each revision of the software. This fine-grained information on the evolution of each vulnerability allows for the time dimension to be considered when building predictive models for vulnerabilities -- a dimension that has traditionally been difficult to work with using readily available vulnerability date.

The dataset is available here

Actionable vulnerability prediction models

Related work in vulnerability prediction models has identified ways to compute the relative likelihood that regions of source code are associated with security vulnerabilities. We seek to recast these vulnerability prediction models into forms which can support decision-making in additional scenarios, such as choosing products which are more likely to be secure in the context of a user's particular environment.

Security vulnerability datasets

To facilitate better empirical vulnerability research, are developing two public software security vulnerability datasets. The first, BugBox, is a corpus of PHP web application vulnerabilities which allows for the behavior of exploits to be measured in a simulated runtime environment. The second, a dataset linking PHP web application vulnerabilities to source code artifacts, is slated for release later in 2014.

Reproducible defect and vulnerability prediction studies

To complement the datasets discussed above, we are currently developing tools supporting easier replication of prediction studies for defects and vulnerabilities.

Software defect and vulnerability modeling

Defect and vulnerability prediction often involves the construction of models which estimate the likelihood that a defect will exist in a particular source code artifact. Sometimes, these models can also be utilized in a generative capacity to produce synthetic data (such as counts of simulated defects). We are examining if characteristics of this synthetic data can be compared to those of real defect data in order to examine the consistency of the model against the data that was actually observed. In addition, we are studying ways to improve cross-project prediction performance by increasing the generality of predictive models, in the same way that avoiding overfitting can improve performance in within-project prediction.

Wiki research dataset

This section describes the companion datasets to Measuring the Wikisphere and its related work.

Due to the inherent difficulty in obtaining experimental data from wikis, past quantitative wiki research has largely been focused on Wikipedia, limiting the degree that it can be generalized. We developed WikiCrawler, a tool that automatically downloads and analyzes wikis, and studied 151 popular wikis running Mediawiki (none of them Wikipedias).

Available for download is a dataset describing the articles of each analyzed wiki, the users of each wiki, and the wiki's revisions (which indicate that a certain user edited a certain wiki article on a certain date and time.) Wikis and wiki articles are identified only by ID numbers, and due to copyright issues, data that would readily link a specific wiki in our data set to a specific wiki in real life is not publicly released. This means that URLs, the text of articles, the titles of articles, and revision comments are not available (although the link graph is preserved.)

The file containing information on each wiki's pages and users (wikidata) is a CSV file containing the following columns (different types of rows can be distinguished by the number of columns therein):

Col 1Col 2Col 3Col 4Col 5
User countWiki IDNumber of users
Page infoWiki IDURL IDArticle IDWord countWord count of clickable links
Link graphWiki IDOrigin URL IDDestination URL ID

The file containing information on each wiki page's edit history (wikihistorydata) is a CSV file containing the following columns:

Col 1Col 2Col 3Col 4Col 5Col 6Col 7Col 8
History entryWiki IDURL IDUser IDMinutes into day editedYear editedMonth editedDay editedReserved

Download dataset

If your research requires the use of data that has not been publicly released, please contact the authors. Possible resolutions include the release of a more detailed scrubbed dataset, or a private release of the full dataset.

Other software projects

Software packages which we developed to support the above projects are listed here. Due to the realities of writing research code, this software is largely undocumented and, hence, is not directly available on this site. However, this software will be made available to anyone who e-mails and requests it.

PHP software metrics computation framework

Much research on defect prediction with software metrics has studied languages such as Java and C; however, there has been comparatively little research on metrics in conjunction with scripting languages such as PHP. Currently available tools which work on PHP source code are only capable of computing a relatively small number of metrics. We have developed a metrics computation tool for PHP that computes a wider variety of size, complexity, and coupling metrics.


WikiCrawler is the software package which was used to generate the wiki corpus available on this site. It has been designed to quickly download the wiki data from Mediawiki instances and extract its relevant features without downloading unnecessary data.

WikiCrawler is written in Java with an Apache Derby backend. R functions to import the data are also available.