PhD Proposal: Leveraging Light-Weight Analyses to Aid Software Maintenance

Software maintenance can account for up to 90% of a system's life cycle cost. As a result, many automated techniques have been developed to reduce the overall effort necessary to sustain software over time. While many of these tools work well under certain circumstances, we believe that they could be improved by taking advantage of large untapped sources of unstructured information that result from natural software development. As the size and complexity of systems increases, taking advantage of such previously-overlooked information is of paramount importance. The proposed research will design lightweight analyses to extract latent information encoded by humans in software development artifacts and thereby reduce the costs of software maintenance. We will design analyses that apply throughout the maintenance process, focusing on three areas: (1) triaging large collections of exposed defects; (2) guiding the search for automatic defect repairs; and (3) ensuring the continued consistency of system documentation over time. For each area of proposed research we aim to improve upon existing approaches to reduce the total cost of software maintenance while requiring minimal additional developer effort. We will evaluate the proposed work using these criteria on a diverse set of real world programs totaling millions of lines of code to concretely show the reduction of maintenance cost and improvement upon existing processes.

The document associated with my PhD proposal can be found here and the slides here.

Leveraging Program Equivalence for Adaptive Program Repair: Models and First Results

Software bugs remain a compelling problem. Automated program repair is a promising approach for reducing cost, and many methods have recently demonstrated positive results. However, success on any particular bug is variable, as is the cost to find a repair. This paper focuses on generate-and-validate repair methods that enumerate candidate repairs and use test cases to define correct behavior. We formalize repair cost in terms of test executions, which dominate most test-based repair algorithms. Insights from this model lead to a novel deterministic repair algorithm that computes a patch quotient space with respect to an approximate semantic equivalence relation. This allows syntactic and dataflow analysis techniques to dramatically reduce the repair search space. Generate-and-validate program repair is shown to be a dual of mutation testing, directly suggesting possible cross-fertilization. Evaluating on 105 real-world bugs in programs totaling 5MLOC and involving 10,000 tests, our new algorithm requires an order-of-magnitude fewer test evaluations than the previous state-of-the-art and is over five times more efficient monetarily.

The homepage for our program repair projects can be found here . Background and experimental data can be found at this link.

Clustering Static Analysis Defect Reports to Reduce Maintenance Costs

tatic analysis tools facilitate software maintenance by automatically identifying bugs in source code. However, for large systems, these tools often produce an overwhelming number of defect reports. Many of these defect reports are conceptually similar, but addressing each report separately costs developer effort and increases the maintenance burden. We propose to automatically cluster machine-generated defect reports so that similar bugs can be triaged and potentially fixed in aggregate. Our approach leverages both syntactic and structural information available in static bug reports to accurately cluster related reports, thus expediting the maintenance process.

We evaluate our technique using 8,948 defect reports produced by the Coverity Static Analysis and FindBugs tools in both C and Java programs totaling over 14 million lines of code. We find that humans overwhelmingly agree that clusters of defect reports produced by our tool could be handled aggregately, thus reducing developer maintenance effort. Additionally, we show that our tool is not only capable of perfectly accurate clusters, but can also significantly reduce the number of defect reports that have to be manually examined by developers. For instance, at a level of 90% accuracy, our technique can reduce the number of individually inspected defect reports by 21.33% while other multi-language tools fail to obtain more than a 2% reduction.

Software Mutational Robustness

Neutral landscapes and mutational robustness are believed to be important enablers of evolvability in biology. We apply these concepts to software, defining mutational robustness to be the fraction of random mutations to program code that leave a program’s behavior unchanged. Test cases are used to measure program behavior and mutation operators are taken from earlier work on genetic programming. Although software is often viewed as brittle, with small changes leading to catastrophic changes in behavior, our results show surprising robustness in the face of random software mutations. The paper describes empirical studies of the mutational robustness of 22 programs, including 14 production software projects, the Siemens benchmarks, and four specially constructed programs. We find that over 30 % of random mutations are neutral with respect to their test suite. The results hold across all classes of programs, for mutations at both the source code and assembly instruction levels, across various programming languages, and bear only a limited relation to test suite coverage. We conclude that mutational robustness is an inherent property of software, and that neutral variants (i.e., those that pass the test suite) often fulfill the program’s original purpose or specification. Based on these results, we conjecture that neutral mutations can be leveraged as a mechanism for generating software diversity. We demonstrate this idea by generating a population of neutral program variants and showing that the variants automatically repair latent bugs. Neutral landscapes also provide a partial explanation for recent results that use evolutionary computation to automatically repair software bugs.

A Human Study of Patch Maintainability

Identifying and fixing defects is a crucial and expensive part of the software lifecycle. Measuring the quality of bug-fixing patches is a difficult task that affects both functional correctness and the future maintainability of the code base. Recent research interest in automatic patch generation makes a systematic understanding of patch maintainability and understandability even more critical.

We present a human study involving over 150 participants, 32 real-world defects, and 40 distinct patches. In the study, humans perform tasks that demonstrate their understanding of the control flow, state, and maintainability aspects of code patches. As a base- line we use both human-written patches that were later reverted and also patches that have stood the test of time to ground our results. To address any potential lack of readability with machine-generated patches, we propose a system wherein such patches are augmented with synthesized, human-readable documentation that summarizes their effects and context. Our results show that machine-generated patches are slightly less maintainable than human-written ones, but that trend reverses when machine patches are augmented with our synthesized documentation. Finally, we examine the relationship between code features (such as the ratio of variable uses to assign- ments) with participants’ abilities to complete the study tasks and thus explain a portion of the broad concept of patch quality.

Feel free to try out the actual experiment at: CodeQuality

The data and subject code from our experiments can be found here

A Human Study of Fault Localization Accuracy

Localizing and repairing defects are critical software engineering activities. Not all programs and not all bugs are equally easy to debug, however. We present formal models, backed by a human study involving 65 participants (from both academia and industry) and 1830 total judgments, relating various software- and defect-related features to human accuracy at locating errors. Our study involves example code from Java textbooks, helping us to control for both readability and complexity. We find that certain types of defects are much harder for humans to locate accurately. For example, humans are over five times more accurate at locating "extra statements" than "missing statements" based on experimental observation. We also find that, independent of the type of defect involved, certain code contexts are harder to debug than others. For example, humans are over three times more accurate at finding defects in code that provides an array abstraction than in code that provides a tree abstraction. We identify and analyze code features that are predictive of human fault localization accuracy. Finally, we present a formal model of debugging accuracy based on those source code features that have a statistically significant correlation with human performance (for the full paper and slides, please see the publications page).

Feel free to try out the actual experiment at: BugFinder

The data and subject code from our experiments can be found here