Research Projects

Information on doing a Senior Thesis with Prof. Sherriff
Research at the University of Virginia

CS Education
Research in CS edcuation is certainly a broad category, but there are numerous areas here that I am interested in. What's the best language to introduce students to programming and how do we even measure this? How do we effectively team students together so that they succeed? How can we teach teamwork skills in engineering disciplines? How do we measure the efficacy of distance education techniques? I'm interested in any and all of these topics.

Transfer of Pair Programming to Other Disciplines
Research has shown that the use of pair programming in industry and low-level CS courses reduces the number of faults introduced into the system. Further research has discussed how the main benefit of pair programming comes mainly out of better understanding requirements and design choices. I am investigating (along with colleagues in other fields) how the concepts we teach about pair programming in CS courses might aid students with group work in other fields. Do the ideas about driver/navigator translate to other activities? If so, does it have an effect on the quality of the work produced?

Software Engineering Courseware for Large Courses
Solutions to aid software engineering instructors with everything from source control to team management can be hard to find and configure for a given institution. What resources should we focus on in a comprehensive software engineering course to best prepare students for industrial development? A customized OS distribution with a given set of tools could aid instructors with the deployment of a software engineering course.
Research at NC State University

Analyzing Software Artifacts through Singular Value Decomposition to Guide Development Decisions

During development, programming teams will produce numerous different types of software development artifacts. A software development artifact is an intermediate or final product that is the result or by-product of software development. Some software development artifacts are created directly and intentionally by the development team, such as source code and design documents. However, other development artifacts are generated by the process itself, such as change records in a source control system, defect records, and test case logs.

Software artifacts created during development are often used for support purposes. For instance, change records might be referenced to track developer progress or to learn the current version of a file in the system. However, change records can also show how components interact with one another or how a system is evolving during development by examining what areas of the system change together and where these sets of changes are emerging. Testers could use information about change trends to direct their testing efforts whenever a new set of changes is created.

Using data mining techniques, software development artifacts can be used to identify association clusters in a software system that would not normally be apparent. An association cluster is a set of code units (e.g. files, packages, blocks of code, etc.) that exhibit a specific relationship with the other code units in the cluster through a particular development artifact. For example, association clusters created from file change records created from defect removal efforts will consist of files that, over a set of defects, tended to have to be modified together to repair a specific defect. Similarly, association clusters derived test case records would provide information about what areas of the code are being tested together. These association clusters could correspond to functional requirements, execution paths, or particular testing methodologies, thus exhibiting underlying aspects of the software system that might not readily be apparent. Earlier research has shown that software development artifacts can be used in software change analysis to discover underlying structures within a code base.

The goal of this research is to build and investigate a methodology that uses software development artifacts to illuminate underlying relationships within a system that are not necessarily structurally dependent or could easily be detected through other analysis techniques to guide software development decisions. These underlying relationships appear as association clusters (i.e. bundles of files, directories, or blocks of code) that exhibit an affinity between themselves based upon the software development artifact that was used to generate the clusters. These association clusters in turn provide a type of "topographical overview" of the system that can be used to find other relationships between various areas of the project. For instance, areas of code that tend to change together should also be tested together and will also likely relate to specific requirements. If a set of field failures generate a specific set of association clusters that have no relation to any association clusters generated from testing data, then there is a possibility that there is a gap in test case coverage. However, these sets of changes might also include configuration files, help files, or other files that are not normally analyzed in an execution-based system analysis, such as a traditional impact analysis.

The methodology proposed in this research is called Software Development Artifact Analysis (SDAA). SDAA provides a framework for selecting and mining software development artifacts, generating association clusters, and then leveraging those clusters within the development process. Artifacts can be gathered from tools within the development process, such as change management systems, source control systems, and defect tracking systems. The data from these artifacts is compiled into a matrix that relates a particular code unit to other code units with regards to that artifact. A singular value decomposition (SVD) is performed on the matrix to generate the association clusters. The results of the SVD can then be utilized in various areas of development, such as impact analysis, test case prioritization, and program comprehension.

To examine the feasibility of SDAA, research was performed with an industrial project at IBM. Software development artifacts regarding changes initiated from defect removal efforts were gathered on three consecutive "fix pack" releases of an IBM product. We performed a short formative study on test case prioritization using the results of SDAA. Our results indicated that SDAA provided test case prioritization information that highlighted test cases that incorporated changed files and files that were candidates for subsequent changes.

Utilizing Verification and Validation Certificates to Estimate Software Defect Density

During software development, teams will use several different methods to make a system more reliable. However, the verification1 and validation (V&V) practices used to make a system reliable might not always be documented effectively, or this documentation may not be maintained properly. This lack of documentation can hinder other developers from knowing what V&V practices have been performed on a given section of code. If developers do not know where V&V has been used, extra time could be spent re-verifying an already thoroughly verified section of code, or worse, a section of code could go unverified. Further, this information could be used post hoc to see what V&V techniques were used on sections of code that have reported failures from customers. Using this failure information could help developers refine their V&V efforts for future projects.

A development team could benefit from a system that provided a means of V&V evidence management. In a software quality context, evidence management is a means of gathering the artifacts and other forms of evidence that a V&V technique was performed to improve V&V documentation efforts. This evidence can take the form of log files, written documentation, information in team management software, or anything else that records V&V effort. A software certificate management system (SCMS) can support this evidence management. A software certificate management system provides an interface and infrastructure to create, maintain, and analyze software certificates. A certificate is a record of a V&V practice employed by developers and can be used to support traceability between code and the evidence of the V&V technique used. Our objective is to provide an automated method which allows developers to track and maintain a certificate-based persistent record of the V&V practices used during development and testing and to then leverage that V&V information to estimate defect density. These V&V records could also be used to improve the development process by monitoring V&V system coverage and providing a V&V reference for software maintenance and future projects. To accomplish this objective, we have developed Defect Estimation with V&V Certificates on Programming (DevCOP). DevCOP is a SCMS which can be use for creating a persistent record of V&V practices as certificates. The DevCOP SCMS is implemented as a plug-in for the Eclipse integrated development environment (IDE).

Software Testing and Reliability Early Warning (STREW) Method for Haskell

Prior research with our research group has shown that in-process metrics can be used as an early indication of an external measure of system defect density. This is done by using a regression analysis to associate the metrics with the actual defect density of previous releases of a system. This research was initially performed using Java and was called STREW-Java or STREW-J. The next research step was to apply these principles to a functional programming language to determine if this technique would translate between different programming paradigms. I built upon STREW-J, modifying the metrics that were collected to form STREW-Haskell, or STREW-H.

The first significant evaluation of the STREW-H method involved analyzing an industrial project by Galois Connections, Inc. in-process to see if the metrics were related to defect density. Defect density data was analyzed as a relative measure to reliability. We worked with Galois Connections, Inc., during the seven-month development of an ASN.1 compiler system. The project consisted of developing a proof-of-concept ASN.1 compiler that could show that high-assurance, high-reliability software could be created using a functional language.

To ascertain whether these metrics were indeed related to defect density, a multiple regression analysis was run on the metrics of the STREW-H with the number of in-process defects that were corrected and logged in the versioning system used by Galois. We performed the multiple regression analysis on 12 randomly-chosen versions of the system and used that regression model to predict the remaining six versions' defect densities. The analysis indicated that future defect densities in the system could be predicted based on this historical data of the project with respect to the STREW-H metric suite. The results of the regression analysis showed that the STREW-H metrics are associated with the number of defects that were discovered and corrected during the development process.

This type of information could prove to be extremely valuable for developers trying to create a system that must be as reliable as possible in early versions. If information exists on similar projects performed by the same development team, metrics such as these can be gathered in-process to help guide corrective action to ensure that the new system is of high reliability. Using this method could also reduce the cost of producing and maintaining a system, since the cost of correcting defects increase exponentially throughout the development and lifetime of a system.