Empirical Evaluation of Pareto Efficient Multi-Objective Regression Test Case Prioritisation

Overview of the manuscript

Test case prioritisation seeks a test case ordering that maximises the likelihood of early fault revelation. Previous prioritisation techniques have tended to be single objective, for which the additional greedy algorithm is the current state-of-the-art. Unlike test suite minimisation, multi objective test case prioritisation has not been thoroughly evaluated. This paper presents an extensive empirical study on the effectiveness of multi objective test case prioritisation, evaluating it on multiple versions of five widely-used benchmark programs and a much larger real world system of over 1 million lines of code. The paper also presents a lossless coverage compaction algorithm that dramatically scales the performance of all algorithms studied by between 2 and 4 orders of magnitude, making prioritisation practical for even very demanding problems.

Additional results

This website was created to accompany our ISSTA 2015 submission which is currently under review. In its present form it only provides results from our analysis, that could not be fitted into the page limits of the paper.

Optimization Quality

The following tables show comprehensive statistical significance tests between all implemented algorithms in terms of the optimization quality. For each subject we statistically compare each pair of algorithms on each quality indicator (HV, EPSILON, IGD) and report p-values for the Wilcoxon test ( $p$ -value) and the Bonferroni correction ( $p_{B o n f}$ ), as well as the $Â_{12}$ effect size.

To indicate the best performing cases, we highlight the cases with statistically significant differences (for the α = 0.05 level of significance) with boldface font. In addition, for the $Â_{12}$ effect size metric, we underline the medium effect size and highlight the large effect size with boldface font.

Testing Effectiveness

The following tables show comprehensive statistical significance tests between all implemented algorithms in terms of testing effectiveness, measured by the APFDc metric. For each subject we statistically compare each pair of algorithms on the APFDc metric and report p-values for the Wilcoxon test ( $p$ -value) and the Bonferroni correction ( $p_{B o n f}$ ), as well as the $Â_{12}$ effect size.

Contact us

For any further inquiries about the manuscript please contact Michael G. Epitropakis or Shin Yoo.

Acknowledgment

This project was supported by the EPSRC grant, EP/J017515/1 (DAASE: Dynamic Adaptive Automated Software Engineering).