Manual debugging is driven by experiments—test runs that narrow down failure causes by systematically confirming or excluding individual factors. The BUGEX approach leverages test case generation to systematically isolate such causes from a single failing test run—causes such as properties of execution states or branches taken that correlate with the failure. Identifying these causes allows for deriving conclusions as: “The failure occurs whenever the daylight savings time starts at midnight local time.” In our evaluation, a prototype of BUGEX precisely pinpointed important failure explaining facts for six out of seven real-life bugs.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Isolating Failure Causes through Test Case Generation
1. Isolating Failure Causes
through
Test Case Generation
Jeremias Rößler • Gordon Fraser • Andreas Zeller • Alessandro Orso
Saarland University / Georgia Institute of Technology
2. Joda Time Brazilian Date
public void testBrazil() {
LocalDate date = new LocalDate(2009, 10, 18);
DateTimeZone dtz =
DateTimeZone.forID("America/Sao_Paulo");
Interval interval = date.toInterval(dtz);
}
3. Joda Time Brazilian Date
public void testBrazil() {
LocalDate date = new LocalDate(2009, 10, 18);
DateTimeZone dtz =
DateTimeZone.forID("America/Sao_Paulo");
Interval interval = date.toInterval(dtz);
}
34. Joda Time Brazilian Date
public void testBrazil() {
LocalDate date = new LocalDate(2009, 10, 18);
DateTimeZone dtz =
DateTimeZone.forID("America/Sao_Paulo");
Interval interval = date.toInterval(dtz);
}
73. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
74. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
75. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
76. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
77. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
78. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
79. Quantitative Results
Branches Predicates
Brazilian Date bug 2,380 s 1 13,55 s 25
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
80. Qualitative Results
Branches Predicates
• Correctbug
Brazilian Date results after2,380seconds 13,55 s 25
20 s 1
in six out of seven cases
Base64 Lookup bug 38 s 1 737 s 2
Sparse Iterator bug 216 s 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
81. Qualitative Results
Branches Predicates
• Correctbug
Brazilian Date results after2,380seconds 13,55 s 25
20 s 1
in six out of seven cases
Base64 Lookup bug 38 s 1 737 s 2
•
Sparse Iterator bug effort 216 s
No additional 8 10,267 s 9
Base64 Decoder bug 214 s 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
82. Qualitative Results
Branches Predicates
• Correctbug
Brazilian Date results after2,380seconds 13,55 s 25
20 s 1
in six out of seven cases
Base64 Lookup bug 38 s 1 737 s 2
•
Sparse Iterator bug effort 216 s
No additional 8 10,267 s 9
•
Base64 Decoder bug of branchess reported
Small number 214 1 1,339 s 23
Western Hemisphere bug 8,422 s 7 30,937 s 9
Vending Machine bug 19 s 1 56 s 1
Parse French Date bug 1,577 s 15 n/a n/a
83. Qualitative Results
Branches Predicates
• Correctbug
Brazilian Date results after2,380seconds 13,55 s 25
20 s 1
in six out of seven cases
Base64 Lookup bug 38 s 1 737 s 2
•
Sparse Iterator bug effort 216 s
No additional 8 10,267 s 9
•
Base64 Decoder bug of branchess reported
Small number 214 1 1,339 s 23
• Directly leads to failure19 s 1
Western Hemisphere bug 8,422 s 7
Vending Machine bug
cause
30,937 s
56 s
9
1
in six out of seven cases
Parse French Date bug 1,577 s 15 n/a n/a
97. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
98. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
99. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
100. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
101. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
102. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
103. Comparing Branches
Statistical
BugEx Debugging
Brazilian Date bug 2,380 s 1 25,223 s 24
Base64 Lookup bug 38 s 1 512 s 17
Sparse Iterator bug 216 s 8 1901 s 14
Base64 Decoder bug 214 s 1 496 s 51
Western Hemisphere bug 8,422 s 7 25,341 s 28
Vending Machine bug 19 s 1 n/a n/a
Parse French Date bug 1,577 s 15 25,542 s 26
Hello and Welcome my talk on How Test Case Generation isolates Failure Causes\nMy name is Jeremias Rößler and this work was conducted together with Gordon Fraser, Andreas Zeller and Alex Orso\nThis is about automated debugging. So lets just start off with an example.\n
Joda Time is a Date and Time library and a drop-in replacement for the default date and time api that ships with Java. It contained a bug that we named the Brazilian Date Bug, since it only manifests on a particular date in the Brazilian Time Zone. And what you see here is the test that reproduces the failure:\nYou create the specific date and time zone and than transform the date to an time interval that represents the 24 hours of that date in the Brazilian time zone. And if you execute this code: boom, your program crashes. Now, as a developer, what do you do?\n
Well, you know there is this field called statistical debugging that is concerned with providing the developer with a defect location. The first ones in the field were Jones, Harrold and Stasko with their tarantula tool.\n
And Liblit and colleagues improved this by focusing on predicate values.\n
But since then, much has been done and so our implementation bases on the work from Rui Abreu and colleagues.\n
And this is how Statistical Debugging works in a nutshell. You take a number of passing and failing executions, and you correlate them to for instance statements in the code to find the faulty ones. So what you do is, you count how many times a statement was executed that resulted in a success and how many times it was executed that resulted in a failure. The intuition is that the more often a statement was executed during a failing execution, and the less often during a passing one, more probable it is that the statement contains the defect that causes the failure. It has been shown to work well using a benchmark with thousands of test cases.\n
And this is how Statistical Debugging works in a nutshell. You take a number of passing and failing executions, and you correlate them to for instance statements in the code to find the faulty ones. So what you do is, you count how many times a statement was executed that resulted in a success and how many times it was executed that resulted in a failure. The intuition is that the more often a statement was executed during a failing execution, and the less often during a passing one, more probable it is that the statement contains the defect that causes the failure. It has been shown to work well using a benchmark with thousands of test cases.\n
And this is how Statistical Debugging works in a nutshell. You take a number of passing and failing executions, and you correlate them to for instance statements in the code to find the faulty ones. So what you do is, you count how many times a statement was executed that resulted in a success and how many times it was executed that resulted in a failure. The intuition is that the more often a statement was executed during a failing execution, and the less often during a passing one, more probable it is that the statement contains the defect that causes the failure. It has been shown to work well using a benchmark with thousands of test cases.\n
So you crank out statistical debugging and apply it to the problem. And this is what you get: A long list of statements with their corresponding probability to contain the defect. In practice, this means that a developer would manually have to sift through that list to try to spot the defect.\n
So you crank out statistical debugging and apply it to the problem. And this is what you get: A long list of statements with their corresponding probability to contain the defect. In practice, this means that a developer would manually have to sift through that list to try to spot the defect.\n
So lets actually do the developers work and inspect some of the results. So this is some static initialization code. To me this doesn’t look very suspicious but then there is also no explanation provided what would make that code suspicious. So hmm, I don’t know. Maybe lets look at the second result.\n
Now this looks like a place where the fields are assembled. Probably also some kind of initialization. Again this doesn’t look very suspicious. But again, I don’t really know. Maybe let’s have a look at the next result.\n
This creates an instance of an UnsupportedOperationField. This sounds as it could have to do something with the failure. But when I look at the code, again this is an initialization of a cache. Hmm, without an explanation, this also doesn’t look very suspicious, but who knows?\n
And in this fashion you would continue ... \n
... to go through the code ...\n
... until you loose interest.\n
And this is what the probability distribution looks like on a pie chart. As you can see, there is no clearly probable location for the defect and the results quickly vanish into the long tail.\n
This is an instance of a problem that already has been described by Parnin and Orso.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
So in summary, I think that statistical debugging actually is a very good idea. But in its current flour it comes with two issues: it produces a long list of results with weak correlations and it provides only the location, without context or explanation why that location would be correlated to the failure.\nAnd this is the contribution of our work: we remedy both issues. We strengthen the correlations, reducing the number of correlated locations in the process. And we not only provide a location but an explanation of the defect to the developer. Now lets focus on the first point for a moment: strengthening the correlations.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
How do we do this? Well the underlying problem simply is that the problematic code is not executed enough to create a strong correlation. So what we do is to simply add more executions. And this indeed makes the correlation stronger.\n
But we even can do better than this: we implement a feedback loop that highly increases effectiveness AND efficiency. So how does this feedback loop work?\n
But we even can do better than this: we implement a feedback loop that highly increases effectiveness AND efficiency. So how does this feedback loop work?\n
Considering the test case I showed you earlier...\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
that gives you this probability distribution. Now lets see what we can do about: So to now consider some of the branches in the program. And taking one of these branches, some these executions failed and some passed. But there is no clear picture, which is why we don’t get a consistent result.\n
So now lets focus on one of these branches. Now we try to come up with an additional execution of this branch. And we see that it also fails, so this ...\n
... increases the correlation with the failure. And getting yet another execution that fails ...\n
... further increases that correlation. Now lets have a look at another branch.\n
If we generate an additional execution for this one, and it passes.\n
This lessens the correlation with the failure and further increases the correlation of the other branch. And getting yet another passing execution ...\n
... further decreases probability and increases it for the other branch.\n
So we go on in a fashion like this until ...\n
... eventually we end up with an probability distribution like this. As you can see: now we get a single branch that is very highly correlated with the defect. And by this I mean like you know VERY highly correlated. And this actually is the result that our tool returns to the developer.\n
So now let’s have a look at the result that we get: the single branch that is returned. And going to that part in the code we find that it compares two offsets. And the comment above the code explains what is going on: because the offsets differ, we must be near a DST boundary.\nAnd this already explains the failure: the specific date that we create is a daylight cut day and the cutover time is midnight. So indeed we are near a DST boundary. And the underlying problem is that since we are near a DST boundary, the time interval that starts at midnight starts at a nonexisting time, because the day does not start at midnight - which in turn triggers an internal consistency check.\n
So I said we would remedy both issues. But the explanation that we got was just coincidentally, because we were lucky to find some comments in the code, right?\n
So how else can we provide an explanation?\n
Well, as it turns out, statistical debugging is a very general approach, that not only works with statements or branches, but with ANY runtime features.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So if you change the focus from trying to locate the defect and instead try to explain it, we could for instance try to correlate state predicates. Or if we have a concurrent program and the bug is concurrency driven, than we could correlate the thread schedule. Or if the failure is data driven, we could use definition-usage pairs. In fact: you could use ANY runtime feature that you think will help you understand what is going on. So now for an example, let’s have a look at state predicates, because this is what we happened to also implement.\n
So first: what is a state predicate? Well, a state predicate encodes the features of an object. So for every object that we find in the state, we create a binary predicate for every attribute and inspector of this object with every other attribute and inspector and constant.\nSo for instance what you might get is something like: the shape area is greater or equal to zero. Or the width of the square equals the height of the square.\n
So first: what is a state predicate? Well, a state predicate encodes the features of an object. So for every object that we find in the state, we create a binary predicate for every attribute and inspector of this object with every other attribute and inspector and constant.\nSo for instance what you might get is something like: the shape area is greater or equal to zero. Or the width of the square equals the height of the square.\n
So first: what is a state predicate? Well, a state predicate encodes the features of an object. So for every object that we find in the state, we create a binary predicate for every attribute and inspector of this object with every other attribute and inspector and constant.\nSo for instance what you might get is something like: the shape area is greater or equal to zero. Or the width of the square equals the height of the square.\n
And to show you a practical result, lets consider a bug that appeared in the Apache Commons Codec library. So these two lines of code trigger a failure. What you do is to create a small byteArray and simply ask whether this array is valid base64. And executing these, the program crashes.\n
Now what does bugex return as a result for predicates? Well first it says that the input array always has length 3. This actually is an artifact from a limitation in the underlying test case generation technique. Actually, in the mean time this should be fixed, but anyway. And the second result we get is that the program crashes whenever octect, which is the name of the parameter of a method, is smaller or equal to zero.\n
Now what does bugex return as a result for predicates? Well first it says that the input array always has length 3. This actually is an artifact from a limitation in the underlying test case generation technique. Actually, in the mean time this should be fixed, but anyway. And the second result we get is that the program crashes whenever octect, which is the name of the parameter of a method, is smaller or equal to zero.\n
And if we look in the code, we find that the reason why the program fails is that the input is used unchecked to lookup its base64 equivalence from an array. And of course when that input is negative, the program is bound to fail.\n
And if we look in the code, we find that the reason why the program fails is that the input is used unchecked to lookup its base64 equivalence from an array. And of course when that input is negative, the program is bound to fail.\n
And if we look in the code, we find that the reason why the program fails is that the input is used unchecked to lookup its base64 equivalence from an array. And of course when that input is negative, the program is bound to fail.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right. So first I want to mention that in ALL cases, we were faster than plain statistical debugging.\n
However, it turns out that it is just our implementation that keeps on running to really make sure we didn’t concentrate on the wrong runtime features and therefore keeps on trying to generate other executions. However, for 6 out of 7 cases, the final results for branches were already ready after 20 seconds. So if you use a life preview of the results, the developer can practically start to investigate the interesting branches right away. And also, I want to stress the fact, that this technique comes without any additional effort whatsoever. All you need is a failing test, and the approach can start. So you could for instance set it up to run on a build server right after a test of the continuous build crashes. Anyway, as long as computation time is cheaper than human time, this approach makes sense to use, right.\nAnd second interesting aspect is that the results are easy to review because only a small number of facts are highly correlated. So even in the case that not all predicates for instance where meaningful, reviewing only a handful of predicates is fast and easy and far different from receiving a list of all predicates ordered by probability. In four defects, it reports even a single branch. \nAnd I want to stress that in 6 out of 7 times, these results pinpointed the problem and give you the cause of the defect with no additional effort on the side of the developer.\n
However, it turns out that it is just our implementation that keeps on running to really make sure we didn’t concentrate on the wrong runtime features and therefore keeps on trying to generate other executions. However, for 6 out of 7 cases, the final results for branches were already ready after 20 seconds. So if you use a life preview of the results, the developer can practically start to investigate the interesting branches right away. And also, I want to stress the fact, that this technique comes without any additional effort whatsoever. All you need is a failing test, and the approach can start. So you could for instance set it up to run on a build server right after a test of the continuous build crashes. Anyway, as long as computation time is cheaper than human time, this approach makes sense to use, right.\nAnd second interesting aspect is that the results are easy to review because only a small number of facts are highly correlated. So even in the case that not all predicates for instance where meaningful, reviewing only a handful of predicates is fast and easy and far different from receiving a list of all predicates ordered by probability. In four defects, it reports even a single branch. \nAnd I want to stress that in 6 out of 7 times, these results pinpointed the problem and give you the cause of the defect with no additional effort on the side of the developer.\n
However, it turns out that it is just our implementation that keeps on running to really make sure we didn’t concentrate on the wrong runtime features and therefore keeps on trying to generate other executions. However, for 6 out of 7 cases, the final results for branches were already ready after 20 seconds. So if you use a life preview of the results, the developer can practically start to investigate the interesting branches right away. And also, I want to stress the fact, that this technique comes without any additional effort whatsoever. All you need is a failing test, and the approach can start. So you could for instance set it up to run on a build server right after a test of the continuous build crashes. Anyway, as long as computation time is cheaper than human time, this approach makes sense to use, right.\nAnd second interesting aspect is that the results are easy to review because only a small number of facts are highly correlated. So even in the case that not all predicates for instance where meaningful, reviewing only a handful of predicates is fast and easy and far different from receiving a list of all predicates ordered by probability. In four defects, it reports even a single branch. \nAnd I want to stress that in 6 out of 7 times, these results pinpointed the problem and give you the cause of the defect with no additional effort on the side of the developer.\n
However, it turns out that it is just our implementation that keeps on running to really make sure we didn’t concentrate on the wrong runtime features and therefore keeps on trying to generate other executions. However, for 6 out of 7 cases, the final results for branches were already ready after 20 seconds. So if you use a life preview of the results, the developer can practically start to investigate the interesting branches right away. And also, I want to stress the fact, that this technique comes without any additional effort whatsoever. All you need is a failing test, and the approach can start. So you could for instance set it up to run on a build server right after a test of the continuous build crashes. Anyway, as long as computation time is cheaper than human time, this approach makes sense to use, right.\nAnd second interesting aspect is that the results are easy to review because only a small number of facts are highly correlated. So even in the case that not all predicates for instance where meaningful, reviewing only a handful of predicates is fast and easy and far different from receiving a list of all predicates ordered by probability. In four defects, it reports even a single branch. \nAnd I want to stress that in 6 out of 7 times, these results pinpointed the problem and give you the cause of the defect with no additional effort on the side of the developer.\n
So these are current results. Now what next? For future work, we plan to implement the approach for more runtime features to help the developer understand other kinds of defects.\nAlso, the main limitations of the approach stem from the underlying test case generation approach. So there we want to both improve the underlying technique and perhaps try to incorporate other such techniques.\nAlso, as you saw, we applied the approach only to seven defects. These results are encouraging, but honestly this is only a start. To get a much deeper insight, we need to do a much larger evaluation with much more examples. And we are currently working on that.\nAnd last but not least, since the goal is to help a developer understand the bug, ultimately we will perform some kind of user study like the ones you saw in yesterdays “empirical studies” session.\n
So these are current results. Now what next? For future work, we plan to implement the approach for more runtime features to help the developer understand other kinds of defects.\nAlso, the main limitations of the approach stem from the underlying test case generation approach. So there we want to both improve the underlying technique and perhaps try to incorporate other such techniques.\nAlso, as you saw, we applied the approach only to seven defects. These results are encouraging, but honestly this is only a start. To get a much deeper insight, we need to do a much larger evaluation with much more examples. And we are currently working on that.\nAnd last but not least, since the goal is to help a developer understand the bug, ultimately we will perform some kind of user study like the ones you saw in yesterdays “empirical studies” session.\n
So these are current results. Now what next? For future work, we plan to implement the approach for more runtime features to help the developer understand other kinds of defects.\nAlso, the main limitations of the approach stem from the underlying test case generation approach. So there we want to both improve the underlying technique and perhaps try to incorporate other such techniques.\nAlso, as you saw, we applied the approach only to seven defects. These results are encouraging, but honestly this is only a start. To get a much deeper insight, we need to do a much larger evaluation with much more examples. And we are currently working on that.\nAnd last but not least, since the goal is to help a developer understand the bug, ultimately we will perform some kind of user study like the ones you saw in yesterdays “empirical studies” session.\n
So these are current results. Now what next? For future work, we plan to implement the approach for more runtime features to help the developer understand other kinds of defects.\nAlso, the main limitations of the approach stem from the underlying test case generation approach. So there we want to both improve the underlying technique and perhaps try to incorporate other such techniques.\nAlso, as you saw, we applied the approach only to seven defects. These results are encouraging, but honestly this is only a start. To get a much deeper insight, we need to do a much larger evaluation with much more examples. And we are currently working on that.\nAnd last but not least, since the goal is to help a developer understand the bug, ultimately we will perform some kind of user study like the ones you saw in yesterdays “empirical studies” session.\n
To sum up my talk, I introduced Statistical Debugging and showed you what it looks like when applied to a real live example. Which shows its limitations. But I also showed that statistical debugging is not inherently flawed, but rather makes some assumptions that are not met by reality. how our approach performs in comparison, and then gave an intuition of how our approach works.\nThank you for listening and now I will gladly try to answer your questions.\n
To sum up my talk, I introduced Statistical Debugging and showed you what it looks like when applied to a real live example. Which shows its limitations. But I also showed that statistical debugging is not inherently flawed, but rather makes some assumptions that are not met by reality. how our approach performs in comparison, and then gave an intuition of how our approach works.\nThank you for listening and now I will gladly try to answer your questions.\n
To sum up my talk, I introduced Statistical Debugging and showed you what it looks like when applied to a real live example. Which shows its limitations. But I also showed that statistical debugging is not inherently flawed, but rather makes some assumptions that are not met by reality. how our approach performs in comparison, and then gave an intuition of how our approach works.\nThank you for listening and now I will gladly try to answer your questions.\n
To sum up my talk, I introduced Statistical Debugging and showed you what it looks like when applied to a real live example. Which shows its limitations. But I also showed that statistical debugging is not inherently flawed, but rather makes some assumptions that are not met by reality. how our approach performs in comparison, and then gave an intuition of how our approach works.\nThank you for listening and now I will gladly try to answer your questions.\n
To sum up my talk, I introduced Statistical Debugging and showed you what it looks like when applied to a real live example. Which shows its limitations. But I also showed that statistical debugging is not inherently flawed, but rather makes some assumptions that are not met by reality. how our approach performs in comparison, and then gave an intuition of how our approach works.\nThank you for listening and now I will gladly try to answer your questions.\n
\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Actually we tried our tool on 7 defects. There are the Brazilian Date bug and the Base64 Lookup bug that I already showed. Then there’s the Sparse Iterator bug of the Apache Commons Math library and it is due to an inconsistency in the way that the state of “no more elements” is represented internally. And all 8 branches returned relate to this inconsistency, and the two predicates that are returned pinpoint that problem. Then there is another bug from Apache Commons Codec, where a buffer is handled incorrectly. And again, the single branch returned pinpoints this problem. Then we have another bug from Joda Time, where one of the 7 branches that was returned is the place where the bug was actually fixed and the 9 predicates contain the very reason of the failure as given in the bug report. Than we have a small artificial example where the problem was pinpointed by the branch returned. And then there’s the last example, also from the Joda Time library, where our approach fails. The reason why it fails is due to the oracle problem: in the test, a valid french date is created that is wrongly classified as invalid by the API. And our test generation technique happily generated lots and lots of input that failed with the same exception. But of course few of those were valid french dates, so we actually did not reproduce the problem when we made the program crash, which is why it didn’t work.\nSo if you look at the time needed, you first think that this is way to much time and its unpractical for reality, right.\n
Depending on the example, we generate some 10s of thousands of test cases. Now this might sound like a lot. However, as this graph shows, in most cases already after 12 minutes, the final result is ready. Also, for some cases we observed something weird: bugex was faster than executing the code instrumented for regular statistical analysis or even running the bare complete test suite.\n