Talk presented on March 4, 2011 at the 15th European Conference on Software Maintenance and Reengineering in Oldenburg, Germany.
Abstract: A single development task such as solving a bug or implementing a new feature often involves changing a number of entities, also known together as a change set. Change sets can be approximated from the version control system. They are then used by the architects and developers to take important decisions. So change sets need to be approximated carefully. It is common to assume that two entities checked-in less than a small time interval from each other, and having the same meta-data associated with them, belong to the same transaction. Transactions may be good approximations of change sets if developers commit change sets in one go and if the required meta-data is available. This is however not the case in the industrial environment (Philips Healthcare) we study. Our paper presents a case study in which we investigated how change sets can be approximated in an environment with a complex workflow and limited meta-data in the version repositories. We found that, dependent on the commit practices used, a suitable time intervals between check-in timestamps of files has to be determined and leveraged to reliably approximate change sets.
6. Philips MRI Systems
• Eight million lines of code across 34,000
files.
• C, C++, and C# used.
• Hundreds of developers across 3 sites.
• An old version of IBM ClearCase used for
version control.
• Nine years of version control data available.
10. “Re-engineering in the large...”
Jens Borchers
Reengineering from a Practitioner’s View –
A Personal Lesson’s Learned Assessment
Invited talk at CSMR 2011
13. We need data!
• Which developers change
which files?
• Which functionality is
implemented in a file?
• Which sub-systems are
often changed together?
• ...
Change Sets
14. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 6, JUNE 2005 429
Mining Version Histories to
Guide Software Changes
Thomas Zimmermann, Student Member, IEEE, Peter Weißgerber,
Stephan Diehl, and Andreas Zeller, Member, IEEE Computer Society
Abstract—We apply data mining to version histories in order to guide programmers along related changes: “Programmers who
changed these functions also changed....” Given a set of existing changes, the mined association rules 1) suggest and predict likely
further changes, 2) show up item coupling that is undetectable by program analysis, and 3) can prevent errors due to incomplete
changes. After an initial change, our ROSE prototype can correctly predict further locations to be changed; the best predictive power is
obtained for changes to existing software. In our evaluation based on the history of eight popular open source projects, ROSE’s
topmost three suggestions contained a correct location with a likelihood of more than 70 percent.
Index Terms—Programming environments/construction tools, distribution, maintenance, enhancement, configuration management,
clustering, classification, association rules, data mining.
22. Environment at Philips
• Developers often commit files associated
to more than one task.
• No build-able system required for commit.
• Multiple developers may work together for
complex tasks.
40. Developer survey
• Ten most active developers invited to
participate. Eight responded.
• Participants briefed on purpose of survey,
how change sets were approximated, and
possibility to discontinue survey.
• Randomly drawn change sets presented.
• Developers evaluated 75 change sets.
41. Precision from survey
Table II: Precision estimated with help of developers
Time interval Precision (in %) #ACS
∗
(δ) Max. Min. Avg. Analyzed Skipped
200 seconds 100 50 91 19 3
1 hour 100 33 91 15 4
1 day 100 40 78 21 4
1 week 100 6 66 14 7
1 month 100 2 36 6 8
∗
ACS stands for Approximated Change Sets
42. Precision from survey
Table II: Precision estimated with help of developers
Time interval Precision (in %) #ACS
∗
(δ) Max. Min. Avg. Analyzed Skipped
200 seconds 100 50 91 19 3
1 hour 100 33 91 15 4
1 day 100 40 78 21 4
1 week 100 6 66 14 7
1 month 100 2 36 6 8
∗
ACS stands for Approximated Change Sets
45. Example Postlist
Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
Stream information <DEVSTREAM>FEMAIN
Developer <USER>Anna
Date <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report? <PR_SECTION>Y
Problem report number <SOLVED_PR>MR00035599
Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier
<CODING_STANDARD>N
<PNS_SAR>N
Developers playing <BBLOCK_OWNER>James
a role in the <REVIEWER>Robert
review process <TEAMLEADER>David
<DOC_SECTION>N
Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
<POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
<TEST_SECTION>Y
<TEST_DONE>@OTM6
Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
<PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
46. Example Postlist
Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
Stream information <DEVSTREAM>FEMAIN
Developer <USER>Anna
Date <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report? <PR_SECTION>Y
Problem report number <SOLVED_PR>MR00035599
Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier
<CODING_STANDARD>N
<PNS_SAR>N
Developers playing <BBLOCK_OWNER>James
a role in the <REVIEWER>Robert
review process <TEAMLEADER>David
<DOC_SECTION>N
Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
<POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
<TEST_SECTION>Y
<TEST_DONE>@OTM6
Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
<PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
47. Example Postlist
Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
Stream information <DEVSTREAM>FEMAIN
Developer <USER>Anna
Date <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report? <PR_SECTION>Y
Problem report number <SOLVED_PR>MR00035599
Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier
<CODING_STANDARD>N
<PNS_SAR>N
Developers playing <BBLOCK_OWNER>James
a role in the <REVIEWER>Robert
review process <TEAMLEADER>David
<DOC_SECTION>N
Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
<POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
<TEST_SECTION>Y
<TEST_DONE>@OTM6
Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
<PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
48. Example Postlist
Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
Stream information <DEVSTREAM>FEMAIN
Developer <USER>Anna
Date <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report? <PR_SECTION>Y
Problem report number <SOLVED_PR>MR00035599
Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier
<CODING_STANDARD>N
<PNS_SAR>N
Developers playing <BBLOCK_OWNER>James
a role in the <REVIEWER>Robert
review process <TEAMLEADER>David
<DOC_SECTION>N
Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
<POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
<TEST_SECTION>Y
<TEST_DONE>@OTM6
Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
<PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
49. Example Postlist
Unique ID <POSTLIST_NAME>nly95872_RFAmpSim_20060103T150114
Stream information <DEVSTREAM>FEMAIN
Developer <USER>Anna
Date <DATE_DD_MMM_YYYY>3 JAN 2006
Is it a problem report? <PR_SECTION>Y
Problem report number <SOLVED_PR>MR00035599
Rationale behind change <REASON_TEXT>Improve simulation of the RF-Amplifier
<CODING_STANDARD>N
<PNS_SAR>N
Developers playing <BBLOCK_OWNER>James
a role in the <REVIEWER>Robert
review process <TEAMLEADER>David
<DOC_SECTION>N
Changed files submitted <POSTLIST_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain5
for review <POSTLIST_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main3
<POSTLIST_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain5
<TEST_SECTION>Y
<TEST_DONE>@OTM6
Last versions of files <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.oneBDRFAmplifierNorf.h@@mainfemain2
used to build system <PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.twobdtransmittercsinterfaces.hcf@@main1
<PREVIOUSLY_CONSOLIDATED_FILE>path.to.file.threeBDRFAmplifierNorf.cpp@@mainfemain4
50. <SWID>29
Results from postlists Figure 5: A samp
able III: Precision estimated IV: Recall estimated using postlists
Table using postlists
Time interval Time interval
Precision (in %) Recall (in %)
(δ) Max. Min. Avg. Max.
(δ) Min. Avg.
200 seconds 100 200 seconds
50 93 100 20 74
1 hour 100 20 1 hour
89 100 20 84
1 day 100 1 1 day
69 100 22 92
1 week 100 31
<11 week 100 24 94
1 month 95 <1 month
1 8 100 25 94
51. <SWID>29
Results from postlists Figure 5: A samp
able III: Precision estimated IV: Recall estimated using postlists
Table using postlists
Time interval Time interval
Precision (in %) Recall (in %)
(δ) Max. Min. Avg. Max.
(δ) Min. Avg.
200 seconds 100 200 seconds
50 93 100 20 74
1 hour 100 20 1 hour
89 100 20 84
1 day 100 1 1 day
69 100 22 92
1 week 100 31
<11 week 100 24 94
1 month 95 <1 month
1 8 100 25 94
52. So, what works?
• The one hour time interval works best for
our environment.
• Optimal time interval may differ from one
environment to another.
53. Threats to validity
• Assumed developers have a good recall of
their own change sets.
• Postlists were carefully selected, but in rare
cases may relate to more than one change
set.
• Change sets with multiple developers
involved were not captured completely.