The performance of three leading application lifecycle management (ALM) systems (Rally by Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was assessed to draw comparative performance observations when customer data exceeds a 500,000-
artifact threshold. The focus of this performance testing was how each system handles a
simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical representative data set of 512,000 objects was constructed and populated in each system in order
to simulate identical use cases as closely as possible. Timed browser testing was performed to gauge the performance of common usage scenarios, and comparisons were then made. Nine tests were performed based on measurable, single-operation events
Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments
1. Performance Analysis of Leading Application Lifecycle
Management Systems for Large Customer Data Environments
Paul Nelson
Director, Enterprise Systems Management, AppliedTrust, Inc.
paul@appliedtrust.com
Dr. Evi Nemeth
Associate Professor Attendant Rank Emeritus, University of Colorado at Boulder
Distinguished Engineer, AppliedTrust, Inc.
evi@appliedtrust.com
Tyler Bell
Engineer, AppliedTrust, Inc.
tyler@appliedtrust.com
AppliedTrust, Inc.
1033 Walnut St, Boulder, CO 80302
(303) 245-4545
Abstract
The performance of three leading application lifecycle management (ALM) systems (Rally by
Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was
assessed to draw comparative performance observations when customer data exceeds a 500,000-
artifact threshold. The focus of this performance testing was how each system handles a
simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical
representative data set of 512,000 objects was constructed and populated in each system in order
to simulate identical use cases as closely as possible. Timed browser testing was performed to
gauge the performance of common usage scenarios, and comparisons were then made. Nine tests
were performed based on measurable, single-operation events.
Rally emerged as the strongest performer based on the test results, leading outright in six
of the nine that were compared. In one of these six tests, Rally tied with VersionOne from a
scoring perspective in terms of relative performance (using the scoring system developed for
comparisons), though it led from a raw measured-speed perspective. In one test not included in
the six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds of
the scoring model that was established. VersionOne was the strongest performer in two of the
nine tests, and exhibited very similar performance characteristics (generally within a 1 – 12
second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, but
as noted, tied with Rally for one. JIRA+GreenHopper was almost an order of magnitude slower
than peers when performing any test that involved its agile software development plug-in. All
1
2. applications were able to complete the tests being performed (i.e., no tests failed outright). Based
on the results, Rally and VersionOne, but not JIRA+GreenHopper, appear to be viable solutions
for clients with a large number of artifacts.
1. Introduction JIRA+GreenHopper JIRA 5.1with
GreenHopper 6 were the versions that were
As the adoption of agile project management tested.
has accelerated over the last decade, so too has The tests measure the performance of
the use of tools supporting this methodology. single-user, single-operation events when an
This growth has resulted in the accumulation underlying customer data set made up of
of artifacts (user stories, defects, tasks, and test 500,000 objects is present. These tests are not
cases) by customers in their ALM system of intended to be used to draw conclusions
choice. The trend is for data stored in these regarding other possible scenarios of interest,
systems to be retained indefinitely, as there is such as load, concurrent users, or other tests
no compelling reason to remove it, and often, not explicitly described.
product generations are developed and The fundamental objective of the testing is
improved over significant periods of time. In to provide some level of quantitative
other cases, the size of specific customers and comparison for user-based interaction with the
ongoing projects may result in very rapid three products, as opposed to system- or
accumulation of artifacts in relatively short service-based interaction.
periods of time. Anecdotal reports suggest that
an artifact threshold exists around the 500,000
artifact point, and this paper seeks to test that 2. Data Set Construction
observation.
This artifact scaling presents a challenge The use of ALM software and the variety of
for ALM solution providers, as customers artifacts, custom fields, etc., will vary
expect performance consistency in their ALM significantly between customers. As a result,
platform regardless of the volume of the there is not necessarily a “right way” to
underlying data. While it is certainly possible structure data for test purposes. More
to architect ALM systems to address such important is that fields contain content that is
challenges, there are anecdotal reports that similarly structured to real data (e.g., text in
some major platforms do not currently handle freeform text fields, dates in date fields), and
large projects in a sufficient manner from a that each platform is populated with the same
performance perspective. data. In some cases, product variations
This paper presents the results of testing prevented this. Rally, for example, does not
performed in August and September 2012, use the concept of an epic, but rather a
recording the performance of Rally Software, hierarchical, user story relationship, whereas
VersionOne, and JIRA+GreenHopper, and VersionOne supports epics.
then drawing comparative conclusions Actually creating data with unique content
between the three products. Atlassian’s ALM for all artifacts would be infeasible for testing
offering utilizes its JIRA product and extends purposes. To model real data, a structure was
it to support agile project management using chosen for a customer instance based on 10
the GreenHopper functionality extension unique projects. Within each project, 40 epics
(referred to in this paper as or parent user stories were populated, and 80
JIRA+GreenHopper). Rally Build 7396, user stories were created within each of those.
VersionOne 12.2.2.3601, and Associated with each user story were 16
artifacts: 10 tasks, four defects, and two test
2
3. cases. In terms of core artifact types, the This generator produces text containing real
product of these counts is 16*80*40*10, or sentence and paragraph structures, but random
512,000. All platforms suffered from strings as words. A number of paragraph size
difficulties related to data population. This and content blocks were created, and their use
manifested in a variety of ways, including was repeated in multiple objects. The
imports “freezing,” data being truncated, or description field of a story contained one or
data being mismapped to incorrect fields. two paragraphs of this generated text. Tasks,
Every effort was made to ensure as much data defects, and tests used one or two sentences. If
consistency between data uploads as possible, one story got two paragraphs, then the next
but there were slight deviations from the story would get one paragraph, and so on in
expected norm. This was estimated to be no rotation. This data model was used for each
more than 5%, and where there was missing system.
data, supplementary uploads were performed It is possible that one or more of the
to move the total artifact count closer to the products may be able to optimize content
512,000 target. In addition, tests were only retrieval with an effective indexing strategy,
performed on objects that met consistency but this advantage is implementable in each
checks (i.e., the same field data). product. Only JIRA+GreenHopper prompted
These symmetrical project data structures the user to initiate indexing operations, and
are not likely to be seen in real customer based on prompted instruction, indexing was
environments. The numbers of parent objects performed after data uploads were complete.
and child objects will also vary considerably.
That being said, a standard form is required to
allow population in three products and to 3. Data Population
enable attempts at some level of data
consistency. Given that the structure is Data was populated primarily by using the
mirrored as closely as possible across each CSV import functionality offered by each
product, the performance variance should be system. This process varied in the operation
indicative of observed behaviors in other sequence and chunking mechanism for
customer environments regardless of the exact uploads, but fundamentally was based on
artifact distributions. tailoring input files to match the input
Custom fields are offered by all products, specifications and uploading a sequence of
and so a number of fields were added and files. Out of necessity, files were uploaded in
populated to simulate their use. Five custom various-sized pieces related to input limits for
fields were added to each story, task, defect, each system. API calls and scripts were used
and test case; one was Boolean true/false, two to establish relationships between artifacts
were numerical values, and two were short text when the CSV input method did not support or
fields. retain these relationships. We encountered
The data populated followed the schema issues with each vendor’s product in importing
specified by each vendor’s documentation. We such a large data set, which suggests that
populated fields for ID, name, description, customers considering switching from one
priority, and estimated cost and time to product to another should look carefully at the
complete. The data consisted of dates and feasibility of loading their existing data. Some
times, values from fixed lists (e.g., the priority of our difficulty in loading data involved the
field with each possible value used in turn), fact that we wanted to measure comparable
references to other objects (parent ID), and operations, and the underlying data structures
text generated by a lorem ipsum generator. made this sometimes easy, sometimes nearly
impossible.
3
4. 4. JIRA+GreenHopper Data Population 7. Testing Methodology
Issues
A single test system was used to collect test
We had to create a ‘Test Case’ issue type in data in order to limit bias introduced by
the JIRA+GreenHopper product and use what different computers and browser instances.
is known in the JIRA+GreenHopper The test platform was a Dell Studio XPS 8100
community as a bug to keep track of the running Microsoft Windows 7 Professional
parent-child hierarchy of data objects. Once SP1 64-bit, and the browser used to perform
this was done, the data loaded quite smoothly testing was Mozilla Firefox v15.0.1. The
using CSV files and its import facility until we Firebug add-on running v1.10.3 was used to
reached the halfway point, when the import collect test metrics. Timing data was recorded
process slowed down considerably. in a data collection spreadsheet constructed for
Ultimately, the data import took two to three this project. While results are expected to vary
full days to complete. if using other software and version
combinations, using a standardized collection
model ensured a consistent, unbiased approach
5. Rally Data Population Issues to gathering test data for this paper, and will
allow legitimate comparisons to be made. It is
Rally limits the size of CSV files to 1000 lines expected that while the actual timing averages
and 2.097 MB. It also destroys the may differ, the comparisons will not.
UserStory/SubStory hierarchy on import At the time measurements were being
(though presents it on export). These taken, the measurement machine was the only
limitations led to a lengthy and tedious data user of our instance of the software products.
population operation. Tasks could not be All tests were performed using the same
imported using the CSV technique. Instead, network and Internet connection, with no
scripting was used to import tasks via Rally’s software updates or changes between tests. To
REST API interface. The script was made ensure there were no large disparities between
using Pyral, which is a library released by response times, an http-ping utility was used in
Rally for quick, easy access to its API using order to measure roundtrip response times to
the Python scripting language. The total data the service URLs provided by each system.
import process took about a week to complete. Averaged response times over 10 http-ping
samples were all under 350 milliseconds and
within 150 milliseconds of each other,
6. VersionOne Data Population Issues
suggesting connectivity and response are
comparable for all systems.
VersionOne did not limit the CSV file size, but
JIRA+GreenHopper had an average response
warned that importing more than 500 objects
time of 194 milliseconds, Rally 266, and
at a time could cause performance issues. This
VersionOne 343. All tests were performed
warning was absolutely true. During import,
during US MDT business hours (8 a.m. – 5:30
our VersionOne test system was totally
p.m.).
unresponsive to user operations. CSV files of
It is noted that running tests in a linear
5000 lines would lock it up for hours, making
manner does introduce the possibility of
data population take over a week of 24-hour
performance variation due to connectivity
days.
performance variations between endpoints,
though these variations would be expected
under any end-user usage scenario and are
4
5. difficult, if not impossible, to predict and was not tested. The focus was on the collection
measure. of core tests described in the test definition
Tests and data constructs were table in the next section.
implemented in a manner to allow apples-to- The time elapsed from the start of the first
apples comparison with as little bias and request until the end of the last
potential benefit to any product as possible. request/response was used as the core time
However, it should be noted that these are metric associated with a requested page load
three different platforms, each with unique when possible. This data is captured with
features. In the case where a feature exists on Firebug, and an example is illustrated below
only one or two of the platforms, that element for a VersionOne test.
Example of timing data collection for a VersionOne test.
We encountered challenges timing pages inefficiencies. Bias may also be introduced in
that perform operations using asynchronous one or more products based on the testing
techniques to update or render data. Since we methodology employed. While every effort
are interested in when the result of operations was made to make tests fair and representative
are visible to the user, timing only the of legitimate use cases, it is recognized that
asynchronous call that initiates the request results might vary if a different data set was
provides little value from a testing perspective. used. Further, the testing has no control over
In cases where no single time event could be localized performance issues affecting the
used, timing was performed manually. This hosted environments from which the services
increased the error associated with the are provided. If testing results in minor
measurement, and this error is estimated to be variance between products, then arguably
roughly one second or less. In cases where some of this variance could be due to factors
manual measurements were made, it is outside of the actual application.
indicated in the result analysis. A stopwatch The enterprise trial versions were used to
with 0.1-second granularity was used for all test each system. We have no data regarding
manually timed tests, as were two people — how each service handles trial instances; it is
one running the test with start/stop instruction possible that the trial instances differ from
and the other timing from those verbal cues. paid subscription instances, but based on our
It is acknowledged that regardless of the review and the trial process, there was no
constraints imposed here to standardize data indication the trial version was in any way
and tests for comparison purposes, there may different. We assume that providers would not
be deviations from performance norms due to intentionally offer a performance-restricted
the use of simulated data, either efficiencies or instance for trial customers, given that their
5
6. end goal would be to convert those trial run for every test was performed to allow
customers to paying subscribers. object caching client-side — so in fact, each
Based on a per-instance calibration routine, test was executed 11 times, but only results 2-
the decision was made to repeat each test 10 11 were analyzed. Based on the belief that the
times per platform. A comparison between a total artifact count is the root cause of
10-test and 50-test sample was performed for scalability issues, allowing caching should
one test case (user story edit) per platform to eliminate some of the variation due to factors
ensure the standard deviation between that cannot be controlled by the test.
respective tests was similar enough to warrant The use of attachments was not tested.
the use of a 10-test sample. In no case was the This was identified as more of a bandwidth
calibration standard deviation greater than one and load test, as opposed to a performance of
second. If the performance differences the system in a scalability scenario.
between applications are found to be of a
similar order of magnitude (i.e., seconds), then
the use of a 10-test sample per application 8. Test Descriptions
should clearly be questioned. However, if the
overriding observation is that each application Tests were constructed based on common uses
performs within the same small performance of ALM systems. Timing data was separated
range of the others, the nuances of sample size into discrete operations when sequences of
calculation are rendered insignificant. events were tested. These timings were
A more in-depth sample sizing exercise compared individually, as opposed to in
could also be performed, and could aggregate, in order to account for interface and
realistically be performed per test. However, it workflow differences between products.
is already recognized that there are numerous There may be tests and scenarios that
factors beyond the control of the tests, to the could be of interest but were not captured,
extent that further increasing sample size either because they were not reproducible in
would offer little value given the relatively all products or were not identified as common
consistent performance observed during operations. Also, it would be desirable in
calibration. future tests to review the performance of
To help reduce as many bandwidth and logical relationships (complex links between
geographic distance factors as possible, the iterations/sprints and other artifacts, for
client browser cache was not cleared between example). The core objective when selecting
tests. This also better reflects real user these tests was to enable comparison for
interaction with the systems. A single pretest similar operations between systems.
# Test Name Description/Purpose
1 Refresh the backlog The backlog page is important to both developers and managers; it
for a single project. is the heart of the systems. Based on variance in accessing the
backlog, the most reliable mechanism to test was identified as a
refresh of the backlog page. Views were configured to display 50
entries per page.
2 Switch backlog views A developer working on two or more projects might frequently
between two projects. swap projects. Views were configured to display 50 entries per
page.
6
7. 3 Paging through With our large data sets, navigation of large tables can become a
backlog lists. performance issue. Views were configured to display 50 entries per
page.
4 Select and view a story Basic access to a story.
from the backlog.
5 Select and view a task. Basic access to a task.
6 Select and view a Basic access to a defect or bug. (Note: JIRA+GreenHopper uses
defect/bug. the term bug, while Rally and VersionOne use defect.)
7 Select and view a test. Basic access to a test case.
8 Create an Common management chore. (Note: This had to be manually timed
iteration/sprint. for JIRA+GreenHopper, as measured time was about 0.3 seconds
while elapsed time was 17 seconds.)
9 Move a story to an Common developer or manager chore. (Note: JIRA+GreenHopper
iteration/sprint. and VersionOne use the term sprint, while Rally uses iteration.)
10 Convert a story to a Common developer chore (Note: This operation is not applicable
defect/bug. to Rally because of the inherent hierarchy between a story and its
defects).
9. Test Results performed badly (subjectively). As such, the
leader in a test is given the “Very Good”
Each test was performed 1+10 times in rating, which corresponds to five points. The
sequence for each software system, and the leading time is then used as a base for
mean and standard deviation were computed. comparative scoring of competitors for that
The point estimates were then compared to test, with each test score based on how many
find the fastest performing application. A +n multiples it was of the fastest performer. The
(seconds) indicator was used to indicate the point legend table is illustrated below.
relative performance lag of the other
applications from the fastest performing Time Multiple Points
application for that test. 1.0x ≤ time < 1.5x 5
The test result summary table illustrates 1.5x ≤ time < 2.5x 4
the relative performance for each test to allow 2.5x ≤ time < 3.5x 3
observable comparisons per product and per 3.5x ≤ time < 4.5x 2
test. In order to provide a measurement-based 4.5x ≤ time 1
comparison, a scale was created to allow
numerical comparison between products.
There were no cases where the leader in a test
7
8. Test Result Summary Table (Relative Performance Analysis)
Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2) Very Poor: (1)
System and Overall 1 2 3 4 5 6 7 8 9
Test Rating Backlog Switch Backlog View View View View Create Story
Summary (Out of Refresh Backlog Paging Story Task Defect Test Sprint →
45) Sprint
Rally 43
VersionOne 32
JIRA+
GreenHopper
18
It must be noted that resulting means are (symmetrically distributed), and 95% should
point-estimate averages. For several reasons, lie within two standard deviations. We
we don’t suggest or use confidence intervals or graphically tested for normality using our
test for significance. Based on the challenges calibration data and observed our data to be
associated with structuring common tests with normally distributed. When there is no overlap
different interfaces, different data structures, between timing at two standard deviations, this
and no guarantee of connection quality, it is implies it will be fairly rare for one of the
extraordinarily difficult to do so. In addition, typically slower performing applications to
because each test may have a different weight exceed the performance of the faster
or relevance to each customer depending on application (for that particular test). If there is
their ALM process, the relevance of a test no overlap at one or two standard deviations
leader should be weighted according to the between the lower and upper bounds, the result
preference of the reader. That being said, these is marked as “Significant.” If there is overlap
tests are intended to reflect the user in one or both cases, that result is flagged as
experience. To address some of the concerns “Insignificant.” Significance is assessed
associated with point estimates, analysis of between the fastest performing application for
high and low bounds based on one and two the test and each of the other two applications.
standard deviations was performed. If the high Therefore, the significance analysis is only
bound for the fastest test overlaps with the low populated for the application with the fastest
bound for either of the slower performing point estimate. The advantage is classed as
application tests, the significance of the insignificant if the closest performing peer
performance gain between those comparisons implies the result is insignificant. All data
is questionable. The overlap suggests there values are in seconds.
will be cases where the slower (overlapping) Results from each test are analyzed
application may perform faster than the separately below. The results of each test are
application with the fastest response time. shown both in table form with values and in
Statistical theory and the three-sigma rule bar graph form, and are also interpreted in the
suggest that when data is normally distributed, text below the corresponding table. Note that
roughly 68% of observations should lie within long bars in the comparison graphs are long
one standard deviation of the mean response times, and therefore bad.
8
9. Test 1: Refresh Backlog Page for a Single Project
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 15.27 1.38 +12.13 13.89 – - 12.52 – -
GreenHopper 16.64 18.02
Rally 5.53 0.29 +2.39 5.24 – - 4.95 – -
5.81 6.10
VersionOne 3.14 0.25 Fastest 2.88 – Significant 2.63 – Significant
3.39 3.64
Interpretation: The data indicates that for this almost 2.4 seconds. Both VersionOne and
particular task, even when accounting for Rally perform significantly better than
variance in performance, VersionOne JIRA+GreenHopper when executing this
performs fastest. Note that the advantage is operation.
relatively small when compared to Rally,
though the Rally point estimate does lag by Best Performer: VersionOne
9
10. Test 2: Switch Backlog Views Between Two Projects
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 13.84 0.83 +11.39 13.01 – - 12.19 – -
GreenHopper 14.66 15.49
Rally 2.45 0.16 Fastest 2.29 – Significant 2.13 – Significant
2.60 2.76
VersionOne 2.94 0.07 +0.49 2.87 – - 2.79 – -
3.01 3.08
*To perform this operation on JIRA+GreenHopper, the user must navigate between two scrumboards and then load
the data. Therefore, the timing numbers for JIRA+GreenHopper are the sum of two measurements. This introduces
request overhead not present in the other two tests, yet the disparity suggests more than just simple transaction
overhead is the cause of the delay. Furthermore, the resulting page was rendered frozen and was not usable for an
additional 10 – 15 seconds. Users would probably pool that additional delay before the page could be accessed in
their user experience impression, but it was not included here.
Interpretation: The data indicates that Rally user interaction, the experience would be
and VersionOne are significantly faster than similar for the two products.
JIRA+GreenHopper, even when considering
the sum of two operations. Rally is faster than Best Performer: Rally
VersionOne, though marginally so. In terms of
10
11. Test 3: Paging Through Backlog List
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 1.53 0.66 Fastest 0.87 – Insignificant 0.21 – Insignificant
GreenHopper 2.19 2.85
Rally 1.93 0.11 +0.4 1.81 – - 1.70 – -
2.04 2.15
VersionOne 3.45 0.29 +1.92 3.16 – - 2.87 – -
3.74 4.04
Interpretation: JIRA+GreenHopper had the likely to be comparable. The data indicates
fastest point-estimate mean, but the analysis that VersionOne is significantly slower than
suggests there is minimal (not significant) the other two systems, and for very large data
improvement over Rally, which was the sets like the tests used, this makes scrolling
second-fastest. The standard deviations through the data quite tedious.
suggest a wider performance variance for
JIRA+GreenHopper, and so while the point Best Performer: JIRA+GreenHopper and
estimate is better, the overall performance is Rally
11
12. Test 4: Selecting and Viewing a User Story From the Backlog
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 3.49 0.99 +2.95 2.49 – - 1.50 – -
GreenHopper 4.48 5.47
Rally 0.53 .07 Fastest 0.46 – Significant 0.40 – Significant
0.60 0.67
VersionOne 1.90 0.30 +1.36 1.59 – - 1.29 – 2.5 -
2.20
Interpretation: The data indicates that Rally is experience. Rally’s performance is also more
significantly faster than either consistent than the other two products (i.e., it
JIRA+GreenHopper or VersionOne. While the has a much lower response standard
result is significant, the one-second difference deviation).
between Rally and VersionOne is not likely to
have a significant impact on the user Best Performer: Rally
12
13. Test 5: Selecting and Viewing a Task
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 1.36 0.17 +0.92 1.20 – - 1.03 – -
GreenHopper 1.53 1.69
Rally 0.44 0.03 Fastest 0.42 – Significant 0.39 – Significant
0.47 0.50
VersionOne 1.46 0.16 +1.01 1.29 – - 1.13 – -
1.62 1.78
Interpretation: The data indicates that Rally is VersionOne showed similar performance.
significantly (in the probabilistic sense) faster Overall, the result for all applications was
than either JIRA+GreenHopper or VersionOne qualitatively good.
by about one second, and also has a more
consistent response time (with the lowest Best Performer: Rally
standard deviation). JIRA+GreenHopper and
13
14. Test 6: Selecting and Viewing a Test Case
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 1.91 0.86 +1.37 1.05 – - 0.19 – -
GreenHopper 2.77 3.64
Rally 0.54 0.13 Fastest 0.41 – Significant 0.28 – Insignificant
0.67 0.80
VersionOne 1.45 0.18 +0.91 1.27 – - 1.09 – -
1.62 1.80
Interpretation: The data indicates that, again, suggesting a consistently better experience.
Rally is fastest in this task, though the speed VersionOne was second in terms of
differences are significant at the one standard performance, followed by
deviation level where there is no overlap in JIRA+GreenHopper.
their respective timing ranges, but not at two
standard deviations. Rally performed with the Best Performer: Rally
lowest point estimate and the lowest variance,
14
15. Test 7: Selecting and Viewing a Defect/Bug
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 1.70 0.81 +1.02 0.88 – - 0.07 – -
GreenHopper 2.51 3.32
Rally 0.68 0.05 Fastest 0.63 – Significant 0.58 – Insignificant
0.72 0.77
VersionOne 1.74 0.17 +1.06 1.56 – - 1.39 – -
1.91 2.08
Interpretation: The data indicates that Rally is very low standard deviation. Though the point
faster by roughly one second based on the estimates are very close, the performance of
point-estimate mean when compared to the VersionOne is preferred based on the low
other two products, with the difference being standard deviation. That being said, given that
significant at the one standard deviation level the point estimates are all below two seconds,
but not at two standard deviations. Variance in there would be little to no perceptible
the results of the other products suggests they difference between VersionOne and
will perform similarly to Rally on some JIRA+GreenHopper from a user perspective.
occasions, but not all. Rally’s performance
was relatively consistent, as indicated by the Best Performer: Rally
15
16. Test 8: Add an Iteration/Sprint
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 17.76 0.60 +17.72 17.16 – - 16.56 – -
GreenHopper 18.36 18.96
Rally 0.04 0.00 Fastest 0.04 – Significant 0.03 – Significant
0.05 0.05
VersionOne 1.36 0.10 +1.32 1.25 – - 1.15 – -
1.46 1.57
*Due to the disparity between Rally and JIRA+GreenHopper here, the graph appears to show no data for Rally.
The graph resolution is simply insufficient to render the data clearly, given the large value generated by
JIRA+GreenHopper tests.
**The JIRA+GreenHopper data was manually measured due to inconsistencies in timing versus content rendering.
Based on requests, it appeared asynchronous page timings were completing when requests were submitted, and the
eventual content updates and rendering were disconnected from the original request being tracked. While this
increases the measurement error, it certainly would not account for a roughly 17-second disparity.
Interpretation: Rally is the fastest performer JIRA+GreenHopper is many times slower than
in this test, with the results being significant at both Rally and VersionOne.
both the one and two standard deviation levels.
Best Performer: Rally
16
17. Test 9: Move a Story to an Iteration/Sprint
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 9.80 6.88 +8.42 2.91 – - 0.00* – -
GreenHopper 16.68 23.56
Rally 3.37 0.22 +1.99 3.15 – - 2.94 – -
3.59 3.80
VersionOne 1.38 0.36 Fastest 1.02 – Significant 0.66 – Insignificant
1.74 2.09
*The standard deviation range suggested a negative value, which is, of course, impossible. Therefore, 0.00 is
provided.
Interpretation: The data indicates that test is a result of the enormous standard
VersionOne is fastest for this operation. The deviation of the JIRA+GreenHopper tests.
insignificant two standard deviation overlap
Best Performer: VersionOne
17
18. Test 10: Convert a Story to a Defect/Bug
System Mean Standard Point 1 SD 1 SD 2 SD 2 SD
Request Deviation Estimate Range Overlap Range Overlap
Time (seconds) Comparison (seconds) Analysis (seconds) Analysis
(seconds) (seconds)
JIRA+ 26.56 2.94 +24.87 23.62 – - 20.68 – -
GreenHopper 29.50 32.44
Rally 1.69 0.25 Fastest 1.44 – Significant 1.19 – Significant
1.94 2.19
VersionOne 6.06 0.28 +4.36 5.77 – - 5.49 – -
6.34 6.62
*JIRA+GreenHopper required manual timing. See the interpretation below for explanation.
Interpretation: This operation is an example update for about 10 seconds while it updated
of one in which the procedure in each system the icon to the left of the new defect from a
is completely different and perhaps not green story icon to a red defect icon. This extra
comparable in any reasonable way. In 10 seconds was not included in the timing
JIRA+GreenHopper, there are three operations results, although perhaps it should have been.
involved (access the story, invoke the editor, In Rally, defects are hierarchically below
and after changing the type of issue, saving the stories as one of a story’s attributes, and so a
changes and updating the database) and these story cannot be converted to a defect, though
had to be manually timed. In addition, the defects can be promoted to stories. That is
JIRA+GreenHopper page froze after the what we measured for Rally’s case. And
18
19. finally, VersionOne has a menu option to do scrumboard, which JIRA+GreenHopper
this task. The results, reported here just for implements with the plug-in GreenHopper.
interest and not defensible statistically, The GreenHopper overlay/add-on seemed
indicate that Rally is fastest at this class of unable to handle the large data sets effectively.
operation, followed by VersionOne at plus- When we tried to include the test of viewing
four seconds and JIRA+GreenHopper at +24 the backlog for all projects, we were able to do
seconds. so for Rally and VersionOne, but the
JIRA+GreenHopper instance queried for over
Best Performer: N/A – Informational 12 hours without rendering the scrumboard
observations only. and merged project backlog. Some object view
operations resulted in second-best performance
for JIRA+GreenHopper, but with the
10. Conclusions exception of viewing tasks, the variance
associated with request was extraordinarily
Our testing was by no means exhaustive, but high compared to Rally and VersionOne. The
thorough enough to build a reasonably sized large variance will manifest to users as an
result set to enable comparison between inconsistent experience (in terms of response
applications. It fundamentally aimed to assess time) when performing the same operation.
the performance of testable elements that are Anecdotally, the performance of
consistent between applications. We tried to VersionOne compared to Rally was
choose simple, small tests that mapped well significantly degraded when import activity
between the three systems and could be was taking place, to the extent that
measured programmatically as opposed to VersionOne becomes effectively unusable
manually (and succeeded in most cases, during import operations. Further testing could
though some manual timing was required). be performed to identify whether this is a
Rally was the strongest performer based on CSV-limited import issue or if it extends to
the test results, leading outright in six of the programmatic API access, as well. Given how
nine that were compared. In one of these six many platforms utilize API access regularly, it
tests, Rally tied with VersionOne from a would be interesting to explore this result
scoring perspective in terms of relative further.
performance (using the scoring system Both Rally and VersionOne appear to
developed for comparisons), though it led provide a reasonable user experience that
from a raw measured-speed perspective. In should satisfy customers in most cases when
one test not included in the six, Rally tied with the applications are utilizing large data sets
JIRA+GreenHopper from a numeric with over 500,000 artifacts.
perspective and within the bounds of the JIRA+GreenHopper is significantly
scoring model that was established. disadvantaged from a performance
VersionOne was the strongest performer in perspective, and seems less suitable for
two of the nine tests, and exhibited very customers with large artifact counts or with
similar performance characteristics (generally aggressive growth expectations. Factors such
within a 1 – 12 second margin) in many of the as user concurrency, variations in sprint
tests that Rally led. JIRA+GreenHopper did structure, and numerous others have the
not lead any tests, but as noted, tied with Rally potential to skew results in either direction,
for one. and it is difficult to predict how specific use
With the exception of backlog paging, cases may affect performance. These tests do,
JIRA+GreenHopper trailed in tests that however, provide a reasonable comparative
leveraged agile development tools such as the
19
20. baseline, suggesting Rally has a slight
performance advantage in general, followed
closely by VersionOne.
References
A variety of references were used to help build
and execute a performance testing
methodology that would allow a reasonable,
statistically supported comparison of the
performance of the three ALM systems. In
addition to documentation available at the
websites for each product, the following
resources were used:
“Agile software development.” Wikipedia.
Accessed Sept. 28, 2012 from
http://en.wikipedia.org/wiki/Agile_soft
ware_development.
Beedle, Mike, et al. “Manifesto for Agile
Software Development.” Accessed
Sept. 28, 2012 from
http://agilemanifesto.org.
Hewitt, Joe, et al. Firebug: Add-ons for
Firefox. Mozilla. Accessed Sept. 28,
2012 from
http://addons.mozilla.org/en-
us/firefox/addon/firebug.
Honza. “Firebug Net Panel Timings.”
Software is Hard. Accessed Sept. 28,
2012 from
http://www.softwareishard.com/blog/fi
rebug/firebug-net-panel-timings.
Peter. “Top Agile and Scrum Tools – Which
One Is Best?” Agile Scout. Accessed
Sept. 28, 2012 from
http://agilescout.com/best-agile-scrum-
tools.
20