This document summarizes a Salesforce webinar about loading data in parallel using the Bulk API. It discusses how parallel processing can significantly increase data load throughput compared to serial loads. However, locks and other inhibitors can prevent optimal parallelism. The webinar demonstrates approaches to identify and manage locks, such as modifying the data schema, ordering the load file, and using a controlled parallel approach. Managing locks is key to achieving high parallelism and throughput during large data loads.
2. Safe Harbor
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of
the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking
statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service
availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future
operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use
of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our
service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth,
interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with
possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and
motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling nonsalesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial
results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and
others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be
delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available.
Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
#forcewebinar
4. Follow Developer Force for the Latest News
@forcedotcom / #forcewebinar
Developer Force – Force.com Community
+Developer Force – Force.com Community
Developer Force
Developer Force Group
#forcewebinar
15. Locks, exceptions, triggers, relationships, …
5M records
Parallel
5M records
5M records
5M records
Serial
20M records
Time
#forcewebinar
Throughput
inhibitors
16. Data load case studies
§ Get hands on with the Salesforce Bulk API
§ Contrast serial data loads vs. parallel data loads
§ Measure degrees of parallelism and throughput
§ Identify and avoid throughput inhibitors
§ Achieve maximum throughput
#forcewebinar
18. Salesforce Bulk API
§ Asynchronous data loading
§ Optimized for large data sets
§ REST API
§ Powers many tools
§ Use to build custom tools with any programming
language (Java, etc.)
#forcewebinar
28. Serial load summary
Concurrency Mode
Records Loaded
Records Failed
Serial
1 million
0
Run Time
52 minutes
Work Completed
48 minutes
Throughput
Degree of Parallelism
Key Problem
Solution
19,500 records per minute
0.94
Degree of parallelism explicitly limited to ~1.
Explore parallel load for increased throughput.
#forcewebinar
29. Throughput Records/Min
Parallelism vs. Throughput of a Single Job
350000
Serial Run
• Low degree of parallelism
300000
250000
200000
150000
100000
50000 Serial
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Degree of Parallelism
#forcewebinar
33. Things to watch for
§ Locks can significantly affect parallel loads
– Wasted processing capacity
– Reduced throughput
– Failures
§ Retry logic is not all its cracked up to be
#forcewebinar
35. Parallel load 1 summary
Concurrency Mode
Records Loaded
Records Failed
Parallel
125,000
875,000
Run Time
10 minutes
Work Completed
2 hours and 30 minutes
Throughput
Degree of Parallelism
Key Problem
Solution
20,000 records per minute
15.79
Lock Exceptions. Server worked significantly harder but no increase in throughput.
Run the load in serial mode or manage locks.
#forcewebinar
36. Throughput Records/Min
Parallelism vs. throughput of a single job
350000
Parallel Run 1
• High degree of parallelism
• Low throughput due to locks
300000
250000
200000
150000
100000
50000 Serial
Parallel 1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Degree of Parallelism
#forcewebinar
37. Time to optimize
§
§
Let’s make your data load
ealize
– Locks inhibit parallelism and throughput
§
nvestigate
– What is causing the locks
§
lan
– Manage the locks
#forcewebinar
39. Parallel load: Sample results
Concurrency Mode
Records Loaded
Records Failed
Parallel
1 million
0
Run Time
3 minutes and 30 seconds
Work Completed
1 hour
Throughput
Degree of Parallelism
Key Problem
Solution
320,000 records per minute
19
None
n/a
#forcewebinar
40. Throughput Records/Min
Parallelism vs. throughput of a single job
350000
Parallel 2
Parallel Run 2
• High degree of parallelism
• High throughput
300000
250000
200000
150000
100000
50000 Serial
Parallel 1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Degree of Parallelism
#forcewebinar
41. Locks can be managed by
§ Elimination
§ Ordering load file
#forcewebinar
43. Managing locks … a discussion while we load
§ Master-detail relationships
§ Lookup relationships
§ Roll-up summary fields
§ Triggers
§ Workflow rules
§ Group membership locks*
#forcewebinar
44. Parallel load: Sample results
Concurrency Mode
Records Loaded
Records Failed
Parallel
1 million
0
Run Time
4 minutes
Work Completed
1 hour
Throughput
Degree of Parallelism
Key Problem
Solution
250,000 records per minute
16.5
Minimal overhead due to locks
Remove all unnecessary locks
#forcewebinar
45. Throughput Records/Min
Parallelism vs. throughput of a single job
350000
Parallel Run 3
• High degree of parallelism
• High throughput
300000
250000
Parallel 2
Parallel 3
200000
150000
100000
50000 Serial
Parallel 1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Degree of Parallelism
#forcewebinar
50. Recap
§
§
Make your parallel data loads
ealize
– Locks inhibit parallelism and throughput
§
nvestigate
– What is causing the locks
§
lan
– Manage the locks
#forcewebinar