2. Content
Introduction
• Telefónica PDI. Who?
01
• Personalisation Server. Why? What?
The SQL version
• Data model and architecture
02
• Integrations, problems and improvements
The NoSQL version
• Data model and architecture
03
• Performance boost
• The bad
Conclusions
• Conclusions
04
• Personal thoughts
4. 01
Telefónica PDI. Who?
• Telefónica
§ Fifth largest telecommunications company in the world
§ Operations in Europe (7 countries), the United States and Latin America
(15 countries)
• Telefónica Digital
§ Web and mobile digital contents and services division
• Product Development and Innovation unit
§ Formerly Telefónica R&D
§ Product & service development, platforms development, research,
technology strategy, user experience and deployment & operation
§ Around 70 different on going projects at all time.
Telefónica PDI
4
6. 01
Opt-in and profile module. Why?
• Users data, profile and permissions, was scattered across different
storages
• Gender
IPTV service
• Film and music preferences
So you want to
Mobile • Permission to contact by SMS?
know my
service
• Gender
address…
AGAIN?!
Music tickets • Address
service
• Music preferences
Location • Address
based offers
• Permission to contact by SMS?
Telefónica PDI
6
7. 01
Opt-in and profile module. Why?
• Users data, profile and permissions, was scattered across different
storages
• Gender
IPTV service
• Film and music preferences
Mobile • Permission to contact by SMS?
service
• Gender
Music tickets • Address
service
• Music preferences
Location • Address
based offers
• Permission to contact by SMS?
Telefónica PDI
7
8. 01
Opt-in and profile module. Why?
• Provide a module to become master
customer’s data storage
• Gender
IPTV service
• Film and music
preferences
• Permission to contact
Mobile
by SMS?
service
• Address
Music tickets
service
Location
based offers
Telefónica PDI
8
9. 01
Opt-in and profile module. What?
• Features:
§ Flexible profile definition, classified in services
§ Profile sharing options between different services
§ Real time API
§ Supplementary offline batch interface
§ Authorization system
§ High availability
§ Inexpensive solution & hardware
Telefónica PDI
9
11. 02
Data model
Services, users and their profile
• Services defined a set of attributes (their profile), with default
value and data type
• Users were registered in services
• Users defined values for some of the services attributes
• Each attribute value had an update date to avoid overwriting newer
changes through batch loads
Telefónica PDI
11
12. 02
Data model
Services profile sharing matrix
• Services could access attributes declared inside other services
• There were sharing rights for read or read and write
• The user had to be registered in both services
Telefónica PDI
12
13. 02
Data model
Authorization system
• Everything that could be accessed in the PS was a resource
• Roles defined access rights (read or read and write) of resources
• Auth users had roles
• Roles could include other roles
Telefónica PDI
13
14. 02
Data model
Bonus features!
• Multiple IDS:
§ Users profile could be accessed with different equivalent IDs depending
on the service
§ Each user ID was defined by an ID type (phone number, email, portal ID,
hash…) and the ID value
Telefónica PDI
14
15. 02
High level logical architecture
§ Everything running on Red Hat EL 5.4 64 bits
Telefónica PDI
15
16. 02
High level logical architecture
§ Everything running on Red Hat EL 5.4 64 bits
Telefónica PDI
16
17. 02
Integration
Planned integration
• PS replaces all customers profile and
permissions DBs
• All systems access this data through
PS real time API
• In special cases, some PS-consumers
could use the batch interface.
• The same way new services could be
added quite easily
Telefónica PDI
17
18. 02
Integration
Problems arise
• Budget restrictions: adapt all services
to use the API was too expensive
• Keep independent systems DBs and
synchronize PS through batch
• Use DBs built-in massive extraction
feature to generate daily batch files
• However… in most cases those DBs
were not able to generate Delta
(only changes) extractions
§ Provide full daily snapshots!
Telefónica PDI
18
19. 02
First version performance
Ireland
• 1.8M customers, 180 profile attributes, 6 services
• Sizes
§ Tables + indexes size: 65Gb
§ 30% of the size were indexes
• Batch
§ Full DWH customer’s profile import: > 24 hours
§ Delta extractions: 4 - 6 hours
§ Loads and extractions performance proportional to data size
• API:
§ Response time with average traffic: 110ms
Telefónica PDI
19
21. 03
Second version
High level logical architecture
• New approach: batch processes access directly DB
Telefónica PDI
21
22. 03
Second version
Batch processes
• Batch processes had to
§ Validate authentication and authorization
§ Verify user, service and attribute existence
§ Check equivalent IDs
§ Validate sharing matrix rights
§ Validate values data type
§ Check the update date of the existing values
Telefónica PDI
22
23. 03
Second version
DB Batch processing
BAs
O ur D
Telefónica PDI
23
24. 03
Second version
New DB-based batch loading process
• Preprocess incoming batch file in BE servers
§ Validate format, services and attributes existence and values data types
§ Generate intermediate file with structure like target DB table
• Load intermediate file (Oracle’s SQL*Loader) to a temporal table
• Switch DB to “deferred writing”, storing all incoming modifications
• Merge temporal table and final table, checking values update date
• Replace old users attributes values table with merge result
• Apply deferred writing operations
Telefónica PDI
24
25. 03
Second version
New batch extraction process
• Generate a temporal DB table with format similar to final batch file.
Two loops over users attributes values table required:
§ Select format of the table; number and order of columns / attributes
§ Fill the new table
• Loop the whole temporal table for final formatting (empty fields…)
• From batch side loop across the whole table (SELECT * FROM …)
• Write each retrieved row as a line in the resulting file
Telefónica PDI
25
26. 03
Second version performance
Ireland performance requirements
• Batch time window: 3:30 hours
§ Full DWH load
§ Two Delta loads
§ Three Delta extractions
• API:
§ Ireland requirement: < 500ms
Telefónica PDI
26
27. 03
Second version performance
Ireland
• 1.8M customers, 180 profile attributes, 6 services
• Sizes
§ Tables + indexes size: 65Gb
§ 30% of the size were indexes
§ Temporal tables size increases almost exponentially: 15Gb and above
§ Intermediate file size: from 700Mb to 7Gb
• Batch
§ Full DWH customer’s profile import: 2:30 hours
§ Delta extractions: 1:00 hour
§ Loads performance worsened quickly (almost exp): 6:00 hours
§ Extractions performance proportional to data size
§ Concurrent batch processes may halt the DB
• API:
§ Response time with average traffic: 80ms
§ Response time while loading was unpredictable: >300ms
Telefónica PDI
27
29. 04
Third version
Speed up DB Batch processes
gain)
A s (a
Our DB
Telefónica PDI
29
30. 04
Third version
New (second) DB-based batch loading process
• Minor preprocessing of incoming batch file in BE servers
§ Just validate format
§ No intermediate file needed!
• Load validated file (Oracle’s SQL*Loader) to a temporal table
• Loop the temporal table merging the values into final table, checking
values update date and data types
§ Use several concurrent writing jobs
• Store results on real table, no need to replace!
• No “deferred writing”!
Telefónica PDI
30
31. 04
Third version
Enhancements to extraction process
• Optimized loops to generate temporal output table.
§ Use several concurrent writing jobs
§ We achieved a speed-up of between 1.5 and 2
• Loop the whole temporal table for final formatting (empty fields…)
• Download and write lines directly inside Oracle’s sqlplus
• No SELECT * FROM … query from Batch side!
Telefónica PDI
31
32. 04
Third version performance
Ireland
• 1.8M customers, 180 profile attributes, 6 services
• Sizes
§ Tables + indexes size: 65Gb
§ 30% of the size were indexes
§ Temporal tables: 15Gb
• Batch
§ Full DWH customer’s profile import: 1:10 hours (vs. 2:30 hours)
§ Three Delta extractions: 2:15 hours (vs. 3:00 hours)
§ Loads and extractions performance proportional to data size
§ Concurrent batch processes not so harmful
s
DBA
• API:
Our
F**K YEAH
§ Response time with average traffic: 110ms
§ Response time while loading: 400ms
Telefónica PDI
32
33. 04
Third version performance
United Kingdom
• 25M customers, 150 profile attributes, 15 services
• Sizes
§ Tables + indexes size: 700Gb
§ 40% of the size were indexes
• Batch
§ Two Delta imports: < 2:00 hours
§ Two Delta extractions: < 2:00 hours
§ Loads and extractions performance proportional to data size
• API:
§ Response time with average traffic: 90ms
s
DBA
Our
F**K YEAH
Telefónica PDI
33
34. 04
Third version performance
Ireland
3rd version
2nd version
DB size
65Gb + 15Gb (temp)
65Gb + > 15Gb
Full DWH load
1:10 hours
2:30 hours
Three Delta exports
2:15 hours
3:00 hours
Batch stability
Stable, linear
Unstable, exponential
API response time
110ms
110ms
API while loading
400ms
Unpredictable
United Kingdom
3rd version
DB size
700Gb
s
Two Delta loads
< 2:00 hours
DBA
Our
Three Delta exports
< 2:00 hours
F**K YEAH
API response time
90ms
Telefónica PDI
34
35. 04
Third version performance
DB stats
• 20 database tables
• API: several queries with up to 35 joins and even some unions
• Authorization: 5 joins to validate auth users access
• Batch:
§ Load: 1700 lines of PL/SQL
§ Extraction: 1200 of PL/SQL
Telefónica PDI
35
37. 04
Third version performance
Mexico
• 20M customers, 200 profile attributes, 10 services
• Mexico time window: 4:00 hours
§ Full DWH load!
§ Additional Delta feeds loads
§ At least two Delta extractions
D BAs
Our
Telefónica PDI
37
42. 05
MongoDB Data Model
DB stats
• Only 5 collections
• API: typically 2 accesses (services and users collections)
• Authorization: access only 1 collection to grant access
• Batch: all processing done outside DB
Telefónica PDI
42
43. 05
NoSQL version
High level logical architecture
§ Everything running on Red Hat EL 6.2 64 bits
Telefónica PDI
43
44. 05
NoSQL version performance
Ireland (at PDI lab)
• 1.8M customers, 180 profile attributes, 6 services
• Sizes
§ Collections + indexes size: 20Gb (vs. 65Gb)
§ < 5% of the size are indexes (vs. 30%)
• Batch
§ Full DWH customer’s profile import: 0:12 hours (vs. 1:10 hours)
§ Three Delta extractions: 0:40 hours (vs. 2:15 hours)
§ Loads and extractions performance proportional to data size
§ Concurrent batch processes without performance affection
• API:
§ Response time with average traffic: < 10ms (vs. 110ms)
§ Response time while loading: the same
§ High load (600 TPS) response time while loading: 300ms
Telefónica PDI
44
45. 05
NoSQL version performance
United Kingdom (at PDI lab)
• 25M customers, 150 profile attributes, 15 services
• Sizes
§ Collections + indexes size: 210Gb (vs. 700Gb)
§ < 5% of the size were indexes
• Batch
§ Two Delta imports: < 0:40 hours (vs. 2:00 hours)
§ Loads and extractions performance proportional to data size
Telefónica PDI
45
46. 05
NoSQL version performance
Mexico
• 20M customers, 200 profile attributes, 15 services
• Sizes
§ Collections + indexes size: 320Gb
§ Indexes size: 1.2Gb
• Batch
§ Initial Full import (20M, 40 attributes): 2:00 hours
§ Small Full import (20M, 6 attributes): 0:40 hours
• API:
§ Response time with average traffic: < 10ms (vs. 90ms)
§ Response time while loading: the same
§ High load (500 TPS) response time while loading: 270ms
Telefónica PDI
46
47. 04
NoSQL version performance
Ireland
NoSQL version
SQL version
DB size
20Gb
80Gb
Full DWH load
0:12 hours
1:10 hours
Three Delta exports
0:40 hours
2:15 hours
API while loading
< 10ms
400ms
API 600TPS + loading
300ms
Timeout / failure
United Kingdom
NoSQL version
SQL version
DB size
210Gb
700Gb
Two Delta loads
< 0:40hours
< 2:00 hours
Mexico
NoSQL version
DB size
320Gb
Initial Full load (40 attr)
2:00 hours
Daily Full load (6 attr)
0:40 hours
D BAs
Our
API while loading
< 10ms
API 500TPS
Telefónica PDI
+ loading
270ms
47
49. 05
The bad
• Batch load process was too fast
§ To keep secondary nodes synched we needed oplog of 16 or 24Gb
§ We had to disable journaling for the first migrations
• Labels of documents fields take up disc space
§ Reduced them to just 2 chars: “attribute_id” -> “ai”
• Respect the unwritten law of at least 70% of size in RAM
• Take care with compound indexes, order matters
§ You can save one index… or you can have problems
§ Put most important key (never nullable) the first one
• DBAs whining and complaining about NoSQL
§ “If we had enough RAM for all data, Oracle would outperform MongoDB”
Telefónica PDI
49
50. 05
The ugly
• Second migration once the PS is already running
§ Full import adding 30 new attributes values: 10:00 hours
§ Full import adding 150 new attributes values: 40:00 hours
• Increase considerably documents size (i.e. adding lots of new values
to the users) makes MongoDB rearrange the documents, performing
around 5 times slower
§ That’s a problem when you are updating 10k documents per second
• Solutions?
§ Avoid this situation at all cost. Run away!
§ Normalize users values; move to a new individual collection
§ Prealloc the size with a faux field
• You could waste space!
§ Load in new collection, merge and swap, like we did in Oracle
Telefónica PDI
50
52. 06
Conclusions & personal thoughts
• Awesome performance boost
§ But not all use cases fit in a MongoDB / NoSQL solution!
• New technology, different limitations
• Fear of the unknown
§ SSDs performance?
§ Long term performance and stability?
• Python + MongoDB + pymongo = fast development
§ I mean, really fast
• MongoDB Monitoring Service (MMS)
• 10gen people were very helpful
Telefónica PDI
52
55. 0X
SQL Physical architecture
§ Scale horizontally adding more BE or DB servers or disks in the SAN
§ Virtualized or physical servers depending on the deployment
Telefónica PDI
55
56. 0X
MongoDB Physical architecture
§ MongoDB arbiters running on BE servers
§ Scale horizontally adding more BE servers or disks in the SAN
§ Sharding may already be configured to scale adding more replica sets
Telefónica PDI
56