Contenu connexe Similaire à Moving from C#/.NET to Hadoop/MongoDB (20) Moving from C#/.NET to Hadoop/MongoDB2. We power the Discovery,
Delivery and Display of
Digital Entertainment
4 © 2012 Rovi Corporation. Company confidential.
3. Global Reach
137M+ 47M+
Viewers use our guide technologies Storefronts with entertainment services
through service provider offerings powered by Rovi Entertainment Store
266M+
Consumer electronic (CE) devices Data coverage:
have our CE guide technologies
4.5M+
TV shows, movies, sports and celebrities
40M+
Households reached globally by
Rovi Advertising Network 3.3M+
Album releases and 32M music tracks
600M+
Devices certified for high quality DivX video
playback 500K+
Movie titles
7 © 2012 Rovi Corporation. Company confidential.
6. 11 © 2012 Rovi Corporation. Company confidential.
8. ETL/Cache Loading Data Takes Too Long
Node 1 MemcacheD MemcacheD
DB (Scratch Cluster
DSG DB Server Server(s))
WSP ETL Server Backup & MemcacheD
Server(s)
Restore
MemcacheD
Transform CI Cache MemcacheD
DSG Extract Database CI Table Loading
Database Loading
Database Database Process Process
MemcacheD MemcacheDB
Node 2 Cluster
DB Server MemcacheDB
Backup &
Restore
MemcacheDB
CI Database
Page 16
14. Challenges
• Transition existing Windows/.NET team to Linux/Java
– Environment setup. Technology framework choices
– Coding differences
– Cultural differences
– Platform differences
– Easier than expected to transition team from .NET to Java – No religious battles
• Backwards compatibility of CXF web services to Microsoft .NET web services
• Managing new releases of Hadoop
• BCP took too long
– Converted to base tables. Used Pig to join the data
• Writes to Mongo are very fast. Updates are slower and saturated disks
– Implemented Diff process (MD5 calc) to allow Hadoop to do the work and minimize writes to Mongo
24 © 2012 Rovi Corporation. Company confidential.
16. Lessons Learned
• General
– Current versions of Hadoop CDH4 and MongoDB 2.0 are actually very stable products
• We purchased enterprise support agreements from both Cloudera and 10gen
– Create a developers VM image
– Deploy early and often even if not ready for real customers
– Use the same setup in test and production environments
• Sharding caused differences
• SQL
– Get raw tables without any transformation or joins
• Let Hadoop do the processing for you
• Hadoop
– Do as much work as you can in Hadoop
– Take the time to create small datasets to iterate fast
– Take the time to learn and use Pig
• It is very fast and provides tons of functionality that you don’t need to code in Java
– Don’t create Runners - Use Oozie workflows
– Measure, benchmark and track performance – Use Hadoop counters
26 © 2012 Rovi Corporation. Company confidential.
17. Lessons Learned - 2
• MongoDB
– RAM, RAM, RAM!!!
– Many writes from Hadoop can easily overwhelm MongoDB
• Single database lock
• Drive bandwidth saturation – Can be expanded through sharding
• Do as much as possible to minimize writes
• Measure where your application is blocking and optimize
– Don’t shard unless you have to – if you do shard, preconfigure your shard key
• You need a good shard key
– Use Replica sets. They are easy to setup and work good.
• Make sure repllog is large enough.
– Use MongoDB Monitoring Service (MMS) – It’s free
– Mongo queries are fast!
27 © 2012 Rovi Corporation. Company confidential.
18. Mongo Query – returns 90 rows from a database of 9
million in 44ms
28 © 2012 Rovi Corporation. Company confidential.
20. Follow-up Information
• Email: robert.vandehey@rovicorp.com
• LinkedIn: http://www.linkedin.com/in/bvandehey
• Twitter: @bvandehey
• Rovi Cloud Services: http://developer.rovicorp.com/
32 © 2012 Rovi Corporation. Company confidential.
Notes de l'éditeur This is the new Data Load Process. It makes it look easy… …The reality it is quite complex. This is just one of our workflows. The orange/tan-ish boxes are Java map/reduce processes. The pink boxes are pig processes. The white boxes are BCP processes. The green boxes are MongoDB collections. Here is our sharding scheme. We actually have 6 more servers than is shown because we decided to have multiple replicas at each remote site.