SlideShare une entreprise Scribd logo
1  sur  21
Moving from C#/.NET to
Hadoop/MongoDB
Robert Vandehey
December 4, 2012
We power the Discovery,
Delivery and Display of
Digital Entertainment
4   © 2012 Rovi Corporation. Company confidential.
Global Reach

137M+                                                 47M+
    Viewers use our guide technologies                Storefronts with entertainment services
    through service provider offerings                powered by Rovi Entertainment Store

    266M+
Consumer electronic (CE) devices                      Data coverage:
have our CE guide technologies
                                                          4.5M+
                                                          TV shows, movies, sports and celebrities
    40M+
    Households reached globally by
    Rovi Advertising Network                              3.3M+
                                                          Album releases and 32M music tracks
    600M+
Devices certified for high quality DivX video
playback                                                  500K+
                                                          Movie titles




7    © 2012 Rovi Corporation. Company confidential.
11   © 2012 Rovi Corporation. Company confidential.
The Problem
13   © 2012 Rovi Corporation. Company confidential.
ETL/Cache Loading Data Takes Too Long




                                                          Node 1                        MemcacheD              MemcacheD
                                                          DB                            (Scratch               Cluster
DSG DB                                                    Server                        Server(s))
               WSP ETL Server                             Backup &                                             MemcacheD
Server(s)
                                                           Restore
                                                                                        MemcacheD
                                   Transform      CI                                                  Cache    MemcacheD
  DSG         Extract   Database                              CI        Table Loading
                                               Database                                              Loading
Database                                                   Database        Process                   Process
                                                                                        MemcacheD              MemcacheDB
                                                          Node 2                                               Cluster
                                                          DB Server                                            MemcacheDB
                                                          Backup &
                                                           Restore
                                                                                                               MemcacheDB
                                                          CI Database




    Page 16
The Solution
17   © 2012 Rovi Corporation. Company confidential.
Hadoop/MongoDB




18   Copyright ®2012 Rovi Corporation. Company confidential.
Network Diagram




20   Copyright ®2012 Rovi Corporation. Company confidential.
Mongo Sharding




21   Copyright ®2012 Rovi Corporation. Company confidential.
Challenges
23   © 2012 Rovi Corporation. Company confidential.
Challenges
• Transition existing Windows/.NET team to Linux/Java
      – Environment setup. Technology framework choices
      – Coding differences
      – Cultural differences
      – Platform differences
      – Easier than expected to transition team from .NET to Java – No religious battles

• Backwards compatibility of CXF web services to Microsoft .NET web services
• Managing new releases of Hadoop
• BCP took too long
      – Converted to base tables. Used Pig to join the data

• Writes to Mongo are very fast. Updates are slower and saturated disks
      – Implemented Diff process (MD5 calc) to allow Hadoop to do the work and minimize writes to Mongo




24   © 2012 Rovi Corporation. Company confidential.
Lessons Learned

25   © 2012 Rovi Corporation. Company confidential.
Lessons Learned
• General
      – Current versions of Hadoop CDH4 and MongoDB 2.0 are actually very stable products
               • We purchased enterprise support agreements from both Cloudera and 10gen
      – Create a developers VM image
      – Deploy early and often even if not ready for real customers
      – Use the same setup in test and production environments
               • Sharding caused differences

• SQL
      – Get raw tables without any transformation or joins
               • Let Hadoop do the processing for you

• Hadoop
      – Do as much work as you can in Hadoop
      – Take the time to create small datasets to iterate fast
      – Take the time to learn and use Pig
               • It is very fast and provides tons of functionality that you don’t need to code in Java
      – Don’t create Runners - Use Oozie workflows
      – Measure, benchmark and track performance – Use Hadoop counters



26   © 2012 Rovi Corporation. Company confidential.
Lessons Learned - 2
• MongoDB
      – RAM, RAM, RAM!!!
      – Many writes from Hadoop can easily overwhelm MongoDB
               • Single database lock
               • Drive bandwidth saturation – Can be expanded through sharding
               • Do as much as possible to minimize writes
               • Measure where your application is blocking and optimize
      – Don’t shard unless you have to – if you do shard, preconfigure your shard key
               • You need a good shard key

      – Use Replica sets. They are easy to setup and work good.
               • Make sure repllog is large enough.

      – Use MongoDB Monitoring Service (MMS) – It’s free
      – Mongo queries are fast!




27   © 2012 Rovi Corporation. Company confidential.
Mongo Query – returns 90 rows from a database of 9
million in 44ms




28   © 2012 Rovi Corporation. Company confidential.
Q&A

31 © 2012 Rovi Corporation. Company confidential.
Follow-up Information

• Email: robert.vandehey@rovicorp.com
• LinkedIn: http://www.linkedin.com/in/bvandehey
• Twitter: @bvandehey
• Rovi Cloud Services: http://developer.rovicorp.com/




32   © 2012 Rovi Corporation. Company confidential.
Thank You
33 © 2012 Rovi Corporation. Company confidential.

Contenu connexe

Tendances

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 

Tendances (20)

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 

Similaire à Moving from C#/.NET to Hadoop/MongoDB

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Sina App Engine - a distributed web solution on cloud
Sina App Engine - a distributed web solution on cloudSina App Engine - a distributed web solution on cloud
Sina App Engine - a distributed web solution on cloud
cong lei
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
DataWorks Summit
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
Membase
 

Similaire à Moving from C#/.NET to Hadoop/MongoDB (20)

MySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached APIMySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached API
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009SQL Server 2008 Migration Workshop 04/29/2009
SQL Server 2008 Migration Workshop 04/29/2009
 
Sina App Engine - a distributed web solution on cloud
Sina App Engine - a distributed web solution on cloudSina App Engine - a distributed web solution on cloud
Sina App Engine - a distributed web solution on cloud
 
Virtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In ChineseVirtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In Chinese
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Couchdb + Membase = Couchbase
Couchdb + Membase = CouchbaseCouchdb + Membase = Couchbase
Couchdb + Membase = Couchbase
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
SQL Server User Group 02/2009
SQL Server User Group 02/2009SQL Server User Group 02/2009
SQL Server User Group 02/2009
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Cloud Consolidation with Oracle (RAC) - How much is too much?
Cloud Consolidation with Oracle (RAC) - How much is too much?Cloud Consolidation with Oracle (RAC) - How much is too much?
Cloud Consolidation with Oracle (RAC) - How much is too much?
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Gear6 and Scaling Website Performance: Caching Session and Profile Data with...
Gear6 and Scaling Website Performance:  Caching Session and Profile Data with...Gear6 and Scaling Website Performance:  Caching Session and Profile Data with...
Gear6 and Scaling Website Performance: Caching Session and Profile Data with...
 
Couchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = ThreeCouchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = Three
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
 
Ruby at UW C4C
Ruby at UW C4CRuby at UW C4C
Ruby at UW C4C
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 

Plus de MongoDB

Plus de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Moving from C#/.NET to Hadoop/MongoDB

  • 1. Moving from C#/.NET to Hadoop/MongoDB Robert Vandehey December 4, 2012
  • 2. We power the Discovery, Delivery and Display of Digital Entertainment 4 © 2012 Rovi Corporation. Company confidential.
  • 3. Global Reach 137M+ 47M+ Viewers use our guide technologies Storefronts with entertainment services through service provider offerings powered by Rovi Entertainment Store 266M+ Consumer electronic (CE) devices Data coverage: have our CE guide technologies 4.5M+ TV shows, movies, sports and celebrities 40M+ Households reached globally by Rovi Advertising Network 3.3M+ Album releases and 32M music tracks 600M+ Devices certified for high quality DivX video playback 500K+ Movie titles 7 © 2012 Rovi Corporation. Company confidential.
  • 4.
  • 5.
  • 6. 11 © 2012 Rovi Corporation. Company confidential.
  • 7. The Problem 13 © 2012 Rovi Corporation. Company confidential.
  • 8. ETL/Cache Loading Data Takes Too Long Node 1 MemcacheD MemcacheD DB (Scratch Cluster DSG DB Server Server(s)) WSP ETL Server Backup & MemcacheD Server(s) Restore MemcacheD Transform CI Cache MemcacheD DSG Extract Database CI Table Loading Database Loading Database Database Process Process MemcacheD MemcacheDB Node 2 Cluster DB Server MemcacheDB Backup & Restore MemcacheDB CI Database Page 16
  • 9. The Solution 17 © 2012 Rovi Corporation. Company confidential.
  • 10. Hadoop/MongoDB 18 Copyright ®2012 Rovi Corporation. Company confidential.
  • 11. Network Diagram 20 Copyright ®2012 Rovi Corporation. Company confidential.
  • 12. Mongo Sharding 21 Copyright ®2012 Rovi Corporation. Company confidential.
  • 13. Challenges 23 © 2012 Rovi Corporation. Company confidential.
  • 14. Challenges • Transition existing Windows/.NET team to Linux/Java – Environment setup. Technology framework choices – Coding differences – Cultural differences – Platform differences – Easier than expected to transition team from .NET to Java – No religious battles • Backwards compatibility of CXF web services to Microsoft .NET web services • Managing new releases of Hadoop • BCP took too long – Converted to base tables. Used Pig to join the data • Writes to Mongo are very fast. Updates are slower and saturated disks – Implemented Diff process (MD5 calc) to allow Hadoop to do the work and minimize writes to Mongo 24 © 2012 Rovi Corporation. Company confidential.
  • 15. Lessons Learned 25 © 2012 Rovi Corporation. Company confidential.
  • 16. Lessons Learned • General – Current versions of Hadoop CDH4 and MongoDB 2.0 are actually very stable products • We purchased enterprise support agreements from both Cloudera and 10gen – Create a developers VM image – Deploy early and often even if not ready for real customers – Use the same setup in test and production environments • Sharding caused differences • SQL – Get raw tables without any transformation or joins • Let Hadoop do the processing for you • Hadoop – Do as much work as you can in Hadoop – Take the time to create small datasets to iterate fast – Take the time to learn and use Pig • It is very fast and provides tons of functionality that you don’t need to code in Java – Don’t create Runners - Use Oozie workflows – Measure, benchmark and track performance – Use Hadoop counters 26 © 2012 Rovi Corporation. Company confidential.
  • 17. Lessons Learned - 2 • MongoDB – RAM, RAM, RAM!!! – Many writes from Hadoop can easily overwhelm MongoDB • Single database lock • Drive bandwidth saturation – Can be expanded through sharding • Do as much as possible to minimize writes • Measure where your application is blocking and optimize – Don’t shard unless you have to – if you do shard, preconfigure your shard key • You need a good shard key – Use Replica sets. They are easy to setup and work good. • Make sure repllog is large enough. – Use MongoDB Monitoring Service (MMS) – It’s free – Mongo queries are fast! 27 © 2012 Rovi Corporation. Company confidential.
  • 18. Mongo Query – returns 90 rows from a database of 9 million in 44ms 28 © 2012 Rovi Corporation. Company confidential.
  • 19. Q&A 31 © 2012 Rovi Corporation. Company confidential.
  • 20. Follow-up Information • Email: robert.vandehey@rovicorp.com • LinkedIn: http://www.linkedin.com/in/bvandehey • Twitter: @bvandehey • Rovi Cloud Services: http://developer.rovicorp.com/ 32 © 2012 Rovi Corporation. Company confidential.
  • 21. Thank You 33 © 2012 Rovi Corporation. Company confidential.

Notes de l'éditeur

  1. This is the new Data Load Process. It makes it look easy…
  2. …The reality it is quite complex. This is just one of our workflows. The orange/tan-ish boxes are Java map/reduce processes. The pink boxes are pig processes. The white boxes are BCP processes. The green boxes are MongoDB collections.
  3. Here is our sharding scheme. We actually have 6 more servers than is shown because we decided to have multiple replicas at each remote site.