Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Innovation in the Enterprise Rent-A-Car Data Warehouse

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 27 Publicité

Innovation in the Enterprise Rent-A-Car Data Warehouse

Télécharger pour lire hors ligne

Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.

This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.

We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.

No deep Hadoop knowledge is necessary, architect or executive level.

Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.

This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.

We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.

No deep Hadoop knowledge is necessary, architect or executive level.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Innovation in the Enterprise Rent-A-Car Data Warehouse (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Innovation in the Enterprise Rent-A-Car Data Warehouse

  1. 1. Innovation in the Data Warehouse Kit Menke, Software Architect & Scott Shaw, Sr. Solutions Engineer DataWorks / Hadoop Summit 2017 July 2017
  2. 2. Agenda 1. Moving to Lambda 2. Forming a Center of Excellence 3. Pain Points
  3. 3. Enterprise Holdings, Inc.  Our Business • 9,600 locations • 90 countries • 97 thousand employees • 1.9 million vehicles  Our Data Warehouse • Streaming and batch data feeds from over 50 internal systems & external sources • 100+ databases and 22+ thousand tables • Around 1 billion queries executed per month • 5+ million report executions every month. • Statistical Modeling & Advanced Analytics - 40+ Projects Implemented for Predictive & Diagnostic Analytics
  4. 4. Data Warehouse - Present
  5. 5. Data Warehouse Growth 0 20 40 60 80 100 120 140 Terabytes Space Usage Max Usable Disk Space Total Disk Space
  6. 6. A Next Generation Architecture  Our considerations… • Design for streaming • Scalability & isolation • Right tool for the right job!  Decision points 1. Use cases and data gravity 2. Workload 3. Cloud vs on premises
  7. 7. Use cases and data gravity  Where are your data sources?  Network  Consumers
  8. 8. Workload
  9. 9. Cloud vs on premises
  10. 10. Other considerations  Cloud • Different design considerations • Disaster recovery • Security
  11. 11. Lambda
  12. 12. Serving Layers
  13. 13. Serving Layers
  14. 14. Future Architecture
  15. 15. Implementing an Architecture  Who?  “Tip of the spear”  Learn quickly and adapt  Out of the day to day support of current systems
  16. 16. Evaluation Biz Value Awareness & Interest Evaluation Technical Enterprise Deployment Enterprise Production Modern Data Architecture Point Deployment Point Production 1 – 2 months 2-6 months 9-15 months 18-36 months 1 2 3 4Potential Operational Strategic Data-Driven Data Lake Industry Leadership Hortonworks Services Client Self Mgmt. Center Of Excellence Deployment COE with HWX Services Support COE with Reduced HWX Services Self Sustaining COE with On Demand HWX Services Journey to Data Driven Org via Center of Excellence
  17. 17. Hadoop Resources Existing Resources (internal or vendor provided) Architect Training Data Architect Hadoop Architect Hive Training Business Analyst Hadoop Analyst Developer Training Developer (Java/Scripting, Python, Ruby, etc…) Hadoop Developer Admin Training Linux/Windows Administrator Cluster Ops & Admin 1. People : Training and Knowledge Transfer for Existing Staff
  18. 18. DATA SCIENTIST DATA ANALYST HDF DEVELOPERADMINSTRATORJAVA DEVELOPER HADOOP APPLICATION DEVELOPER BI END-USERS MANAGERS EXECUTIVES APACHE PIG & HIVE (4 DAYS) HORTONWORKS CERTIFIED PROFESSIONAL DEVELOPER HORTONWORKS CERTIFIED ADMINISTRATOR HADOOP ADMIN I (4d) HADOOP ADMIN II (4d) SECURITY (3d) HORTONWORKS CERTIFIED SPARK DEVELOPER ENTERPRISE SPARK I (4d) + SPARK DATA SCIENCE (4d) DATA SCIENCE (3d) HORTONWORKS CERTIFIED JAVA DEVELOPER HBASE ESSENTIALS (2d) JAVA DEVELOPER (4d) + STORM & TRIDENT (2d) ONLINE REFERENCE LIBRARY (self-service) HBASE ESSENTIALS (2d) HORTONWORKS DATA FLOW (3d) HORTONWORKS CERTIFIED SPARK DEVELOPER ENTERPRISE SPARK I (4d) + HORTONWORKS CERTIFIED SPARK DEVELOPER ENTERPRISE SPARK I (4d) + Hortonworks Certification Hortonworks Course Legend LEARNING & CERTIFICATION PATHS
  19. 19. People1 Empower teams and individuals through successes and failures  Identify and build individuals with potential  Design and implement career paths and other solutions to drive individual growth  Establish strong internal and external community presence  Develop training curriculum focused on forming leaders Process2 Platform3 Leverage processes and Configuration Items to enable excellence  Establish foundational mission and directional artifacts  Design and implement project methodologies and core competencies aligned with business strategy  Establish knowledge sharing and collaboration repositories Leverage the right technology to address your needs  Establish architecture and design principles focused on leveraging technology to address business needs  Deliver technical solutions capable of supporting innovation in a governed environment  Develop enterprise integration solutions focused on scalability  Deploy systems and controls focused on improving quality. Transferrable skills with training Processes Integration with refinementPlatform that powers Future of Data Three P’s Are Central to Your Center of Excellence
  20. 20. Our Experience 1. Skill sets 2. Lessons Learned 3. Governance 4. Agile
  21. 21. Skill sets  Hiring • Problem: High demand, complicated ecosystem, low overall experience • Workaround: hire those who learn quickly and have skills that transfer easily  Culture • Change is hard • Software Engineer vs Java Developer
  22. 22. Lessons Learned  Workload isolation • Multi-tenancy is possible but difficult  Debugging / development is hard • Lots of moving pieces • Logs spread out across many machines • Development environments require a lot of software • Distributed systems just work differently  Technologies • Like: Hive, HBase, Spark • Dislike: Oozie, Sqoop
  23. 23. Governance  Open source • Huge number of projects • Licenses  Schema management • Multiple different databases • Evolution  Data • Different data types, files, no relational data • Certification process
  24. 24. Agile / Project Management  Planning  Dependencies  Deliverables  Aggressive timelines  A lot to learn, so fail fast!
  25. 25. Conclusion 1. Moving to Lambda 2. Forming a Center of Excellence 3. Our Experiences 4. Enjoy the journey!

Notes de l'éditeur

  • Title:
    Innovation in the Enterprise Rent-A-Car Data Warehouse with a Center of Excellence
    Abstract:
    Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
    This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
    We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
    No deep Hadoop knowledge is necessary, architect or executive level.
    40 min for presentation

    Thursday, June 15th at 2:10 PM
    https://dataworkssummit.com/san-jose-2017/sessions/innovation-in-the-data-warehouse/

  • Welcome everyone
    During intro, start thinking about how this architecture can benefit you and how you could use it

    Agenda with Timing:
    Background (5)
    Moving to Lambda + Decision Points for moving to cloud (10)
    Forming a Center of Excellence + Roles / Personas (10)
    Pain Points (10)
    Questions (5)
    Total = 40 minutes
  • Over 50 internal and external sources feed our current data warehouse including our operational applications.
    ETL is performed in the warehouse and makes up over 100 different databases.
    In turn, the warehouse is the source for nearly all of our reporting plus apps, feeds, and analytics.
    Use Cases not good fit for EDW
    Unstructured data
    Source structures changing frequently
    Data for exploration, discovery, & analytics
    Staging, transient, & history data
    Real-time
  • Scaling the current warehouse.
    We’ve run into CPU and disk space constraints a few times now.
    System Capacity - Space & CPU Constraints
  • Prepare audience for the next decision point slides

    Design for streaming first, support batch
    Scale components individually
    Isolation - protect critical processes
    Ingest structured and unstructured data
    Right tool for the right job
    Sustainable
    Automated and supportable



    we’ll walk through cloud vs on prem, physical vs virtual, cluster workload
    Our use cases center around data warehousing, reporting, and analytics.
  • Use cases should drive your architecture. We have a ton of batch processes today but are trying to get results faster which means more streaming.
    Data Gravity is the idea that data has weight and the more there is the harder it is to move around.

    Data Gravity / integration points
    Ideally you want to be as close as possible to your data sources
    For internet data sources Cloud makes a lot of sense
    Consider network Bandwidth to/from cloud implementation

    Don’t forget your Consumers!
    Operational
    Highly optimized
    Super scalable
    Canned reports or API for integration
    Analytics
    Adhoc reporting
    Playground
    Low number of users, high expectations
  • Streaming
    Running 24/7
    Need dedicated resources
    Batch
    Scheduled
    Periods of high utilization (scalability)
    Multi-Tenancy
    Blended workloads
    YARN (queues, node labels)
    Think about Isolating nodes for real-time

  • Scalability
    Much easier to scale a Cloud solution UP and OUT
    Physical hardware requires an infrastructure team to manage
    In cloud, most of the components can scale individually.

    Cloud offerings
    Hadoop: Azure HDInsight, Amazon EMR, Google Cloud
    Integration with other PaaS services


  • Performance 
    Physical hardware will perform better, Hadoop is designed with physical hardware in mind 
    Separation of storage and compute 
    Maintenance 
    No hardware to maintain for virtual servers 
    Time to market 
    Virtual machines much faster to provision, can be fully automated 
    For physical hardware often a roadblock then appliance is good option instead of commodity 
    Development and test environments make more sense to virtualize 

    Disaster recovery
    Data is locally redundant
    Backups not usually required unless you need geo-redundancy
    Security - Many different things to secure!
    PaaS services vs IaaS vs SaaS
    Kerberos for user, service, and host authentication
    Authorization: Apache Ranger (Hortonworks) or Apache Sentry (Cloudera) or MapR Control System
    Network isolation for Hadoop services
    Data at rest (HDFS encryption, BLOB storage)
    Hadoop Distribution - Race to include the most Apache projects
    Top 3: Hortonworks, Cloudera, MapR
    Big companies with Hadoop offering

  • Lambda is a natural progression for us. Not new architecture, Enterprise conservative, proven arch, won’t put us at risk.
    Attempts to combine batch and streaming to get benefits from both
    Batch layer is comprehensive and accurate
    Streaming layer is fast but might only be able to keep recent data
    Potentially have to maintain two codebases – avoid this by using Spark
  • The same data sources will start flowing into our new lambda architecture in the cloud.
    The batch processing leg of lambda for ingesting files
  • Now that we’ve decided on an architecture need to implement it
  • Change is hard!
  • Hiring
    Problem: High demand, complicated ecosystem, low overall experience
    Workaround: hire those who learn quickly and have skills that transfer easily (ex: Java and Linux)

    Culture AKA “change is hard”
    One of our biggest challenges is that people are tied to a particular technology or way of doing things. We want to hire Software Engineers, not Java Developers. DBAs instead of Oracle/SQL Server/Teradata DBA (relational, nosql). The same concepts apply to multiple different technologies.
    If you’re in IT, you should expect to reinvent yourself
    If you’re like Richard Hendricks who only uses tabs don’t limit yourself. Be open to other possibilities … like maybe using spaces.

  • Workload isolation is hard
    Multi-tenancy is possible
    Takes work to make sure batch jobs don’t impact the real-time streaming processes. Lots of monitoring.
    Debugging / development is hard
    Lots of moving pieces
    Logs spread out across many machines
    Development environments require a lot of software
    Distributed systems just work differently
    Things we like: Hive(because our department heavily uses SQL) HBase (because it is bullet proof and can handle almost any key/value data) Spark
    Things we don't like:Oozie and Sqoop because we end up spending too much time on setup which slows us down.
  • Roadblocks – Find out that PaaS service you want to use (Azure Analysis Services) doesn’t support reading from HDInsight… and that Microsoft is “weeks” away
    Unable to plan more than a sprint in advance
    Planning with shifting priorities
    What are you delivering and when?
    Aggressive timelines
    Roadblocks – Fail fast
    A lot to learn

×