Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

107 vues

Publié le

Listen to this session to get some insights from two recent implementations of cloud-based Big Data clusters. The purpose of the first solution is DWH Offloading and Machine Learning in the telecommunications industry. The session will cover how we established the data transfer between on-premises server and cloud services. In addition, we will talk about Spark jobs on EMR-cluster, Hive with GlueCatalog to query data stored in S3, quick analytics with Athena, hosting and testing Exasol on EC2-Instances and the provisioning of the cloud infrastructure with CloudFormation. Looking at an earlier phase in the AWS adoption lifecycle, we will also talk about an insurance company finding their way into the AWS cloud. Their goal is to complement their existing enterprise DWH with more agile and data science oriented tools from the cloud, aiming at machine learning and artifical intelligence to complement their claims workflow. In this part we will cover topics like security setup in IAM, connectivity configuration in EC2 and EMR, all complemented with S3 for their storage needs.

Speakers: Roland Wammers, Matthias Diekstall, Manuel Marowski, Opitz Consulting Deutschland GmbH

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

  1. 1. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten © OPITZ CONSULTING 2018 Big Data Stories from the Field Matthias Diekstall, Roland Wammers, Manuel Marowski From Theory to Practice
  2. 2. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 2 Agenda 1 2 3 DWH Modernization with AWS BigData Advanced Analytics & Complex Event Processing at congstar Stream Analytics & Machine Learning with AWS OC Quickstarter Big Data Stories from the Field
  3. 3. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 3 DWH Modernization with AWS BigData as an Insurance Company  Once upon a Time …  Defined Targets  Challenges  Our Proposal  Technical Implementation  … and they lived happily ever after 1 Big Data Stories from the Field
  4. 4. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 4 Once upon a Time …  Mid-sized insurance company  6000 Employees  4 M Clients  14 M Contracts  3.2 B EUR in Revenues  Enterprise DWH established  Standard Reporting in place  Data Mining in a few departments  Using MS Excel mostly  Partially R desktop usage
  5. 5. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 5 Defined Targets  Get a feeling for new technologies (Hadoop Ecosystem)  Learn their approach to data processing  Low investment  „Big Data Test Drive“  Increase flexibility for data sources  Enable self service for departments on a larger scale
  6. 6. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 6 Challenges  No tangible use case initially  No decision regarding products/license model  No good grasp on fundamental concepts of Big Data technologies  Little resources for driving this project  No hardware available (short-term)  Direct connectivity to source systems questionable
  7. 7. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 7 Our Proposal  Quick start with a cloud-based solution  Start small and allow for growth  Allow a wide variety of technologies without having to dedicate resources to administration and operation  To be more precise:  Prepare environment for easy startup  Train/coach employees in essential aspects  Use AWS technologies
  8. 8. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 8 Technical Implementation  AWS IAM for user management  AWS S3 for data storage  AWS EMR as the basis for data processing  Hive  Pig  Spark  Python  Zeppelin as graphical frontend  Augmented with R Studio  Mini Tutorials for users
  9. 9. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 9
  10. 10. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 10 AWS Mini tutorials for users
  11. 11. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 11 … and they lived happily ever after  Results  Targets achieved at minimal cost (< $500 in ~ 3 months)  Competency development  Better understanding of „how it works“  Lessons learned  Focus on as few tools as possible  Create simple step-by-step tutorials  Even a hypothetical use case is better than none
  12. 12. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 12 Advanced Analytics & Complex Event Processing at congstar  First Thoughts  Creating the Base  Working with the Data  First Steps to Advanced Analytics 2 Big Data Stories from the Field
  13. 13. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 13 congstar GmbH  Subsidiary of Telekom Deutschland GmbH  Founded in July 2007  Sells mobile contracts and DSL  Over 4.500.000 customers
  14. 14. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 14 Motivation  Better understanding of the user  Improve the user experience  Enhance existing systems  Being prepared for future requirements  Create new content in reasonable time
  15. 15. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 15 Challenges  Building a big data system for advanced analytics and complex event processing in AWS  Find right technologies in Hadoop  Find suitable AWS services  Keeping the costs low  Provisioning  Replacing old systems with new technology  Secure data transfer between on prem and AWS  Live agile
  16. 16. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 16 Infrastructure as code  Testing resources and services via AWS management console  Creating CloudFormation templates  Infrastructure as code  Create stacks for development, test and production system  Working with stacks  Adjustments made in the code  Diff of old and new code  Rollback function in case of error  Establishing a secure VPN connection
  17. 17. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 17 Overview of the basic Infrastructure
  18. 18. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 18 Collecting and loading data into S3  Data transfer  Initial connection only established from the on prem network  Need on prem solution to transfer data into S3  NIFI  Web UI  Schedule flows  No programming skills needed  Limited to used processors  Format: CSV, AVRO
  19. 19. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 19 Process data  Using Spark (Scala)  Fast data processing  Needs implementation  Format: Parquet or Avro – saves space, time and money  Organize the data  Layer  Partitions  Purpose  Source  …
  20. 20. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 20 Using spot instances  Data-backup capabilities  Set a max. bidding price you are willing to pay  Saves time and money  Cons:  You loose the instances when the spot-price increases you max. price  2 minutes to save your data  Hybrid model for Hadoop  Master and 1/3 workers on on-demand instances  Rest on spot instances
  21. 21. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 21 Get data available with SQL  Create Glue catalog with a Glue crawler  Scans all sub folders of a S3 path  Tries to recognize the right format  Classifies according to the file type  Glue catalog  Used as Hive metastore on an EMR cluster  Used in Athena for ad hoc analytics  Not all classifiers are perfect  Manual adjustments of the crawler are required  Manual adjustments of the table definitions are required
  22. 22. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 22 Testing Exasol on AWS market place  Starting Exasol on EC2 instance  Using an EBS instance  Testing various instances  Duplicating the instance to be more free in testing  Testing different server types/sizes  Testing licensed software (AWS Marketplace) before buying expensive license
  23. 23. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 23 Amazon SageMaker  JupyterHub  Python-based API  Focusing on development, learning, testing and distributing ML-Models  Easy switching between several algorithms
  24. 24. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 24 Outlook  Combine Exasol with ML models created by SageMaker
  25. 25. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Stream Analytics & Machine Learning with AWS OC Quickstarter
  26. 26. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 26 Stream Analytics & Machine Learning with AWS OC Quickstarter  Use case  DWH offloading  Architectural overview  The data flow  Industrial use case 3 Big Data Stories from the Field
  27. 27. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Use case: Twitter Stream Analytics Seite 27 Twitter Streaming Data Machine Learning sentiment analysis
  28. 28. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field DWH Offloading DWH Integration Layer Enterprise Layer User View Layer Source
  29. 29. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field DWH offloading Data Integration Layer Enterprise Layer Offload Refined Data Lake User View Layer ETL
  30. 30. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Advantages of DWH-Offloading  Cost savings through outsourcing to low-cost storage space  Combining structured data with unstructured data
  31. 31. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Used technologies  Scala  Hive, Oozie, Kafka, Spark, Sqoop ➢ Stream Processing ➢ DWH Offloading ➢ Scheduling  Spark.ML ➢ sentiment analysis  AWS ➢ infrastructure / Hadoop / HDFS / S3 / Data lake  ELK-Stack (Elastic Search, Logstash, Kibana) ➢ Visualization / Indexed data access
  32. 32. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field
  33. 33. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field
  34. 34. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Industrial use cases  Predictive Maintenance  Real-time error detection in production processes  Dynamic evaluation of component quality
  35. 35. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten @OC_WIRE OPITZCONSULTING opitzconsultingWWW.OPITZ-CONSULTING.COM Seite 35 Contact us! Big Data Stories from the Field Matthias Diekstall Developer +49 201 892994-1753 Matthias.Diekstall@opitz-consulting.com Roland Wammers Solution Architect +49 201 892994-1757 Roland.Wammers@opitz-consulting.com Manuel Marowski Developer +49 201 892994-1748 Manuel.Marowski@opitz-consulting.com

×