Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
Big Data Stories from the Field
Matthias Diekstall, Roland Wammers,
Manuel Marowski
From Theory to Practice

Öffentlich Seite 2
Agenda
1
2
3
DWH Modernization with AWS BigData
Advanced Analytics & Complex Event
Processing at congstar
Stream Analytics & Machine Learning with
AWS OC Quickstarter

Öffentlich Seite 3
DWH Modernization with AWS
BigData as an Insurance Company
 Once upon a Time …
 Defined Targets
 Challenges
 Our Proposal
 Technical Implementation
 … and they lived happily ever after
1

ÖffentlichBig Data Stories from the Field Seite 4
Once upon a Time …
 Mid-sized insurance company
 6000 Employees
 4 M Clients
 14 M Contracts
 3.2 B EUR in Revenues
 Enterprise DWH established
 Standard Reporting in place
 Data Mining in a few departments
 Using MS Excel mostly
 Partially R desktop usage

Defined Targets
 Get a feeling for new technologies (Hadoop Ecosystem)
 Learn their approach to data processing
 Low investment
 „Big Data Test Drive“
 Increase flexibility for data sources
 Enable self service for departments on a larger scale

Challenges
 No tangible use case initially
 No decision regarding products/license
model
 No good grasp on fundamental
concepts of Big Data technologies
 Little resources for driving this project
 No hardware available (short-term)
 Direct connectivity to source systems
questionable

Our Proposal
 Quick start with a cloud-based solution
 Start small and allow for growth
 Allow a wide variety of technologies without having to dedicate resources
to administration and operation
 To be more precise:
 Prepare environment for easy startup
 Train/coach employees in essential aspects
 Use AWS technologies

Technical Implementation
 AWS IAM for user management
 AWS S3 for data storage
 AWS EMR as the basis for data processing
 Hive
 Pig
 Spark
 Python
 Zeppelin as graphical frontend
 Augmented with R Studio
 Mini Tutorials for users

AWS Mini tutorials for users

… and they lived happily ever after
 Results
 Targets achieved at minimal cost (< $500 in ~ 3 months)
 Competency development
 Better understanding of „how it works“
 Lessons learned
 Focus on as few tools as possible
 Create simple step-by-step tutorials
 Even a hypothetical use case is better than none

Öffentlich Seite 12
Advanced Analytics & Complex Event
Processing at congstar
 First Thoughts
 Creating the Base
 Working with the Data
 First Steps to Advanced Analytics
2

congstar GmbH
 Subsidiary of Telekom Deutschland GmbH
 Founded in July 2007
 Sells mobile contracts and DSL
 Over 4.500.000 customers

Motivation
 Better understanding of the user
 Improve the user experience
 Enhance existing systems
 Being prepared for future requirements
 Create new content in reasonable time

Challenges
 Building a big data system for advanced analytics and complex event
processing in AWS
 Find right technologies in Hadoop
 Find suitable AWS services
 Keeping the costs low
 Provisioning
 Replacing old systems with new technology
 Secure data transfer between on prem and AWS
 Live agile

Infrastructure as code
 Testing resources and services via AWS management console
 Creating CloudFormation templates
 Infrastructure as code
 Create stacks for development, test and production system
 Working with stacks
 Adjustments made in the code
 Diff of old and new code
 Rollback function in case of error
 Establishing a secure VPN connection

Overview of the basic Infrastructure

Collecting and loading data into S3
 Data transfer
 Initial connection only established from the on prem network
 Need on prem solution to transfer data into S3
 NIFI
 Web UI
 Schedule flows
 No programming skills needed
 Limited to used processors
 Format: CSV, AVRO

Process data
 Using Spark (Scala)
 Fast data processing
 Needs implementation
 Format: Parquet or Avro – saves space, time and money
 Organize the data
 Layer
 Partitions
 Purpose
 Source
 …

Using spot instances
 Data-backup capabilities
 Set a max. bidding price you are willing to pay
 Saves time and money
 Cons:
 You loose the instances when the spot-price increases you max. price
 2 minutes to save your data
 Hybrid model for Hadoop
 Master and 1/3 workers on on-demand instances
 Rest on spot instances

Get data available with SQL
 Create Glue catalog with a Glue crawler
 Scans all sub folders of a S3 path
 Tries to recognize the right format
 Classifies according to the file type
 Glue catalog
 Used as Hive metastore on an EMR cluster
 Used in Athena for ad hoc analytics
 Not all classifiers are perfect
 Manual adjustments of the crawler are required
 Manual adjustments of the table definitions are required

Testing Exasol on AWS market place
 Starting Exasol on EC2 instance
 Using an EBS instance
 Testing various instances
 Duplicating the instance to be more free in testing
 Testing different server types/sizes
 Testing licensed software (AWS Marketplace) before buying expensive
license

Amazon SageMaker
 JupyterHub
 Python-based API
 Focusing on development, learning, testing and distributing ML-Models
 Easy switching between several algorithms

Outlook
 Combine Exasol with ML models created by SageMaker

ÖffentlichBig Data Stories from the Field
Stream Analytics & Machine Learning with
AWS OC Quickstarter

Öffentlich Seite 26
Stream Analytics & Machine
Learning with AWS OC Quickstarter
 Use case
 DWH offloading
 Architectural overview
 The data flow
 Industrial use case
3

Use case: Twitter Stream Analytics
Seite 27
Twitter
Streaming Data
Machine Learning sentiment analysis

DWH Offloading
DWH
Integration
Layer
Enterprise
Layer
User View
Layer
Source

DWH offloading
Data
Integration
Layer
Enterprise
Layer
Offload
Refined Data Lake
User View
Layer
ETL

Advantages of DWH-Offloading
 Cost savings through outsourcing to low-cost storage space
 Combining structured data with unstructured data

Used technologies
 Scala
 Hive, Oozie, Kafka, Spark, Sqoop
➢ Stream Processing
➢ DWH Offloading
➢ Scheduling
 Spark.ML
➢ sentiment analysis
 AWS
➢ infrastructure / Hadoop / HDFS / S3 / Data lake
 ELK-Stack (Elastic Search, Logstash, Kibana)
➢ Visualization / Indexed data access

Industrial use cases
 Predictive Maintenance
 Real-time error detection in production processes
 Dynamic evaluation of component quality

Öffentlich
 Überraschend mehr Möglichkeiten
@OC_WIRE OPITZCONSULTING opitzconsultingWWW.OPITZ-CONSULTING.COM
Seite 35
Contact us!
Matthias Diekstall
Developer
+49 201 892994-1753
Matthias.Diekstall@opitz-consulting.com
Roland Wammers
Solution Architect
+49 201 892994-1757
Roland.Wammers@opitz-consulting.com
Manuel Marowski
Developer
+49 201 892994-1748
Manuel.Marowski@opitz-consulting.com

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

Similaire à Analytics Web Day | From Theory to Practice: Big Data Stories from the Field (20)

Plus de AWS Germany

Plus de AWS Germany (20)

Dernier

Dernier (20)

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field