The future of Big Data tooling

The Future of Big Data Tooling
Alexander Aldev

Alexander Aldev
About me
Тranslator| between business and IT
CTO and co-founder | MammothDB
17 years | various shades of analytics, DWH, BI
Nerd | making scaled data infrastructure practical

Spoiler Alert!
This talk
Can we predict the future?
How do Big Data tools work today?
How did they evolve?
… and some examples
What is their environment?
yes, we can! It already happened.

HOW MANY Z’S IN THIS
SOUP?
photo Ursus Wehrli

THE BIG DATA TOOL APPROACH
photo Ursus Wehrli
1
23
4

MAYBE THIS WOULD HELP…
photo Ursus Wehrli

Working Definition
Just what’s Big Data?
Datasets so large and/or complex
that traditional data processing techniques
are inadequate to handle them
Examples
Indexing 100PB of crawled web content
Providing on-line interactive analytics to 10mln clients

IT’S RELATIVE
photo Anders Rasmusen

Today
The Big Data toolset?
… for analytics, this is mostly synonymous with Hadoop

Hadoop architecture
Cluster of Commodity Servers
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(MapReduce)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Pig)
Query
(Hive)
Machine Learning
(Mahout)

the DFS
Data Node 1
File 1
Data Node 10Data Node 2
File 2
High throughput
Linear scalability
Fault tolerance
block
replication

classical workflow
1 MapReduce Job
Input File on DFS
Split
Extract
Structure
Shuffle
Aggregate
Output File on DFS
Store on DFS
Read from DFS
Store on Local FS
Analytical Query
Input on DFS Input on DFS
M/R Job
M/R Job
M/R JobIntermediate/DFS
M/R Job
Intermediate/DFSIntermediate/DFS
Output on DFS

programmability
Map()
Reduce()
complex queries
require running many
Map/Reduce jobs!!!
JOINs are difficult
WHEREs are difficult
File 1 k
File 2k
node 1
node 2
shuffle
File 2k
= k ?

resource management
1 Task = 1 Core
Split
400 cores
= 100 node x 4 cores
2.5 GB/s
= 400 tasks x 64 MB/task / 10 sec/task
14.6 GB/s
= 100 nodes * 150 MB/s
20 cores
= 5 node x 4 cores
128 MB/s
= 20 tasks x 64 MB/task / 10 sec/task
2.9 GB/s
= 20 nodes * 150 MB/s
theoretical
theoretical
max
max
in reality, multiple M/R
~ 3MB/s

Spark architecture
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(Spark)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Scala)
Query
(Spark SQL)
Machine Learning
(MLib)

Optimized Execution
what’s different?
Pipelines for batches of jobs
Memory caching of intermediate results
Programmability
Rich set of high-level data flow operations
Support for popular languages Scala, Java, Python

Workflow
what’s the same?
Scan the file
Interpret data structure in user code
Perform analysis
Philosophy
Ingest and collect all data now
Analyze later

Hadoop Storage
other improvements
Columnar data formats
Compression
SQL-on-Hadoop
Friendlier interface to analysts and tools
Optimized implementation (Impala, PrestoDB)

Data Sources
now, an enterprise
A variety of systems covering departmental functions
Mostly structured and transactional
Loose alignment of business terms
Typical Challenges
Data quality
Data integration
Interactive analytics
Business audiences
Client self-service analytics
Significant volumes (10-100 TB range)
Leveraging investment in IT and training
BUDGET!!!

Scalable Storage and Computaton
Big Data tools offer
Reliable and scalable storage for files
Reliable and scalable batch-mode computation
Not efficient at small scale
Unified Data Integration
The data is there
Its quality is up to the user
Its integration is up to the user and difficult / slow
“The user” is a small group of highly qualified data scientists
New programming interfaces
Mounting costs to acquire, extend and run

Top Uses in 2015 (Gartner)
Hadoop adoption
File storage
Basic analytics
Proof of concept
Next year: Advanced Analytics, DWH
Cluster Size
Average cluster size: 20 nodes
Median cluster size: 32 nodes
50% report under 10TB of storage
Top Reasons for Slow Adoption
Lack of adequate skill
No business case

Especially good at …
so Hadoop is …
Batch-processing
of web-scale
unstructured data
on large expensive infrastructures
But not that good at …
data integration and unification
concurrent use
interactive querying
accessibility to business users

Yeah, mainframes of old days…
sounds familiar…?
Batch-processed
Centralized
Users waiting queuing for system access
CODASYL-style programming
What’s the future?
Scale out
Let the data management system manage the data
Optimized structured storage
Declarative syntax for business users
Interfacing data management and presentation tools
Data integration methodologies

scaled-out DBMS
Distributed File
Store
Resource
ManagementDistributed Execution & Aggregation
Higher-level Apps
Declarative Query Language
Distributed Database Engine
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning

MammothDB architecture
Resource
ManagementInteractive Map/Reduce
Higher-level Apps
SQL
Columnar RDBMS (per Node)
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning

Business Challenge
use case logistics
Predict cost of moving cargo between pairs of cities
Integrate into ERP
Validate at country level globally
Track historical accuracy
Outputs: 3 levels of service, 15’000 tradelanes, 4 charges
Client DWH
Solution
MammothDB Web Portal
E-LT
prediction
model
MS SSAS
ROLAP cube
Rate
Calculator
SAP extract
generator

Business Challenge
use case media planing
Track campaign across different media
Integrate online feeds
Store extended historical data
Load into downstream system
Provide ad-hoc reporting
Google
Solution
MammothDB Web Portal
E-LT
pull &
consolidate
MS SSAS
ROLAP cube
extract
generator
Facebook
Gemius
…
QlikView

The future of Big Data tooling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to The future of Big Data tooling

Similar to The future of Big Data tooling (20)

More from Data Science Society

More from Data Science Society (20)

Recently uploaded

Recently uploaded (20)

The future of Big Data tooling

Editor's Notes