Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
2. Alexander Aldev
About me
Тranslator| between business and IT
CTO and co-founder | MammothDB
17 years | various shades of analytics, DWH, BI
Nerd | making scaled data infrastructure practical
3. Spoiler Alert!
This talk
Can we predict the future?
How do Big Data tools work today?
How did they evolve?
… and some examples
What is their environment?
yes, we can! It already happened.
7. Working Definition
Just what’s Big Data?
Datasets so large and/or complex
that traditional data processing techniques
are inadequate to handle them
Examples
Indexing 100PB of crawled web content
Providing on-line interactive analytics to 10mln clients
9. Today
The Big Data toolset?
… for analytics, this is mostly synonymous with Hadoop
10. Hadoop architecture
Cluster of Commodity Servers
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(MapReduce)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Pig)
Query
(Hive)
Machine Learning
(Mahout)
11. the DFS
Data Node 1
File 1
Data Node 10Data Node 2
File 2
High throughput
Linear scalability
Fault tolerance
block
replication
12. classical workflow
1 MapReduce Job
Input File on DFS
Split
Extract
Structure
Shuffle
Aggregate
Output File on DFS
Store on DFS
Read from DFS
Store on Local FS
Analytical Query
Input on DFS Input on DFS
M/R Job
M/R Job
M/R JobIntermediate/DFS
M/R Job
Intermediate/DFSIntermediate/DFS
Output on DFS
15. Spark architecture
Cluster of Commodity Servers
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(Spark)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Scala)
Query
(Spark SQL)
Machine Learning
(MLib)
16. Optimized Execution
what’s different?
Pipelines for batches of jobs
Memory caching of intermediate results
Programmability
Rich set of high-level data flow operations
Support for popular languages Scala, Java, Python
17. Workflow
what’s the same?
Scan the file
Interpret data structure in user code
Perform analysis
Philosophy
Ingest and collect all data now
Analyze later
18. Hadoop Storage
other improvements
Columnar data formats
Compression
SQL-on-Hadoop
Friendlier interface to analysts and tools
Optimized implementation (Impala, PrestoDB)
19. Data Sources
now, an enterprise
A variety of systems covering departmental functions
Mostly structured and transactional
Loose alignment of business terms
Typical Challenges
Data quality
Data integration
Interactive analytics
Business audiences
Client self-service analytics
Significant volumes (10-100 TB range)
Leveraging investment in IT and training
BUDGET!!!
20. Scalable Storage and Computaton
Big Data tools offer
Reliable and scalable storage for files
Reliable and scalable batch-mode computation
Not efficient at small scale
Unified Data Integration
The data is there
Its quality is up to the user
Its integration is up to the user and difficult / slow
“The user” is a small group of highly qualified data scientists
New programming interfaces
Mounting costs to acquire, extend and run
21. Top Uses in 2015 (Gartner)
Hadoop adoption
File storage
Basic analytics
Proof of concept
Next year: Advanced Analytics, DWH
Cluster Size
Average cluster size: 20 nodes
Median cluster size: 32 nodes
50% report under 10TB of storage
Top Reasons for Slow Adoption
Lack of adequate skill
No business case
22. Especially good at …
so Hadoop is …
Batch-processing
of web-scale
unstructured data
on large expensive infrastructures
But not that good at …
data integration and unification
concurrent use
interactive querying
accessibility to business users
23. Yeah, mainframes of old days…
sounds familiar…?
Batch-processed
Centralized
Users waiting queuing for system access
CODASYL-style programming
What’s the future?
Scale out
Let the data management system manage the data
Optimized structured storage
Declarative syntax for business users
Interfacing data management and presentation tools
Data integration methodologies
24. scaled-out DBMS
Cluster of Commodity Servers
Distributed File
Store
Resource
ManagementDistributed Execution & Aggregation
Higher-level Apps
Declarative Query Language
Distributed Database Engine
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
25. MammothDB architecture
Cluster of Commodity Servers
Resource
ManagementInteractive Map/Reduce
Higher-level Apps
SQL
Columnar RDBMS (per Node)
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
26. Business Challenge
use case logistics
Predict cost of moving cargo between pairs of cities
Integrate into ERP
Validate at country level globally
Track historical accuracy
Outputs: 3 levels of service, 15’000 tradelanes, 4 charges
Client DWH
Solution
MammothDB Web Portal
E-LT
prediction
model
MS SSAS
ROLAP cube
Rate
Calculator
SAP extract
generator
27. Business Challenge
use case media planing
Track campaign across different media
Integrate online feeds
Store extended historical data
Load into downstream system
Provide ad-hoc reporting
Google
Solution
MammothDB Web Portal
E-LT
pull &
consolidate
MS SSAS
ROLAP cube
extract
generator
Facebook
Gemius
…
QlikView
… and the next two slides summarize the key points.
Let’s take a quiz. How many letters Z in this soup?
Let’s do it like a big data tool.
Although typically Big Data is associated with Web-Scale, 100s of PB and a mix of structured and unstructured data, it is still a challenge to copy 1 TB external HDD, it takes 2 hrs to copy.