SlideShare a Scribd company logo
1 of 33
Download to read offline
Big Data Analysis
in Java World
by Serhiy Masyutin
Agenda
 The Big Data Problem
 Map-Reduce
 MPP Analytical Database
 In-Memory Data Fabric
 Real-Life Project
 Q&A
The Big Data Problem
http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png
- Doug Laney
The Big Data Problem
Map-Reduce MPP AD IMDF
When do I
need it?
In an hour In a minute Now
What do I need
to do with it?
Exploratory
analytics
Structured
analytics
Singular event
processing
(some
analytics),
Transactions
How will I query
and search?
Unstructured Ad hoc SQL Structured
How do I need
to store it?
I do, but not
required to
I must and I am
required to
Temporarily
Where is it
coming from?
File/ETL File/ETL Event/Stream/F
ile/ETL
http://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp
The Big Data Problem
Map-Reduce MPP AD IMDF
Transactions
Customer records
Geo-spatial
Sensors
Social Media
XML, JSON
Raw Logs
Text
Image
Video
moreprocessing
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce
The Big Data Problem
Data is not Information
- Clifford Stoll
Map-Reduce
http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800
Map-Reduce
https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
Map-Reduce
http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
Map-Reduce
https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
Map-Reduce
Volume Variety Velocity
Medium-
Large
Unstructured
data
Batch
processing
MPP Analytical Database
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramOneNodeDown.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramTwoNodesDown.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png
MPP Analytical Database
JDBC
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
MPP Analytical Database
Volume Variety Velocity
Small-
Medium-
Large
Structured
data
Interactive
ASTER DATABASEMatrix
In-Memory Data Fabric
https://ignite.incubator.apache.org/images/in_memory_data.png
In-Memory Data Fabric
https://ignite.incubator.apache.org/images/in_memory_data.png
In-Memory Data Fabric
https://ignite.incubator.apache.org/images/in_memory_compute.png
In-Memory Data Fabric
http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png
In-Memory Data Fabric
Volume Variety Velocity
Small-
Medium
Structured
data
(Near) Real-
Time
Real-Life Project
 Sensor data
 Currently number of devices
doubles every year
 Data flow ~200GB/month
 Target data flow ~500GB/month
Real-Life Project
Requirements
When do I need it? In a minute
What do I need to do with it? Structured analytics
How will I query and search? Ad hoc SQL
How do I need to store it? I must and I am required to
Where is it coming from? XML
Real-Life Project
 Time-series data
 RESTful API
 Extendable analytics
 Scalability
 Speed to Market
Real-Life Project
Real-Life Project
Availability Zone C
Availability Zone B
Availability Zone A
Processor
Raw message
store
Client API
Collector
Analytic Executor
Pool
Analytics API
Clients
Devices
3rd Party
Services
Analytic Engine
UI
Recent
data store
Permanent
data store
Availability Zone C
Availability Zone B
Availability Zone A
Processor
Raw message
store
Client API
Collector
Analytic Executor
Pool
Analytics API
Clients
Devices
3rd Party
Services
Analytic Engine
UI
Recent
data store
Permanent
data store
Real-Life Project
Availability Zone C
Availability Zone B
Availability Zone A
Processor
Raw message
store
Client API
Collector
Analytic Executor
Pool
Analytics API
Clients
Devices
3rd Party
Services
Analytic Engine
UI
Recent
data store
Permanent
data store
Real-Life Project
Real-Life Project
 Vertica stores time-series data only
 Append-only data store
 Store organizational data
separately
 Use Vertica’s ExternalFilter for data
load
 R analytics as UDFs on Vertica
 Scale Vertica cluster accordingly
Real-Life Project
 Choose the right tool for the job,
late changes are expensive
 You can do everything yourself.
Should you?
Q&A

More Related Content

What's hot

Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at ScaleDatabricks
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayDatabricks
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
 
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranFP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranDatabricks
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 

What's hot (20)

Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using Ray
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranFP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 

Viewers also liked

Fieldwork 2015 data analysis stage
Fieldwork 2015     data analysis stageFieldwork 2015     data analysis stage
Fieldwork 2015 data analysis stagebarc300
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Dmitry Grapov
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Ryan Cuprak
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
DMDW 11. Student Presentation - JAVA to MongoDB
DMDW 11. Student Presentation - JAVA to MongoDBDMDW 11. Student Presentation - JAVA to MongoDB
DMDW 11. Student Presentation - JAVA to MongoDBJohannes Hoppe
 
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015MassTLC
 
Data analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarData analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarAli Zeeshan
 
JDK: CPU, PSU, LU, FR — WTF?!
JDK: CPU, PSU, LU, FR — WTF?!JDK: CPU, PSU, LU, FR — WTF?!
JDK: CPU, PSU, LU, FR — WTF?!Alexey Fyodorov
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)RichardWarburton
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and FutureRichardWarburton
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysischashnikov
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 

Viewers also liked (20)

Fieldwork 2015 data analysis stage
Fieldwork 2015     data analysis stageFieldwork 2015     data analysis stage
Fieldwork 2015 data analysis stage
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Big Data
Big DataBig Data
Big Data
 
BI Suite Overview
BI Suite OverviewBI Suite Overview
BI Suite Overview
 
Getting Started with J2EE, A Roadmap
Getting Started with J2EE, A RoadmapGetting Started with J2EE, A Roadmap
Getting Started with J2EE, A Roadmap
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
DMDW 11. Student Presentation - JAVA to MongoDB
DMDW 11. Student Presentation - JAVA to MongoDBDMDW 11. Student Presentation - JAVA to MongoDB
DMDW 11. Student Presentation - JAVA to MongoDB
 
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
 
Data analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarData analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - Webinar
 
JDK: CPU, PSU, LU, FR — WTF?!
JDK: CPU, PSU, LU, FR — WTF?!JDK: CPU, PSU, LU, FR — WTF?!
JDK: CPU, PSU, LU, FR — WTF?!
 
Jee conf
Jee confJee conf
Jee conf
 
ETL in Clojure
ETL in ClojureETL in Clojure
ETL in Clojure
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysis
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 

Similar to JEEConf 2015 Big Data Analysis in Java World

Java day Big Data Analysis In Java World
Java day   Big Data Analysis In Java WorldJava day   Big Data Analysis In Java World
Java day Big Data Analysis In Java WorldSerg Masyutin
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java worldSerg Masyutin
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
 
From an experiment to a real production environment
From an experiment to a real production environmentFrom an experiment to a real production environment
From an experiment to a real production environmentDataWorks Summit
 
June 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopJune 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopYahoo Developer Network
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human TimeDataWorks Summit
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQLYu Ishikawa
 
Net conf cl v2018 real time analytics
Net conf cl v2018 real time analyticsNet conf cl v2018 real time analytics
Net conf cl v2018 real time analyticsGaston Cruz
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 

Similar to JEEConf 2015 Big Data Analysis in Java World (20)

Java day Big Data Analysis In Java World
Java day   Big Data Analysis In Java WorldJava day   Big Data Analysis In Java World
Java day Big Data Analysis In Java World
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
From an experiment to a real production environment
From an experiment to a real production environmentFrom an experiment to a real production environment
From an experiment to a real production environment
 
June 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoopJune 2014 HUG: Interactive analytics over hadoop
June 2014 HUG: Interactive analytics over hadoop
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Net conf cl v2018 real time analytics
Net conf cl v2018 real time analyticsNet conf cl v2018 real time analytics
Net conf cl v2018 real time analytics
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
 

Recently uploaded

Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxKISHAN KUMAR
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxLMW Machine Tool Division
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical SensorTanvir Moin
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....santhyamuthu1
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptxSaiGouthamSunkara
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxHome
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfNaveenVerma126
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Akarthi keyan
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationMohsinKhanA
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technologyabdulkadirmukarram03
 

Recently uploaded (20)

Lecture 2 .pdf
Lecture 2                           .pdfLecture 2                           .pdf
Lecture 2 .pdf
 
Présentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdfPrésentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdf
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptx
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical Sensor
 
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
SATELITE COMMUNICATION UNIT 1 CEC352 REGULATION 2021 PPT BASICS OF SATELITE ....
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptx
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
Test of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptxTest of Significance of Large Samples for Mean = µ.pptx
Test of Significance of Large Samples for Mean = µ.pptx
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part A
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software Simulation
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technology
 

JEEConf 2015 Big Data Analysis in Java World

Editor's Notes

  1. Introduction Hello guys-n-girls, my name is Serhiy Masyutin. I have more than 14 years of professional experience in different IT branches, learned tons of languages and tools. Every day I just enjoy my job. And yes, I like when things are done nicely, robust, useful and in time. Started from Desktop applications in C++ and Delphi(you know it? ) then moved to C++ and Java Telecommunication projects, have done eCommerse project in PHP and Mobile crossplatform applications in JavaScript. Currently my project is in Java and it falls into Big Data category. So today I’m here to provide you an overview of what Big Data is and how can one approach to solutions of problems that it brings.
  2. TBD: How each approach fits CAP-theorem in slides
  3. What is big data? Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include capture, transfer, storage, sharing, search, analysis, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reductions and reduced risk. In a 2001 Doug Laney, an analyst from Gartner, defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.[17]  Additionally, a new V "Veracity" is added by some organizations to describe it. If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[20] Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification [21] to infer laws from large sets of data with low information density to reveal relationships, dependencies and perform predictions of outcomes and behaviors. A more recent, consensual definition states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". Big data can be described by the following characteristics: Volume – The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic. Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data.
  4. Use Nepal earthquake as a sample for data value that decreases in time as a single event and becomes useful as historical data. IMDG could be composed to work with MR & MPP when event processing is done via IMDG and historical data is batch stored into MR or MPP AD TBD text cleanup == Approaches to analysis How do you decide what belongs in the real-time tier versus the interactive tier? To understand the best use of each of these, there are some questions you can start asking to help you determine which is the best fit for your use case. It’s worth noting that any decision will also be subject to other architectural considerations unique to each business. When do I need it? Over time, the value derived from actions on a singular piece of data becomes lower, and becomes more useful in the aggregate. The decay of immediate relevance for a piece a data is something like inverse exponential. My applications need to use the data now It’s helpful to think of real time as being your “Now Data” — the in-memory data which is relevant at this moment, which you need to have fast access to. Real time brings together your now data with aggregates of your historic data, which is where customers will find immediate value. Unique to each enterprise are the interactions between what your business is doing, and events external to your company. Some parts of your business may operate and respond in real time, while others may not. Keeping data in-memory can help to alleviate problems such as large batch jobs in back end systems taking too long. Think about areas of your business where real time data analysis would give you an advantage: Online retailers need to respond quickly to queries. This is even more critical when the retailer is a target of aggregators like Google Shopping Financial institutions reacting to market events and news Airlines trying to optimize scheduling of services while aircraft are on the ground in the most efficient and cost effective way Retailers need to keep taking orders during surges in demand, even if the back end systems can’t scale to accommodate. Financial institutions calculating risk of a particular transaction in real time to rapidly make the best decision possible. In such use cases, the answer is real time products such as IMDG. Once IMDG receives the data, the user can then act on it immediately through the event framework. It can take part in application-led XA transactions (global transactions from multiple data stores,) so anything that needs to be transactional and consistent should go there. What if you need it both now and later? If this is the case, you do not want to cram all of your computation and data into a single tier and have it address both cases. Neither will be solved well. The key item is to use the right solution in the right moment. It is recommended to constrain your real time tier to only respond to business events happening now. The work done on this tier should be focused on singular events, which includes anything that should be updated as a result of a single piece of data coming in. This could be a transaction from an internal system, or a feed from an external system. Since the real time tier must be as responsive as possible, you don’t want to do long-running, exhaustive work on it. You will want to do deep exploratory analysis somewhere else. With a singular piece of data, you might decide to update a running aggregate in-memory, send it to another system, persist it, index it, or take another action. The key is that the action being taken is based on the singular event or a small set of singular events that are being correlated. For longer running queries and analytics, such as year end reporting, or data exploration to detect new patterns in your business, the interactive and batch tiers are more appropriate. What are my storage requirements? You may have multiple answers to this question depending on the type of data. If you do not need to store the data long term, IMDG can manage it in-memory with strong consistency and availability. If you want to store it long term, but may not be working with it in the short term, then Map-Reduce is a highly scalable storage solution. If you are required to store the data because of regulations and reporting requirements, and it is well structured, then MPP AD is a fantastic answer. Where is my data coming from? Is the data coming from a stream of events from internal or external systems? Message driven architectures? Files? Extract, transform, and load (ETL) database events? IMDG is great at handling large and varying streams of data from any type of system. IMDG can handle data streams accelerating by adding more nodes to the system, meaning your pipe in isn’t throttled. Meanwhile, MPP AD and Map-Reduce solutions are both better at taking batch updates (either file or ETL). This works out well, because IMDG can write to both these systems in batch. It can be configured to write to any backend store. In such a case, you would use IMDG to do a large data ingest, write it out to Map-Reduce store in batch, and then analyze it there. What are my latency requirements? What latency does your business require in a given scenario and data set? For machine time latency (microseconds to seconds,) IMDG is the solution. If the latency is longer, or at the speed of human interaction, SQL over Map-Reduce or MPP AD might be most appropriate. Usually, these break down pretty cleanly into customer/partner latency (real time) versus internal latency (interactive and batch), however if you are in a real time business, like stock trading, everything may be time critical.
  5. Wiki: Transaction data are data describing an event (the change as a result of a transaction) and is usually described with verbs. Transaction data always has a time dimension, a numerical value and refers to one or more objects (i.e. the reference data). Pre-processing Requirements In many ways, your choice of platform is determined by the data you want to analyze. Instead of thinking in terms of structured, semi-structured, or unstructured data, consider the amount of pre-processing needed to develop an effective predictive model. Predictive models and machine learning algorithms need specific and well-formatted inputs. The steps to transform the raw data into something usable for modeling depend on the source and type of data. The following chart compares the analysis approaches for different data pre-processing requirements. Transactional data and traditional customer information records are best suited for MPP AD, as they require little to no pre-processing. Geospatial data often requires relatively complex geometric calculations Raw log files, XML or JSON files and typical social media data are semi-structured, and this is a situation where SQL over Map-Reduce is ideal. Users write SQL to interact with files in HDFS, enabling quick insights without writing Pig or MapReduce jobs. Depending on the log files, MPP AD can also be used to parse semi-structured logs efficiently. For text it is easy to include open-source Natural Language Processing toolkits in your processing pipeline within Map-Reduce stack using Procedural Languages such as PL/Python. Video or image data demands extensive pre-processing requirements. Map-Reduce is the best choice, especially when combined with in-memory processing engine. Summary Whether you need to have high scalability the first step is to make your data ready for exploration and analysis.
  6. http://todayinsci.com/S/Stoll_Clifford/StollClifford-Quotations.htm
  7. http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800 Wiki MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. The "MapReduce Framework" orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. The model is inspired by the map and reduce functions commonly used in functional programming,[6] although their purpose in the MapReduce framework is not the same as in their original forms.[7] The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce (such as MongoDB) will usually not be faster than a traditional (non-MapReduce) implementation, any gains are usually only seen with multi-threaded implementations.[8]  The use of this model is beneficial only when the optimized distributed shuffle/combine operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized.
  8. https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative.  The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available. Another way to look at MapReduce is as a 5-step parallel and distributed computation: - Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value. - Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. - “Combine" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value. - Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step. - Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome. These five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected. In many situations, the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
  9. HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.  Assumptions and Goals Hardware Failure Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed.  “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. NameNode and DataNodes HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. Robustness The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Emphasis on the single point of failure or HDFS. HDFS and CAP http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/ if you have a network that may drop messages, then you cannot have both availability and consistency, you must choose one. This kind of nuance is not captured by the CAP theorem: consistency is often much more expensive in terms of throughput or latency to maintain than availability. Systems such as ZooKeeper are explicitly sequentially consistent because there are few enough nodes in a cluster that the cost of writing to quorum is relatively small. The Hadoop Distributed File System (HDFS) also chooses consistency – three failed datanodes can render a file’s blocks unavailable if you are unlucky. Both systems are designed to work in real networks, however, where partitions and failures will occur*, and when they do both systems will become unavailable, having made their choice between consistency and availability. That choice remains the unavoidable reality for distributed data stores. *For more on the inevitably of failure modes in large distributed systems, the interested reader is referred to James Hamilton’s LISA ’07 paper On Designing and Deploying Internet-Scale Services. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/index.html http://stackoverflow.com/questions/19923196/cap-with-distributed-system The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
  10. Java integration Minimally, applications specify the input/output file locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. User creates configuration, implements Mapper-Combiner-Reducer, defines input and output data types and file paths, submits to job client that is doing all the work with job tracker.
  11. The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
  12. http://blog.pivotal.io/big-data-pivotal/products/why-mpp-based-analytical-databases-are-still-key-for-enterprises http://architects.dzone.com/articles/sql-and-mpp-next-phase-big http://www.ndm.net/datawarehouse/Greenplum/greenplum-database-overview Most DW appliances use massively parallel processing (MPP) architectures to provide high query performance and platform scalability. MPP architectures consist of independent processors or servers executing in parallel. Most MPP architectures implement a "shared-nothing architecture" where each server operates self-sufficiently and controls its own memory and disk. DW appliances distribute data onto dedicated disk storage units connected to each server in the appliance. This distribution allows DW appliances to resolve a relational query by scanning data on each server in parallel. The divide-and-conquer approach delivers high performance and scales linearly as new servers are added into the architecture. MPP Analytical Database’s shared-nothing architecture provides every segment with an independent high-bandwidth connection to dedicated storage. The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general-purpose database systems. By transparently distributing data and work across multiple 'segment' servers, MPP Analytical Database executes mathematically intensive analytical queries “close to the data” with performance that scales linearly with the number of segment servers. MPP-Based Analytic Databases have been designed with security, authentication, disaster recovery, high availability and backup/restore in mind. Main features to achieve performance goals are: Column-oriented storage Data compression Other outstanding features: ANSI SQL support. Built-in analytical functions to work with: Time series analysis Statistical analysis Event series analysis, i.e. pattern matching UDFs, i.e. R/C++/Java http://www.vertica.com/2012/12/21/a-deeper-dive-on-vertica-r/ http://en.wikipedia.org/wiki/Shared_nothing_architecture SNA A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage. People typically contrast SN with systems that keep a large amount of centrally-stored state information, whether in a database, an application server, or any other similar single point of contention.[citation needed] The advantages of SN architecture versus a central entity that controls the network (a controller-based architecture) include eliminating single points of failure, allowing self-healing capabilities and providing an advantage with offering non-disruptive upgrades. Emphasis on contrast to HDFS master/slave architecture with single point of failure
  13. K-Safety K-safety is a measure of fault tolerance in the database cluster. The value K represents the number of replicas of the data in the database that exist in the database cluster. These replicas allow other nodes to take over for failed nodes, allowing the database to continue running while still ensuring data integrity. If more than K nodes in the database fail, some of the data in the database may become unavailable. It is possible for an database to have more than K nodes fail and still continue running safely, because the database continues to run as long as every data segment is available on at least one functioning cluster node. Potentially, up to half the nodes in a database with a K-safety level of 1 could fail without causing the database to shut down. As long as the data on each failed node is available from another active node, the database continues to run. If half or more of the nodes in the database cluster fail, the database will automatically shut down even if all of the data in the database is technically available from replicas. This behavior prevents issues due to network partitioning. In HP Vertica, the value of K can be zero (0), one (1), or two (2). The physical schema design must meet certain requirements. To create designs that are K-safe, HP recommends using the Database Designer. The diagram above shows a 5-node cluster that has a K-safety level of 1. Each of the nodes contains buddy projections for the data stored in the next higher node (node 1 has buddy projections for node 2, node 2 has buddy projections for node 3, etc.). Any of the nodes in the cluster could fail, and the database would still be able to continue running (although with lower performance, since one of the nodes has to handle its own workload and the workload of the failed node).
  14. If node 2 fails, node 1 handles requests on its behalf using its replica of node 2's data, in addition to performing its own role in processing requests. The fault tolerance of the database will fall from 1 to 0, since a single node could cause the database to become unsafe. In this example, if either node 1 or node 3 fails, the database would become unsafe because not all of its data would be available. If node 1 fails, then node 2's data will no longer be available. If node 3 fails, its data will no longer be available, because node 2 is also down and could not fill in for it. In this case, nodes 1 and 3 are considered critical nodes. In a database with a K-safety level of 1, the node that contains the buddy projection of a failed node and the node whose buddy projections were on the failed node will always become critical nodes.
  15. With node 2 down, either node 4 or 5 in the cluster could fail and the database would still have all of its data available. For example, if node 4 fails, node 3 is able to use its buddy projections to fill in for it. In this situation, any further loss of nodes would result in a database shutdown, since all of the nodes in the cluster are now critical nodes. (In addition, if one more node were to fail, half or more of the nodes would be down, requiring HP Vertica to automatically shut down, no matter if all of the data were available or not.)
  16. In a database with a K-safety level of 2, any node in the cluster could fail after node 2 and the database would be able to continue running. For example, if in the 5-node cluster each node contained buddy projections for both its neighbors (for example, node 1 contained buddy projections for both node 5 and node 2), then nodes 2 and 3 could fail and the database could continue running. Node 1 could fill in for node 2, and node 4 could fill in for node 3. Due to the requirement that half or more nodes in the cluster be available in order for the database to continue running, the cluster could not continue running if node 5 were to fail as well, even though nodes 1 and 4 both have buddy projections for its data. K-Safety Requirements Your database must have a minimum number of 2K+1 nodes to be able to have a K-safety level of K. // CAP-theorem states that in distributed database if you have a network that may drop messages, then you cannot have both complete availability and perfect consistency in the event of a partition, instead you must choose one. The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system. Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P. Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values. Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed. Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem: Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning. Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database. Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.
  17. Java integration Via JDBC, so in most cases for Java application you are using your regular SQL as with MySQL.
  18. https://dataddict.wordpress.com/2013/05/14/choosing-a-mpp-database-is-incredibly-hard/ HP Vertica Analytics Platform 6.1 “Bulldozer” I’ve talked a lot about Vertica because I love this product, and the last release of the platform kept the same feeling in my head: This is a trulyAdvanced Database. But, believe me, I’m not the only one; companies like Twitter, Zynga, Convertro, are big users of the platform. If you want to know more about the Bulldozer release, just see the webinar dedicated to this topic here or you ca download its Datasheet from here. Greenplum Database 4.0 But Vertica’s team is not the only one innovating in this space. The amazing engineering team at Greenplum have added some outstanding features to its new release which are very useful and synonym of the hard work of this team: High Performance gNet™ for Hadoop Out-of-the-Box Support for Big Data Analytics Multi-level Partitioning with Dynamic Partitioning Elimination Polymorphic Data Storage-MultiStorage/SSD Support Fast Query processing with a new loading technology called Scatter/Gather Streaming, allowing automatic parallelization of data loading and queries Analytics and Language Support
, proving methods for advanced analytic functions like t-statistics, p-values, and Naïve Bayes inside the database, and besides have a great integration with R Dynamic Query Prioritization and many more amazing features Teradata Aster Data Database 5.0 This is another great team which is doing very well in this field, with itsamazing platform combining highly technical research too in a single product. Some of these features are: A patent-pending SQL-MapReduce framework who allows to combine the power of MapReduce with SQL Hybrid row/column storage depending of your needs Two great things called “Always-On” and “Always-Parallel” who allow to use parallelism for data and analytics processing; and provide world-class fault tolerance A great group of ready-to-use analytic functions for rapid analytic platform development called Aster MapReduce Analytics Portfolio Rich monitoring and easy management of data and analytic processing with the intuitive Aster Management Console Great integration with several languages like Java, C, C#, Python, C++, and R Dynamic mixed workload management ensures scalable performance even with large numbers of concurrent users and workloads You can download its Datasheet from here. ParAccel Analytic Platform 4.0 This is another team which is doing a terrific job building this outstanding analytic platform. Some of its features: On-Demand integration with Hadoop Columnar-Oriented storage scheme A power Extensibility framework Advanced Query Optimization, allowing to perform better complex operations like JOINs, sorting and query planing Advanced communication protocol for the whole interconnection around the cluster to improve the process of data loading, backup and recovery and parallel query execution Advanced I/O optimization which allows to scan performance improving, using high performance algorithms to predict which data blocks will be needed for future operations. Adaptive compression encoding depending of the involved data type A great number of analytic functions ready to use for a lot of techniques like pattern matching, time series analysis, advertising attribution and optimization, sophisticated fraud detection and event analysis and a lot of statistic methods like Univariate, Multivariate, Data Mining, Mathematical, Corporate Finance, Options/Derivatives, Portfolio Management, Fixed Income and many more. You can read more about all this on its Datasheet here. Amazon Redshift Amazon Redshift is based in a version of ParAccel.
  19. The data model is distributed across many servers in a single location or across multiple locations.  This distribution is known as a data fabric.  This distributed model is known as a ‘shared nothing’ architecture. All servers can be active in each site. All data is stored in memory of the servers. Servers can be added or removed non-disruptively, to increase the amount of CPU and RAM available. The data model is non-relational and is object-based.  Distributed applications written on the .NET and Java application platforms are supported. The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers. http://www.gridgain.com/in-memory-database-vs-in-memory-data-grid-revisited/ Before moving forward, let’s clarify what we mean by “in-memory”. Although some vendors refer to SSDs, Flash-on-PCI, Memory Channel Storage, and DRAM as “in-memory”, in reality, most vendors support a tiered storage model where part of the data is stored in DRAM, which then gets overflown to a variety of flash or disk devices. Therefore it is rarely a DRAM-only, Flash-only or disk-only product. However, it’s important to note that most products in both categories are often biased towards mostly DRAM or mostly flash/disk storage in their architecture. The main point to take away is that “in-memory” products are not confined to one fixed definition, but in the end they all have a significant “in-memory” component. In-Memory Data Grids typically lack full ANSI SQL support but instead provide MPP-based (Massively Parallel Processing) capabilities where data is spread across large cluster of commodity servers and processed in explicitly parallel fashion. The main access pattern is key/value access, MapReduce, various forms of HPC-like processing, and a limited distributed SQL querying and indexing capabilities. An In-Memory Data Grid always works with an existing database providing a layer of massively distributed in-memory storage and processing between the database and the application. Applications then rely on this layer for super-fast data access and processing. Most In-Memory Data Grids can seamlessly read-through and write-through from and to databases, when necessary, and generally are highly integrated with existing databases. Emphasis on similarities with MPP AD, just needs DB for persistence.
  20. Hazelcast and cap The oft-discussed CAP-theorem in distributed database theory states that if you have a network that may drop messages, then you cannot have both complete availability and perfect consistency in the event of a partition, instead you must choose one. Multiple copies of data (up to 3, backup-count property) are stored in multiple machines for automatic data recovery in case of single or multiple server failures Hazelcast’s approach to the CAP-theorem notes that network partitions caused by total loss of messages are very rare on modern LaNs; instead, Hazelcast applies the CAP-theorem over wide-area (and possibly geographically diverse) networks. In the event of a network partition where nodes remain up and connected to different groups of clients (i.e. a split-brain scenario), Hazelcast will give up Consistency (“C”) and remain available (“a”) whilst partitioned (“p”). Emphasis on contrast with HP Vertica However, unlike some NoSQL implementations, C is not given up unless a network partition occurs. The effect for the user in the event of a partition would be that clients connected to one partition would see locally consistent results. However, clients connected to different partitions would not necessarily see the same result. For example, an AtomicInteger could now potentially have different values in different partitions. Fortunately, such partitions are very rare in most networks. Since Hazelcast clients are always made aware of a list of nodes they could connect to, in the much more likely event of loss of a datacenter (or region), clients would simply reconnect to unaffected nodes.
  21. Add Computations to Data TBD text cleanup http://hazelcast.com/use-cases/imdg/in-memory-computing/ What is it? In-Memory Computing Microprocessors double in performance and speed roughly every two years. Software developers have created analytics that let researchers crunch millions of variables from disparate sources of information. However, the time it takes a server or a smartphone to retrieve data from a storage system for a cloud company or hosting provider hasn’t decreased much since it still involves searching a spinning, mechanical hard drive. Such a transaction might only take milliseconds, but millions of transactions per day add up to a lot of time. In-memory computing (IMC) reduces the need to fetch data from disks by moving more data out of the drives and into faster memory, such as flash (or in the case of Hazelcast, RAM). Memory based on flash can be more than 53 times faster than memory based around disks. IMC that processes distributed data in parallel across multiple nodes is a technical necessity because the data is stored that way (in-memory across multiple nodes) due to its sheer size. Moving data from drives to memory results in ultra fast access to data and allows developers to cut many lines of codes from their application. This helps on many fronts: Fewer product delays and operations headaches Greater customer usability experience Higher customer satisfaction Business customers such as retailers, banks and utilities, to quickly detect patterns, analyze massive volume of data on the fly, and perform operations in real time In-Memory Computing Advantages In-memory computing has advantages relating to the fact that reading and writing data that is purely in-memory is faster than data stored on a drive. For example: Cache large amounts of data, and get fast response times for searches Store session data, allowing for customization of live sessions and optimizing website performance Improve complex event processing Benefits of Using Hazelcast Being a true IMC solution, Hazelcast provides you the capability to apply and execute business logic and complex algorithms on data in-memory of the magnitude of terabytes and petabytes. Some of the immediate benefits of using Hazelcast include: The ability to store countless amount of data in-memory, thus ensuring extremely fast response times for extracting data Bypassing the need of complex memory tuning that would otherwise be required to keep huge data in-memory Saving the costs in hardware and operations due to the capacity of storing large volume of data within a single machine (based on RAM) The ability of in-memory computing for large volumes of data in parallel across a handful of nodes Ultra low latency and high throughput infrastructure because of in-memory storage and computing In-memory MapReduce and Aggregators for super-fast computation/aggregation of large amounts of data Hazelcast is Flexible Hazelcast is highly flexible. It provides multiple locking semantics for users to attain the desired level of consistency while performing complex in-memory computing of data. Some of the Hazelcast features that empower a user with in-memory computing are: Distributed Executor Service EntryProcessors MapReduce Distributed Aggregator The diagram above depicts a typical IMC architecture with Hazelcast.
  22. Topology with every node being an app server as well. Topology with a dedicated Hazelcast cluster Java integration Hazelcast provides a drop-in library that any Java developer can include in minutes to enable them to build elegantly simple mission-critical, transactional, and terascale in-memory applications.  Data structures: Map, Queue, Lock, Cache Computing: Aggregated and Map-Reduce TBD: cleanup Hazelcast provides a convenient and familiar interface for developers to work with distributed data structures and other aspects of in-memory computing. For example, in its simplest configuration, hazelcast can be treated as an implementation of the familiar ConcurrenthashMap that can be accessed from multiple JVMs, including JVMs that are spread out across the network. However, it is not necessary to deal with the overall sophistication present in the architecture in order to work effectively with hazelcast, and many users are happy integrating purely at the level of the java.util.concurrent or javax.cache apis. http://hazelcast.org/ Core Java java.util.concurrent.ConcurrentMap com.hazelcast.core.MultiMap >> Collection<String> values = multiMap.get("key") javax.cache.Cache java.util.concurrent.BlockingQueue java.util.concurrent.locks.Lock Hazelacast specific com.hazelcast.core.IMap // EntryProcessor >> map.executeOnKey com.hazelcast.core.ITopic com.hazelcast.core.IExecutorService com.hazelcast.core.IAtomicLong In-memory data computing com.hazelcast.mapreduce.aggregation.Supplier com.hazelcast.mapreduce.aggregation.Aggregations com.hazelcast.mapreduce.Job; com.hazelcast.mapreduce.JobTracker; com.hazelcast.mapreduce.KeyValueSource; com.hazelcast.mapreduce.Mapper; com.hazelcast.mapreduce.Reducer; …
  23. How it fits the big data problem Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies. Oracle Coherence is the industry leading in-memory data grid solution that enables organizations to predictably scale mission-critical applications by providing fast access to frequently used data.  Pivotal GemFire is a distributed data management platform. Pivotal GemFire is designed for many diverse data management situations, but is especially useful for high-volume, latency-sensitive, mission-critical, transactional systems. Gigaspaces XAP is an in-memory computing software platform that processes all your data & apps in real time.
  24. Project domain Talk about Gen3 – Oracle & .NET. – Works but is not scalable.
  25. Project goals http://en.wikipedia.org/wiki/Business_analytics#Types_of_analytics http://en.wikipedia.org/wiki/Data_analysis Note: all system changes i.e. parameters for analytical computations are time-series data.
  26. Project technology Akka is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang. Language bindings exist for both Java and Scala. R is a free software environment for statistical computing and graphics.
  27. Talk on Vertica built-in time-series analysis capabilities Redis as in-memory database for recent data Akka as framework for building distributed applications R as statistical computing language Java-R integration via RESTful API
  28. Talk on Vertica built-in time-series analysis capabilities Redis as in-memory database for recent data Akka as framework for building distributed applications R as statistical computing language Java-R integration via RESTful API
  29. Talk on Vertica built-in time-series analysis capabilities Redis as in-memory database for recent data Akka as framework for building distributed applications R as statistical computing language Java-R integration via RESTful API
  30. What could be done another way http://www.vertica.com/2012/11/14/how-to-parse-anything-into-vertica-using-externalfilter/ http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/
  31. Lessons learned