SlideShare a Scribd company logo
1 of 34
Download to read offline
Small intro to Big Data
Michał Matłoka @mmatloka
Outline
1. What is Big Data?
2. Storing
3. Batch & Streams processing
4. Resource Managers
5. Machine Learning
6. Analysis & Visualization
7. Other
What is Big Data?
● Volume
● Velocity
● Variety
How big is Big?
ThoughtWorks: Big Data envy
Storing
CAP theorem
(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availability
● Partition Tolerance
Relational scaling
(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-increments
● Distributed transactions,
possible, but…
It can work!
You don’t always need ACID
BASE might be enough
NoSQL
(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (Neo4J, …
● Multi-model (OrientDB, …)
Apple - 115k Cassandra nodes with
over 10PB of data!
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more
Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Apache Spark
RDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
● MLib
● GraphX
● Spark Streaming
● API: Scala, Java, Python, R*
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Wordcount
Source: http://spark.apache.org/examples.html
Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*
Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State
Stream
processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositional/declarative
● Maturity
Differences
Stream processing
● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache Flink
● Apache Beam
● Apache Kafka Streams
● Apache Samza
● Apache Gearpump
● Apache Apex
● Apache Ignite Streaming
● Apache S4
● ...
Resource
management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad - YARN on
Mesos
● DC/OS
Source:
https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb
oard-ee-600x395@2x.gif
Analysis
SQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark SQL
● Apache Drill
● Facebook Presto
● ...
Machine Learning
● Apache Mahout
● Apache Samoa
● Spark MLib
● FlinkML
● H2O
● TensorFlow
Notebooks
● IPython
● Jupyter
● Apache Zeppelin
Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache ZooKeeper
Awesome Big Data
https://github.com/onurakpolat
/awesome-bigdata
Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)
Articles & references
● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-
frameworks-part-1
● https://dcos.io/
● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
● http://spark.apache.org/
● https://flink.apache.org/
● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-
big-data-analytics-frameworks
● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing
● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Thank you, Q&A?
@mmatloka
http://www.slideshare.net/softwaremill
https://softwaremill.com/blog/

More Related Content

What's hot

presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15Zhenxiao Luo
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherObjectRocket
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Apache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupApache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupPatrick Deenen
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudEduardo Silva Pereira
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs RedisItamar Haber
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 

What's hot (20)

Cascalog
CascalogCascalog
Cascalog
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better Together
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
NATE-Central-Log
NATE-Central-LogNATE-Central-Log
NATE-Central-Log
 
Apache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupApache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java Meetup
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 

Viewers also liked

Machine learning by example
Machine learning by exampleMachine learning by example
Machine learning by exampleSoftwareMill
 
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMillSoftwareMill
 
Jednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEMJednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEMSoftwareMill
 
Scalatra - Scalar Mini
Scalatra  - Scalar MiniScalatra  - Scalar Mini
Scalatra - Scalar MiniSoftwareMill
 
An Introduction to Akka
An Introduction to AkkaAn Introduction to Akka
An Introduction to AkkaSoftwareMill
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...In-Memory Computing Summit
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsAlpine Data
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineTianlun Zhang
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataRommel Garcia
 
From spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developersFrom spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developersSoftwareMill
 
How secure your web framework is?
How secure your web framework is?How secure your web framework is?
How secure your web framework is?SoftwareMill
 
Code reviews - Human Factor
Code reviews - Human FactorCode reviews - Human Factor
Code reviews - Human FactorSoftwareMill
 
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?SoftwareMill
 
Proste REST API z użyciem play i slick
Proste REST API z użyciem play i slickProste REST API z użyciem play i slick
Proste REST API z użyciem play i slickSoftwareMill
 
What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...SoftwareMill
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 

Viewers also liked (20)

Origins of free
Origins of freeOrigins of free
Origins of free
 
Machine learning by example
Machine learning by exampleMachine learning by example
Machine learning by example
 
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
 
Jednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEMJednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEM
 
Scalatra - Scalar Mini
Scalatra  - Scalar MiniScalatra  - Scalar Mini
Scalatra - Scalar Mini
 
An Introduction to Akka
An Introduction to AkkaAn Introduction to Akka
An Introduction to Akka
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
From spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developersFrom spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developers
 
How secure your web framework is?
How secure your web framework is?How secure your web framework is?
How secure your web framework is?
 
Code reviews - Human Factor
Code reviews - Human FactorCode reviews - Human Factor
Code reviews - Human Factor
 
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
 
Proste REST API z użyciem play i slick
Proste REST API z użyciem play i slickProste REST API z użyciem play i slick
Proste REST API z użyciem play i slick
 
What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 

Similar to Small intro to Big Data - Old version

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 

Similar to Small intro to Big Data - Old version (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 

More from SoftwareMill

Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
How To Survive a Live-Coding Session
How To Survive a Live-Coding SessionHow To Survive a Live-Coding Session
How To Survive a Live-Coding SessionSoftwareMill
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...SoftwareMill
 
Have you ever wondered about code review?
Have you ever wondered about code review?Have you ever wondered about code review?
Have you ever wondered about code review?SoftwareMill
 
Reactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaReactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaSoftwareMill
 
W świecie botów czyli po co nam SI
W świecie botów czyli po co nam SIW świecie botów czyli po co nam SI
W świecie botów czyli po co nam SISoftwareMill
 
Small intro to Big Data
Small intro to Big DataSmall intro to Big Data
Small intro to Big DataSoftwareMill
 
Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9SoftwareMill
 
Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) SoftwareMill
 
Teal Is The New Black
Teal Is The New BlackTeal Is The New Black
Teal Is The New BlackSoftwareMill
 
Windowing data in big data streams
Windowing data in big data streamsWindowing data in big data streams
Windowing data in big data streamsSoftwareMill
 
Kafka as a message queue
Kafka as a message queueKafka as a message queue
Kafka as a message queueSoftwareMill
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraSoftwareMill
 
Cassandra - how to fail?
Cassandra - how to fail?Cassandra - how to fail?
Cassandra - how to fail?SoftwareMill
 
How to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companyHow to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companySoftwareMill
 
Performance tests with gatling
Performance tests with gatlingPerformance tests with gatling
Performance tests with gatlingSoftwareMill
 
Projekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraProjekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraSoftwareMill
 

More from SoftwareMill (18)

Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
How To Survive a Live-Coding Session
How To Survive a Live-Coding SessionHow To Survive a Live-Coding Session
How To Survive a Live-Coding Session
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
 
Have you ever wondered about code review?
Have you ever wondered about code review?Have you ever wondered about code review?
Have you ever wondered about code review?
 
Reactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaReactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and Alpakka
 
W świecie botów czyli po co nam SI
W świecie botów czyli po co nam SIW świecie botów czyli po co nam SI
W świecie botów czyli po co nam SI
 
Small intro to Big Data
Small intro to Big DataSmall intro to Big Data
Small intro to Big Data
 
Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9
 
Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry)
 
Teal Is The New Black
Teal Is The New BlackTeal Is The New Black
Teal Is The New Black
 
Windowing data in big data streams
Windowing data in big data streamsWindowing data in big data streams
Windowing data in big data streams
 
Kafka as a message queue
Kafka as a message queueKafka as a message queue
Kafka as a message queue
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Origins of Free
Origins of FreeOrigins of Free
Origins of Free
 
Cassandra - how to fail?
Cassandra - how to fail?Cassandra - how to fail?
Cassandra - how to fail?
 
How to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companyHow to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent company
 
Performance tests with gatling
Performance tests with gatlingPerformance tests with gatling
Performance tests with gatling
 
Projekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraProjekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designera
 

Recently uploaded

Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Recently uploaded (20)

Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Small intro to Big Data - Old version

  • 1. Small intro to Big Data Michał Matłoka @mmatloka
  • 2. Outline 1. What is Big Data? 2. Storing 3. Batch & Streams processing 4. Resource Managers 5. Machine Learning 6. Analysis & Visualization 7. Other
  • 3. What is Big Data? ● Volume ● Velocity ● Variety
  • 4. How big is Big?
  • 7. CAP theorem (Brewer’s theorem) In distributed system you can only have two of three guarantees: ● Consistency ● Availability ● Partition Tolerance
  • 8. Relational scaling (horizontal) Example limitations: ● Max 48 nodes ● Read-only nodes ● Cross-shard joins… ● Auto-increments ● Distributed transactions, possible, but… It can work!
  • 9. You don’t always need ACID
  • 10. BASE might be enough
  • 11. NoSQL (Not only SQL) ● Key-value (Redis, Dynamo, ...) ● Column (Cassandra, HBase, ...) ● Document (MongoDB, … ) ● Graph (Neo4J, … ● Multi-model (OrientDB, …) Apple - 115k Cassandra nodes with over 10PB of data!
  • 13. Batch Processing ● Processes data from 1 or more sources from bigger period of time (e.g. day, month) ● Source: db, Apache Parquet, ... ● Not real-time ● Can take hours or more
  • 14. Apache Hadoop ● Based on Google paper ● First release in 2006! ● Map -> (shuffle) -> Reduce ● Was the beginning of many projectsMapReduce
  • 15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Hadoop Wordcount - part I Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 16. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Hadoop Wordcount - part II Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 17. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } Hadoop Wordcount - part III Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 18. Apache Spark RDD (Resilient Distributed Dataset) DAG (Directed acyclic graph) ● RDD - map, filter, count etc ● Spark SQL ● MLib ● GraphX ● Spark Streaming ● API: Scala, Java, Python, R*
  • 19. val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark Wordcount Source: http://spark.apache.org/examples.html
  • 20. Apache Flink DAGs with iterations ● Batch & native streaming ● FlinkML ● Table API & SQL (Beta) ● Gelly - graph analytics ● FlinkCEP - detect patterns in data streams ● Compatible with Apache Hadoop and Apache Storm APIs ● API: Scala, Java, Python*
  • 21. Stream processing ● Near real-time/real time ● Processing (usually) does not end ● Source: files, Apache Kafka, Socket, Akka Actors, Twitter, RabbitMQ etc ● Event time vs processing time ● Windows - fixed, sliding, session ● Watermarks ● State
  • 22. Stream processing ● Native/micro-batch ● Latency ● Throughput ● Delivery guarantees ● Resources managers ● API - compositional/declarative ● Maturity Differences
  • 23. Stream processing ● Apache Storm ● Apache Storm Trident ● Alibaba JStorm ● Twitter Heron ● Apache Spark Streaming ● Apache Flink ● Apache Beam ● Apache Kafka Streams ● Apache Samza ● Apache Gearpump ● Apache Apex ● Apache Ignite Streaming ● Apache S4 ● ...
  • 24. Resource management ● Apache YARN ● Apache Mesos (1.0.1!) ● Apache Slider - deploy existing apps on YARN ● Apache Myriad - YARN on Mesos ● DC/OS
  • 26. Analysis SQL Engines & Querying ● Apache Hive ● Apache Pig ● Apache HAWQ ● Apache Impala ● Apache Phoenix ● Apache Spark SQL ● Apache Drill ● Facebook Presto ● ...
  • 27. Machine Learning ● Apache Mahout ● Apache Samoa ● Spark MLib ● FlinkML ● H2O ● TensorFlow
  • 30. Hadoop-related ● Apache Sqoop ● Apache Flume ● Apache Oozie ● Hue ● Apache HDFS ● Apache Ambari ● Apache Knox ● Apache ZooKeeper
  • 32. Conclusions ● There is a lot of it! ● https://pixelastic.github.io/pokem onorbigdata/ ● If you want to learn, start with SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka)
  • 33. Articles & references ● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t echnologies/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing- frameworks-part-1 ● https://dcos.io/ ● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn ● http://spark.apache.org/ ● https://flink.apache.org/ ● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of- big-data-analytics-frameworks ● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing ● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102