SlideShare une entreprise Scribd logo
1  sur  37
Hadoop Summit
San Jose, California
June 28th 2016
Analysis of Major Trends in
Big Data Analytics
Slim Baltagi
Director, Enterprise Architecture
Capital One Financial Corporation
Welcome!
About me:
• I’m currently director of Enterprise Architecture at Capital One: a
top 10 US financial corporation based in McLean, VA.
• I have over 20 years of IT experience.
• I have over 7 years of Big Data experience: Engineer, Architect,
Evangelist, Blogger, Thought Leader, Speaker, Organizer of Apache
Flink meetups in many countries, Creator and maintainer of the Big
Data Knowledge Base: http://SparkBigData.com with over 7,000
categorized web resources about Hadoop, Spark, Flink, …
Thanks: This talk won the community vote of the ‘Future
of Apache Hadoop’ track. Thanks to all of you who: voted
for this talk, attending this talk now, reading these slides.
Disclaimer: This is a vendor-independent talk that
expresses my own opinions. I am not endorsing nor
promoting any product or vendor mentioned in this talk.2
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
3
What is a typical Big Data Analytics Stack:
Hadoop, Spark, Flink, …?
4
1. Portability between Big Data Execution Engines
If you have an existing Big Data application based on
MapReduce and you want to benefit from a different
execution engine such as Tez, Spark or Flink, you might
need to:
• Reuse some of your existing code such as mapper and
reduce functions. Example:
• Leverage a ‘compatibility layer’ to run your existing
Big Data application on the new engine. Example:
Hadoop Compatibility Layer from Flink
• Switch to a different engine if the tool you used
supports it. Example: Hive/Pig on Tez, Hive/Pig on
Spark, Sqoop on Spark, Cascading on Flink.
• Rewrite your Big Data application! 5
1. Portability between Big Data Execution Engines
Apache Beam (unified Batch and Stream processing) is
a new Apache incubator project based on years of
experience developing Big Data infrastructure
(MapReduce, FlumeJava, MillWheel) within Google
http://beam.incubator.apache.org/
Apache Beam provides a unified API for Batch and
Stream processing and also multiple runners.
Beam programs become portable across multiple
runtime environments, both proprietary (e.g., Google
Cloud Dataflow) and open-source (e.g., Flink, Spark).
Apache Beam web
resourceshttp://sparkbigdata.com/component/tags/tag/67
6
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
7
2. Emergence of stream analytics
Stonebraker et al. predicted in 2005 that stream
processing is going to become increasingly important
and attributed this to the ‘sensorization of the real
world: everything of material significance on the
planet get ‘sensor-tagged’ and report its state or
location in real time’. http://cs.brown.edu/~ugur/8rulesSigRec.pdf
I think stream processing is becoming important not
only because of this sensorization of the real world but
also because of the following factors:
1. Data streams
2. Technology
3. Business
4. Consumers
8
2. Emergence of stream analytics
ConsumersData Streams
Technology Business1
2 3
4
Emergence of Stream
Analytics
9
2. Emergence of stream analytics
1 Data Streams
 Real-world data is available as series of events that
are continuously produced by a variety of
applications and disparate systems inside and
outside the enterprise.
 Examples:
• Sensor networks data
• Web logs
• Database transactions
• System logs
• Tweets and social media data
• Click streams
• Mobile apps data
10
2. Emergence of stream analytics
2 Technology
Simplified data architecture with Apache Kafka as a
major innovation and backbone of stream
architectures.
Rapidly maturing open source stream analytics tools:
Apache Flink, Apache Apex, Spark Streaming, Kafka Streams,
Apache Samza, Apache Storm, Apache Gearpump, Heron, …
Cloud services for stream processing: Google Cloud
Dataflow, Microsoft’s Azure Stream Analytics, Amazon Kinesis
Streams, IBM InfoSphere Streams, …
Vendors innovating in this space: Confluent, Data
Artisans, Databricks, MapR, Hortonworks, StreamSets, …
More mobile devices than human beings!
11
2. Emergence of stream analytics
3 Business
Challenges:
Lag between data creation and actionable insights.
Infrastructure is idle most of the time
Web and mobile application growth, new types/sources
of data.
Need of organizations to shift from reactive approach
to a more of a proactive approach to interactions with
customers, suppliers and employees.
12
2. Emergence of stream analytics
3 Business
Opportunities:
Embracing stream analytics helps organizations with
faster time to insight, competitive advantages and
operational efficiency in a wide range of verticals.
With stream analytics, new startups are/will be
challenging established companies. Example: Pay-As-
You-Go insurance or Usage-Based Auto Insurance
Speed is said to have become the new currency of
business.
13
2. Emergence of stream analytics
4 Consumers
Consumers expect everything to be online and
immediately accessible through mobile
applications.
Mobile, always-on consumers are becoming more and
more demanding for instant responses from enterprise
applications in the way they are used to in mobile
applications from social networks such as Twitter,
Facebook, Linkedin …
Younger generation who grow up with video gaming
and accustomed to real-time interaction are now
themselves a growing class of consumers.
14
2. Emergence of stream analytics
 Financial services
 Telecommunications
 Online gaming systems
 Security & Intelligence
 Advertisement serving
 Sensor Networks
 Social Media
 Healthcare
 Oil & Gas
 Retail & eCommerce
 Transportation and logistics
Stream Processor
Business
Applications
(e.g. Enterprise
Command
Center)
Personal Mobile
Applications
Data Lake
Event
Collector
& Broker
Advanced Analytics
& Machine Learning
Real-Time
Notifications
Real-Time
DecisionsApps
Sensors
Devices
Other
Sources
Business
System
Backend
Dashboards
Sourcing & Integration Analytics & Processing Serving & Consuming
16
End-to-end stream analytics solution architecture
2. Emergence of stream analytics
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
17
3. In-Memory Analytics
While In-Memory Analytics are not new, the trend is that
they are the focus of renewed attention thanks to:
• the availability of new memory that could easily fit
most active data sets
• the maturing or newly available in-memory open source
tools in many categories such as:
 Memory-centric distributed File System
 Columnar data format
 Key Value data stores
 IMDG: In-Memory Data Grids
 Distributed Cache
 Very Large Hashmaps
In the next couple slides, I will share a few examples
18
3. In-Memory Analytics
Alluxio http://alluxio.org (formerly known as Tachyon) is
an open source memory speed virtual distributed
storage system. Example of its usage patterns:
• Accelerate Big Data Analytics workloads by
prefetching views and creating caches on demand.
• Sharing data between applications by writing to
Alluxio’s in-memory data store and read it back at
far greater speed.
 Rocks DB https://github.com/facebook/rocksdb/ An open
source library from Facebook that provides an
embeddable, persistent key-value store. It is suited for
fast storage of data on RAM and flash drives. It is used
as state backend by Samza, Flink, Kafka Streams, …
19
3. In-Memory Analytics
Apache Arrow (http://arrow.apache.org/) for columnar in-
memory analytics.
• Apache Arrow enables execution engines to take
advantage of the latest SIMD (Single Input Multiple
Data) operations included in modern processors, for
native vectorized optimization of analytical data
processing.
• Columnar layout of data also allows for a better use of
CPU caches by placing all data relevant to a column
operation in as compact of a format as possible.
• Apache Arrow advantages is that systems utilizing it
as a common memory format have no overhead for
cross-system data communication and also can share
functionality.
20
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics frameworks
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Deployment of Big Data applications in a
hybrid model: on-premise and on the cloud
21
4. Rapid Application Development of Big
Data applications
MicroservicesAPIs
Notebooks
/Shells
GUIs1
2 3
4
Rapid Applications Development of
Big Data Analytics
22
4. Rapid Application Development of Big
Data applications
1 APIs
 Apache Spark and Apache Flink provide high level and
easy to use API compared to Hadoop MapReduce
 Apache Beam is a new open source project from
Google that attempts to unify data processing
frameworks with a core API, allowing easy portability
between execution engines.
 Use Apache Beam unified API for batch and streaming
and then run on a local runner, Apache Spark, Apache
Flink, …
 The biggest advantage is in developer productivity and
ease of migration between processing engines.
23
4. Rapid Application Development of Big
Data applications
2 Shells or Notebooks
• REPL (Read Evaluate Print Loop) interpreter
• Interactive queries
• Explore data quickly
• Sketch out your ideas in the shell to make sure you’ve
got your code right before deploying it to a cluster.
• Web-based interactive computation environment
• Collaborative data analytics and visualization tool
• Combines rich text, execution code, plots and rich
media
• Exploratory data science
• Saving and replaying of written code
24
4. Rapid Application Development of Big
Data applications
2 Shells or Notebooks Apache Zeppelin
25
4. Rapid Application Development of Big
Data applications
3 GUIs
 Apache Nifi
26
4. Rapid Application Development of Big
Data applications
4 Microservices:
 Microservices are an important trend in building larger
systems by:
• decomposing their functions into relatively simple,
single purpose services
• that asynchronously communicate via Apache
Kafka as a message passing technology that avoid
unwanted dependencies between these services.
 This streaming architectural style provides agility
as microservices can be built and maintained by
small and cross-functional teams.
27
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics frameworks
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
28
5. Open sourcing Machine Learning systems
by tech giants
Yahoo
CaffeOnSpark
Facebook
Torch
IBM
SystemML
Google
TensorFlow1
2 3
5
Open sourcing machine
learning systems by tech giants
29
4
Microsoft
DMTK
Amazon
DSSTNE
6
5. Open sourcing Machine Learning systems
by tech giants
1 Torch http://torch.ch/ is an open source
Machine Learning library which provides a
wide range of deep learning algorithms.
Facebook donated its optimized deep learning modules to
the Torch project on January 16, 2015.
2 Apache SystemML http://systemml.apache.org/
is a distributed and declarative machine learning platform.
It was created in 2010 by IBM and donated as an open
source Apache project on November 2nd, 2015.
3 TensorFlow is an open source machine learning library
created by Google. https://www.tensorflow.org It was released
under the Apache 2.0 open source license on November 9th,
2015 30
5. Open sourcing Machine Learning
systems by tech giants
4 DMTK (Distributed Machine Learning Toolkit) allows
models to be trained on multiple nodes at once.
http://www.dmtk.io/ DMTK was open sourced
by Microsoft on November 12, 2015.
5 CaffeOnSpark https://github.com/yahoo/CaffeOnSpark is an
open source machine learning library created by Yahoo. It
was open sourced on February 24th, 2016
DSSTNE (Deep Scalable Sparse Tensor Network
Engine) “Destiny” is an Amazon developed library for
building Deep Learning (DL) Machine Learning (ML)
models. It was open sourced on May 11th, 2016
https://github.com/amznlabs/amazon-dsstne
31
6
5. Open sourcing Machine Learning
systems by tech giants
It is expected to see wider adoption of Machine Learning
tools by companies besides these tech giants in a
similar way that MapReduce and Hadoop helped making
“Big Data” a part of just every company’s strategy!
These tech giants are not pushing their machine
learning systems for internal use only but they are
racing to open source them, attract users and
committers and advance the entire industry.
This combined with deployment on commodity clusters
will accelerate such adoption and as a result we will see
new machine learning use cases especially building on
deep learning that will transform multiple industries.
32
Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics frameworks
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
33
6. Hybrid Cloud Computing
Cloud is becoming mainstream and software stack is
adapting.
Big Data applications will eventually all move to the
cloud to benefit from agility, elasticity and on-demand
computing!
Meanwhile, companies need to advance their strategy
for hybrid integration between cloud and on-premise
deployments.
Deployment of Big Data applications in a hybrid
model: on-premise and on the cloud
34
6. Hybrid Cloud Computing
The following are a few patterns for such hybrid
integration:
1. Replicating data from SaaS apps to existing on-
premise databases to be used by other on-premise
applications such as analytics ones.
2. Integrating SaaS applications themselves with on-
premise applications.
3. Hybrid Data Warehousing with the Cloud: move data
from on-premise data warehouse to the cloud.
4. Real-Time analytics on streaming data: depending on
your use case, you might keep your stream analytics
infrastructure directly accessible on-premise for low
latency.
Key Takeaways
1. Adopt Apache Beam for easier development and
portability between Big Data Execution Engines
2. Adopt stream analytics for faster time to insight,
competitive advantages and operational efficiency
3. Accelerate your Big Data applications with In-Memory
open source tools
4. Adopt Rapid Application Development of Big Data
applications: APIs, Notebooks, GUIs, Microservices…
5. Have Machine Learning part of your strategy or
passively watch your industry completely
transformed!
6. How to advance your strategy for hybrid integration
between cloud and on-premise deployments?
36
Thanks!
To all of you for attending!
Any questions?
Let’s keep in touch!
• sbaltagi@gmail.com
• @SlimBaltagi
• https://www.linkedin.com/in/slimbaltagi
37

Contenu connexe

Tendances

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariDataWorks Summit/Hadoop Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...DataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 

Tendances (20)

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 

En vedette

The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 

En vedette (20)

The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 

Similaire à Analysis of Major Trends in Big Data Analytics

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)Abdelkrim Boujraf
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdfRAHULRAHU8
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective Viewijtsrd
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark newAnam Mahmood
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Impetus Technologies
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public CloudIMC Institute
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Datainside-BigData.com
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source frameworkedunextgen
 

Similaire à Analysis of Major Trends in Big Data Analytics (20)

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source framework
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Analysis of Major Trends in Big Data Analytics

  • 1. Hadoop Summit San Jose, California June 28th 2016 Analysis of Major Trends in Big Data Analytics Slim Baltagi Director, Enterprise Architecture Capital One Financial Corporation
  • 2. Welcome! About me: • I’m currently director of Enterprise Architecture at Capital One: a top 10 US financial corporation based in McLean, VA. • I have over 20 years of IT experience. • I have over 7 years of Big Data experience: Engineer, Architect, Evangelist, Blogger, Thought Leader, Speaker, Organizer of Apache Flink meetups in many countries, Creator and maintainer of the Big Data Knowledge Base: http://SparkBigData.com with over 7,000 categorized web resources about Hadoop, Spark, Flink, … Thanks: This talk won the community vote of the ‘Future of Apache Hadoop’ track. Thanks to all of you who: voted for this talk, attending this talk now, reading these slides. Disclaimer: This is a vendor-independent talk that expresses my own opinions. I am not endorsing nor promoting any product or vendor mentioned in this talk.2
  • 3. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 3
  • 4. What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …? 4
  • 5. 1. Portability between Big Data Execution Engines If you have an existing Big Data application based on MapReduce and you want to benefit from a different execution engine such as Tez, Spark or Flink, you might need to: • Reuse some of your existing code such as mapper and reduce functions. Example: • Leverage a ‘compatibility layer’ to run your existing Big Data application on the new engine. Example: Hadoop Compatibility Layer from Flink • Switch to a different engine if the tool you used supports it. Example: Hive/Pig on Tez, Hive/Pig on Spark, Sqoop on Spark, Cascading on Flink. • Rewrite your Big Data application! 5
  • 6. 1. Portability between Big Data Execution Engines Apache Beam (unified Batch and Stream processing) is a new Apache incubator project based on years of experience developing Big Data infrastructure (MapReduce, FlumeJava, MillWheel) within Google http://beam.incubator.apache.org/ Apache Beam provides a unified API for Batch and Stream processing and also multiple runners. Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark). Apache Beam web resourceshttp://sparkbigdata.com/component/tags/tag/67 6
  • 7. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 7
  • 8. 2. Emergence of stream analytics Stonebraker et al. predicted in 2005 that stream processing is going to become increasingly important and attributed this to the ‘sensorization of the real world: everything of material significance on the planet get ‘sensor-tagged’ and report its state or location in real time’. http://cs.brown.edu/~ugur/8rulesSigRec.pdf I think stream processing is becoming important not only because of this sensorization of the real world but also because of the following factors: 1. Data streams 2. Technology 3. Business 4. Consumers 8
  • 9. 2. Emergence of stream analytics ConsumersData Streams Technology Business1 2 3 4 Emergence of Stream Analytics 9
  • 10. 2. Emergence of stream analytics 1 Data Streams  Real-world data is available as series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise.  Examples: • Sensor networks data • Web logs • Database transactions • System logs • Tweets and social media data • Click streams • Mobile apps data 10
  • 11. 2. Emergence of stream analytics 2 Technology Simplified data architecture with Apache Kafka as a major innovation and backbone of stream architectures. Rapidly maturing open source stream analytics tools: Apache Flink, Apache Apex, Spark Streaming, Kafka Streams, Apache Samza, Apache Storm, Apache Gearpump, Heron, … Cloud services for stream processing: Google Cloud Dataflow, Microsoft’s Azure Stream Analytics, Amazon Kinesis Streams, IBM InfoSphere Streams, … Vendors innovating in this space: Confluent, Data Artisans, Databricks, MapR, Hortonworks, StreamSets, … More mobile devices than human beings! 11
  • 12. 2. Emergence of stream analytics 3 Business Challenges: Lag between data creation and actionable insights. Infrastructure is idle most of the time Web and mobile application growth, new types/sources of data. Need of organizations to shift from reactive approach to a more of a proactive approach to interactions with customers, suppliers and employees. 12
  • 13. 2. Emergence of stream analytics 3 Business Opportunities: Embracing stream analytics helps organizations with faster time to insight, competitive advantages and operational efficiency in a wide range of verticals. With stream analytics, new startups are/will be challenging established companies. Example: Pay-As- You-Go insurance or Usage-Based Auto Insurance Speed is said to have become the new currency of business. 13
  • 14. 2. Emergence of stream analytics 4 Consumers Consumers expect everything to be online and immediately accessible through mobile applications. Mobile, always-on consumers are becoming more and more demanding for instant responses from enterprise applications in the way they are used to in mobile applications from social networks such as Twitter, Facebook, Linkedin … Younger generation who grow up with video gaming and accustomed to real-time interaction are now themselves a growing class of consumers. 14
  • 15. 2. Emergence of stream analytics  Financial services  Telecommunications  Online gaming systems  Security & Intelligence  Advertisement serving  Sensor Networks  Social Media  Healthcare  Oil & Gas  Retail & eCommerce  Transportation and logistics
  • 16. Stream Processor Business Applications (e.g. Enterprise Command Center) Personal Mobile Applications Data Lake Event Collector & Broker Advanced Analytics & Machine Learning Real-Time Notifications Real-Time DecisionsApps Sensors Devices Other Sources Business System Backend Dashboards Sourcing & Integration Analytics & Processing Serving & Consuming 16 End-to-end stream analytics solution architecture 2. Emergence of stream analytics
  • 17. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 17
  • 18. 3. In-Memory Analytics While In-Memory Analytics are not new, the trend is that they are the focus of renewed attention thanks to: • the availability of new memory that could easily fit most active data sets • the maturing or newly available in-memory open source tools in many categories such as:  Memory-centric distributed File System  Columnar data format  Key Value data stores  IMDG: In-Memory Data Grids  Distributed Cache  Very Large Hashmaps In the next couple slides, I will share a few examples 18
  • 19. 3. In-Memory Analytics Alluxio http://alluxio.org (formerly known as Tachyon) is an open source memory speed virtual distributed storage system. Example of its usage patterns: • Accelerate Big Data Analytics workloads by prefetching views and creating caches on demand. • Sharing data between applications by writing to Alluxio’s in-memory data store and read it back at far greater speed.  Rocks DB https://github.com/facebook/rocksdb/ An open source library from Facebook that provides an embeddable, persistent key-value store. It is suited for fast storage of data on RAM and flash drives. It is used as state backend by Samza, Flink, Kafka Streams, … 19
  • 20. 3. In-Memory Analytics Apache Arrow (http://arrow.apache.org/) for columnar in- memory analytics. • Apache Arrow enables execution engines to take advantage of the latest SIMD (Single Input Multiple Data) operations included in modern processors, for native vectorized optimization of analytical data processing. • Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible. • Apache Arrow advantages is that systems utilizing it as a common memory format have no overhead for cross-system data communication and also can share functionality. 20
  • 21. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Deployment of Big Data applications in a hybrid model: on-premise and on the cloud 21
  • 22. 4. Rapid Application Development of Big Data applications MicroservicesAPIs Notebooks /Shells GUIs1 2 3 4 Rapid Applications Development of Big Data Analytics 22
  • 23. 4. Rapid Application Development of Big Data applications 1 APIs  Apache Spark and Apache Flink provide high level and easy to use API compared to Hadoop MapReduce  Apache Beam is a new open source project from Google that attempts to unify data processing frameworks with a core API, allowing easy portability between execution engines.  Use Apache Beam unified API for batch and streaming and then run on a local runner, Apache Spark, Apache Flink, …  The biggest advantage is in developer productivity and ease of migration between processing engines. 23
  • 24. 4. Rapid Application Development of Big Data applications 2 Shells or Notebooks • REPL (Read Evaluate Print Loop) interpreter • Interactive queries • Explore data quickly • Sketch out your ideas in the shell to make sure you’ve got your code right before deploying it to a cluster. • Web-based interactive computation environment • Collaborative data analytics and visualization tool • Combines rich text, execution code, plots and rich media • Exploratory data science • Saving and replaying of written code 24
  • 25. 4. Rapid Application Development of Big Data applications 2 Shells or Notebooks Apache Zeppelin 25
  • 26. 4. Rapid Application Development of Big Data applications 3 GUIs  Apache Nifi 26
  • 27. 4. Rapid Application Development of Big Data applications 4 Microservices:  Microservices are an important trend in building larger systems by: • decomposing their functions into relatively simple, single purpose services • that asynchronously communicate via Apache Kafka as a message passing technology that avoid unwanted dependencies between these services.  This streaming architectural style provides agility as microservices can be built and maintained by small and cross-functional teams. 27
  • 28. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 28
  • 29. 5. Open sourcing Machine Learning systems by tech giants Yahoo CaffeOnSpark Facebook Torch IBM SystemML Google TensorFlow1 2 3 5 Open sourcing machine learning systems by tech giants 29 4 Microsoft DMTK Amazon DSSTNE 6
  • 30. 5. Open sourcing Machine Learning systems by tech giants 1 Torch http://torch.ch/ is an open source Machine Learning library which provides a wide range of deep learning algorithms. Facebook donated its optimized deep learning modules to the Torch project on January 16, 2015. 2 Apache SystemML http://systemml.apache.org/ is a distributed and declarative machine learning platform. It was created in 2010 by IBM and donated as an open source Apache project on November 2nd, 2015. 3 TensorFlow is an open source machine learning library created by Google. https://www.tensorflow.org It was released under the Apache 2.0 open source license on November 9th, 2015 30
  • 31. 5. Open sourcing Machine Learning systems by tech giants 4 DMTK (Distributed Machine Learning Toolkit) allows models to be trained on multiple nodes at once. http://www.dmtk.io/ DMTK was open sourced by Microsoft on November 12, 2015. 5 CaffeOnSpark https://github.com/yahoo/CaffeOnSpark is an open source machine learning library created by Yahoo. It was open sourced on February 24th, 2016 DSSTNE (Deep Scalable Sparse Tensor Network Engine) “Destiny” is an Amazon developed library for building Deep Learning (DL) Machine Learning (ML) models. It was open sourced on May 11th, 2016 https://github.com/amznlabs/amazon-dsstne 31 6
  • 32. 5. Open sourcing Machine Learning systems by tech giants It is expected to see wider adoption of Machine Learning tools by companies besides these tech giants in a similar way that MapReduce and Hadoop helped making “Big Data” a part of just every company’s strategy! These tech giants are not pushing their machine learning systems for internal use only but they are racing to open source them, attract users and committers and advance the entire industry. This combined with deployment on commodity clusters will accelerate such adoption and as a result we will see new machine learning use cases especially building on deep learning that will transform multiple industries. 32
  • 33. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 33
  • 34. 6. Hybrid Cloud Computing Cloud is becoming mainstream and software stack is adapting. Big Data applications will eventually all move to the cloud to benefit from agility, elasticity and on-demand computing! Meanwhile, companies need to advance their strategy for hybrid integration between cloud and on-premise deployments. Deployment of Big Data applications in a hybrid model: on-premise and on the cloud 34
  • 35. 6. Hybrid Cloud Computing The following are a few patterns for such hybrid integration: 1. Replicating data from SaaS apps to existing on- premise databases to be used by other on-premise applications such as analytics ones. 2. Integrating SaaS applications themselves with on- premise applications. 3. Hybrid Data Warehousing with the Cloud: move data from on-premise data warehouse to the cloud. 4. Real-Time analytics on streaming data: depending on your use case, you might keep your stream analytics infrastructure directly accessible on-premise for low latency.
  • 36. Key Takeaways 1. Adopt Apache Beam for easier development and portability between Big Data Execution Engines 2. Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency 3. Accelerate your Big Data applications with In-Memory open source tools 4. Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices… 5. Have Machine Learning part of your strategy or passively watch your industry completely transformed! 6. How to advance your strategy for hybrid integration between cloud and on-premise deployments? 36
  • 37. Thanks! To all of you for attending! Any questions? Let’s keep in touch! • sbaltagi@gmail.com • @SlimBaltagi • https://www.linkedin.com/in/slimbaltagi 37