SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Innovative Solutions For Mission Critical Systems
HSV Big Data
The Big Data Puzzle -
Where Does the Eclipse Piece Fit?
Innovative Solutions For Mission Critical Systems
HSV Big Data
Chief Technical Officer at CohesionForce and co-founder of the Huntsville
Eclipse Framework Developers Group.
Developing applications and frameworks based on Eclipse technology since
2009.
Specialize in “Horizontal Integration” - combining existing technology to
provide new capabilities.
Find me online:
Email – jlangley@cohesionforce.com
Twitter – @jperiodlangley
Google+ JLangley
About me:
Innovative Solutions For Mission Critical Systems
HSV Big Data
CohesionForce Inc. (CFI) is a Technology and Services Veteran Owned
Small Business headquartered in Huntsville, Alabama. CFI has
progressed from our founding in 1998 with measured growth focused on
providing innovative solutions and collaborative teaming for complete
support to our Prime and Government customers.
About CohesionForce:
Core Values:

Disciplined engineering should be easy

Systems should be fulfilling to use

Better tools are needed to solve the problems of tomorrow
Innovative Solutions For Mission Critical Systems
HSV Big Data
The tools used for Big Data analytics seem to be converging on the
Apache Software Foundation as a home. As an Eclipse and Apache
integrator, CohesionForce has found a useful fit for Eclipse projects
when used as tooling for an underlying Apache project runtime.
Using file formats such as Apache Avro and Parquet along with a
compute system such as Apache Hive or Spark allows us to query the
data using a SQL “like” language with Apache Hive and Spark.
Summary:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Big Data is well... a big topic. In this talk, we will focus on the following:
1.Sample data used to approximate the problem space
2.Apache Projects that CohesionForce has used for data analysis
3.Example Configurations tested
4.Eclipse Projects that CohesionForce has used/developed for tooling
5.Thoughts on future work with Eclipse for Big Data & Data Science
Scope:
Innovative Solutions For Mission Critical Systems
HSV Big Data
A sizable list of data available for use can be found here:
https://aws.amazon.com/public-data-sets/
These data sets are a mix of textual, spatial, image, and video. They
closely approximate the size of the data that we needed, but did not
have the combination of factors that we were looking for.
We needed data for events that have an ID, type, location, and a time.
We were able to create a suitable data set by transforming taxi data
available here: http://www.andresmh.com/nyctaxitrips/
After converting the data to a common format (DIS EntityStatePDU),
the result was just over 174M events.
Sample Data:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Distributed Interactive Simulation (DIS)
DIS is an IEEE standard (IEEE-1278.1) developed by the Simulation
Interoperability Standards (SISO) and approved by IEEE.
The EntityStatePDU contains fields such as
●
EntityID – Site,Application,Entity

EntityType – 7-tuple enumeration

EntityLocation – x/y/z Geocentric WGS-84

EntityOrientation – psi/theta/phi Euler angles

EntityVelocity – x/y/z along the orientation axis
The EntityStatePDU is used as an example throughout this
presentation.
Distributed Interactive Simulation:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Ambari
Apache Avro
Apache Bigtop
Apache BookKeeper
Apache CouchDB
Apache Crunch
Apache DataFu
Apache DirectMemory
Apache Drill
Apache Falcon
Apache Flink
Apache Flume
Apache Giraph
Apache Hama
Apache Big Data Projects:
Apache Helix
Apache Ignite
Apache Kafka
Apache Knox
Apache MetaModel
Apache ORC
Apache Oozie
Apache Parquet
Apache Phoenix
Apache REEF
Apache Samza
Apache Spark
Apache Sqoop
Apache Storm
Apache Tajo
Apache Tez
Apache VXQuery
Interesting Note:
Apache lists projects
Hadoop, Hive, Pig,
and Zookeeper under
the database category.
Innovative Solutions For Mission Critical Systems
HSV Big Data
CohesionForce experience with:
Apache Hadoop
Apache Hive
Apache Avro
Apache Spark
Apache Parquet
Innovative Solutions For Mission Critical Systems
HSV Big Data
Data Formats
Apache Avro Apache Parquet
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Avro Data Format:
• Schema based with multiple language bindings (C, Python, Java, etc)
• Schema defined using JSON or IDL
• Useful for nested data structures
• File meta data can be added
• Schema stored with each serialized file
• Allows for dynamic typing or “type discovery”
• Multiple options for compression – currently using deflate (zlib) and snappy
Example Schema for EntityStatePDU
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Avro Data Format:
• Schema based with multiple language bindings (C, Python, Java, etc)
• Schema defined using JSON or IDL
• Useful for nested data structures
• File meta data can be added
• Schema stored with each serialized file
• Allows for dynamic typing or “type discovery”
• Multiple options for compression – currently using deflate (zlib) and snappy
Example Schema for EntityStatePDU
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Parquet Data Format:
• Columnar storage format
• Also schema based
• Type specific compression
• Initially developed by Twitter and open sourced
• Promoted to an Apache Top-Level project on April 27, 2015
• Twitter is converting their data from Avro to Parquet
protoVersion, exID, type, family, time, length, pad
protoVersion, exID, type, family, time, length, pad
protoVersion, exID, type, family, time, length, pad
protoVersion, exID, type, family, time, length, pad
protoVersion, exID, type, family, time, length, pad
protoVersion:1 –
protoVersion:N
exID:1 – exID:N
type:1 – type:N
family:1 – family:N
time:1 – time:N
length:1 – length:N
pad:1 – pad:N
Traditional:
Parquet:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Hadoop
Data Storage + Map/Reduce
+ Scheduling
●
Replicated storage using HDFS and commodity hardware
●
YARN – Job scheduler application for managing
Hadoop jobs
●
Map/Reduce – programming paradigm for writing jobs
that process large amounts of data in parallel
Innovative Solutions For Mission Critical Systems
HSV Big Data
Execution Engines
Apache Spark
Apache Hive
Apache Hive converts Hive SQL
Statements into commands that
run on another execution engine
(Map/Reduce, Spark, Tez)
Provides an ODBC Connection
Takes advantage of multiple cores
and large amounts of memory to
run massively multi-threaded jobs.
Has incorporated most of Hive capability
into the SqlContext.
Innovative Solutions For Mission Critical Systems
HSV Big Data
Visualization Engines
Eclipse BIRT
Innovative Solutions For Mission Critical Systems
HSV Big Data
Apache Zeppelin:
●
Web based “notebook” that provides interactive sessions for data analysis
●
Currently in incubation phase (We are running a source 0.6.0 build at CFI)
Innovative Solutions For Mission Critical Systems
HSV Big Data
Eclipse BIRT:
●
Mature reporting capability for web and application use
●
Exports to most supported office formats
●
Used more for pre-determined reports than interactive analysis
Innovative Solutions For Mission Critical Systems
HSV Big Data
Example
Configurations
Innovative Solutions For Mission Critical Systems
HSV Big Data
Initial Implementation:
Hadoop DN Hadoop DNHadoop DN
Hadoop NN
Apache Yarn
Apache Hive
Eclipse BIRT
Innovative Solutions For Mission Critical Systems
HSV Big Data
Second Implementation:
Hadoop DN Hadoop DNHadoop DN
Hadoop NN
Apache Yarn
Apache Hive
Eclipse BIRT
Spark Master
Spark Worker Spark Worker Spark Worker
Innovative Solutions For Mission Critical Systems
HSV Big Data
Third Implementation:
Hadoop DN Hadoop DNHadoop DN
Hadoop NN
Apache Hive
Eclipse BIRT
Spark Master
Spark Worker Spark Worker Spark Worker
Innovative Solutions For Mission Critical Systems
HSV Big Data
Fourth Implementation:
Hadoop DN Hadoop DNHadoop DN
Hadoop NN
Spark Shell
Spark Master
Spark Worker Spark Worker Spark Worker
Innovative Solutions For Mission Critical Systems
HSV Big Data
Fifth Implementation:
(Presented @EclipseCon)
Hadoop DN Hadoop DNHadoop DN
Hadoop NN
Apache
Zeppelin
Browser
Spark Master
Spark Worker Spark Worker Spark Worker
Innovative Solutions For Mission Critical Systems
HSV Big Data
Current Implementation:
Hadoop DN
Hadoop DN
Hadoop DN
Hadoop NN
Apache
Zeppelin
Browser
Spark Master
Spark Worker Spark Worker Spark Worker
Innovative Solutions For Mission Critical Systems
HSV Big Data
Don't forget to show Zeppelin!
See other attachment on session
Innovative Solutions For Mission Critical Systems
HSV Big Data
Application Implementation:
Hadoop DN
Hadoop DN
Hadoop DN
Hadoop NN
Thrift Server
BIRT
Spark Master
Spark Worker Spark Worker Spark Worker
Application
ODBC
Connection
Innovative Solutions For Mission Critical Systems
HSV Big Data
More Spark Info
Innovative Solutions For Mission Critical Systems
HSV Big Data
●
The core concept in the Spark framework.
●
RDDs are immutable. This allows multiple processes to evaluate the
same piece of data – you can't change the data in an RDD, but you
can create new RDDs.
●
RDDs support two types of operations:
●
Transformations – map, groupByKey, etc. These return a new RDD
based on the operation run.
●
Actions – these return a value – count, first, countByKey, etc.
Spark Resilient Distributed Datasets (RDD):
Innovative Solutions For Mission Critical Systems
HSV Big Data
●
The core concept in the Spark framework.
●
RDDs are immutable. This allows multiple processes to evaluate the
same piece of data – you can't change the data in an RDD, but you
can create new RDDs.
●
RDDs support two types of operations:
●
Transformations – map, groupByKey, etc. These return a new RDD
based on the operation run.
●
Actions – these return a value – count, first, countByKey, etc.
Spark Resilient Distributed Datasets (RDD):
Innovative Solutions For Mission Critical Systems
HSV Big Data
●
Spark Streaming – batches RDD into a pipeline. Existing operations
on RDD can be reused in a stream configuration.
●
Spark SQL – exposes Spark datasets to allow SQL like queries.
●
Spark MLib – machine learning library. Classification, regression,
clustering, collaborative filtering, etc.
●
Spark GraphX – API for graphs and graph-parallel computation.
Spark Core Libraries:
Innovative Solutions For Mission Critical Systems
HSV Big Data
●
Provides an interactive session for evaluating and debugging
transforms.
●
Can attach to a separate cluster (based on the master url) or include
its own master in the JVM.
●
Available in both Scala and Python
Spark Shell:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Eclipse + Apache Workflow:
Data
Model
Generate
Logger
Generate
Queries
Innovative Solutions For Mission Critical Systems
HSV Big Data
Innovative Solutions For Mission Critical Systems
HSV Big Data
Eclipse-Avro: The purpose of this repository is to provide the capability
to store EMF data files in the Apache Avro format. The Acceleo project
is used to generate an Avro Schema based on a given EMF schema
along with an AvroResourceImpl that can be used in the place of the
XMIResourceImple to load and save data using the common EMF
methodology. https://github.com/LangleyStudios/eclipse-avro
DIS Toolkit: The DIS Toolkit provides an EMF model based on the DIS
schema provided by the OpenDIS codebase, along with a generated
data logger that stores files in compressed binary using the Apache
Avro format. https://github.com/CohesionForce/dis-toolkit
AvroToParquet: a simple command line converter for Apache Avro to
Apache Parquet file formats.
https://github.com/CohesionForce/avroToParquet
Eclipse Tooling:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Eclipse BIRT - Visualizing Big Data with Hadoop and BIRT
http://www.eclipse.org/community/eclipse_newsletter/2013/april/article1.php
Talend Open Data Studio – Start working with Hadoop and NoSQL
databases today using simple, graphical tools and wizards to generate
native code that leverages the full power of Hadoop.
https://www.talend.com/download/talend-open-studio
DataStax DevCenter - a free, Eclipse-based tool, which is designed to
be a lightweight visual interface that provides developers and others
with the ability to easily run CQL queries against Cassandra, view
query results, and perform other related tasks.
http://www.datastax.com/what-we-offer/products-services/devcenter
Other Eclipse Based Projects:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Architect- provides an Eclipse-based workbench in which data
scientists can get their job done, in other words, an integrated
development environment (IDE) targeted specifically at data scientists.
http://www.openanalytics.eu/products
StatED - an Eclipse based IDE (integrated development environment)
for R. It offers a set of mature tools for R coding and package building.
This includes a fully integrated R Console, Object Browser and R Help
System, whereas multiple local and remote installations of R are
supported.
http://www.walware.de/goto/statet
https://github.com/walware/statet
More Eclipse Based Projects:
Innovative Solutions For Mission Critical Systems
HSV Big Data
Eclipse Sirius could be used to create graphical editors or visualizations
of component deployment, schemas, queries, etc.
XText could be used to provide editors for Hive Query Language, while
also providing binding to types retrieved from data schemas. This
would allow a user to write queries with syntax highlighting, code
completion, and validation before execution.
XTend could be used to generate loggers, transforms, and other tasks
based on a data model. The config files needed for Hadoop & Spark
could be generated based on a modeled laydown.
Provisioning bundles, starting containers.
Thoughts on Future Direction:
Innovative Solutions For Mission Critical Systems
HSV Big Data
We would like to compile a working list of ideas on this subject.
We would also like to identify potential users of these types of tools to
be sure that we implement the proper feature set.
If you need help with integrating any of the tools/concepts that have
been covered, please:
Contact us:
CohesionForce
www.cohesionforce.com
Email: jlangley@CohesionForce.com
Twitter: @jperiodlangley
Interested?

Contenu connexe

Tendances

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 

Tendances (19)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 

Similaire à The Big Data Puzzle, Where Does the Eclipse Piece Fit?

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similaire à The Big Data Puzzle, Where Does the Eclipse Piece Fit? (20)

Shiv shakti resume
Shiv shakti resumeShiv shakti resume
Shiv shakti resume
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Mukul-Resume
Mukul-ResumeMukul-Resume
Mukul-Resume
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
Hadoop
HadoopHadoop
Hadoop
 
Using Machine Learning with HDInsight
Using Machine Learning with HDInsightUsing Machine Learning with HDInsight
Using Machine Learning with HDInsight
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Resume_VipinKP
Resume_VipinKPResume_VipinKP
Resume_VipinKP
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Shiv Shakti
Shiv ShaktiShiv Shakti
Shiv Shakti
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

The Big Data Puzzle, Where Does the Eclipse Piece Fit?

  • 1. Innovative Solutions For Mission Critical Systems HSV Big Data The Big Data Puzzle - Where Does the Eclipse Piece Fit?
  • 2. Innovative Solutions For Mission Critical Systems HSV Big Data Chief Technical Officer at CohesionForce and co-founder of the Huntsville Eclipse Framework Developers Group. Developing applications and frameworks based on Eclipse technology since 2009. Specialize in “Horizontal Integration” - combining existing technology to provide new capabilities. Find me online: Email – jlangley@cohesionforce.com Twitter – @jperiodlangley Google+ JLangley About me:
  • 3. Innovative Solutions For Mission Critical Systems HSV Big Data CohesionForce Inc. (CFI) is a Technology and Services Veteran Owned Small Business headquartered in Huntsville, Alabama. CFI has progressed from our founding in 1998 with measured growth focused on providing innovative solutions and collaborative teaming for complete support to our Prime and Government customers. About CohesionForce: Core Values:  Disciplined engineering should be easy  Systems should be fulfilling to use  Better tools are needed to solve the problems of tomorrow
  • 4. Innovative Solutions For Mission Critical Systems HSV Big Data The tools used for Big Data analytics seem to be converging on the Apache Software Foundation as a home. As an Eclipse and Apache integrator, CohesionForce has found a useful fit for Eclipse projects when used as tooling for an underlying Apache project runtime. Using file formats such as Apache Avro and Parquet along with a compute system such as Apache Hive or Spark allows us to query the data using a SQL “like” language with Apache Hive and Spark. Summary:
  • 5. Innovative Solutions For Mission Critical Systems HSV Big Data Big Data is well... a big topic. In this talk, we will focus on the following: 1.Sample data used to approximate the problem space 2.Apache Projects that CohesionForce has used for data analysis 3.Example Configurations tested 4.Eclipse Projects that CohesionForce has used/developed for tooling 5.Thoughts on future work with Eclipse for Big Data & Data Science Scope:
  • 6. Innovative Solutions For Mission Critical Systems HSV Big Data A sizable list of data available for use can be found here: https://aws.amazon.com/public-data-sets/ These data sets are a mix of textual, spatial, image, and video. They closely approximate the size of the data that we needed, but did not have the combination of factors that we were looking for. We needed data for events that have an ID, type, location, and a time. We were able to create a suitable data set by transforming taxi data available here: http://www.andresmh.com/nyctaxitrips/ After converting the data to a common format (DIS EntityStatePDU), the result was just over 174M events. Sample Data:
  • 7. Innovative Solutions For Mission Critical Systems HSV Big Data Distributed Interactive Simulation (DIS) DIS is an IEEE standard (IEEE-1278.1) developed by the Simulation Interoperability Standards (SISO) and approved by IEEE. The EntityStatePDU contains fields such as ● EntityID – Site,Application,Entity  EntityType – 7-tuple enumeration  EntityLocation – x/y/z Geocentric WGS-84  EntityOrientation – psi/theta/phi Euler angles  EntityVelocity – x/y/z along the orientation axis The EntityStatePDU is used as an example throughout this presentation. Distributed Interactive Simulation:
  • 8. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Ambari Apache Avro Apache Bigtop Apache BookKeeper Apache CouchDB Apache Crunch Apache DataFu Apache DirectMemory Apache Drill Apache Falcon Apache Flink Apache Flume Apache Giraph Apache Hama Apache Big Data Projects: Apache Helix Apache Ignite Apache Kafka Apache Knox Apache MetaModel Apache ORC Apache Oozie Apache Parquet Apache Phoenix Apache REEF Apache Samza Apache Spark Apache Sqoop Apache Storm Apache Tajo Apache Tez Apache VXQuery Interesting Note: Apache lists projects Hadoop, Hive, Pig, and Zookeeper under the database category.
  • 9. Innovative Solutions For Mission Critical Systems HSV Big Data CohesionForce experience with: Apache Hadoop Apache Hive Apache Avro Apache Spark Apache Parquet
  • 10. Innovative Solutions For Mission Critical Systems HSV Big Data Data Formats Apache Avro Apache Parquet
  • 11. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Avro Data Format: • Schema based with multiple language bindings (C, Python, Java, etc) • Schema defined using JSON or IDL • Useful for nested data structures • File meta data can be added • Schema stored with each serialized file • Allows for dynamic typing or “type discovery” • Multiple options for compression – currently using deflate (zlib) and snappy Example Schema for EntityStatePDU
  • 12. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Avro Data Format: • Schema based with multiple language bindings (C, Python, Java, etc) • Schema defined using JSON or IDL • Useful for nested data structures • File meta data can be added • Schema stored with each serialized file • Allows for dynamic typing or “type discovery” • Multiple options for compression – currently using deflate (zlib) and snappy Example Schema for EntityStatePDU
  • 13. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Parquet Data Format: • Columnar storage format • Also schema based • Type specific compression • Initially developed by Twitter and open sourced • Promoted to an Apache Top-Level project on April 27, 2015 • Twitter is converting their data from Avro to Parquet protoVersion, exID, type, family, time, length, pad protoVersion, exID, type, family, time, length, pad protoVersion, exID, type, family, time, length, pad protoVersion, exID, type, family, time, length, pad protoVersion, exID, type, family, time, length, pad protoVersion:1 – protoVersion:N exID:1 – exID:N type:1 – type:N family:1 – family:N time:1 – time:N length:1 – length:N pad:1 – pad:N Traditional: Parquet:
  • 14. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Hadoop Data Storage + Map/Reduce + Scheduling ● Replicated storage using HDFS and commodity hardware ● YARN – Job scheduler application for managing Hadoop jobs ● Map/Reduce – programming paradigm for writing jobs that process large amounts of data in parallel
  • 15. Innovative Solutions For Mission Critical Systems HSV Big Data Execution Engines Apache Spark Apache Hive Apache Hive converts Hive SQL Statements into commands that run on another execution engine (Map/Reduce, Spark, Tez) Provides an ODBC Connection Takes advantage of multiple cores and large amounts of memory to run massively multi-threaded jobs. Has incorporated most of Hive capability into the SqlContext.
  • 16. Innovative Solutions For Mission Critical Systems HSV Big Data Visualization Engines Eclipse BIRT
  • 17. Innovative Solutions For Mission Critical Systems HSV Big Data Apache Zeppelin: ● Web based “notebook” that provides interactive sessions for data analysis ● Currently in incubation phase (We are running a source 0.6.0 build at CFI)
  • 18. Innovative Solutions For Mission Critical Systems HSV Big Data Eclipse BIRT: ● Mature reporting capability for web and application use ● Exports to most supported office formats ● Used more for pre-determined reports than interactive analysis
  • 19. Innovative Solutions For Mission Critical Systems HSV Big Data Example Configurations
  • 20. Innovative Solutions For Mission Critical Systems HSV Big Data Initial Implementation: Hadoop DN Hadoop DNHadoop DN Hadoop NN Apache Yarn Apache Hive Eclipse BIRT
  • 21. Innovative Solutions For Mission Critical Systems HSV Big Data Second Implementation: Hadoop DN Hadoop DNHadoop DN Hadoop NN Apache Yarn Apache Hive Eclipse BIRT Spark Master Spark Worker Spark Worker Spark Worker
  • 22. Innovative Solutions For Mission Critical Systems HSV Big Data Third Implementation: Hadoop DN Hadoop DNHadoop DN Hadoop NN Apache Hive Eclipse BIRT Spark Master Spark Worker Spark Worker Spark Worker
  • 23. Innovative Solutions For Mission Critical Systems HSV Big Data Fourth Implementation: Hadoop DN Hadoop DNHadoop DN Hadoop NN Spark Shell Spark Master Spark Worker Spark Worker Spark Worker
  • 24. Innovative Solutions For Mission Critical Systems HSV Big Data Fifth Implementation: (Presented @EclipseCon) Hadoop DN Hadoop DNHadoop DN Hadoop NN Apache Zeppelin Browser Spark Master Spark Worker Spark Worker Spark Worker
  • 25. Innovative Solutions For Mission Critical Systems HSV Big Data Current Implementation: Hadoop DN Hadoop DN Hadoop DN Hadoop NN Apache Zeppelin Browser Spark Master Spark Worker Spark Worker Spark Worker
  • 26. Innovative Solutions For Mission Critical Systems HSV Big Data Don't forget to show Zeppelin! See other attachment on session
  • 27. Innovative Solutions For Mission Critical Systems HSV Big Data Application Implementation: Hadoop DN Hadoop DN Hadoop DN Hadoop NN Thrift Server BIRT Spark Master Spark Worker Spark Worker Spark Worker Application ODBC Connection
  • 28. Innovative Solutions For Mission Critical Systems HSV Big Data More Spark Info
  • 29. Innovative Solutions For Mission Critical Systems HSV Big Data ● The core concept in the Spark framework. ● RDDs are immutable. This allows multiple processes to evaluate the same piece of data – you can't change the data in an RDD, but you can create new RDDs. ● RDDs support two types of operations: ● Transformations – map, groupByKey, etc. These return a new RDD based on the operation run. ● Actions – these return a value – count, first, countByKey, etc. Spark Resilient Distributed Datasets (RDD):
  • 30. Innovative Solutions For Mission Critical Systems HSV Big Data ● The core concept in the Spark framework. ● RDDs are immutable. This allows multiple processes to evaluate the same piece of data – you can't change the data in an RDD, but you can create new RDDs. ● RDDs support two types of operations: ● Transformations – map, groupByKey, etc. These return a new RDD based on the operation run. ● Actions – these return a value – count, first, countByKey, etc. Spark Resilient Distributed Datasets (RDD):
  • 31. Innovative Solutions For Mission Critical Systems HSV Big Data ● Spark Streaming – batches RDD into a pipeline. Existing operations on RDD can be reused in a stream configuration. ● Spark SQL – exposes Spark datasets to allow SQL like queries. ● Spark MLib – machine learning library. Classification, regression, clustering, collaborative filtering, etc. ● Spark GraphX – API for graphs and graph-parallel computation. Spark Core Libraries:
  • 32. Innovative Solutions For Mission Critical Systems HSV Big Data ● Provides an interactive session for evaluating and debugging transforms. ● Can attach to a separate cluster (based on the master url) or include its own master in the JVM. ● Available in both Scala and Python Spark Shell:
  • 33. Innovative Solutions For Mission Critical Systems HSV Big Data Eclipse + Apache Workflow: Data Model Generate Logger Generate Queries
  • 34. Innovative Solutions For Mission Critical Systems HSV Big Data
  • 35. Innovative Solutions For Mission Critical Systems HSV Big Data Eclipse-Avro: The purpose of this repository is to provide the capability to store EMF data files in the Apache Avro format. The Acceleo project is used to generate an Avro Schema based on a given EMF schema along with an AvroResourceImpl that can be used in the place of the XMIResourceImple to load and save data using the common EMF methodology. https://github.com/LangleyStudios/eclipse-avro DIS Toolkit: The DIS Toolkit provides an EMF model based on the DIS schema provided by the OpenDIS codebase, along with a generated data logger that stores files in compressed binary using the Apache Avro format. https://github.com/CohesionForce/dis-toolkit AvroToParquet: a simple command line converter for Apache Avro to Apache Parquet file formats. https://github.com/CohesionForce/avroToParquet Eclipse Tooling:
  • 36. Innovative Solutions For Mission Critical Systems HSV Big Data Eclipse BIRT - Visualizing Big Data with Hadoop and BIRT http://www.eclipse.org/community/eclipse_newsletter/2013/april/article1.php Talend Open Data Studio – Start working with Hadoop and NoSQL databases today using simple, graphical tools and wizards to generate native code that leverages the full power of Hadoop. https://www.talend.com/download/talend-open-studio DataStax DevCenter - a free, Eclipse-based tool, which is designed to be a lightweight visual interface that provides developers and others with the ability to easily run CQL queries against Cassandra, view query results, and perform other related tasks. http://www.datastax.com/what-we-offer/products-services/devcenter Other Eclipse Based Projects:
  • 37. Innovative Solutions For Mission Critical Systems HSV Big Data Architect- provides an Eclipse-based workbench in which data scientists can get their job done, in other words, an integrated development environment (IDE) targeted specifically at data scientists. http://www.openanalytics.eu/products StatED - an Eclipse based IDE (integrated development environment) for R. It offers a set of mature tools for R coding and package building. This includes a fully integrated R Console, Object Browser and R Help System, whereas multiple local and remote installations of R are supported. http://www.walware.de/goto/statet https://github.com/walware/statet More Eclipse Based Projects:
  • 38. Innovative Solutions For Mission Critical Systems HSV Big Data Eclipse Sirius could be used to create graphical editors or visualizations of component deployment, schemas, queries, etc. XText could be used to provide editors for Hive Query Language, while also providing binding to types retrieved from data schemas. This would allow a user to write queries with syntax highlighting, code completion, and validation before execution. XTend could be used to generate loggers, transforms, and other tasks based on a data model. The config files needed for Hadoop & Spark could be generated based on a modeled laydown. Provisioning bundles, starting containers. Thoughts on Future Direction:
  • 39. Innovative Solutions For Mission Critical Systems HSV Big Data We would like to compile a working list of ideas on this subject. We would also like to identify potential users of these types of tools to be sure that we implement the proper feature set. If you need help with integrating any of the tools/concepts that have been covered, please: Contact us: CohesionForce www.cohesionforce.com Email: jlangley@CohesionForce.com Twitter: @jperiodlangley Interested?