SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Partners in Crime
Cassandra Analytics and ETL with Hadoop




Cassandra Summit 2010

Date: August 10th, 2010
What is Hadoop?

• Distributed processing framework (MapReduce)
  – Moves processing to the data
• Distributed filesystem
  – Allows data to move when processing can't
Why use Hadoop with Cassandra?

 Perfect partners for big data laundering

• Cassandra optimized for access
• Hadoop optimized for processing
  – Many analytics frameworks
  – Existing integrations
      • RDBMS → Hadoop → Cassandra
Cluster Layouts

• Existing Hadoop cluster?
  – Start Hadoop tasktrackers on Cassandra cluster
  – Processing performed on local nodes
Cluster Layouts

• No Hadoop cluster?
  – Start all Hadoop daemons on 2-3 nodes
      • MapReduce depends lightly on HDFS
  – Start Hadoop tasktrackers on Cassandra cluster
Hadoop Integration Points

• JVM MapReduce
  – Keys/values iterated in process
• Hadoop Streaming
  – Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig
  – High level relational language (SQL alternative)
• Apache Hive
  – Forthcoming support for Cassandra storage
Demo

• Code
  – github.com/stuhood/cassandra-summit-demo
• Flow
  – Load with Hadoop Streaming
  – Analyze with Apache Pig
  – Load/Process with JVM MapReduce
Hadoop Streaming Summary

• Mapper/Reducer scripts
  – Any language
• Script is moved to the data


 cat $input | mapper | sort | reducer > $output
ETL with Streaming

• ETL to Cassandra in ~50 lines
 Load!
ETL with Streaming

1)Files in HDFS
2)Hadoop Streaming
3)bin/load-mapper.py (the code you write)
4)Cassandra's Streaming Shim
5)Cassandra
Apache Pig Summary

• Declarative relational language
Analytics with Pig

• Analytics from Cassandra in ~20 lines
 Analyze!
Analytics with Pig

1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files in HDFS
JVM MapReduce Summary

• Extend Mapper/Reducer base classes
• Hadoop:
  – Transports the Jar to nodes near the data
  – Efficiently streams data through
Load/Process with MapReduce

• Efficient bulk loading in ~80 lines
 Summarize!
Load/Process with MapReduce

1)Files in HDFS
2)MapReduce
3)Mapper/Reducer (the code you write)
4)Cassandra's ColumnFamilyOutputFormat
5)Cassandra
Future Work

• Pig Output
• Hive
• Hadoop Streaming Input
• Optimizations
Questions?
References

• Code available at
  – github.com/stuhood/cassandra-summit-demo
• Open issues
  – CASSANDRA-1315
  – CASSANDRA-1322
  – CASSANDRA-1368
• “Hadoop + Cassandra” - Jeremy Hanna
  – slideshare.net/jeromatron/cassandrahadoop-4399672

Contenu connexe

Tendances

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Edureka!
 

Tendances (20)

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
 

En vedette

Space-time data workshop at IfGI
Space-time data workshop at IfGISpace-time data workshop at IfGI
Space-time data workshop at IfGITomislav Hengl
 
ArcGIS Space-Time Mining of Crime Data
ArcGIS Space-Time Mining of Crime DataArcGIS Space-Time Mining of Crime Data
ArcGIS Space-Time Mining of Crime Datamargaretmfurr
 
10 Steps to Optimize Your Crime Analysis
10 Steps to Optimize Your Crime Analysis10 Steps to Optimize Your Crime Analysis
10 Steps to Optimize Your Crime AnalysisAzavea
 
Crime Risk Forecasting and Predictive Analytics - Esri UC
Crime Risk Forecasting and Predictive Analytics - Esri UCCrime Risk Forecasting and Predictive Analytics - Esri UC
Crime Risk Forecasting and Predictive Analytics - Esri UCAzavea
 
Helping Australian agencies fight serious crime
Helping Australian agencies fight serious crimeHelping Australian agencies fight serious crime
Helping Australian agencies fight serious crimeWynyard Group
 
Group Capstone Project
Group Capstone ProjectGroup Capstone Project
Group Capstone Projectmargaretmfurr
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesChamath Sajeewa
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationScott Mongeau
 
Cyber crime and security ppt
Cyber crime and security pptCyber crime and security ppt
Cyber crime and security pptLipsita Behera
 

En vedette (10)

Space-time data workshop at IfGI
Space-time data workshop at IfGISpace-time data workshop at IfGI
Space-time data workshop at IfGI
 
ArcGIS Space-Time Mining of Crime Data
ArcGIS Space-Time Mining of Crime DataArcGIS Space-Time Mining of Crime Data
ArcGIS Space-Time Mining of Crime Data
 
10 Steps to Optimize Your Crime Analysis
10 Steps to Optimize Your Crime Analysis10 Steps to Optimize Your Crime Analysis
10 Steps to Optimize Your Crime Analysis
 
Crime Risk Forecasting and Predictive Analytics - Esri UC
Crime Risk Forecasting and Predictive Analytics - Esri UCCrime Risk Forecasting and Predictive Analytics - Esri UC
Crime Risk Forecasting and Predictive Analytics - Esri UC
 
Helping Australian agencies fight serious crime
Helping Australian agencies fight serious crimeHelping Australian agencies fight serious crime
Helping Australian agencies fight serious crime
 
Group Capstone Project
Group Capstone ProjectGroup Capstone Project
Group Capstone Project
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articles
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
ACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and MitigationACFE Presentation on Analytics for Fraud Detection and Mitigation
ACFE Presentation on Analytics for Fraud Detection and Mitigation
 
Cyber crime and security ppt
Cyber crime and security pptCyber crime and security ppt
Cyber crime and security ppt
 

Similaire à Partners in Crime: Cassandra Analytics and ETL with Hadoop

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Analytics using big data technologies
Analytics using big data technologiesAnalytics using big data technologies
Analytics using big data technologiesBalakrishnan Vinchu
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialPranamesh Chakraborty
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 

Similaire à Partners in Crime: Cassandra Analytics and ETL with Hadoop (20)

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Presentation
PresentationPresentation
Presentation
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Analytics using big data technologies
Analytics using big data technologiesAnalytics using big data technologies
Analytics using big data technologies
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Partners in Crime: Cassandra Analytics and ETL with Hadoop

  • 1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010
  • 2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't
  • 3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra
  • 4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes
  • 5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster
  • 6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage
  • 7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce
  • 8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output
  • 9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!
  • 10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra
  • 11. Apache Pig Summary • Declarative relational language
  • 12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!
  • 13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS
  • 14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through
  • 15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!
  • 16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra
  • 17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations
  • 19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672