SlideShare une entreprise Scribd logo
1  sur  20
Cassandra/Hadoop Integration OLTP + OLAP = Cassandra
BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)
Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra
Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop
Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration
Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning
All-in-one Configuration JobTracker and NameNode Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers
Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat
OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat
Visualizing Take vertical slices of columns Over the whole column family
What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming
Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig
LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br />	as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;
ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc Summary of Integration
Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop
Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future
Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion
About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More
About me: jeremy.hanna@dachisgroup.com @jeromatron on Twitter jeromatron on IRC in #cassandra Questions

Contenu connexe

Tendances

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLMark Wong
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File SystemNtu
 

Tendances (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Unit 7
Unit 7Unit 7
Unit 7
 
Elmasri Navathe DBMS Unit-1 ppt
Elmasri Navathe DBMS Unit-1 pptElmasri Navathe DBMS Unit-1 ppt
Elmasri Navathe DBMS Unit-1 ppt
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Yarn
YarnYarn
Yarn
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Consistency in NoSQL
Consistency in NoSQLConsistency in NoSQL
Consistency in NoSQL
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File System
 

En vedette

Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in AnalyticsPig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in AnalyticsJeremy Hanna
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Chris Strano
 
Coverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceCoverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceNicholas Toscano
 
Business Advisors, Consultants, and Coaches: Whats The Difference?
Business Advisors, Consultants, and Coaches:  Whats The Difference?Business Advisors, Consultants, and Coaches:  Whats The Difference?
Business Advisors, Consultants, and Coaches: Whats The Difference?Alan Walsh
 
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Lars Crama
 
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?Patrick Lowenthal
 
BURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insuranceBURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insuranceDuncan Waugh
 
IBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solutionIBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solutionhearme limited company
 
Avaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensingAvaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensingMotty Ben Atia
 
Box Security Whitepaper
Box Security WhitepaperBox Security Whitepaper
Box Security WhitepaperBoxHQ
 

En vedette (17)

Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in AnalyticsPig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Recommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, scRecommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, sc
 
TruLink hearing control app user guide
TruLink hearing control app user guideTruLink hearing control app user guide
TruLink hearing control app user guide
 
Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Is life insurance tax deductible in super?
Is life insurance tax deductible in super?
 
Coverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceCoverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property Insurance
 
GENBAND G6 datasheet
GENBAND G6 datasheetGENBAND G6 datasheet
GENBAND G6 datasheet
 
Business Advisors, Consultants, and Coaches: Whats The Difference?
Business Advisors, Consultants, and Coaches:  Whats The Difference?Business Advisors, Consultants, and Coaches:  Whats The Difference?
Business Advisors, Consultants, and Coaches: Whats The Difference?
 
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
 
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
 
BURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insuranceBURGLAR ALARM BASICS and insurance
BURGLAR ALARM BASICS and insurance
 
IBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solutionIBM AppScan Source - The SAST solution
IBM AppScan Source - The SAST solution
 
Avaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensingAvaya Aura 6.x suite licensing
Avaya Aura 6.x suite licensing
 
Box Security Whitepaper
Box Security WhitepaperBox Security Whitepaper
Box Security Whitepaper
 

Similaire à Cassandra/Hadoop Integration

Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainersriram0233
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingSamatha Kamuni
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourseSamatha Kamuni
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future HBaseCon
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationJoyabrata Das
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 

Similaire à Cassandra/Hadoop Integration (20)

Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 

Plus de Jeremy Hanna

Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraJeremy Hanna
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real WorldJeremy Hanna
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real WorldJeremy Hanna
 
Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for DevelopersJeremy Hanna
 
Troubleshooting Cassandra
Troubleshooting CassandraTroubleshooting Cassandra
Troubleshooting CassandraJeremy Hanna
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraJeremy Hanna
 
End-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraJeremy Hanna
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Jeremy Hanna
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoopJeremy Hanna
 

Plus de Jeremy Hanna (11)

Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for Developers
 
Troubleshooting Cassandra
Troubleshooting CassandraTroubleshooting Cassandra
Troubleshooting Cassandra
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
 
End-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache Cassandra
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoop
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 

Dernier

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Cassandra/Hadoop Integration

  • 2. BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)
  • 3. Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra
  • 4. Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop
  • 5. Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration
  • 6. Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning
  • 7. All-in-one Configuration JobTracker and NameNode Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
  • 8. Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers
  • 9. Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat
  • 10. OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat
  • 11. Visualizing Take vertical slices of columns Over the whole column family
  • 12. What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming
  • 13. Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig
  • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;
  • 15. ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc Summary of Integration
  • 16. Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop
  • 17. Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future
  • 18. Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion
  • 19. About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More
  • 20. About me: jeremy.hanna@dachisgroup.com @jeromatron on Twitter jeromatron on IRC in #cassandra Questions

Notes de l'éditeur

  1. Floating above the clouds
  2. Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  3. Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  4. Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  5. IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.