SlideShare a Scribd company logo
1 of 15
Download to read offline
Hadoop
   +
Cassandra
Cassandra




   Distributed and decentralized data store
   Very efficient for fast writes and reads (we ourselves
   run a website that reads/writes in real time to
   Cassandra)
   But what about analytics?
Hadoop over Cassandra


   Useful for -
      Built-in support for hadoop since 0.6
      Can use any language without having to understand the Thrift
      API
      Distributed analysis - massively reduces time
      Possible to use Pig/Hive
   What is supported -
      Read from Cassandra since 0.6
      Write to Cassandra since 0.7
      Support for Hadoop Streaming since 0.7 (only output
      streaming supported as of now)
Cluster Configuration


   Ideal configuration -
      Overlay a Hadoop cluster over the Cassandra nodes
      Separate server for namenode/jobtracker
      Tasktracker on each Cassandra node
      At least one node needs to be a data node for house-keeping
      purposes
   What this achieves -
      Data locality
      Analytics engine scales with data
Ideal is not always ideal enough



   Certain level of tuning always required
      Tune cassandra.range.batch.size. Usually would want to
      reduce it.
      Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf.
      xml for 0.6+) to avoid time-outs.
      Use NetworkTopologyStrategy and custom Snitches to
      separate the analytics as a virtual data-center.
Sample cluster topology
                           Real time random access -
     All-in-one topology
                           Separate analytics nodes
Classes that make all this possible

    ColumnFamilyRecordReader and
    ColumnFamilyRecordWriter
       To read/write rows from/to Cassandra
    ColumnFamilySplit
       Create splits over the Cassandra data
    ConfigHelper
       Helper to configure Cassandra specific information
    ColumnFamilyInputFormat
    and ColumnFamilyOutputFormat
       Inherit Hadoop classes so that Hadoop jobs can interact with
       data (read/write)
    AvroOutputReader
       Stream output to Cassandra
Example
public class Lookalike extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Configuration(), new Lookalike(), args);
        System.exit(0);
    }

    @Override
    public int run(String[] arg0) throws Exception {
       Job job = new Job(getConf(), "Lookalike Report");
       job.setJarByClass(Lookalike.class);
       job.setMapperClass(LookalikeMapper.class);
       job.setReducerClass(LookalikeReducer.class);
                            TextPair.class1);
       job.setOutputKeyClass(
       job.setOutputValueClass(TextPair.class);
       FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE));
                              KeyPartitioner.class2);
       job.setPartitionerClass(
       job.setGroupingComparatorClass( extPair.GroupingComparator.class2);
                                     T

       job.setInputFormatClass(ColumnFamilyInputFormat.class);                                                Con
setThriftContact(conf, host, 9160);
       ConfigHelper.setColumnFamily(conf, keyspace, columnFamily);
       ConfigHelper.setRangeBatchSize(conf, batchSize);
       List<byte[]> columnNames = Arrays.asList("properties".getBytes(),
                                                            "personality".
getBytes())
       SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames);
       ConfigHelper.setSlicePredicate(conf, predicate);

        job.waitForCompletion(true);
        return 0;
    }
                                                   1 - See this for more on TextPair - http://bit.ly/fCtaZA
                                          2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,
TextPair, TextPair>
{
@Override
    protected void setup(Context context)
    {
        HasMap<String, String> targetUserMap = loadTargetUserMap();
    }

    public void map(String key, SortedMap<byte[], IColumn> columns, Context
context)                               throws IOException, InterruptedException
    {
         //Read the properties and personality columns
        IColumn propertiesColumn = columns.get("properties".getBytes());
         if (propertiesColumn == null)
             return;
        String propertiesValue = new String(propertiesColumn.value()); //JSONObject
         IColumn personalityColumn = columns.get("personality".getBytes());
         if (personalityColumn == null)
             return;
        String personalityValue = new String(personalityColumn.value());
//JSONObject
         for(Map.Entry<String, String> targetUser : targetUserMap.entrySet())
         {
             double score = scoreLookAlike(targetUser.getValue(), personalityValue);
             if(score>=FILTER_SCORE)
             {
                 context.write(new TextPair(propertiesValue,score.toString()),
                                 new TextPair(targetUserMap.getKey(), score.toString));
             }
         }
    }
}
public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {
{
    @Override
    public void reduce(TextPair key, Iterable<TextPair> values, Context context)
                throws IOException, InterruptedException {
    {
        int counter = 1;
        for(TextPair value : values)
        {
          if(counter >= USER_COUNT) //USER_COUNT = 100
          {
              break;
          }
          context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond());
          counter++;
        }
    }
}


//Sample Output
//TargetUser                           Lookalike User                       Score
//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739

//It is also possible to write this output to Cassandra (we don't do this currently).
//It is quite straight forward. See word_count example in Cassandra contrib folder
Some stats

   Cassandra cluster of 16 nodes
   Hadoop cluster of 5 nodes
   Over 120 million rows
   Over 600 GB of data
   Over 20 Trillion computations
   Hadoop - Just over 4 hours
   Serial PHP script - crossed 48 hours and was still
   chugging along
Links


Cassandra : The Definitive Guide

Hadoop MapReduce in Cassandra cluster (DataStax)

Cassandra and Hadoop MapReduce (Datastax)

Cassandra Wiki - Hadoop Support

Cassandra/Hadoop Integration (Jeremy Hanna)

Hadoop : The Definitive Guide
Questions

More Related Content

What's hot

What's hot (20)

Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = Brisk
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Scala+data
Scala+dataScala+data
Scala+data
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 

Viewers also liked

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Modern Data Stack France
 

Viewers also liked (9)

Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Stratio's Cassandra Lucene index: Geospatial use cases by Andrés Peña
Stratio's Cassandra Lucene index: Geospatial use cases by Andrés PeñaStratio's Cassandra Lucene index: Geospatial use cases by Andrés Peña
Stratio's Cassandra Lucene index: Geospatial use cases by Andrés Peña
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 

Similar to Hadoop Integration in Cassandra

Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
Deependra Ariyadewa
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
Ananth PackkilDurai
 

Similar to Hadoop Integration in Cassandra (20)

Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Spring data ii
Spring data iiSpring data ii
Spring data ii
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spring data iii
Spring data iiiSpring data iii
Spring data iii
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
Integrating SAP the Java EE Way - JBoss One Day talk 2012
Integrating SAP the Java EE Way - JBoss One Day talk 2012Integrating SAP the Java EE Way - JBoss One Day talk 2012
Integrating SAP the Java EE Way - JBoss One Day talk 2012
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deeply Declarative Data Pipelines
Deeply Declarative Data PipelinesDeeply Declarative Data Pipelines
Deeply Declarative Data Pipelines
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Hadoop Integration in Cassandra

  • 1.
  • 2. Hadoop + Cassandra
  • 3. Cassandra Distributed and decentralized data store Very efficient for fast writes and reads (we ourselves run a website that reads/writes in real time to Cassandra) But what about analytics?
  • 4. Hadoop over Cassandra Useful for - Built-in support for hadoop since 0.6 Can use any language without having to understand the Thrift API Distributed analysis - massively reduces time Possible to use Pig/Hive What is supported - Read from Cassandra since 0.6 Write to Cassandra since 0.7 Support for Hadoop Streaming since 0.7 (only output streaming supported as of now)
  • 5. Cluster Configuration Ideal configuration - Overlay a Hadoop cluster over the Cassandra nodes Separate server for namenode/jobtracker Tasktracker on each Cassandra node At least one node needs to be a data node for house-keeping purposes What this achieves - Data locality Analytics engine scales with data
  • 6. Ideal is not always ideal enough Certain level of tuning always required Tune cassandra.range.batch.size. Usually would want to reduce it. Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf. xml for 0.6+) to avoid time-outs. Use NetworkTopologyStrategy and custom Snitches to separate the analytics as a virtual data-center.
  • 7. Sample cluster topology Real time random access - All-in-one topology Separate analytics nodes
  • 8. Classes that make all this possible ColumnFamilyRecordReader and ColumnFamilyRecordWriter To read/write rows from/to Cassandra ColumnFamilySplit Create splits over the Cassandra data ConfigHelper Helper to configure Cassandra specific information ColumnFamilyInputFormat and ColumnFamilyOutputFormat Inherit Hadoop classes so that Hadoop jobs can interact with data (read/write) AvroOutputReader Stream output to Cassandra
  • 10. public class Lookalike extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new Lookalike(), args); System.exit(0); } @Override public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "Lookalike Report"); job.setJarByClass(Lookalike.class); job.setMapperClass(LookalikeMapper.class); job.setReducerClass(LookalikeReducer.class); TextPair.class1); job.setOutputKeyClass( job.setOutputValueClass(TextPair.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE)); KeyPartitioner.class2); job.setPartitionerClass( job.setGroupingComparatorClass( extPair.GroupingComparator.class2); T job.setInputFormatClass(ColumnFamilyInputFormat.class); Con setThriftContact(conf, host, 9160); ConfigHelper.setColumnFamily(conf, keyspace, columnFamily); ConfigHelper.setRangeBatchSize(conf, batchSize); List<byte[]> columnNames = Arrays.asList("properties".getBytes(), "personality". getBytes()) SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames); ConfigHelper.setSlicePredicate(conf, predicate); job.waitForCompletion(true); return 0; } 1 - See this for more on TextPair - http://bit.ly/fCtaZA 2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
  • 11. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>, TextPair, TextPair> { @Override protected void setup(Context context) { HasMap<String, String> targetUserMap = loadTargetUserMap(); } public void map(String key, SortedMap<byte[], IColumn> columns, Context context) throws IOException, InterruptedException { //Read the properties and personality columns IColumn propertiesColumn = columns.get("properties".getBytes()); if (propertiesColumn == null) return; String propertiesValue = new String(propertiesColumn.value()); //JSONObject IColumn personalityColumn = columns.get("personality".getBytes()); if (personalityColumn == null) return; String personalityValue = new String(personalityColumn.value()); //JSONObject for(Map.Entry<String, String> targetUser : targetUserMap.entrySet()) { double score = scoreLookAlike(targetUser.getValue(), personalityValue); if(score>=FILTER_SCORE) { context.write(new TextPair(propertiesValue,score.toString()), new TextPair(targetUserMap.getKey(), score.toString)); } } } }
  • 12. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> { { @Override public void reduce(TextPair key, Iterable<TextPair> values, Context context) throws IOException, InterruptedException { { int counter = 1; for(TextPair value : values) { if(counter >= USER_COUNT) //USER_COUNT = 100 { break; } context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond()); counter++; } } } //Sample Output //TargetUser Lookalike User Score //7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739 //It is also possible to write this output to Cassandra (we don't do this currently). //It is quite straight forward. See word_count example in Cassandra contrib folder
  • 13. Some stats Cassandra cluster of 16 nodes Hadoop cluster of 5 nodes Over 120 million rows Over 600 GB of data Over 20 Trillion computations Hadoop - Just over 4 hours Serial PHP script - crossed 48 hours and was still chugging along
  • 14. Links Cassandra : The Definitive Guide Hadoop MapReduce in Cassandra cluster (DataStax) Cassandra and Hadoop MapReduce (Datastax) Cassandra Wiki - Hadoop Support Cassandra/Hadoop Integration (Jeremy Hanna) Hadoop : The Definitive Guide