Hadoop Integration in Cassandra

Cassandra

Distributed and decentralized data store
Very efficient for fast writes and reads (we ourselves
run a website that reads/writes in real time to
Cassandra)
But what about analytics?

Hadoop over Cassandra

Useful for -
Built-in support for hadoop since 0.6
Can use any language without having to understand the Thrift
API
Distributed analysis - massively reduces time
Possible to use Pig/Hive
What is supported -
Read from Cassandra since 0.6
Write to Cassandra since 0.7
Support for Hadoop Streaming since 0.7 (only output
streaming supported as of now)

Cluster Configuration

Ideal configuration -
Overlay a Hadoop cluster over the Cassandra nodes
Separate server for namenode/jobtracker
Tasktracker on each Cassandra node
At least one node needs to be a data node for house-keeping
purposes
What this achieves -
Data locality
Analytics engine scales with data

Ideal is not always ideal enough

Certain level of tuning always required
Tune cassandra.range.batch.size. Usually would want to
reduce it.
Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf.
xml for 0.6+) to avoid time-outs.
Use NetworkTopologyStrategy and custom Snitches to
separate the analytics as a virtual data-center.

Sample cluster topology
Real time random access -
All-in-one topology
Separate analytics nodes

Classes that make all this possible

ColumnFamilyRecordReader and
ColumnFamilyRecordWriter
To read/write rows from/to Cassandra
ColumnFamilySplit
Create splits over the Cassandra data
ConfigHelper
Helper to configure Cassandra specific information
ColumnFamilyInputFormat
and ColumnFamilyOutputFormat
Inherit Hadoop classes so that Hadoop jobs can interact with
data (read/write)
AvroOutputReader
Stream output to Cassandra

public class Lookalike extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new Lookalike(), args);
System.exit(0);
}

@Override
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), "Lookalike Report");
job.setJarByClass(Lookalike.class);
job.setMapperClass(LookalikeMapper.class);
job.setReducerClass(LookalikeReducer.class);
TextPair.class1);
job.setOutputKeyClass(
job.setOutputValueClass(TextPair.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE));
KeyPartitioner.class2);
job.setPartitionerClass(
job.setGroupingComparatorClass( extPair.GroupingComparator.class2);
T

job.setInputFormatClass(ColumnFamilyInputFormat.class); Con
setThriftContact(conf, host, 9160);
ConfigHelper.setColumnFamily(conf, keyspace, columnFamily);
ConfigHelper.setRangeBatchSize(conf, batchSize);
List<byte[]> columnNames = Arrays.asList("properties".getBytes(),
"personality".
getBytes())
SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames);
ConfigHelper.setSlicePredicate(conf, predicate);

job.waitForCompletion(true);
return 0;
}
1 - See this for more on TextPair - http://bit.ly/fCtaZA
2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8

public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,
TextPair, TextPair>
{
@Override
protected void setup(Context context)
{
HasMap<String, String> targetUserMap = loadTargetUserMap();
}

public void map(String key, SortedMap<byte[], IColumn> columns, Context
context) throws IOException, InterruptedException
{
//Read the properties and personality columns
IColumn propertiesColumn = columns.get("properties".getBytes());
if (propertiesColumn == null)
return;
String propertiesValue = new String(propertiesColumn.value()); //JSONObject
IColumn personalityColumn = columns.get("personality".getBytes());
if (personalityColumn == null)
return;
String personalityValue = new String(personalityColumn.value());
//JSONObject
for(Map.Entry<String, String> targetUser : targetUserMap.entrySet())
{
double score = scoreLookAlike(targetUser.getValue(), personalityValue);
if(score>=FILTER_SCORE)
{
context.write(new TextPair(propertiesValue,score.toString()),
new TextPair(targetUserMap.getKey(), score.toString));
}
}
}
}

public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {
{
@Override
public void reduce(TextPair key, Iterable<TextPair> values, Context context)
throws IOException, InterruptedException {
{
int counter = 1;
for(TextPair value : values)
{
if(counter >= USER_COUNT) //USER_COUNT = 100
{
break;
}
context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond());
counter++;
}
}
}

//Sample Output
//TargetUser Lookalike User Score
//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739

//It is also possible to write this output to Cassandra (we don't do this currently).
//It is quite straight forward. See word_count example in Cassandra contrib folder

Some stats

Cassandra cluster of 16 nodes
Hadoop cluster of 5 nodes
Over 120 million rows
Over 600 GB of data
Over 20 Trillion computations
Hadoop - Just over 4 hours
Serial PHP script - crossed 48 hours and was still
chugging along

Links

Cassandra : The Definitive Guide

Hadoop MapReduce in Cassandra cluster (DataStax)

Cassandra and Hadoop MapReduce (Datastax)

Cassandra Wiki - Hadoop Support

Cassandra/Hadoop Integration (Jeremy Hanna)

Hadoop : The Definitive Guide

Hadoop Integration in Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop Integration in Cassandra

Similar to Hadoop Integration in Cassandra (20)

Recently uploaded

Recently uploaded (20)

Hadoop Integration in Cassandra