Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Store and Process Big Data with Hadoop and Cassandra
1. Store and Process Big Data
with Hadoop and Cassandra
Apache BarCamp
By
Deependra Ariyadewa
WSO2, Inc.
2. Store Data with
● Project site : http://cassandra.apache.org
● The latest release version is 1.0.7
● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala
● Cassandra Users : http://www.datastax.com/cassandrausers
● The largest known Cassandra cluster has over 300 TB of data in over
400 machines.
● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport
7. Cassandra DevOps
$CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost
[default@unknown] show keyspaces;
Keyspace: system:
Replication Strategy: org.apache.cassandra.locator.LocalStrategy
Durable Writes: true
Options: [replication_factor:1]
Column Families:
ColumnFamily: HintsColumnFamily (Super)
"hinted handoff data"
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.
marshal.BytesType
Row cache size / save period in seconds / keys to save : 0.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 0.01/0
GC grace seconds: 0
Compaction min/max thresholds: 4/32
Read repair chance: 0.0
Replicate on write: true
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
8. Cassandra CLI
[default@apache] create column family Location with comparator=UTF8Type and
default_validation_class=UTF8Type and key_validation_class=UTF8Type;
f04561a0-60ed-11e1-0000-242d50cf1fbf
Waiting for schema agreement...
... schemas agree across the cluster
[default@apache] set Location[00001][City]='Colombo';
Value inserted.
Elapsed time: 140 msec(s).
[default@apache] list Location;
Using default limit of 100
-------------------
RowKey: 00001
=> (column=City, value=Colombo, timestamp=1330311097464000)
1 Row Returned.
Elapsed time: 122 msec(s).
9. Store Data with Hector
import me.prettyprint.cassandra.service.CassandraHostConfigurator;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.factory.HFactory;
import java.util.HashMap;
import java.util.Map;
public class ExampleHelper {
public static final String CLUSTER_NAME = "ClusterOne";
public static final String USERNAME_KEY = "username";
public static final String PASSWORD_KEY = "password";
public static final String RPC_PORT = "9160";
public static final String CSS_NODE0 = "localhost";
public static final String CSS_NODE1 = "css1.stratoslive.wso2.com";
public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";
public static Cluster createCluster(String username, String password) {
Map<String, String> credentials =
new HashMap<String, String>();
credentials.put(USERNAME_KEY, username);
credentials.put(PASSWORD_KEY, password);
String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," +
CSS_NODE2 + ":" + RPC_PORT;
return HFactory.createCluster(CLUSTER_NAME,
new CassandraHostConfigurator(hostList), credentials);
}
}
10. Store Data with Hector
Create Keyspace:
KeyspaceDefinition definition = new ThriftKsDef(keyspaceName);
cluster.addKeyspace(definition);
Add column family:
ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily);
cluster.addColumnFamily(familyDefinition);
Write Data:
Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());
String columnValue = UUID.randomUUID().toString();
mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));
Read Data:
ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace);
columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName);
QueryResult<HColumn<String, String>> result = columnQuery.execute();
HColumn<String, String> hColumn = result.get();
System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "n");
11. Variable Consistency
● ANY: Wait until some replica has responded.
● ONE: Wait until one replica has responded.
● TWO: Wait until two replicas have responded.
● THREE: Wait until three replicas have responded
.
● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was
stablished.
● EACH_QUORUM: Wait for quorum on each datacenter.
● QUORUM: Wait for a quorum of replicas (no matter which datacenter).
● ALL: Blocks for all the replicas before returning to the client.
12. Variable Consistency
Create a customized Consistency Level:
ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();
Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();
clmap.put("MyColumnFamily", HConsistencyLevel.ONE);
configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);
configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);
HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);
13. CQL
Insert data with CQL:
cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo');
Retrieve data with CQL
cqlsh> select * from Location where KEY='00001';
14. Apache
● Project Site: http://hadoop.apache.org
● Latest Version 1.0.1
● Hadoop is in use at Amazon, Yahoo, Adobe, eBay,
Facebook
● Commercial support : http://hortonworks.com
http://www.cloudera.com
16. How to install Hadoop
● Download the artifact from:
http://hadoop.apache.org/common/releases.
html
● Extract : tar -xzvf hadoop-1.0.1.tar.gz
● Copy and extract installation to each data node.
scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop
● Start Hadoop : $HADOOP_HOME:/bin/start-all
19. Simple Mapreduce Job
Mapper
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws I
OException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
20. Simple Mapreduce Job
Reducer:
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
21. Simple Mapreduce Job
Job Runner:
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);