Store and Process Big Data with Hadoop and Cassandra

Store and Process Big Data
with Hadoop and Cassandra
Apache BarCamp
By
Deependra Ariyadewa
WSO2, Inc.

Store Data with

● Project site : http://cassandra.apache.org

● The latest release version is 1.0.7

● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala

● Cassandra Users : http://www.datastax.com/cassandrausers

● The largest known Cassandra cluster has over 300 TB of data in over
400 machines.

● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport

Cassandra Deployment architecture
hash(key1)

hash(key2)

key => {(k,v),(k,v),(k,v)}

hash(key) => key order

How to Install Cassandra

● Download the artifact
apache-cassandra-1.0.7-bin.tar.gz from http://cassandra.apache.org/download/

● Extract
tar -xzvf apache-cassandra-1.0.7-bin.tar.gz

● Set up folder paths

mkdir -p /var/log/cassandra

chown -R `whoami` /var/log/cassandra

mkdir -p /var/lib/cassandra
chown -R `whoami` /var/lib/cassandra

How to Configure Cassandra
Main Configuration file :

$CASSANDRA_HOME/conf/cassandra.yaml

cluster_name: 'Test Cluster'

seed_provider:
- seeds: "192.168.0.121"

storage_port: 7000

listen_address: localhost

rpc_address: localhost

rpc_port: 9160

Cassandra Clustering

initial_token:

partitioner: org.apache.cassandra.dht.RandomPartitioner

http://wiki.apache.org/cassandra/Operations

Cassandra DevOps

$CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost

[default@unknown] show keyspaces;
Keyspace: system:
Replication Strategy: org.apache.cassandra.locator.LocalStrategy
Durable Writes: true
Options: [replication_factor:1]
Column Families:
ColumnFamily: HintsColumnFamily (Super)
"hinted handoff data"
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.
marshal.BytesType
Row cache size / save period in seconds / keys to save : 0.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 0.01/0
GC grace seconds: 0
Compaction min/max thresholds: 4/32
Read repair chance: 0.0
Replicate on write: true
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

Cassandra CLI

[default@apache] create column family Location with comparator=UTF8Type and
default_validation_class=UTF8Type and key_validation_class=UTF8Type;
f04561a0-60ed-11e1-0000-242d50cf1fbf
Waiting for schema agreement...
... schemas agree across the cluster

[default@apache] set Location[00001][City]='Colombo';
Value inserted.
Elapsed time: 140 msec(s).

[default@apache] list Location;
Using default limit of 100
-------------------
RowKey: 00001
=> (column=City, value=Colombo, timestamp=1330311097464000)

1 Row Returned.
Elapsed time: 122 msec(s).

Store Data with Hector
import me.prettyprint.cassandra.service.CassandraHostConfigurator;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.factory.HFactory;

import java.util.HashMap;
import java.util.Map;

public class ExampleHelper {

public static final String CLUSTER_NAME = "ClusterOne";
public static final String USERNAME_KEY = "username";
public static final String PASSWORD_KEY = "password";
public static final String RPC_PORT = "9160";
public static final String CSS_NODE0 = "localhost";
public static final String CSS_NODE1 = "css1.stratoslive.wso2.com";
public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";

public static Cluster createCluster(String username, String password) {
Map<String, String> credentials =
new HashMap<String, String>();
credentials.put(USERNAME_KEY, username);
credentials.put(PASSWORD_KEY, password);
String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," +
CSS_NODE2 + ":" + RPC_PORT;
return HFactory.createCluster(CLUSTER_NAME,
new CassandraHostConfigurator(hostList), credentials);
}

}

Store Data with Hector
Create Keyspace:

KeyspaceDefinition definition = new ThriftKsDef(keyspaceName);
cluster.addKeyspace(definition);

Add column family:
ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily);
cluster.addColumnFamily(familyDefinition);

Write Data:

Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());

String columnValue = UUID.randomUUID().toString();
mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));

Read Data:
ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace);

columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName);
QueryResult<HColumn<String, String>> result = columnQuery.execute();
HColumn<String, String> hColumn = result.get();

System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "n");

Variable Consistency
● ANY: Wait until some replica has responded.

● ONE: Wait until one replica has responded.

● TWO: Wait until two replicas have responded.

● THREE: Wait until three replicas have responded
.
● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was
stablished.

● EACH_QUORUM: Wait for quorum on each datacenter.

● QUORUM: Wait for a quorum of replicas (no matter which datacenter).

● ALL: Blocks for all the replicas before returning to the client.

Variable Consistency

Create a customized Consistency Level:

ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();
Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();

clmap.put("MyColumnFamily", HConsistencyLevel.ONE);

configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);
configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);

HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);

CQL

Insert data with CQL:

cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo');

Retrieve data with CQL

cqlsh> select * from Location where KEY='00001';

Apache

● Project Site: http://hadoop.apache.org

● Latest Version 1.0.1

● Hadoop is in use at Amazon, Yahoo, Adobe, eBay,
Facebook

● Commercial support : http://hortonworks.com
http://www.cloudera.com

Hadoop deployment Architecture

How to install Hadoop

● Download the artifact from:

http://hadoop.apache.org/common/releases.
html

● Extract : tar -xzvf hadoop-1.0.1.tar.gz

● Copy and extract installation to each data node.

scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop

● Start Hadoop : $HADOOP_HOME:/bin/start-all

Hadoop CLI - HDFS

Format Namenode :

$HADOOP_HOME:/bin/hadoop namenode -format

File operations on HDFS:

$HADOOP_HOME:/bin/hadoop dfs -lsr /

$HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2

Mapreduce

source:http://developer.yahoo.com/hadoop/tutorial/module4.html

Simple Mapreduce Job

Mapper

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws I
OException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

Reducer:
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}


Job Runner:

JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

High level Mapreduce Interfaces

● Hive

● Pig

Store and Process Big Data with Hadoop and Cassandra

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Store and Process Big Data with Hadoop and Cassandra

Similaire à Store and Process Big Data with Hadoop and Cassandra (20)

Dernier

Dernier (20)

Store and Process Big Data with Hadoop and Cassandra