14.05.2012 Opening the tool box: Development, testing and deployment in the Hadoop ecosystem (Jean-Pierre König, MeMo News AG)
1. Jean-Pierre König, MeMo News AG
OPENING THE TOOL BOX
DEVELOPMENT, TESTING AND DEPLOYMENT IN THE HADOOP
ECOSYSTEM
14.05.12
http://www.flickr.com/photos/theaucitron/5810163712/sizes/l/in/photostream/
3. Development
The Applicationisa ...
• Distributed newsagent
• GUI-less Java Application
• Spring-based 2-layer architecture
• Services and data access objects
• Client of Hadoop
• Dependencies to Zookeeper and HBase
14.05.12
4. Development(2)
We use Maven 3 for
• Project structure -Corporate POM & Modules
• Dependency Management
• Build the artifact Corporate
POM
global newsagent tools mapred
Loader (Client)
Infrastructure
Model
Utils
Services
Data Access
Objects
14.05.12
6. MapReduce
6
• Java MR jobs for business processes
• Input and output paths either HDFS or HBase
• MR job chaining by Azkaban
• PIG, HIVE for ad-hoc queries
14.05.12
10. HBase
• We use the Apache HBaseTestingUtility
• It’s in-memory complete hadoop instance
with dfs, zk and hbase
• It‘s very slow – conciderlongrunning IT
publicclassConfigurableHBaseClient {
protectedstaticHBaseTestingUtility TEST_UTIL;
static{
final Configurationconf = HBaseConfiguration.create();
conf.addResource("hbase-default-test.xml");
try{
TEST_UTIL = HBaseTestingUtilityFactory.getMiniCluster(1, conf);
} catch (final Exception e) {
fail("Couldnot start hadoop mini cluster.");
}
}
}
14.05.12
11. MapReduce
• Since business logic involved, we use hadoop-
mrunit for testing Map/Reduce Jobs
• It’s in-memory testing
• Parameterized Mapper/Reducer with a driver
@Test
publicvoidreduceShouldWriteExactlyOneLinePerMap() throwsIOException {
final List<DoubleWritable>values = newArrayList<DoubleWritable>();
values.add(new DoubleWritable(399287729));
this.driver.withInput(newText("de.t-online/nachrichten"), values);
this.driver.run();
assertEquals(1, this.driver.getCounters().findCounter(
MeMoCounters.SIGNALS_WRITTEN).getValue());
}
14.05.12
12. Zookeeper
• We use the Apache Zookeeper ClientBase
• It‘s not in-memory but against the staging
cluster
• Prefix paths e.g.: /test/memo/subscribers
@Test
publicvoidgetNumberOfSubscribersShouldSetWatchFlag()
throwsKeeperException,InterruptedException{
final SubscriberDaoImplsubscriberDao =
newSubscriberDaoImpl(zookeeperDao, DIR, null);
subscriberDao.getNumberOfSubscribers(listener);
verify(this.zookeeper, times(1)).getChildren(eq(DIR), eq(subscriberDao));
}
14.05.12
14. The Application
• Automated build and restart via capistrano
• Build on every machine
• There is a .m2 repository everywhere
set :deploy_to, "/usr/share/memo-newsagent“
set:keep_releases, 1
after "deploy:setup" do
run "mkdir -p /var/run/memo #{shared_path}/logs /var/log/memo/"
...
end
after "deploy:update_code" do
run "cd #{current_release} &&mvninstall-Pfast> #{shared_path}/logs/build.log"
end
after "deploy", "rowlog:stop", "newsagent:restart", "rowlog:start"
14.05.12
16. Map Reduce Jobs
• We use a Maven HadoopPlugin
hadoop:pack a la mvn:package
hadoop:deploy HDFS and target folder
• All dependencies packed-in Careful: Huge
JARs without dependency management
see github.com/memonews/maven-hadoop
14.05.12
17. DevOps
OTHER TOOLS IN USE
http://www.flickr.com/photos/damongman/4979871047/sizes/l/in/photostream/
18. Other Tools
• Staging environment in-house, 1 to 1 copy
from production (virtualized)
• Azkaban for MR job scheduling
• Jenkins for (Integration-) Tests and Metrics
• GIT
• Icinga for Monitoring & Alerting
• Ganglia / Graphite for Hadoop Metrics
• Fliwi for automated cluster provisioning
14.05.12