Publicité
Publicité

Contenu connexe

Publicité
Publicité

Polyglot Persistence & Big Data in the Cloud

  1. Polyglot Persistence Big Data in the Cloud Andrei Savu / andrei.savu@cloudsoftcorp.com
  2. Overview • Introduction • Databases • Search • Processing • Deployment
  3. Polyglot Persistence “Polyglot Persistence, like polyglot programming, is all about choosing the right persistence option for the task at hand” http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html http://martinfowler.com/bliki/PolyglotPersistence.html
  4. It all started from ... a set of papers released by Google & Amazon
  5. • Google Filesystem (2003) http://research.google.com/archive/gfs.html • Google MapReduce (2004) http://research.google.com/archive/mapreduce.html • Google BigTable (2006) http://research.google.com/archive/bigtable.html • Amazon Dynamo (2007) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
  6. Databases
  7. Apache HBase • Java • persistence through HDFS (Hadoop) • designed to be able to store massive amounts • Map/Reduce with of data Hadoop • speaks HTTP / REST, • designed for real time Thrift, Avro workloads • based on Google • https://hbase.apache.org/ BigTable
  8. Apache Cassandra • Java • really fast writes • inspired by Google • excellent for a large BigTable and Amazon number of high speed Dynamo counters • tunable trade-offs • Map/Reduce possible with Hadoop • query by column and • range of keys http://cassandra.apache.org/
  9. MongoDB • C++ • map/reduce with javascript • document database (bson) with rich indexing • server side javascript • master / slave replication • journaling • built-in sharding • fast in-place updates • auto failover with replica • http://www.mongodb.org/ sets
  10. Apache CouchDB • Erlang • exposes a stream of realtime updates • document database (json) • needs compacting • bi-directional replication • indexing via views (JS) • advanced conflict • attachment handling resolution • https://couchdb.apache.org/ • MVCC - writes do not block reads
  11. Riak (Basho) • Erlang, C, Javascript • tunable trade-offs (N, R, W) • key, value store • mapreduce in JS or • focus on fault tolerance Erlang and cross datacenter replication • full-text indexing with riak search • speaks HTTP/REST or custom binary • http://wiki.basho.com/
  12. Neo4j • Java • web admin interface • graph database • nodes & relationships can have metadata • speaks HTTP/REST • indexing • standalone or embeddable in Java apps • http://neo4j.org/ • full ACID
  13. Redis • C/C++ • values can be expired • disk-backed data • Pub/Sub for messaging structure server • ideal for rapidly changing • master-slave replication data that fits in memory • supports: strings, lists, • http://redis.io/ sets, hashes, sorted sets • batch operations
  14. Search
  15. elasticsearch • Java • simple multi-tenancy • based on Apache Lucene • real-time search • distributed by design • scale to 100s of machines • cloud aware (Amazon) • http://www.elasticsearch.org/ • understands JSON objects • no-schema required
  16. Apache SolrCloud • Java • automatic management of multiple shards • based on Apache Lucene (share the same repo) • automatic fail-over • adds distributed • durable writes capabilites to Solr • https://wiki.apache.org/ • based on ZooKeeper for solr/SolrCloud coordination & config
  17. Processing
  18. Apache Hadoop • Java, C/C++ • can scale to 1000s of machines • set of distributed systems (hdfs, mr etc.) • designed to be highly available at the • framework for application level distributed data processing • https:// hadoop.apache.org/ • simple programming model (map / reduce)
  19. Hadoop Ecosystem • HDFS (Storage) • Oozie (workflow) • MapReduce (Processing) • Mahout (machine learning) • Hive, Pig (high level languages) • Flume (log streaming) • HBase (database) • Sqoop (data import) • ZooKeeper • Whirr (deployment) (coordination)
  20. Deployment on Cloud Infrastructure (using jclouds)
  21. Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
  22. First Steps • Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1 • Use # export credentials $ bin/whirr launch-cluster --config ... $ bin/whirr destroy-cluster --config ... https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
  23. Deploy Hadoop whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
  24. With Mahout whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-tasktracker
  25. Or with HBase whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+hadoop-tasktracker +hbase-regionserver
  26. Or Cassandra whirr.instance-templates=10 cassandra
  27. And elasticsearch whirr.instance-templates=10 elasticsearch
  28. Thanks! andrei.savu@cloudsoftcorp.com

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
Publicité