Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Avro Data | Washington DC HUG

3 263 vues

Publié le

Publié dans : Technologie
  • http://www.dbmanagement.info/Tutorials/Hadoop.htm #Hadoop #Avro #Cassandro #Drill #Flume Tutorial (Videos and Books)at $7.95
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Avro Data | Washington DC HUG

  1. 1. Avro Data Doug Cutting Cloudera & ApacheAvro, Nutch, Hadoop, Pig, Hive, HBase, Zookeeper, Whirr, Cassandra and Mahout are trademarks of the Apache Software Foundation
  2. 2. How did we get here?
  3. 3. 2002-2003Nutch SequenceFileWritable
  4. 4. 2004-2005Nutch MapReduce NDFS SequenceFileWritable
  5. 5. 2006Hadoop MapReduce HDFS SequenceFileWritable
  6. 6. 2007 HBase PigZookeeper Hadoop MapReduce HDFS SequenceFile Writable
  7. 7. 2008 HBase Pig Hive MahoutZookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  8. 8. 2009-2010 Your Application Here Whirr Oozie Hue ... HBase Pig Hive MahoutFlume Zookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  9. 9. Today● face an exploding combination of ● tools ● data formats ● programming languages● may require new adapter for each combination● more tools and languages are good ● but more formats might not be ● Google claims benefits of common format
  10. 10. Data Format Properties● expressive ● supports complex, nested data structures● efficient ● fast and small● dynamic ● programs can process & define new datatypes● file format ● standalone ● splittable, compressed, sortable
  11. 11. Data Format Comparison CSV XML/JSON SequenceFile Thrift & PB Avrolanguage yes yes no yes yesindependentexpressive no yes yes yes yesefficient no no yes yes yesdynamic yes yes no no yesstandalone ? yes no no yessplittable ? ? yes ? yessortable yes ? yes no yes
  12. 12. Avro● specification-based design ● permits independent implementations ● schema in JSON to simplify impls● dynamic implementations the norm ● static, codegen-based implementations too● file format specified ● standalone, splittable, compressed● efficient binary encoding ● factors schema out of instances● sortable
  13. 13. IDL Schemas for authoring static datatypes// a simple three-element recordrecord Block { string id; int length; array<string> hosts;}// a linked list of intsrecord IntList { int value; union { null, IntList} next;}
  14. 14. JSON Schemas for interchange// a simple three-element record{"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "int"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ]}// a linked list of ints{"name": "IntList", "type": "record":, "fields": [ {"name": "value", "type": "int"}, {"name": "next", "type": ["null", "IntList"]} ]}
  15. 15. Dynamic Schemas e.g., in JavaSchema block = Schema.createRecord("Block", "a block", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("id", Schema.create(Type.STRING), null, null));fields.add(new Field("length", Schema.create(Type.INT), null, null));fields.add(new Field("hosts", Schema.createArray(Schema.create(Type.STRING)), null, null));block.setFields(fields);Schema list = Schema.createRecord("MyList", "a list", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("value", Schema.create(Type.INT), null, null));fields.add(new Field("next", Schema.createUnion(Arrays.asList(new Schema[] { Schema.create(Type.NULL), list }, null, null));list.setFields(fields);
  16. 16. Avro Schema Evolution● writers schema always provided to reader● so reader can compare: ● the schema used to write with ● the schema expected by application● fields that match (name & type) are read● fields written that dont match are skipped● expected fields not written can be identified● same features as provided by numeric field ids
  17. 17. Avro MapReduce API● Single-valued inputs and outputs ● key/value pairs only required for intermediate● map(IN, Collector<OUT>) ● map-only jobs never need to create k/v pairs● map(IN, Collector<Pair<K,V>>)● reduce(K, Iterable<V>, Collector<OUT>) ● if IN and OUT are pairs, default is sort
  18. 18. Avro Java MapReduce Examplepublic void map(String text, AvroCollector<Pair<String,Long>> c,                Reporter r) throws IOException {  StringTokenizer i = new StringTokenizer(text.toString());  while (i.hasMoreTokens())    c.collect(new Pair<String,Long>(i.nextToken(), 1L));}public void reduce(String word, Iterable<Long> counts,                   AvroCollector<Pair<String,Long>> c,                   Reporter r) throws IOException {  long sum = 0;  for (long count : counts)    sum += count;  c.collect(new Pair<String,Long>(word, sum));}
  19. 19. Avro Status● Current ● APIs: C, C++, C# Java, Python, PHP, Ruby – interoperable data & RPC ● Integration: Pig, Hive, Flume, Crunch, etc. ● Conversion: SequenceFile, Thrift, Protobuf ● Java Mapreduce API● Upcoming ● MapReduce APIs for more languages – efficient, rich data
  20. 20. Summary● Ecosystem needs a common data format ● thats expressive, efficient, dynamic, etc.● Avro meets this need ● but switching data formats is a slow process