Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

7 186 vues

Publié le

Publié dans : Technologie, Business
  • Identifiez-vous pour voir les commentaires

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

  1. 1. Pig 0.6 and 0.7<br />Alan Gates<br />What’s New With Pig<br />
  2. 2. Accumulator<br />A = load ‘clicks’;<br />B = group A by user;<br />C = foreach B {<br /> C1 = order A by timestamp;<br /> generate user, sessionize(C1);<br />}<br />…<br />Many aggregate operations cannot use combiner but do not need all records for a single key together<br />New in 0.6, Accumulator interface which can be implemented by UDFs<br />Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue<br />
  3. 3. Also in 0.6<br />UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf<br />A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors<br />
  4. 4. New Load and Store Interfaces<br />0.6 and before<br />Want to write a LoadFunc that works on files and uses standard splits? Easy<br />Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat)<br />Want to write a StoreFunc that works on something other than files? Sorry<br />0.7<br />LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy<br />StoreFunc now sits atop OutputFormat, …<br />Not backward compatible, will require rewrite of custom Load and StoreFuncs<br />
  5. 5. Also in 0.7<br />Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment<br />More aggressive use of Hadoop distributed cache for features such as replicated join and order by<br />
  6. 6. What We Are Working On Now<br />Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files<br />Adding UDFs in scripting languages (python initially) - PIG-928<br />Allow users to set a custom partitioner in some cases - PIG-282<br />Make Pig available in Maven repositories - PIG-1334<br />Label Interfaces for audience and stability - PIG-1311<br />Part of Hadoop’s compatibility plan, see the following blog posthttp://bit.ly/9yRDlH<br />
  7. 7. Questions<br />