SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
A Simple Join Id Last
First People 1 Washington George 2 Lincoln Abraham Key Entry Log Location Id Time Dunster 1 11:00am Dunster 2 11:02am Kirkland 2 11:08am You want to track individuals throughout the day. How would you do this in M/R, if you had to?
What is DW? a.k.a. BI
“Business Intelligence” Provides data to support decisions Not the operational/transactional database e.g., answers “what has our inventory been over time?”, not “what is our inventory now?”
pitting and popping cherries into
our mouths at a rate of more A bit smaller than than 157 million pounds over a three month period. Wow! natural peanut butte So what becomes of the other 53 million pounds? Well, Trader Joe’s Mini M some of the fruit is frozen, some used for jams and preserves excellent for snacki Why DW? and some is used to make Trader Joe’s Cherry Cider. Our for chocolate chips Cherry Cider is a 100% juice blend – cherry, apple, plum cream. We’re sellin and pineapple juices from concentrate – that makes ample use of Bing cherries from the Paciﬁc Northwest. It has big, Chocolat bold cherry sweetness and no added sugar. We’re selling Cherry Cider in a 64 ﬂuid ounce bottle for $3.69, every day. $1.99 Do you have a ﬁrst it involve nearly br I told you, hands off the Geez, lighten up. You Chocolate Chip Granola Bars! Trader Joe’s, you c get six in every box. You could share. it takes is a bit of Coated Granola B Learn from data No rock-hard-teeth oats, organic rice cr The bottoms are cov chocolate. They’re these little chocolat Reporting Trader Joe’s Cho Bars are deﬁnitely healthier when we ﬂavors, colors or pr fats. And because Ad-hoc analysis 17 deliciously affordable, e.g.: which trail mix should TJ’s discontinue? (and other important business questions)
“MAD Skills” MAD Skills: New
Analysis Practices for Big Data Jeffrey Cohen Brian Dolan Mark Dunlap Greenplum Fox Interactive Media Evergreen Technologies Joseph M. Hellerstein Caleb Welton U.C. Berkeley Greenplum Magnetic ABSTRACT As massive data acquisition and storage becomes increas- into groups. This was the topic of signiﬁcant academic re- search and industrial development throughout the 1990’s. ingly aﬀordable, a wide variety of enterprises are employing Traditionally, a carefully designed EDW is considered to statisticians to engage in sophisticated data analysis. In this have a central role in good IT practice. The design and evolution of a comprehensive EDW schema serves as the Agile paper we highlight the emerging practice of Magnetic, Ag- ile, Deep (MAD) data analysis as a radical departure from rallying point for disciplined data integration within a large traditional Enterprise Data Warehouses and Business Intel- enterprise, rationalizing the outputs and representations of ligence. We present our design philosophy, techniques and all business processes. The resulting database serves as the experience providing MAD analytics for one of the world’s repository of record for critical business functions. In addi- largest advertising networks at Fox Interactive Media, us- tion, the database server storing the EDW has traditionally ing the Greenplum parallel database system. We describe been a major computational asset, serving as the central, database design methodologies that support the agile work- scalable engine for key enterprise analytics. The concep- Deep ing style of analysts in these settings. We present data- tual and computational centrality of the EDW makes it a parallel algorithms for sophisticated statistical techniques, mission-critical, expensive resource, used for serving data- with a focus on density methods. Finally, we reﬂect on intensive reports targeted at executive decision-makers. It is database system features that enable agile design and ﬂexi- traditionally controlled by a dedicated IT staﬀ that not only ble algorithm development using both SQL and MapReduce maintains the system, but jealously controls access to ensure interfaces over a variety of storage mechanisms. that executives can rely on a high quality of service.  While this orthodox EDW approach continues today in many settings, a number of factors are pushing towards a 1. INTRODUCTION very diﬀerent philosophy for large-scale data management in If you are looking for a career where your services will be the enterprise. First, storage is now so cheap that small sub- in high demand, you should ﬁnd something where you provide groups within an enterprise can develop an isolated database a scarce, complementary service to something that is getting of astonishing scale within their discretionary budget. The ubiquitous and cheap. So what’s getting ubiquitous and cheap? world’s largest data warehouse from just over a decade ago Data. And what is complementary to data? Analysis. can be stored on less than 20 commodity disks priced at – Prof. Hal Varian, UC Berkeley, Chief Economist at Google  under $100 today. A department can pay for 1-2 orders of magnitude more storage than that without coordinating mad (adj.): an adjective used to enhance a noun. with management. Meanwhile, the number of massive-scale 1- dude, you got skills. data sources in an enterprise has grown remarkably: mas- 2- dude, you got mad skills. sive databases arise today even from single sources like click- – UrbanDictionary.com  streams, software logs, email and discussion forum archives, etc. Finally, the value of data analysis has entered com- Standard business practices for large-scale data analysis cen- mon culture, with numerous companies showing how sophis- ter on the notion of an “Enterprise Data Warehouse” (EDW) ticated data analysis leads to cost savings and even direct that is queried by “Business Intelligence” (BI) software. BI revenue. The end result of these opportunities is a grassroots tools produce reports and interactive interfaces that summa- move to collect and leverage data in multiple organizational
MADness is Enabling BI /
Reporting Ad-hoc RDBMS (Aggregates) Queries? ETL (Extraction, Transform, Load) Storage (Raw Data) Data Mining? Collection Instrumentation } Traditional DW
BI / Reporting Ad-hoc Queries
RDBMS (Aggregates) Data Mining ETL (Extraction, Transform, Load) Storage (Raw Data) Collection Instrumentation } Traditional DW
Implementation APPENDIX A. NARADA IN
OverLog /** If I have none, just store what I got */ Here we provide an executable OverLog implementation R6 member@X(X, Address, ASequence, T, ALive) :- of Narada’s mesh maintenance algorithms. Current limita- membersFound@X(X, Address, ASequence, ALive, C), tions of the P2 parser and planner require slightly wordier C == 0, T := f_now(). syntax for some of our constructs. Speciﬁcally, handling of negation is still incomplete, requiring that we rewrite some rules to eliminate negation. Furthermore, our planner cur- /** If I have some, just update with the information I received if it has a higher rently handles rules with collocated terms only. The Over- sequence number. */ Log speciﬁcation below is directly parsed and executed by our current codebase. R7 member@X(X, Address, ASequence, T, ALive) :- membersFound@X(X, Address, ASequence, ALive, C), /** Base tables */ C > 0, T := f_now(), member@X(X, Address, MySequence, MyT, MyLive), MySequence < ASequence. materialize(member, infinity, infinity, keys(2)). materialize(sequence, infinity, 1, keys(2)). materialize(neighbor, infinity, infinity, keys(2)). /** Update my neighbor’s member entry */ R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X, /* Environment table containing configuration Y, YSeq, A, AS, AL), T := f_now(), YLive := 1. values */ materialize(env, infinity, infinity, keys(2,3)). /** Add anyone from whom I receive a refresh BOOM Project message to my neighbors */ /* Setup of configuration values */ N1 neighbor@X(X, Y) :- refresh@X(X, Y, YS, A, AS, L). E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X, H, Y), H == "neighbor". /** Probing of neighbor liveness */ (Berkeley) /** Start with sequence number 0 */ L1 neighborProbe@X(X) :- periodic@X(X, E, 1). L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T := S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0, f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT, 1), Sequence := 0. L), T - YT > 20. L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y). L4 member@X(X, Neighbor, DeadSequence, T, Live) :- /** Periodically start a refresh */ deadNeighbor@X(X, Neighbor), member@X(X, Neighbor, S, T1, L), Live := 0, DeadSequence := S R1 refreshEvent@X(X) :- periodic@X(X, E, 3). + 1, T:= f_now(). /** Increment my own sequence number */ B. CHORD IN OverLog Here we provide the full OverLog speciﬁcation for Chord. Overlog (Berkeley) R2 refreshSequence@X(X, NewSequence) :- This speciﬁcation deals with lookups, ring maintenance with refreshEvent@X(X), sequence@X(X, Sequence), NewSequence := Sequence + 1. a ﬁxed number of successors, ﬁnger-table maintenance and opportunistic ﬁnger table population, joins, stabilization, and node failure detection. /** Save my incremented sequence */ /* The base tuples */ R3 sequence@X(X, NewSequence) :- refreshSequence@X(X, NewSequence). materialize(node, infinity, 1, keys(1)). materialize(finger, 180, 160, keys(2)). materialize(bestSucc, infinity, 1, keys(1)). /** Send a refresh to all neighbors with my current materialize(succDist, 10, 100, keys(2)). membership */ materialize(succ, 10, 100, keys(2)). materialize(pred, infinity, 100, keys(1)). R4 refresh@Y(Y, X, NewSequence, Address, ASequence, materialize(succCount, infinity, 1, keys(1)). ALive) :- refreshSequence@X(X, NewSequence), materialize(join, 10, 5, keys(1)). member@X(X, Address, ASequence, Time, ALive), materialize(landmark, infinity, 1, keys(1)). neighbor@X(X, Y). materialize(fFix, infinity, 160, keys(2)). materialize(nextFingerFix, infinity, 1, keys(1)). materialize(pingNode, 10, infinity, keys(2)). /** How many member entries that match the member materialize(pendingPing, 10, infinity, keys(2)). in a refresh message (but not myself) do I have? */ R5 membersFound@X(X, Address, ASeq, ALive, /** Lookups */ count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq, ALive), member@X(X, Address, MySeq, MyTime, L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N), MyLive), X != Address. lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in 15
Debugging and Visualization Task durations
(RandomWriter: 100GB written: 4 hosts): All nodes Task durations (Sort: 20GB input: 4 hosts): All nodes 40 JT_Map JT_Map JT_Reduce 150 30 100 Per-task Per-task 20 Mochi (CMU) 50 10 0 0 0 100 200 300 400 0 200 400 600 800 Time/s Time/s Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom) Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-node JT_Map JT_Reduce Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-node JT_Map JT_Reduce Parallax (UW) 60 60 50 40 Per-task Per-task 40 30 20 20 10 0 0 0 200 400 600 800 0 100 200 300 400 500 600 700 Time/s Time/s Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below) 4 Examples of Mochi’s Value We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). All of the data is derived from log traces from the Yahoo! M45  production cluster. The examples in § 4.1, § 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’s analysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but we omit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zooming in/out and targeted inspection) is of more beneﬁt, rather than a static one. 4.1 Understanding Hadoop Job Structure Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the
Optimizations For a single query....
For a single workﬂow... Across workﬂows... Bring out last century’s DB research! (joins) And ﬁle system research too! (RAID) HadoopDB (Yale) Data Formats (yes, in ’09)
the ﬂour, and are able
to sell a ﬁve pound bag to you for only crispies) to create a salty, savory snack that dares to thin $2.99. Our ﬂour is made from 100% U.S. grown hard wheat outside the snack box. Sound a little strange? Perhaps. Bu – All Purpose is a blend of hard winter and spring wheat once you try them, we think you’ll be back for more. We’r and White Whole Wheat is 100% hard white winter wheat selling Trader Joe’s Sesame Seaweed Rice Balls in a ﬁv – and both have four grams of protein in every quarter-cup ounce bag for only $1.49. serving. You’ll ﬁnd both Baker Josef’s Flours directly at The Wheel the source – your neighborhood Trader Joe’s. Baby Swiss from a Master • Only $3.99 a Pound! Trader Joe’s Baby Swiss Cheese comes to us from a Cheesemaker who has been creating quality cheeses fo Wisconsin farmer-owned cheese co-op that has been more than 30 years. producing craftsman cheeses since 1885. It is an artisan- made cheese produced under the watchful eye of a Master Baby Swiss is similar to Swiss cheese but is aged for a shorte period of time, resulting in a milder cheese with signiﬁcantl “Look, there are lots of different typesOriginal” “The of wheels!” – Todd Lipcon smaller “eyes” than its grown-up namesake. From a ﬂavo Sweet & Nutty… Just Like We Are! standpoint, it’s buttery, a little nutty and a touch sweet. I chunks well for salads, melts beautifully on burgers an slices easily for snacks. We’re selling random weight block Honey Roasted Peanuts of Master-crafted Trader Joe’s Baby Swiss Cheese fo Remember the sweet and crunchy taste of the original honey $3.99 a pound, every day – a terriﬁc value, and the sam Don’t Re-invent Re-invent! roasted peanuts? Remember the ﬁrst time you tried a knock- off version and felt sadness, coupled with disappointment, enveloped in ennui, longing for a snack that was as good great price we offered on this cheese back in 2005! as the original? Trader Joe’s has the power to make you ennui-free. Focus on your Lots of new When the original purveyor of honey roasted peanuts became yet another victim of corporate reorganization, one of our industrious nut suppliers bought exclusive rights to their data/problem possibilities! original honey roasted peanut recipe, and we’ve been selling truckloads of them ever since. Honey Roasted Peanuts are a natural for snacking any time – to satisfy the afternoon munchies, out on a long hike, or just sitting in front of the TV watching a game. What about... Proof that our nut buyer is as industrious as our nut supplier, New Models! we’re selling this one-of-a-kind product at a one-of-a-kind price – each 16 ounce bag of Trader Joe’s The Original Honey Roasted Peanuts is $2.69, every day. Uh-oh. Looks like Joe’s been reinventing the wheel again. New implementations! 19 Reliability, Better optimizations! Durability, Stability, Tooling