IAC 2024 - IA Fast Track to Search Focused AI Solutions
Microsoft Dryad
1. Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
5. Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
6. The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache 1 MB 1 MB PCIe PCIe NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86
7.
8. Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address. So let’s say the input rate grows again to 10x-100x what it was in the previous step.
9. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream. This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page. This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time. Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
10. Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these challenges is a great way to allow others to contribute to a shared research agenda Make simulated and reference data sets available to ground such a distributed research effort
11. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities A combination of live data, including streaming, and historical data Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…
12.
13. Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
14. Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
19. Cluster of 240 AMD64 (quad) machines, 920 disks
20. Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
21. LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
36. Combining with LINQ-to-IMDB Query DryadLINQ Subquery Subquery Subquery Subquery Historical Reference Data LINQ-to-IMDB
37. Combining with LINQ-to-CEP Query DryadLINQ Subquery Subquery Subquery Subquery Subquery ‘Live’ Streaming Data LINQ-to-IMDB LINQ-to-CEP
38. Cost of storing data – few cents/month/MB Cost of acquiring data – negligible Extracting insight while acquiring data - priceless Mining historical data for ways to extract insight – precious CEDR CEP – the engine that makes it possible Consistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
39. Complex Event Processing Complex Event Processing (CEP) is the continuous and incremental processing of event (data) streams from multiple sources based on declarative query and pattern specifications with near-zero latency.
43. Many interesting repercussionsConsistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
44. CEDR (Orinoco) Overview Currently processing over 400M events per day for internal application (5000 events/sec)
Language Integrated Query is an extension of .Net which allows one to write declarative computations on collections
Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
This is the basic Dryad terminology.
The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)