SlideShare une entreprise Scribd logo
1  sur  31
Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
Select eXtreme Computing Group (XCG) Initiatives Cloud Computing Futures ab initio R&D on cloud hardware/software infrastructure Multicore academic engagement Universal Parallel Computing Research Centers (UPCRCs) Software incubations Multicore applications, power management, scheduling Quantum computing Topological quantum computing investigations Security and cryptography Theoretical explorations and software tools ,[object Object]
Worldwide government and academic research partnerships
Inform next generation cloud computing infrastructure,[object Object]
Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea  Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache 1 MB  1 MB  PCIe  PCIe  NoC NoC ctlr ctlr cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86
Drinking from the Twitter Fire Hose On the “input” end ,[object Object]
Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these    challenges is a great way to allow others       to contribute to a shared research agenda Make simulated and reference data sets available    to ground such a distributed research effort
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities 	A combination of live data, including streaming, and historical data  	Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult 	because in actuality it is not simple…
This Talk is About Effort to build & port tools for data intensive research in the cloud ,[object Object],Able to handle torrential streams of live and historical data ,[object Object],Intersection of four fundamental strategies  Distribute Data and perform Parallel Processing Parallel operations to take advantage of multiple cores; Reduce the size of the data accessed Data compression Data structures that limit the amount of data required for queries; Stream data processing to extract information before storage
Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
Simple Programming Model Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985] ,[object Object]
DryadLINQ provides simple but powerful programming model
 Only few lines of code needed to implement Terasort, benchmark May 2008
DryadLINQ result: 349 seconds (5.8 min)
 Cluster of 240 AMD64 (quad) machines, 920 disks
 Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records =    ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
Dryad Generalizes Unix Pipes Unix Pipes: 1-D 		grep |  sed  | sort | awk |  perl Dryad: 2-D, multi-machine, virtualized 	 grep1000 |  sed500  | sort1000 | awk500 |  perl50
Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes) Channel is a finite streams of items ,[object Object]
  TCP pipes (inter-machine)
  Memory FIFOs (intra-machine),[object Object]
Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
Dryad Scheduler is a State Machine Static optimizer builds execution graph Vertex can run anywhere once all its inputs are ready. Dynamic optimizer mutates running graph  Distributes code, routes data; Schedules processes on machines near data; Adjusts available compute resources at each stage; Automatically recovers computation, adjusts for overload ,[object Object]
If A’s inputs are gone, run upstream vertices again (recursively);
If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
LINQ == Tree of Operators A query is comprised of a tree of operators As with a program AST, these trees can be analyzed, rewritten This is why PLINQ can safely introduce parallelism q = from x in A where p(x) select x3; ,[object Object]

Contenu connexe

Tendances

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionDenodo
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsIBM Power Systems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectPrecisely
 
In memory computing
In memory computingIn memory computing
In memory computingGagan Reddy
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Databricks
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingCumulus Networks
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28groberts52
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale Computing
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereVMUG IT
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Ontico
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...Dell EMC World
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6DataStax
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single pageJoe Krotz
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherDavid La Motta
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCinside-BigData.com
 

Tendances (20)

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
 
In memory computing
In memory computingIn memory computing
In memory computing
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production Environment
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is Here
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single page
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack Together
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
 

Similaire à Microsoft Dryad

High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The BoxIan Foster
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeDataWorks Summit
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyGetInData
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyKrzysztof Adamski
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposPriyanka Aash
 

Similaire à Microsoft Dryad (20)

High performance computing
High performance computingHigh performance computing
High performance computing
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud ready
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gpos
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Microsoft Dryad

  • 1. Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
  • 2.
  • 3. Worldwide government and academic research partnerships
  • 4.
  • 5. Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
  • 6. The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache 1 MB 1 MB PCIe PCIe NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86
  • 7.
  • 8. Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
  • 9. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
  • 10. Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these challenges is a great way to allow others to contribute to a shared research agenda Make simulated and reference data sets available to ground such a distributed research effort
  • 11. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities A combination of live data, including streaming, and historical data Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…
  • 12.
  • 13. Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
  • 14. Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
  • 15.
  • 16. DryadLINQ provides simple but powerful programming model
  • 17. Only few lines of code needed to implement Terasort, benchmark May 2008
  • 18. DryadLINQ result: 349 seconds (5.8 min)
  • 19. Cluster of 240 AMD64 (quad) machines, 920 disks
  • 20. Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
  • 21. LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
  • 22. Dryad Generalizes Unix Pipes Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D, multi-machine, virtualized grep1000 | sed500 | sort1000 | awk500 | perl50
  • 23.
  • 24. TCP pipes (inter-machine)
  • 25.
  • 26. Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
  • 27.
  • 28. If A’s inputs are gone, run upstream vertices again (recursively);
  • 29. If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
  • 30. Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
  • 31.
  • 34. Nesting queries inside of others is commonPLINQ can fuse partitions var q1 = from x in A select x*2; var q2 = q1.Sum();
  • 35. Combining with PLINQ Query DryadLINQ subquery PLINQ
  • 36. Combining with LINQ-to-IMDB Query DryadLINQ Subquery Subquery Subquery Subquery Historical Reference Data LINQ-to-IMDB
  • 37. Combining with LINQ-to-CEP Query DryadLINQ Subquery Subquery Subquery Subquery Subquery ‘Live’ Streaming Data LINQ-to-IMDB LINQ-to-CEP
  • 38. Cost of storing data – few cents/month/MB Cost of acquiring data – negligible Extracting insight while acquiring data - priceless Mining historical data for ways to extract insight – precious CEDR CEP – the engine that makes it possible Consistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 39. Complex Event Processing Complex Event Processing (CEP) is the continuous and incremental processing of event (data) streams from multiple sources based on declarative query and pattern specifications with near-zero latency.
  • 40.
  • 41.
  • 42. Separately specify desired disorder handling strategy
  • 43. Many interesting repercussionsConsistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 44. CEDR (Orinoco) Overview Currently processing over 400M events per day for internal application (5000 events/sec)
  • 45.
  • 46.
  • 47.
  • 48.

Notes de l'éditeur

  1. Language Integrated Query is an extension of .Net which allows one to write declarative computations on collections
  2. Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
  3. This is the basic Dryad terminology.
  4. The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
  5. Computation Staging