SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
LARGE, DISTRIBUTED COMPUTING
INFRASTRUCTURES
–
OPPORTUNITIES & CHALLENGES

Dominique A. Heger Ph.D.
DHTechnologies, Data Nubes
Austin, TX, USA
Performance &
Capacity Studies

Availability &
Reliability Studies

Systems
Modeling

Scalability &
Speedup Studies

Linux & UNIX
Internals

Design, Architecture
& Feasibility Studies

Cloud Computing

Systems StressTesting &
Benchmarking

Research, Education
& Training

Operations
Research

Machine Learning

BI, Data Analytics &
Data Mining,
Predictive Analytics

Hadoop Ecosystem &
MapReduce
www.dhtusa.com
www.datanubes.com
WORLD IS DEALING WITH MASSIVE DATA SETS
World-Wide Digital Data Volume (Source IDC 2012)

2000 -> ~800 Terabytes

2006 -> ~160 Exabytes

2012 -> ~2.7 Zettabytes

2020 -> ~35 Zettabytes

40% to 50% growth-rate per year
Name

Usage
(Decimal)

Number of Bytes
(Decimal)

1 megabyte

MB

106

1,000,000

1 gigabyte

GB

109

1,000,000,000

1 terabyte

TB

1012

1,000,000,000,000

1 petabyte

PB

1015

1,000,000,000,000,000

1 exabyte

EB

1018

1,000,000,000,000,000,000

1 zettabyte

ZB

1021

1,000,000,000,000,000,000,000

1 yottabyte



Abbr.

YB

1024

1,000,000,000,000,000,000,000,000

Storing and managing 1PB of data may cost a company between $500K - $1M/year

Source: IDC 2012
STRUCTURED VERSUS UNSTRUCTURED DATA










All systems generated data has structure!
70% to 80% of the digital data volume is labeled as unstructured
Currently, most companies make all their business decisions solely based on their
structured data pool …
56% of companies are overwhelmed by their data management requirements
60% of companies state that timely capturing & analysis of the data is not optimal
~2,700 EB of new information in 2012 with Internet as primary driver

Complex,
Unstructured

Relational

Source: Gartner & IDC (2012)
DATA AS AN ASSET TODAY
Just as the Oil Industry Circa 1900 ….

After the refining process, one barrel of crude oil
yielded more than 40% gasoline and only 3%
kerosene, creating large quantities of waste gasoline
for disposal.
“Book: The American Gas Station”

There are many Fortune 1000+ companies today with
massive write-once & read-none data sets ….
5
BIG DATA – BIG CHALLENGES











Big Data implies that the size of the data sets themselves
become part of the problem
Traditional techniques and tools to process the data sets
are running out of steam
A company does not have to be big to have Big Data
problems
Big Data Analytics & Predictive Analytics
Data Management moves from batch to real time
processing (Intel 2012)
Cloud IT delivery model supports Big Data projects
HOW TO APPROACH A BIG DATA PROJECT
1.
2.
3.
4.

5.

First, treat Big Data project as a business mandate and
NOT as an IT challenge!
Define the top 3 most critical business questions that
provide insight that will change the company’s dynamic
Quantify the current time to answer (TTA) as well as the
quality of the answer for these questions
Now the Big Data project goals and objectives can be
defined as “reduce the time to answer the following
business questions from X number of hours down to Y
number of minutes”
Discuss the technology, people, tools, and project
management opportunities required to realize these
goals & objectives. Always do a POC!
PROBLEM DEFINITION


Given the Big Data goals and a budget, provide a solution
(supported by algorithms and an analysis framework) that
guarantees that the quality of the answers meets the time and
business objectives while data is accumulating over time.




1.

2.
3.


This can only be achieved by implementing a scalable system
infrastructure that fuses human intelligence with statistical and
computational design principles (science and engineering)
Requires the 3 dimensions (systems, tools/algorithms, people)
working together to improve the data analysis framework while
meeting the goals and objectives
Systems -> Design scalability into the IT solutions (Cloud)
Algorithms -> Assess/Improve scalability, efficiency, and quality
of the algorithms
People -> Train & leverage human activity and intelligence (Data
Scientist, CDO)
STATUS QUO


Today's solutions reflect fixed points in the solution space
TARGET SOLUTION




What is required are techniques to dynamically choose the best-possible
operating points in the solution space
Find answers at scale by tightly integrating algorithms, systems, and people
Algorithms/Tools
Data
Nubes

Systems

People
Source: AMPLab, UCB
ALGORITHMS & TOOLS









G1 -> The traditional ML toolsets for machine learning and statistical
analysis such as SAS, SPSS, or the R language. They do allow for a deep
analysis of smaller data sets (what is considered small is obviously
debatable)
G2 -> 2nd generation ML toolsets such as Mahout or RapidMiner that
provide better scalability compared to G1, but may not support the vast
range of ML algorithms as the G1 tools
G3 -> 3d generation toolsets such as Twister, Spark, HaLoop, Hama, R
over Hadoop, or GraphLab that provide deeper analysis cycles of big
data sets
Most current ML algorithms do not scale well to large data sets
Sometimes unreasonable to process all data points and expect an
answer within the specified time-frame (project goal)
BIG DATA ANALYSIS - SUGGESTED APPROACH






Given a question to be answered, a time-frame, and a budget,
design and implement the system to obtain immediate answers
while perpetually improving the quality of the results
Calibrate the answers and provide error statistics
Stop the process when the error < given threshold
FLEXIBILITY FOR A DYNAMIC SYSTEM



Given a question to be answered, a time-frame, and a budget, automatically
choose the best possible algorithm
Example: Nearest Neighbor verses Learning Vector Quantization Classifier
SYSTEMS – HADOOP









Hadoop – Java based distributed computing framework
that is designed to support applications that are
implemented via the MapReduce programming model
Hadoop Design Strategy – Move the actual computation
to the data
Old Strategy – Move the data to the computation (SAN)
The traditional Hadoop performance focus is on
aggregate data set (batch read) performance and NOT
on any individual latency scenarios. The current focus
though is more and more on Real Time processing!
How to extract value from Big Data? ML!
HADOOP ECOSYSTEM (PARTIAL VIEW)
Twitter
Real-Time
Processing

Configuration
Management

Data
Handlers
Data Serialization System

Tools

KAFKA
Distributed
Messaging
System

Schedulers

RDBMS
Data Store &
NoSQL
SYSTEMS – IN-MEMORY COMPUTING (IMC)






IMC represents a set of technology components that allow storing data in system memory
(DRAM) and/or Non-Volatile NAND flash memory rather than on traditional hard disks
Core based systems and memory prices are coming down. Latency delta between NAND
flash memory (ns) and HD’s (ms) is significant while scaling the workload
IMDG and IMCG products are available now and are solid









Case Study: 177M Tweets/day, 512 bytes each, data-set -> 2 weeks
Cluster (Intel Quad, 64GB Ram) with 1TB RAM -> ~$30,000 (20 parallel Quad nodes)

In-Memory Hadoop available now (GridGain)
Non-Volatile Phase-Change RAM (PCRAM) or Resistive RAM (RRAM) technologies may
supersede NAND flash soon
Establish an In-Memory Computing roadmap (Due-Diligence & Feasibility Study)

Source: Gartner, 2012
BIG DATA SYSTEMS FOCUS


Convert data center into a (Hadoop) processing unit












Commodity HW, Intel Core, Interconnect, Local Disks, No SAN
Support existing cluster computing applications (via Cassandra,
Hive, Pig, or Hbase)
Support interactive and iterative data analysis (ML)
Support predictive, insightful query languages (Hive, Pig)
Support efficient and effective data movement among RDBMS
and column oriented data stores (Sqoop)
Support distributed maintenance and monitoring of the entire IT
infrastructure (Ganglia, Nagio, Chukwa, Ambari, White Elephant)
Scalability, robustness, performance, diversity, analytics, data
visualization, and security aspects have to be designed into the
solution
Make it all happen in a Cloud environment
BIG DATA & CLOUD COMPUTING
Pay by use instead of provisioning for peak
Risk of over-provisioning: underutilization
Heavy penalty for under-provisioning (lost revenue, users)
Big Data -> Analytics as a Service (AaaS), may be based on IaaS, PaaS, SaaS

Resources

Capacity
Resources

•
•
•
•

Demand

Capacity

Demand
Time

Traditional Data Center

Time

Cloud Based Data Center

Unused Resources
18
PEOPLE – BIG DATA


Assure that people are an integrated (integral) part of the
solution system










Leverage human activity
Leverage human intelligence
Leverage croudsourcing (online community)
Curate and clean dirty data (Data Cleaner, Data Wrangler)
Address imprecise questions
Design, validate, and improve algorithms

After the business objectives are set, address any data at scale
project by tightly integrating algorithms, systems, and people
PEOPLE – MASSIVE DEMAND & SMALL TALENT POOL




US alone is facing an estimated shortage of approximately 190,000
scientist with deep analytical skills by 2018 (Source McKinsey, 2011)
By 2018, US alone is facing an estimated shortage of approximately 1.5
million managers and analysts that have the know-how to leverage the
results of big data studies to make effective business decisions (Source
McKinsey, 2011)




The Hadoop Ecosystem & Cloud Computing in general is powered by Linux.
91.4% of the top 500 supercomputers are Linux-based (Source TOP500)






A 2013 job report compiled by Dice showed that 93% of the contacted US
companies (850 firms) are hiring Linux professionals this year.
The same study revealed that 90% of the firms stated that it is very difficult at
the moment (2013) to even find Linux talent in the US. This number is up from
80% for the 2012 study.
According to Dice, the average salary increase for a Linux professional in the US
is approximately 9% this year. At the same time, the average IT salary increase in
the US is approximately 5%.
BIG DATA 2020






Approach Big Data problems first as a business case (not an IT project)
and strive for results that provide the right quality at the right time
answers.
Big Data projects require the fusion of algorithms/tools, systems, and
people.
In-Memory Computing (IMC), Complex Event Processing (CEP), as well as
Quantum Computing reflect powerful options for Big Data projects
Massive research opportunities across many domains exist, but the
main objectives are:









Create a new generation of Big Data scientists (cross-disciplinary talent)
Machine Learning has to become an engineering discipline
Develop competency centers for the Big Data ecosystem
Develop centers of excellence for Linux & SW engineering
Leverage Cloud computing for Big Data, evaluate IMC/CEP now
Plan for IMC, CEP, Cloud, and the Big Data SW/HW infrastructure at the top
company level and not the IT department
Leverage and be active in the Open Source community
THANKS MUCH!
SQL, NoSQL & NewSQL Framework

NewSQL is a class of modern relational database management
systems that seek to provide the same scalable performance of
NoSQL systems for online transaction processing (read-write) workloads
while still maintaining the ACID (Atomicity, Consistency, Isolation, Durability)
guarantees of a traditional database system

Source: Infochimps (2012)
Column verses Row Data Store – Data Operations
Column verses Row Data Store – Memory Storage

Contenu connexe

Tendances

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapterRajiv Tiwari
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Dedup with hadoop
Dedup with hadoopDedup with hadoop
Dedup with hadoopNeeta Pande
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Vasu S
 
Lokesh_Kansal_Resume
Lokesh_Kansal_ResumeLokesh_Kansal_Resume
Lokesh_Kansal_ResumeLokesh Kansal
 

Tendances (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Dedup with hadoop
Dedup with hadoopDedup with hadoop
Dedup with hadoop
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
BigData
BigDataBigData
BigData
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
 
Lokesh_Kansal_Resume
Lokesh_Kansal_ResumeLokesh_Kansal_Resume
Lokesh_Kansal_Resume
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 

Similaire à 2013 International Conference on Knowledge, Innovation and Enterprise Presentation

Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
"Demystifying Big Data by AIBDP.org
"Demystifying Big Data by AIBDP.org"Demystifying Big Data by AIBDP.org
"Demystifying Big Data by AIBDP.orgAIBDP
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data SolutionJames Serra
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 

Similaire à 2013 International Conference on Knowledge, Innovation and Enterprise Presentation (20)

Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Big Data
Big DataBig Data
Big Data
 
De-Mystifying Big Data
De-Mystifying Big DataDe-Mystifying Big Data
De-Mystifying Big Data
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
"Demystifying Big Data by AIBDP.org
"Demystifying Big Data by AIBDP.org"Demystifying Big Data by AIBDP.org
"Demystifying Big Data by AIBDP.org
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Big Data
Big DataBig Data
Big Data
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

2013 International Conference on Knowledge, Innovation and Enterprise Presentation

  • 1. LARGE, DISTRIBUTED COMPUTING INFRASTRUCTURES – OPPORTUNITIES & CHALLENGES Dominique A. Heger Ph.D. DHTechnologies, Data Nubes Austin, TX, USA
  • 2. Performance & Capacity Studies Availability & Reliability Studies Systems Modeling Scalability & Speedup Studies Linux & UNIX Internals Design, Architecture & Feasibility Studies Cloud Computing Systems StressTesting & Benchmarking Research, Education & Training Operations Research Machine Learning BI, Data Analytics & Data Mining, Predictive Analytics Hadoop Ecosystem & MapReduce www.dhtusa.com www.datanubes.com
  • 3. WORLD IS DEALING WITH MASSIVE DATA SETS World-Wide Digital Data Volume (Source IDC 2012)  2000 -> ~800 Terabytes  2006 -> ~160 Exabytes  2012 -> ~2.7 Zettabytes  2020 -> ~35 Zettabytes  40% to 50% growth-rate per year Name Usage (Decimal) Number of Bytes (Decimal) 1 megabyte MB 106 1,000,000 1 gigabyte GB 109 1,000,000,000 1 terabyte TB 1012 1,000,000,000,000 1 petabyte PB 1015 1,000,000,000,000,000 1 exabyte EB 1018 1,000,000,000,000,000,000 1 zettabyte ZB 1021 1,000,000,000,000,000,000,000 1 yottabyte  Abbr. YB 1024 1,000,000,000,000,000,000,000,000 Storing and managing 1PB of data may cost a company between $500K - $1M/year Source: IDC 2012
  • 4. STRUCTURED VERSUS UNSTRUCTURED DATA       All systems generated data has structure! 70% to 80% of the digital data volume is labeled as unstructured Currently, most companies make all their business decisions solely based on their structured data pool … 56% of companies are overwhelmed by their data management requirements 60% of companies state that timely capturing & analysis of the data is not optimal ~2,700 EB of new information in 2012 with Internet as primary driver Complex, Unstructured Relational Source: Gartner & IDC (2012)
  • 5. DATA AS AN ASSET TODAY Just as the Oil Industry Circa 1900 …. After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal. “Book: The American Gas Station” There are many Fortune 1000+ companies today with massive write-once & read-none data sets …. 5
  • 6. BIG DATA – BIG CHALLENGES       Big Data implies that the size of the data sets themselves become part of the problem Traditional techniques and tools to process the data sets are running out of steam A company does not have to be big to have Big Data problems Big Data Analytics & Predictive Analytics Data Management moves from batch to real time processing (Intel 2012) Cloud IT delivery model supports Big Data projects
  • 7. HOW TO APPROACH A BIG DATA PROJECT 1. 2. 3. 4. 5. First, treat Big Data project as a business mandate and NOT as an IT challenge! Define the top 3 most critical business questions that provide insight that will change the company’s dynamic Quantify the current time to answer (TTA) as well as the quality of the answer for these questions Now the Big Data project goals and objectives can be defined as “reduce the time to answer the following business questions from X number of hours down to Y number of minutes” Discuss the technology, people, tools, and project management opportunities required to realize these goals & objectives. Always do a POC!
  • 8. PROBLEM DEFINITION  Given the Big Data goals and a budget, provide a solution (supported by algorithms and an analysis framework) that guarantees that the quality of the answers meets the time and business objectives while data is accumulating over time.   1. 2. 3.  This can only be achieved by implementing a scalable system infrastructure that fuses human intelligence with statistical and computational design principles (science and engineering) Requires the 3 dimensions (systems, tools/algorithms, people) working together to improve the data analysis framework while meeting the goals and objectives Systems -> Design scalability into the IT solutions (Cloud) Algorithms -> Assess/Improve scalability, efficiency, and quality of the algorithms People -> Train & leverage human activity and intelligence (Data Scientist, CDO)
  • 9. STATUS QUO  Today's solutions reflect fixed points in the solution space
  • 10. TARGET SOLUTION   What is required are techniques to dynamically choose the best-possible operating points in the solution space Find answers at scale by tightly integrating algorithms, systems, and people Algorithms/Tools Data Nubes Systems People Source: AMPLab, UCB
  • 11. ALGORITHMS & TOOLS      G1 -> The traditional ML toolsets for machine learning and statistical analysis such as SAS, SPSS, or the R language. They do allow for a deep analysis of smaller data sets (what is considered small is obviously debatable) G2 -> 2nd generation ML toolsets such as Mahout or RapidMiner that provide better scalability compared to G1, but may not support the vast range of ML algorithms as the G1 tools G3 -> 3d generation toolsets such as Twister, Spark, HaLoop, Hama, R over Hadoop, or GraphLab that provide deeper analysis cycles of big data sets Most current ML algorithms do not scale well to large data sets Sometimes unreasonable to process all data points and expect an answer within the specified time-frame (project goal)
  • 12. BIG DATA ANALYSIS - SUGGESTED APPROACH    Given a question to be answered, a time-frame, and a budget, design and implement the system to obtain immediate answers while perpetually improving the quality of the results Calibrate the answers and provide error statistics Stop the process when the error < given threshold
  • 13. FLEXIBILITY FOR A DYNAMIC SYSTEM   Given a question to be answered, a time-frame, and a budget, automatically choose the best possible algorithm Example: Nearest Neighbor verses Learning Vector Quantization Classifier
  • 14. SYSTEMS – HADOOP      Hadoop – Java based distributed computing framework that is designed to support applications that are implemented via the MapReduce programming model Hadoop Design Strategy – Move the actual computation to the data Old Strategy – Move the data to the computation (SAN) The traditional Hadoop performance focus is on aggregate data set (batch read) performance and NOT on any individual latency scenarios. The current focus though is more and more on Real Time processing! How to extract value from Big Data? ML!
  • 15. HADOOP ECOSYSTEM (PARTIAL VIEW) Twitter Real-Time Processing Configuration Management Data Handlers Data Serialization System Tools KAFKA Distributed Messaging System Schedulers RDBMS Data Store & NoSQL
  • 16. SYSTEMS – IN-MEMORY COMPUTING (IMC)    IMC represents a set of technology components that allow storing data in system memory (DRAM) and/or Non-Volatile NAND flash memory rather than on traditional hard disks Core based systems and memory prices are coming down. Latency delta between NAND flash memory (ns) and HD’s (ms) is significant while scaling the workload IMDG and IMCG products are available now and are solid      Case Study: 177M Tweets/day, 512 bytes each, data-set -> 2 weeks Cluster (Intel Quad, 64GB Ram) with 1TB RAM -> ~$30,000 (20 parallel Quad nodes) In-Memory Hadoop available now (GridGain) Non-Volatile Phase-Change RAM (PCRAM) or Resistive RAM (RRAM) technologies may supersede NAND flash soon Establish an In-Memory Computing roadmap (Due-Diligence & Feasibility Study) Source: Gartner, 2012
  • 17. BIG DATA SYSTEMS FOCUS  Convert data center into a (Hadoop) processing unit         Commodity HW, Intel Core, Interconnect, Local Disks, No SAN Support existing cluster computing applications (via Cassandra, Hive, Pig, or Hbase) Support interactive and iterative data analysis (ML) Support predictive, insightful query languages (Hive, Pig) Support efficient and effective data movement among RDBMS and column oriented data stores (Sqoop) Support distributed maintenance and monitoring of the entire IT infrastructure (Ganglia, Nagio, Chukwa, Ambari, White Elephant) Scalability, robustness, performance, diversity, analytics, data visualization, and security aspects have to be designed into the solution Make it all happen in a Cloud environment
  • 18. BIG DATA & CLOUD COMPUTING Pay by use instead of provisioning for peak Risk of over-provisioning: underutilization Heavy penalty for under-provisioning (lost revenue, users) Big Data -> Analytics as a Service (AaaS), may be based on IaaS, PaaS, SaaS Resources Capacity Resources • • • • Demand Capacity Demand Time Traditional Data Center Time Cloud Based Data Center Unused Resources 18
  • 19. PEOPLE – BIG DATA  Assure that people are an integrated (integral) part of the solution system        Leverage human activity Leverage human intelligence Leverage croudsourcing (online community) Curate and clean dirty data (Data Cleaner, Data Wrangler) Address imprecise questions Design, validate, and improve algorithms After the business objectives are set, address any data at scale project by tightly integrating algorithms, systems, and people
  • 20. PEOPLE – MASSIVE DEMAND & SMALL TALENT POOL   US alone is facing an estimated shortage of approximately 190,000 scientist with deep analytical skills by 2018 (Source McKinsey, 2011) By 2018, US alone is facing an estimated shortage of approximately 1.5 million managers and analysts that have the know-how to leverage the results of big data studies to make effective business decisions (Source McKinsey, 2011)   The Hadoop Ecosystem & Cloud Computing in general is powered by Linux. 91.4% of the top 500 supercomputers are Linux-based (Source TOP500)    A 2013 job report compiled by Dice showed that 93% of the contacted US companies (850 firms) are hiring Linux professionals this year. The same study revealed that 90% of the firms stated that it is very difficult at the moment (2013) to even find Linux talent in the US. This number is up from 80% for the 2012 study. According to Dice, the average salary increase for a Linux professional in the US is approximately 9% this year. At the same time, the average IT salary increase in the US is approximately 5%.
  • 21. BIG DATA 2020     Approach Big Data problems first as a business case (not an IT project) and strive for results that provide the right quality at the right time answers. Big Data projects require the fusion of algorithms/tools, systems, and people. In-Memory Computing (IMC), Complex Event Processing (CEP), as well as Quantum Computing reflect powerful options for Big Data projects Massive research opportunities across many domains exist, but the main objectives are:        Create a new generation of Big Data scientists (cross-disciplinary talent) Machine Learning has to become an engineering discipline Develop competency centers for the Big Data ecosystem Develop centers of excellence for Linux & SW engineering Leverage Cloud computing for Big Data, evaluate IMC/CEP now Plan for IMC, CEP, Cloud, and the Big Data SW/HW infrastructure at the top company level and not the IT department Leverage and be active in the Open Source community
  • 23. SQL, NoSQL & NewSQL Framework NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still maintaining the ACID (Atomicity, Consistency, Isolation, Durability) guarantees of a traditional database system Source: Infochimps (2012)
  • 24. Column verses Row Data Store – Data Operations
  • 25. Column verses Row Data Store – Memory Storage