SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
©2015
Slide 1
Introduction to Big Data
- Satish Gopalani, Ashish Tadose, Deepak Dixit
(Ampool)
©2015
Slide 2
What is big-data?
Definition of big data and 4 V’s
Some Stats
Who is using Big data?
Applications
Intro to Hadoop and Map Reduce
Coins counting analogy
Typical workflow of Hadoop
Big data for Statisticians.
Problems with Big Data
Conclusion
OUTLINE
©2015
Slide 3
Big data refers to a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications. That is, beyond
current comfort levels. “Big” is relative, depending on context, amount of data and complexity of the
problem.
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation. (Gartner)
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and
unstructured data that is so large it is difficult to process using traditional database and software
techniques (Webopedia)
• Multiple terabytes or petabytes
• Today’s big may be tomorrow’s normal
• It is relative to its context
Ref: https://www.stat.wisc.edu/bigdata
What is Big Data
©2015
Slide 4
©2015
Slide 5
OBAMA ADMINISTRATION UNVEILS “BIG
DATA” INITIATIVE: ANNOUNCES $200
MILLION IN NEW R&D INVESTMENTS
https://www.whitehouse.gov/sites/default/files/mi
crosites/ostp/big_data_press_release_final_2.pd
f
Stats
©2015
Slide 6
©2015
Slide 7
Ref: http://bit.ly/1RPwZnb
©2015
Slide 8Ref: http://www.slideshare.net/VipinBatra/introduction-to-big-data-45010980
©2015
Slide 9
Introduction to Hadoop and Map Reduce
©2015
Slide 10
• Apache Hadoop is an open-source software framework for
distributed storage and distributed processing for very
large data sets.
• Works on computer clusters built from commodity
hardware.
• Popularized by Google Map Reduce paper in 2004.
• Written in Java.
• Has two main components: MapReduce and HDFS
Hadoop
©2015
Slide 11
Counting coins Analogy
Img src: http://thelogicalindian.com/wp-content/uploads/2015/09/Untitled-138-750x500.jpg
©2015
Slide 12
Anatomy of MapReduce
d a c
a b c
a 3
b 1
c 2
a 1
b 1
c 1
a 1
c 1
a 1
a 1 1 1
b 1
c 1 1
HDFS mappers reducers HDFS
©2015
Slide 13
Evolution of Analytics Process
©2015
Slide 14
©2015
Slide 15
©2015
Slide 16
©2015
Slide 17
Ref: http://bit.ly/S1ma4Z
Seven (7) Tips for Statisticians using Big
Data
©2015
Slide 18
One temptation in applied statistics is
to take a tool you know well
(regression) and use it to hit all the
nails.
There is a similar temptation in big
data to get fixated on a tool (hadoop,
pig, hive, nosql databases, distributed
computing, gpgpu, etc.) and ignore the
problem of can we infer x relates to y
or that x predicts y.
Problem first not solution backward
©2015
Slide 19
Even in small data example, there can be a bug in the code used to
analyze them. With big data and complex models this is even more
important. Mozilla Science is doing interesting work on code review
for data analysis in science. But in general if you just get a friend to
look over your code it will catch a huge fraction of the problems you
might have.
Make your code and data available and have
smart people check it
©2015
Slide 20
Unless you ran a randomized trial, potential
confounders should keep you up at night
Any time you discover a cool new result, your first thought should be, "what are
the potential confounders?"
©2015
Slide 21
It can be easy to be tricked by the size of a data set. Imagine you have an
image of a simple black circle on a white background stored as pixels. As the
resolution increases the size of the data increases, but the amount of
information may not. In general the bigger the sample size the better and
sample size and data size aren't always tightly correlated.
Know what your real sample size is.
©2015
Slide 22
Before you analyze your data with computers, be sure to
plot it
A common mistake made by amateur
analysts is to immediately jump to fitting
models to big data sets with the fanciest
computational tool. But you can miss
pretty obvious things like this if you don't
plot your data.
©2015
Slide 23
If you want to understand a data set you have to be able to play around with it
and explore it. You need to make tables, make plots, identify quirks, outliers,
missing data patterns and problems with the data. To do this you need to
interact with the data quickly. One way to do this is to analyze the whole data
set at once using tools like Hive, Hadoop, or Pig. But an often easier, better,
and more cost effective approach is to use random sampling . As Robert
Gentleman put it "make big data as small as possible as quick as possible".
Interactive analysis is the best way to really figure out
what is going on in a data set
©2015
Slide 24
If the goal is prediction accuracy, average many prediction
models together
In general, the prediction algorithms that most frequently win Kaggle
competitions or the Netflix prize blend multiple models together.
The idea is that by averaging (or majority voting) multiple good
prediction algorithms you can reduce variability without giving up bias.
©2015
Slide 25
The parable of Google Flu: traps in big data
analysis
Google Flu Trends: the limits of big data
Eight (No, Nine!) Problems with Big Data
Big Data Problems
Ref: http://bit.ly/1fUzZO1
©2015
Slide 26
Conclusion
©2015
Slide 27
Questions?

Contenu connexe

Tendances

Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case Muh Saleh
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data ScienceBrijeshGoyani
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Intro to big data and applications - day 2
Intro to big data and applications - day 2Intro to big data and applications - day 2
Intro to big data and applications - day 2Parviz Vakili
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 

Tendances (20)

Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big data tools
Big data toolsBig data tools
Big data tools
 
Bigdata
BigdataBigdata
Bigdata
 
Intro to big data and applications - day 2
Intro to big data and applications - day 2Intro to big data and applications - day 2
Intro to big data and applications - day 2
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data
Big DataBig Data
Big Data
 

Similaire à Introduction to Big Data

The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Big Data – Is it a hype or for real?
 Big Data – Is it a hype or for real?  Big Data – Is it a hype or for real?
Big Data – Is it a hype or for real? Dirk Ortloff
 
International Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxInternational Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxvrickens
 
International Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxInternational Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxdoylymaura
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data FundamentalsSmarak Das
 
Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroStephen Lahanas
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015Fiona Lew
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Inside Analysis
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 

Similaire à Introduction to Big Data (20)

Big Data
Big DataBig Data
Big Data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
Big Data – Is it a hype or for real?
 Big Data – Is it a hype or for real?  Big Data – Is it a hype or for real?
Big Data – Is it a hype or for real?
 
International Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxInternational Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docx
 
International Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docxInternational Conference on Smart Computing and Electronic Ent.docx
International Conference on Smart Computing and Electronic Ent.docx
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - Intro
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Big Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFOBig Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFO
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 

Dernier

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Introduction to Big Data

  • 1. ©2015 Slide 1 Introduction to Big Data - Satish Gopalani, Ashish Tadose, Deepak Dixit (Ampool)
  • 2. ©2015 Slide 2 What is big-data? Definition of big data and 4 V’s Some Stats Who is using Big data? Applications Intro to Hadoop and Map Reduce Coins counting analogy Typical workflow of Hadoop Big data for Statisticians. Problems with Big Data Conclusion OUTLINE
  • 3. ©2015 Slide 3 Big data refers to a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. That is, beyond current comfort levels. “Big” is relative, depending on context, amount of data and complexity of the problem. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost- effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner) Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques (Webopedia) • Multiple terabytes or petabytes • Today’s big may be tomorrow’s normal • It is relative to its context Ref: https://www.stat.wisc.edu/bigdata What is Big Data
  • 5. ©2015 Slide 5 OBAMA ADMINISTRATION UNVEILS “BIG DATA” INITIATIVE: ANNOUNCES $200 MILLION IN NEW R&D INVESTMENTS https://www.whitehouse.gov/sites/default/files/mi crosites/ostp/big_data_press_release_final_2.pd f Stats
  • 9. ©2015 Slide 9 Introduction to Hadoop and Map Reduce
  • 10. ©2015 Slide 10 • Apache Hadoop is an open-source software framework for distributed storage and distributed processing for very large data sets. • Works on computer clusters built from commodity hardware. • Popularized by Google Map Reduce paper in 2004. • Written in Java. • Has two main components: MapReduce and HDFS Hadoop
  • 11. ©2015 Slide 11 Counting coins Analogy Img src: http://thelogicalindian.com/wp-content/uploads/2015/09/Untitled-138-750x500.jpg
  • 12. ©2015 Slide 12 Anatomy of MapReduce d a c a b c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1 HDFS mappers reducers HDFS
  • 13. ©2015 Slide 13 Evolution of Analytics Process
  • 17. ©2015 Slide 17 Ref: http://bit.ly/S1ma4Z Seven (7) Tips for Statisticians using Big Data
  • 18. ©2015 Slide 18 One temptation in applied statistics is to take a tool you know well (regression) and use it to hit all the nails. There is a similar temptation in big data to get fixated on a tool (hadoop, pig, hive, nosql databases, distributed computing, gpgpu, etc.) and ignore the problem of can we infer x relates to y or that x predicts y. Problem first not solution backward
  • 19. ©2015 Slide 19 Even in small data example, there can be a bug in the code used to analyze them. With big data and complex models this is even more important. Mozilla Science is doing interesting work on code review for data analysis in science. But in general if you just get a friend to look over your code it will catch a huge fraction of the problems you might have. Make your code and data available and have smart people check it
  • 20. ©2015 Slide 20 Unless you ran a randomized trial, potential confounders should keep you up at night Any time you discover a cool new result, your first thought should be, "what are the potential confounders?"
  • 21. ©2015 Slide 21 It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not. In general the bigger the sample size the better and sample size and data size aren't always tightly correlated. Know what your real sample size is.
  • 22. ©2015 Slide 22 Before you analyze your data with computers, be sure to plot it A common mistake made by amateur analysts is to immediately jump to fitting models to big data sets with the fanciest computational tool. But you can miss pretty obvious things like this if you don't plot your data.
  • 23. ©2015 Slide 23 If you want to understand a data set you have to be able to play around with it and explore it. You need to make tables, make plots, identify quirks, outliers, missing data patterns and problems with the data. To do this you need to interact with the data quickly. One way to do this is to analyze the whole data set at once using tools like Hive, Hadoop, or Pig. But an often easier, better, and more cost effective approach is to use random sampling . As Robert Gentleman put it "make big data as small as possible as quick as possible". Interactive analysis is the best way to really figure out what is going on in a data set
  • 24. ©2015 Slide 24 If the goal is prediction accuracy, average many prediction models together In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias.
  • 25. ©2015 Slide 25 The parable of Google Flu: traps in big data analysis Google Flu Trends: the limits of big data Eight (No, Nine!) Problems with Big Data Big Data Problems Ref: http://bit.ly/1fUzZO1