SlideShare une entreprise Scribd logo
1  sur  62
Marko Grobelnik
Jozef Stefan Institute, Slovenia
Sao Carlos, July 15th 2013
 Big-Data in numbers
 Big-Data Definitions
 Motivation
 State of Market
 Techniques
 Tools
 Data Science
 Concluding remarks
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
http://www.go-gulf.com/blog/online-time/
 „Big-data‟ is similar to „Small-data‟, but bigger
 …but having data bigger it requires different
approaches:
◦ techniques, tools, architectures
 …with an aim to solve new problems
◦ …or old problems in a better way.
 Volume –
challenging to load
and process (how to
index, retrieve)
 Variety – different
data types and
degree of structure
(how to query semi-
structured data)
 Velocity – real-time
processing
influenced by rate of
data arrival
From “Understanding Big Data” by IBM
 1. Volume (lots of data = “Tonnabytes”)
 2. Variety (complexity, curse of
dimensionality)
 3. Velocity (rate of data and information flow)
 4. Veracity (verifying inference-based models
from comprehensive data collections)
 5. Variability
 6. Venue (location)
 7. Vocabulary (semantics)
Comparing volume of “big data” and “data mining” queries
…adding “web 2.0” to “big data” and “data mining” queries volume
Big-Data
 Key enablers for the appearance and growth
of “Big Data” are:
◦ Increase of storage capacities
◦ Increase of processing power
◦ Availability of data
Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013
 …when the operations on data are complex:
◦ …e.g. simple counting is not a complex problem
◦ Modeling and reasoning with data of different kinds
can get extremely complex
 Good news about big-data:
◦ Often, because of vast amount of data, modeling
techniques can get simpler (e.g. smart counting can
replace complex model-based analytics)…
◦ …as long as we deal with the scale
 Research areas (such
as IR, KDD, ML, NLP,
SemWeb, …) are sub-
cubes within the data
cube
Scalability
Streaming
Context
Quality
Usage
 A risk with “Big-Data mining” is that an
analyst can “discover” patterns that are
meaningless
 Statisticians call it Bonferroni‟s principle:
◦ Roughly, if you look in more places for interesting
patterns than your amount of data will support, you
are bound to find crap
Example:
 We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
◦ 109 people being tracked.
◦ 1000 days.
◦ Each person stays in a hotel 1% of the time (1 day out of 100)
◦ Hotels hold 100 people (so 105 hotels).
◦ If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?
 Expected number of “suspicious” pairs of people:
◦ 250,000
◦ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
 Smart sampling of data
◦ …reducing the original data while not losing the
statistical properties of data
 Finding similar items
◦ …efficient multidimensional indexing
 Incremental updating of the models
◦ (vs. building models from scratch)
◦ …crucial for streaming data
 Distributed linear algebra
◦ …dealing with large sparse matrices
 On the top of the previous ops we perform
usual data mining/machine learning/statistics
operators:
◦ Supervised learning (classification, regression, …)
◦ Non-supervised learning (clustering, different types
of decompositions, …)
◦ …
 …we are just more careful which algorithms
we choose (typically linear or sub-linear
versions)
 An excellent overview of the algorithms
covering the above issues is the book
“Rajaraman, Leskovec, Ullman: Mining of
Massive Datasets”
 Downloadable from:
http://infolab.stanford.edu/~ullman/mmds.html
 Where processing is hosted?
◦ Distributed Servers / Cloud (e.g. Amazon EC2)
 Where data is stored?
◦ Distributed Storage (e.g. Amazon S3)
 What is the programming model?
◦ Distributed Processing (e.g. MapReduce)
 How data is stored & indexed?
◦ High-performance schema-free databases (e.g.
MongoDB)
 What operations are performed on data?
◦ Analytic / Semantic Processing (e.g. R, OWLIM)
 Computing and storage are typically hosted
transparently on cloud infrastructures
◦ …providing scale, flexibility and high fail-safety
 Distributed Servers
◦ Amazon-EC2, Google App Engine, Elastic,
Beanstalk, Heroku
 Distributed Storage
◦ Amazon-S3, Hadoop Distributed File System
 Distributed processing of Big-Data requires non-
standard programming models
◦ …beyond single machines or traditional parallel
programming models (like MPI)
◦ …the aim is to simplify complex programming tasks
 The most popular programming model is
MapReduce approach
 Implementations of MapReduce
◦ Hadoop (http://hadoop.apache.org/), Hive, Pig,
Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu,
Flume, Kafka, Azkaban, Oozie, Greenplum
 The key idea of the MapReduce approach:
◦ A target problem needs to be parallelizable
◦ First, the problem gets split into a set of smaller problems (Map step)
◦ Next, smaller problems are solved in a parallel way
◦ Finally, a set of solutions to the smaller problems get synthesized
into a solution of the original problem (Reduce step)
 NoSQL class of databases have in common:
◦ To support large amounts of data
◦ Have mostly non-SQL interface
◦ Operate on distributed infrastructures (e.g. Hadoop)
◦ Are based on key-value pairs (no predefined schema)
◦ …are flexible and fast
 Implementations
◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
Hypertable, Voldemort, Riak, ZooKeeper…
 Interdisciplinary field using
techniques and theories from many
fields, including math, statistics, data
engineering, pattern recognition and
learning, advanced computing,
visualization, uncertainty modeling,
data warehousing, and high
performance computing with the goal
of extracting meaning from data and
creating data products.
 Data science is a novel term that is
often used interchangeably with
competitive intelligence or business
analytics, although it is becoming
more common.
 Data science seeks to use all available
and relevant data to effectively tell a
story that can be easily understood by
non-practitioners.
http://en.wikipedia.org/wiki/Data_science
http://blog.revolutionanalytics.com/data-science/
http://blog.revolutionanalytics.com/data-science/
Analyzing the Analyzers
An Introspective Survey of Data Scientists and Their Work
By Harlan Harris, Sean Murphy, Marck Vaisman
Publisher: O'Reilly Media
Released: June 2013
An Introduction to Data
Jeffrey Stanton, Syracuse University School of Information Studies
Downloadable from http://jsresearch.net/wiki/projects/teachdatascience
Released: February 2013
 Big-Data is everywhere, we are just not used to
deal with it
 The “Big-Data” hype is very recent
◦ …growth seems to be going up
◦ …evident lack of experts to build Big-Data apps
 Can we do “Big-Data” without big investment?
◦ …yes – many open source tools, computing machinery is
cheap (to buy or to rent)
◦ …the key is knowledge on how to deal with data
◦ …data is either free (e.g. Wikipedia) or to buy (e.g.
twitter)
 Analysis of big social network
 Recommender service on a big scale
 How to profile people from web data?
 Complex data visualization
 How to track information across languages
 Putting deep semantics into the picture

Contenu connexe

Tendances

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...BigMine
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altinmustafa sarac
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 

Tendances (20)

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 

En vedette

Big Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data FrontierBig Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data FrontierCA Technologies
 
Systems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBESystems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBEGeorge Gilbert
 
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015MassTLC
 
Innomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & ServicesInnomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & ServicesInnomantra
 
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)Institute for the Future
 
Opening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking IndustryOpening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking IndustryInnomantra
 
Half Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesHalf Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesInstitute for the Future
 
EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazioneserepisa
 
Seo thời tam quốc
Seo thời tam quốcSeo thời tam quốc
Seo thời tam quốcLong Hacki
 
ALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersInstitute for the Future
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Abovekmurray230
 

En vedette (20)

Big Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data FrontierBig Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
Big Data, Big Opportunity: A Primer for Understanding The Big Data Frontier
 
Systems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBESystems of Intelligence - Wikibon/theCUBE
Systems of Intelligence - Wikibon/theCUBE
 
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015Jeff Kelly, Wikibon Slides; Big Data Summit 2015
Jeff Kelly, Wikibon Slides; Big Data Summit 2015
 
Big Data
Big DataBig Data
Big Data
 
Innomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & ServicesInnomantra - Intellectual Property Consulting & Services
Innomantra - Intellectual Property Consulting & Services
 
Issues in the conservation of photographs
Issues in the conservation of photographsIssues in the conservation of photographs
Issues in the conservation of photographs
 
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
The Nature of the Future: The Socialstructed World (Marina Gorbis Keynote)
 
A&m assessment help
A&m assessment helpA&m assessment help
A&m assessment help
 
Final Assignment
Final AssignmentFinal Assignment
Final Assignment
 
EzMate 401 Arise Biotech
EzMate 401 Arise BiotechEzMate 401 Arise Biotech
EzMate 401 Arise Biotech
 
Cuidando al Cuidador
Cuidando al CuidadorCuidando al Cuidador
Cuidando al Cuidador
 
Cámbiate Transvulcania
Cámbiate TransvulcaniaCámbiate Transvulcania
Cámbiate Transvulcania
 
Opening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking IndustryOpening an Account for Innovation in Banking Industry
Opening an Account for Innovation in Banking Industry
 
Half Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesHalf Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and Templates
 
3d Views Portfolio
3d Views Portfolio3d Views Portfolio
3d Views Portfolio
 
EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazione
 
Seo thời tam quốc
Seo thời tam quốcSeo thời tam quốc
Seo thời tam quốc
 
ALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union Headquarters
 
¿Querer es poder?
¿Querer es poder?¿Querer es poder?
¿Querer es poder?
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Above
 

Similaire à Big Data Analytics V2

ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...eswcsummerschool
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationRaffael Marty
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 

Similaire à Big Data Analytics V2 (20)

ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Spark
SparkSpark
Spark
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 

Dernier

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Dernier (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Big Data Analytics V2

  • 1. Marko Grobelnik Jozef Stefan Institute, Slovenia Sao Carlos, July 15th 2013
  • 2.  Big-Data in numbers  Big-Data Definitions  Motivation  State of Market  Techniques  Tools  Data Science  Concluding remarks
  • 3.
  • 4.
  • 5.
  • 14.
  • 15.  „Big-data‟ is similar to „Small-data‟, but bigger  …but having data bigger it requires different approaches: ◦ techniques, tools, architectures  …with an aim to solve new problems ◦ …or old problems in a better way.
  • 16.  Volume – challenging to load and process (how to index, retrieve)  Variety – different data types and degree of structure (how to query semi- structured data)  Velocity – real-time processing influenced by rate of data arrival From “Understanding Big Data” by IBM
  • 17.  1. Volume (lots of data = “Tonnabytes”)  2. Variety (complexity, curse of dimensionality)  3. Velocity (rate of data and information flow)  4. Veracity (verifying inference-based models from comprehensive data collections)  5. Variability  6. Venue (location)  7. Vocabulary (semantics)
  • 18. Comparing volume of “big data” and “data mining” queries
  • 19. …adding “web 2.0” to “big data” and “data mining” queries volume
  • 20.
  • 22.
  • 23.  Key enablers for the appearance and growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013
  • 35.
  • 36.
  • 37.
  • 38.  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model-based analytics)… ◦ …as long as we deal with the scale
  • 39.  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are sub- cubes within the data cube Scalability Streaming Context Quality Usage
  • 40.  A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless  Statisticians call it Bonferroni‟s principle: ◦ Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap
  • 41. Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 42.  Smart sampling of data ◦ …reducing the original data while not losing the statistical properties of data  Finding similar items ◦ …efficient multidimensional indexing  Incremental updating of the models ◦ (vs. building models from scratch) ◦ …crucial for streaming data  Distributed linear algebra ◦ …dealing with large sparse matrices
  • 43.  On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …) ◦ Non-supervised learning (clustering, different types of decompositions, …) ◦ …  …we are just more careful which algorithms we choose (typically linear or sub-linear versions)
  • 44.  An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Leskovec, Ullman: Mining of Massive Datasets”  Downloadable from: http://infolab.stanford.edu/~ullman/mmds.html
  • 45.
  • 46.  Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)  What is the programming model? ◦ Distributed Processing (e.g. MapReduce)  How data is stored & indexed? ◦ High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
  • 47.
  • 48.  Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety  Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic, Beanstalk, Heroku  Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
  • 49.  Distributed processing of Big-Data requires non- standard programming models ◦ …beyond single machines or traditional parallel programming models (like MPI) ◦ …the aim is to simplify complex programming tasks  The most popular programming model is MapReduce approach  Implementations of MapReduce ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • 50.  The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable ◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
  • 51.  NoSQL class of databases have in common: ◦ To support large amounts of data ◦ Have mostly non-SQL interface ◦ Operate on distributed infrastructures (e.g. Hadoop) ◦ Are based on key-value pairs (no predefined schema) ◦ …are flexible and fast  Implementations ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper…
  • 52.
  • 53.  Interdisciplinary field using techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.  Data science is a novel term that is often used interchangeably with competitive intelligence or business analytics, although it is becoming more common.  Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners. http://en.wikipedia.org/wiki/Data_science
  • 56. Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work By Harlan Harris, Sean Murphy, Marck Vaisman Publisher: O'Reilly Media Released: June 2013 An Introduction to Data Jeffrey Stanton, Syracuse University School of Information Studies Downloadable from http://jsresearch.net/wiki/projects/teachdatascience Released: February 2013
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.  Big-Data is everywhere, we are just not used to deal with it  The “Big-Data” hype is very recent ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps  Can we do “Big-Data” without big investment? ◦ …yes – many open source tools, computing machinery is cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g. twitter)
  • 62.  Analysis of big social network  Recommender service on a big scale  How to profile people from web data?  Complex data visualization  How to track information across languages  Putting deep semantics into the picture