Yo. big data. understanding data science in the era of big data.

Yo. Big Data
understanding data science in the era of big data.
Natalino Busa
@natalinobusa

Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com

● (almost) everything is a number
● A few guys came with some good ideas:
Aristoteles, Galileo, Popper,
Fisher, Pearson, Bayes
What has changed in 2500 years?

Aristoteles
Analytical reasoning
induction
deduction
Causality
Ontology

Galileo
Scientific method
experiment
reproducibility
math formula’s as models

Popper
Falsification
Exact sciences
Models have to adhere reality
Statistical inference:
Can we falsify beliefs?

Pearson
Statistical method
Null hypothesis
hypothesis testing
Principal Component Analysis
Correlation Coefficient

Fisher
Statistical method
Likelihood function
Significance
Distribution
Sufficient statistics

Bayes
Math of belief
belief inference
network of beliefs
hypothesis -> beliefs

What about it?
The shocking truth:
1) we use these concepts every day
2) we have a pre-scientific intuition of these ideas

Why do we bother?
New problems are related to understanding human behavior:
understand needs, desires, dreams, ambitions, cravings, and hopes.
Models have a great side effect:
they help us predicting the future.
three weapons:
Processing power: Models becomes faster: can unroll for everybody’s profiles
Sources: extract more data features, use different data.
Context: exploring information in order to understand the person.

So, why data?
Data is our way of understanding life and reality.

How to deal with it?
Well, it’s quite simple, in a nutshell:
This is what (data) science is about:
data -> hypothesis -> validation

… but what we (mostly) really do is:
Use very little data
-> apply it to pre-formulated beliefs
-> come up with some “gut feeling”
Validate it:
It didn’t work? “Well, I am still right. ”

What’s the problem with it?
● Context
○ we could use some more data
○ insufficient feature engineering
● Add more hypotheses
○ we could explore more scenarios, “pivoting”
○ look at the problem from other angles
○ need data “artistry”

Big data to the rescue?
Big Data is the domain which:
transforms
numbers to insights
services to experiences

by aggregating data sources
across users
across applications
across domains

in order to
providing personalized and relevant results
to the consumer of the given service
anywhere,
anytime.

Some small headaches
users != consumers
N=all : doesn’t mean you don’t need to clean it
Not all data is born equal
you don’t know what you don’t know

Keep exploring.
Your problem might not be captured by your data features.

Some small headaches
Tough to inspect big data.
Tough to reason about big data.
representativity/bias, support, and segmentation
signal to noise ratio:
look at GFT (Google Flu Trends) for instance

Diminishing
returns
Most of models pretty good
after a few weeks
winner added just about 5% more
after 1 year, 300 ensemble model
moral:
move on, get a new angle

How to compare?
You know the answer (supervised methods)
confusion matrix
ROC (Receiver Operating Characteristic)
Mean Square Error (MSE)
You don’t know the answer (unsupervised methods)
objective function
access ground truth
A/B testing

Beware the modeling risks
Overfitting train data
Not enough “support” in the population
Not enough features available/discovered
Not well defined objective function

Object functions
“ you can please some of the people some of the time”

Object functions
Many want a slice of the cake when it’s about object functions
● what the user wants
● what the community wants
● what marketing wants
● what business wants
● what finance/monetization wants

Data scientists
Data artists,
Data analysts
Data scientists
Data engineers
confirmatory analysis:
domain knowledge, statisticians and data analysis
exploratory analysis :
data artists/scientists
operational analysis:
data engineers , data technologists

What do we look in the haystack?
outliers
outliers are indicators and/or noise
groups
(Similarity metrics, PCA, SVD)
Big data as pragmatic approach to:
cheap storage
distributed computing

How to enjoy and compare data science?
enjoy the artistry
appreciate the genius
cross-validation
avoid falling into the trap of over-fitted models
define baseline
avoid qualitative methods
define a metric, put the models to the bench, compare results

Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?

Yo. big data. understanding data science in the era of big data.

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Yo. big data. understanding data science in the era of big data.

Similaire à Yo. big data. understanding data science in the era of big data. (20)

Plus de Natalino Busa

Plus de Natalino Busa (17)

Dernier

Dernier (20)

Yo. big data. understanding data science in the era of big data.