SlideShare une entreprise Scribd logo
1  sur  98
Télécharger pour lire hors ligne
Joel Grus
Seattle DAML Meetup
June 23, 2015
Data Science from Scratch
About me
Old-school DAML-er
Wrote a book ---------->
SWE at Google
Formerly data science at
VoloMetrix, Decide,
Farecast
The Road to
Data Science
The Road to
Data Science
My
Grad School
Fareology
Data Science Is A Broad Field
Some Stuff
More
Stuff
Even
More
Stuff
Data
Science
People who think they're
data scientists, but they're
not really data scientists
People who are a danger
to everyone around them
People who say
"machine learnings"
a data scientist should be able to
JOEL GRUS
a data scientist should be able to
run a regression,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value,
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
JOEL GRUS
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
specialization is for engineers.
JOEL GRUS
A lot of stuff!
What Are Hiring Managers Looking For?
What Are Hiring Managers Looking For?
Let's check LinkedIn
a data scientist should be able to
run a regression, write a sql query, scrape a web
site, design an experiment, factor matrices, use a
data frame, pretend to understand deep learning,
steal from the d3 gallery, argue r versus python,
think in mapreduce, update a prior, build a
dashboard, clean up messy data, test a hypothesis,
talk to a businessperson, script a shell, code on a
whiteboard, hack a p-value, machine-learn a model.
specialization is for engineers.
JOEL GRUS
grad students!
Learning Data Science
I want to be a
data scientist. Great!
The Math Way
I like to start with
matrix
decompositions.
How's your
measure theory?
The Math Way
The Good:
Solid foundation
Math is the noblest
known pursuit
The Math Way
The Good:
Solid foundation
Math is the noblest
known pursuit
The Bad:
Some weirdos don't
think math is fun
Can be pretty
forbidding
Can miss practical
skills
So, did you
count the
words in that
document?
No, but I have an
elegant proof
that the number
of words is finite!
OK, Let's Try Again
I want to be a
data scientist. Great!
The Tools Way
Here's a list of
the 25 libraries
you really ought
to know. How's
your R
programming?
The Tools Way
The Good:
Don't have to
understand the
math
Practical
Can get started doing
fun stuff right away
The Tools Way
The Good:
Don't have to
understand the
math
Practical
Can get started doing
fun stuff right away
The Bad:
Don't have to
understand the
math
Can get started doing
bad science right
away
So, did you
build that
model?
Yes, and it fits the
training data
almost perfectly!
OK, Maybe Not That Either
So Then What?
Example: k-means clustering
Unsupervised machine learning technique
Given a set of points, group them into k clusters
in a way that minimizes the within-cluster sum-
of-squares
i.e. in a way such that the clusters are as "small"
as possible (for a particular conception of
"small")
The Math Way
The Math Way
The Tools Way
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
The Tools Way
>>> from sklearn import cluster, datasets
>>> iris = datasets.load_iris()
>>> X_iris = iris.data
>>> y_iris = iris.target
>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(X_iris)
KMeans(copy_x=True, init='k-means++', ...
>>> print(k_means.labels_[::10])
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
>>> print(y_iris[::10])
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
So What To Do?
Bootcamps?
Data Science from Scratch
This is to certify that Joel Grus
has honorably completed the course of study outlined in
the book Data Science from Scratch: First Principles with
Python, and is entitled to all the Rights, Privileges, and
Honors thereunto appertaining.
Joel GrusJune 23, 2015
Certificate Programs?
Hey! Data scientists!
Learning By Building
You don't really understand something until you
build it
For example, I understand garbage disposals
much better now that I had to replace one that
was leaking water all over my kitchen
More relevantly, I thought I understood
hypothesis testing, until I tried to write a book
chapter + code about it.
Learning By Building
Functional Programming
Break Things Down Into Small Functions
So you
don't end
up with
something
like this
Don't Mutate
Example: k-means clustering
Given a set of points, group them into k clusters
in a way that minimizes the within-cluster sum-
of-squares
Global optimization is hard, so use a greedy
iterative approach
Fun Motivation: Image Posterization
Image consists of pixels
Each pixel is a triplet (R,G,B)
Imagine pixels as points in space
Find k clusters of pixels
Recolor each pixel to its cluster mean
I think it's fun, anyway
8 colors
Example: k-means clustering
given some points, find k clusters by
choose k "means"
repeat:
assign each point to cluster of closest "mean"
recompute mean of each cluster
sounds simple! let's code!
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
find the points in each cluster
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
compute the distance
assign the point to the cluster of the mean with
the smallest distance
find the points in each cluster
and compute the new means
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
Not impenetrable, but
a lot less helpful than
it could be
def k_means(points, k, num_iters=10):
means = list(random.sample(points, k))
assignments = [None for _ in points]
for _ in range(num_iters):
# assign each point to closest mean
for i, point_i in enumerate(points):
d_min = float('inf')
for j, mean_j in enumerate(means):
d = sum((x - y)**2
for x, y in zip(point_i, mean_j))
if d < d_min:
d_min = d
assignments[i] = j
# recompute means
for j in range(k):
cluster = [point for i, point in enumerate(points) if assignments[i] ==
j]
means[j] = mean(cluster)
return means
Not impenetrable, but
a lot less helpful than
it could be
Can we make it
simpler?
Break Things Down Into Small Functions
def k_means(points, k, num_iters=10):
# start with k of the points as "means"
means = random.sample(points, k)
# and iterate finding new means
for _ in range(num_iters):
means = new_means(points, means)
return means
def new_means(points, means):
# assign points to clusters
# each cluster is just a list of points
clusters = assign_clusters(points, means)
# return the cluster means
return [mean(cluster)
for cluster in clusters]
def assign_clusters(points, means):
# one cluster for each mean
# each cluster starts empty
clusters = [[] for _ in means]
# assign each point to cluster
# corresponding to closest mean
for p in points:
index = closest_index(point, means)
clusters[index].append(point)
return clusters
def closest_index(point, means):
# return index of closest mean
return argmin(distance(point, mean)
for mean in means)
def argmin(xs):
# return index of smallest element
return min(enumerate(xs),
key=lambda pair: pair[1])[0]
To Recap
k_means(points, k, num_iters=10)
mean(points)
k_means(points, k, num_iters=10)
new_means(points, means)
assign_clusters(points, means)
closest_index(point, means)
argmin(xs)
distance(point1, point2)
mean(points)
add(point1, point2)
scalar_multiply(c, point)
As a Pedagogical Tool
Can be used "top down" (as we did here)
Implement high-level logic
Then implement the details
Nice for exposition
Can also be used "bottom up"
Implement small pieces
Build up to high-level logic
Good for workshops
Example: Decision Trees
Want to predict whether
a given Meetup is worth
attending (True) or not
(False)
Inputs are dictionaries
describing each Meetup
{ "group" : "DAML",
"date" : "2015-06-23",
"beer" : "free",
"food" : "dim sum",
"speaker" : "@joelgrus",
"location" : "Google",
"topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists",
"date" : "2015-06-23",
"location" : "Round the Table",
"beer" : "none",
"food" : "none",
"topic" : "Godless Game Night" }
Example: Decision Trees
{ "group" : "DAML",
"date" : "2015-06-23",
"beer" : "free",
"food" : "dim sum",
"speaker" : "@joelgrus",
"location" : "Google",
"topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists",
"date" : "2015-06-23",
"location" : "Round the Table",
"beer" : "none",
"food" : "none",
"topic" : "Godless Game Night" }
beer?
True False
speaker?
True False
free none
paid
@jakevdp @joelgrus
Example: Decision Trees
class LeafNode:
def __init__(self, prediction):
self.prediction = prediction
def predict(self, input_dict):
return self.prediction
class DecisionNode:
def __init__(self, attribute, subtree_dict):
self.attribute = attribute
self.subtree_dict = subtree_dict
def predict(self, input_dict):
value = input_dict.get(self.attribute)
subtree = self.subtree_dict[value]
return subtree.predict(input)
Example: Decision Trees
Again inspiration from functional programming:
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
look at the "beer" entry
a map from each possible
"beer" value to a subtree
always predict a specific value
Example: Decision Trees
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input =
predict subtree input
where subtree = subtrees Map.! (input Map.!
Example: Decision Trees
type Input = Map.Map String String
data Tree = Predict Bool
| Subtrees String (Map.Map String Tree)
We can do the same,
we'll say a decision tree is either
True
False
(attribute, subtree_dict)
("beer",
{ "free" : True,
"none" : False,
"paid" : ("speaker",
{...})})
predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input =
predict subtree input
where subtree = subtrees Map.! (input Map.! a)
Example: Decision Trees
def predict(tree, input_dict):
# leaf node predicts itself
if tree in (True, False):
return tree
else:
# destructure tree
attribute, subtree_dict = tree
# find appropriate subtree
value = input_dict[attribute]
subtree = subtree_dict[value]
# classify using subtree
return predict(subtree, input_dict)
Not Just For Data Science
In Conclusion
Teaching data science is fun, if you're smart
about it
Learning data science is fun, if you're smart
about it
Writing a book is not that much fun
Having written a book is pretty fun
Making slides is actually kind of fun
Functional programming is a lot of fun
Thanks!
@joelgrus
joelgrus@gmail.com
joelgrus.com

Contenu connexe

Tendances

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Large Language Models Are Reasoning Teachers
Large Language Models Are Reasoning TeachersLarge Language Models Are Reasoning Teachers
Large Language Models Are Reasoning TeachersNamgyu Ho
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
ChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxJesus Rodriguez
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxColleen Farrelly
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsAdventureWorld5
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdfNeo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdfNeo4j
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceJisc
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonYash Khanna
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 

Tendances (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Large Language Models Are Reasoning Teachers
Large Language Models Are Reasoning TeachersLarge Language Models Are Reasoning Teachers
Large Language Models Are Reasoning Teachers
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
ChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdfNeo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
 
Python PPT
Python PPTPython PPT
Python PPT
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable service
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
 
Python
PythonPython
Python
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 

En vedette

F# for startups v2
F# for startups v2F# for startups v2
F# for startups v2joelgrus
 
T shirts, feminism, parenting, and data science
T shirts, feminism, parenting, and data scienceT shirts, feminism, parenting, and data science
T shirts, feminism, parenting, and data sciencejoelgrus
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Seattle DAML meetup
 
F# for startups
F# for startupsF# for startups
F# for startupsjoelgrus
 
Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Seattle DAML meetup
 
Numbers game
Numbers gameNumbers game
Numbers gamejoelgrus
 
Secrets of Fire Truck Society - Slides for Ignite Strata 2013
Secrets of Fire Truck Society - Slides for Ignite Strata 2013Secrets of Fire Truck Society - Slides for Ignite Strata 2013
Secrets of Fire Truck Society - Slides for Ignite Strata 2013joelgrus
 

En vedette (7)

F# for startups v2
F# for startups v2F# for startups v2
F# for startups v2
 
T shirts, feminism, parenting, and data science
T shirts, feminism, parenting, and data scienceT shirts, feminism, parenting, and data science
T shirts, feminism, parenting, and data science
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
 
F# for startups
F# for startupsF# for startups
F# for startups
 
Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016
 
Numbers game
Numbers gameNumbers game
Numbers game
 
Secrets of Fire Truck Society - Slides for Ignite Strata 2013
Secrets of Fire Truck Society - Slides for Ignite Strata 2013Secrets of Fire Truck Society - Slides for Ignite Strata 2013
Secrets of Fire Truck Society - Slides for Ignite Strata 2013
 

Similaire à Joel Grus Seattle DAML Meetup Data Science Presentation

Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!David Coallier
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with pythonJan Kroon
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...David Coallier
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
Sztuka czytania między wierszami - R i Data mining
Sztuka czytania między wierszami - R i Data miningSztuka czytania między wierszami - R i Data mining
Sztuka czytania między wierszami - R i Data miningKatarzyna Mrowca
 
AI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAbe
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data ScienceSanghamitra Deb
 

Similaire à Joel Grus Seattle DAML Meetup Data Science Presentation (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
20151020 Metis
20151020 Metis20151020 Metis
20151020 Metis
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with python
 
R & Data mining in action
R & Data mining in actionR & Data mining in action
R & Data mining in action
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Sztuka czytania między wierszami - R i Data mining
Sztuka czytania między wierszami - R i Data miningSztuka czytania między wierszami - R i Data mining
Sztuka czytania między wierszami - R i Data mining
 
AI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data Science
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Machine learning with Google machine learning APIs - Puppy or Muffin?
Machine learning with Google machine learning APIs - Puppy or Muffin?Machine learning with Google machine learning APIs - Puppy or Muffin?
Machine learning with Google machine learning APIs - Puppy or Muffin?
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Data science
Data scienceData science
Data science
 

Plus de Seattle DAML meetup

Understanding disparities using the American Community Survey - Sean Green, M...
Understanding disparities using the American Community Survey - Sean Green, M...Understanding disparities using the American Community Survey - Sean Green, M...
Understanding disparities using the American Community Survey - Sean Green, M...Seattle DAML meetup
 
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Seattle DAML meetup
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Seattle DAML meetup
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Seattle DAML meetup
 
Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Seattle DAML meetup
 
Hunting criminals with hybrid analytics -- October 2015
Hunting criminals with hybrid analytics -- October 2015Hunting criminals with hybrid analytics -- October 2015
Hunting criminals with hybrid analytics -- October 2015Seattle DAML meetup
 
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...Seattle DAML meetup
 
Adventures in Data Visualization - Jeff Heer, May 2015
Adventures in Data Visualization - Jeff Heer, May 2015Adventures in Data Visualization - Jeff Heer, May 2015
Adventures in Data Visualization - Jeff Heer, May 2015Seattle DAML meetup
 
Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Seattle DAML meetup
 

Plus de Seattle DAML meetup (9)

Understanding disparities using the American Community Survey - Sean Green, M...
Understanding disparities using the American Community Survey - Sean Green, M...Understanding disparities using the American Community Survey - Sean Green, M...
Understanding disparities using the American Community Survey - Sean Green, M...
 
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
 
Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015Been Kim - Interpretable machine learning, Nov 2015
Been Kim - Interpretable machine learning, Nov 2015
 
Hunting criminals with hybrid analytics -- October 2015
Hunting criminals with hybrid analytics -- October 2015Hunting criminals with hybrid analytics -- October 2015
Hunting criminals with hybrid analytics -- October 2015
 
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
 
Adventures in Data Visualization - Jeff Heer, May 2015
Adventures in Data Visualization - Jeff Heer, May 2015Adventures in Data Visualization - Jeff Heer, May 2015
Adventures in Data Visualization - Jeff Heer, May 2015
 
Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015
 

Dernier

Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunicationnovrain7111
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 

Dernier (20)

Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subject
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunication
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
ASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductosASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductos
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 

Joel Grus Seattle DAML Meetup Data Science Presentation

  • 1. Joel Grus Seattle DAML Meetup June 23, 2015 Data Science from Scratch
  • 2. About me Old-school DAML-er Wrote a book ----------> SWE at Google Formerly data science at VoloMetrix, Decide, Farecast
  • 4. The Road to Data Science My
  • 5.
  • 7.
  • 8.
  • 10. Data Science Is A Broad Field Some Stuff More Stuff Even More Stuff Data Science People who think they're data scientists, but they're not really data scientists People who are a danger to everyone around them People who say "machine learnings"
  • 11.
  • 12. a data scientist should be able to JOEL GRUS
  • 13. a data scientist should be able to run a regression, JOEL GRUS
  • 14. a data scientist should be able to run a regression, write a sql query, JOEL GRUS
  • 15. a data scientist should be able to run a regression, write a sql query, scrape a web site, JOEL GRUS
  • 16. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, JOEL GRUS
  • 17. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, JOEL GRUS
  • 18. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, JOEL GRUS
  • 19. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, JOEL GRUS
  • 20. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, JOEL GRUS
  • 21. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, JOEL GRUS
  • 22. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, JOEL GRUS
  • 23. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, JOEL GRUS
  • 24. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, JOEL GRUS
  • 25. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, JOEL GRUS
  • 26. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, JOEL GRUS
  • 27. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, JOEL GRUS
  • 28. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS
  • 29. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS
  • 30. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS
  • 31. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS
  • 32. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers. JOEL GRUS
  • 33. A lot of stuff!
  • 34. What Are Hiring Managers Looking For?
  • 35. What Are Hiring Managers Looking For? Let's check LinkedIn
  • 36.
  • 37. a data scientist should be able to run a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers. JOEL GRUS grad students!
  • 39. I want to be a data scientist. Great!
  • 40. The Math Way I like to start with matrix decompositions. How's your measure theory?
  • 41. The Math Way The Good: Solid foundation Math is the noblest known pursuit
  • 42. The Math Way The Good: Solid foundation Math is the noblest known pursuit The Bad: Some weirdos don't think math is fun Can be pretty forbidding Can miss practical skills
  • 43. So, did you count the words in that document? No, but I have an elegant proof that the number of words is finite!
  • 44. OK, Let's Try Again
  • 45. I want to be a data scientist. Great!
  • 46. The Tools Way Here's a list of the 25 libraries you really ought to know. How's your R programming?
  • 47. The Tools Way The Good: Don't have to understand the math Practical Can get started doing fun stuff right away
  • 48. The Tools Way The Good: Don't have to understand the math Practical Can get started doing fun stuff right away The Bad: Don't have to understand the math Can get started doing bad science right away
  • 49. So, did you build that model? Yes, and it fits the training data almost perfectly!
  • 50. OK, Maybe Not That Either
  • 52. Example: k-means clustering Unsupervised machine learning technique Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum- of-squares i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")
  • 53.
  • 56. The Tools Way # a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex = 2)
  • 57. The Tools Way >>> from sklearn import cluster, datasets >>> iris = datasets.load_iris() >>> X_iris = iris.data >>> y_iris = iris.target >>> k_means = cluster.KMeans(n_clusters=3) >>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ... >>> print(k_means.labels_[::10]) [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2] >>> print(y_iris[::10]) [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
  • 58. So What To Do?
  • 60. Data Science from Scratch This is to certify that Joel Grus has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel GrusJune 23, 2015 Certificate Programs?
  • 62. Learning By Building You don't really understand something until you build it For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.
  • 64. Break Things Down Into Small Functions
  • 65. So you don't end up with something like this
  • 67. Example: k-means clustering Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum- of-squares Global optimization is hard, so use a greedy iterative approach
  • 68. Fun Motivation: Image Posterization Image consists of pixels Each pixel is a triplet (R,G,B) Imagine pixels as points in space Find k clusters of pixels Recolor each pixel to its cluster mean I think it's fun, anyway 8 colors
  • 69. Example: k-means clustering given some points, find k clusters by choose k "means" repeat: assign each point to cluster of closest "mean" recompute mean of each cluster sounds simple! let's code!
  • 70. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means
  • 71. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points
  • 72. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments
  • 73. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration
  • 74. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point
  • 75. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean
  • 76. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance
  • 77. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance
  • 78. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance find the points in each cluster
  • 79. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means start with k randomly chosen points start with no cluster assignments for each iteration for each point for each mean compute the distance assign the point to the cluster of the mean with the smallest distance find the points in each cluster and compute the new means
  • 80. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means Not impenetrable, but a lot less helpful than it could be
  • 81. def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points] for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j # recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster) return means Not impenetrable, but a lot less helpful than it could be Can we make it simpler?
  • 82. Break Things Down Into Small Functions
  • 83. def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k) # and iterate finding new means for _ in range(num_iters): means = new_means(points, means) return means
  • 84. def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means) # return the cluster means return [mean(cluster) for cluster in clusters]
  • 85. def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters
  • 86. def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means) def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]
  • 87. To Recap k_means(points, k, num_iters=10) mean(points) k_means(points, k, num_iters=10) new_means(points, means) assign_clusters(points, means) closest_index(point, means) argmin(xs) distance(point1, point2) mean(points) add(point1, point2) scalar_multiply(c, point)
  • 88. As a Pedagogical Tool Can be used "top down" (as we did here) Implement high-level logic Then implement the details Nice for exposition Can also be used "bottom up" Implement small pieces Build up to high-level logic Good for workshops
  • 89. Example: Decision Trees Want to predict whether a given Meetup is worth attending (True) or not (False) Inputs are dictionaries describing each Meetup { "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" } { "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
  • 90. Example: Decision Trees { "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" } { "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" } beer? True False speaker? True False free none paid @jakevdp @joelgrus
  • 91. Example: Decision Trees class LeafNode: def __init__(self, prediction): self.prediction = prediction def predict(self, input_dict): return self.prediction class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)
  • 92. Example: Decision Trees Again inspiration from functional programming: type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) look at the "beer" entry a map from each possible "beer" value to a subtree always predict a specific value
  • 93. Example: Decision Trees type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) predict :: Tree -> Input -> Bool predict (Predict b) _ = b predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.!
  • 94. Example: Decision Trees type Input = Map.Map String String data Tree = Predict Bool | Subtrees String (Map.Map String Tree) We can do the same, we'll say a decision tree is either True False (attribute, subtree_dict) ("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})
  • 95. predict :: Tree -> Input -> Bool predict (Predict b) _ = b predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a) Example: Decision Trees def predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)
  • 96. Not Just For Data Science
  • 97. In Conclusion Teaching data science is fun, if you're smart about it Learning data science is fun, if you're smart about it Writing a book is not that much fun Having written a book is pretty fun Making slides is actually kind of fun Functional programming is a lot of fun

Notes de l'éditeur

  1. hedge fund jerks
  2. sql jockeys
  3. I can do some of these
  4. I can do some of these
  5. I can do some of these
  6. I can do some of these
  7. I can do some of these
  8. I can do some of these
  9. I can do some of these
  10. I can do some of these
  11. I can do some of these
  12. I can do some of these
  13. I can do some of these
  14. I can do some of these
  15. I can do some of these
  16. I can do some of these
  17. I can do some of these
  18. I can do some of these
  19. I can do some of these
  20. I can do some of these
  21. I can do some of these
  22. I can do some of these
  23. I can do some of these
  24. typed in "data science" into LinkedIn Jobs
  25. I can do some of these
  26. for those of us without PhDs
  27. https://www.flickr.com/photos/arlophoto/5616233274
  28. https://www.flickr.com/photos/arlophoto/5616233274
  29. Norvig