Divvy bikes have changed the way we can get around Chicago. This talk will demonstrate the impact of Divvy with an interactive visualization. Rather than focus on the tools and languages used to build it, the talk will emphasize design and content aspects of the visualization (at divvy.datasco.pe) as well as some recent work to quantify the similarity of bike stations. The talk will feature a live-demo of the visualization and the opportunity for attendees to share their own thoughts and hypotheses about bike trip patterns.
9. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
who is a data scientist?
“a scientist who can code”
10. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
11. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
• repeatable analysis
12. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
• repeatable analysis
• freedom to think about problems new ways
14. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
15. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
24. Northwestern Data Visualization| @gabegaster | 2015 may
1950
cost of new
analysis
years
today
same person thinking about the problem
can conduct experiments to answer it
hoursv
v
computing has progressed
27. Northwestern Data Visualization| @gabegaster | 2015 may
open-source code
standing on
shoulders of giants
computing has progressed
28. Northwestern Data Visualization| @gabegaster | 2015 may
open-source code
standing on
shoulders of giants
computing has progressed
29. Northwestern Data Visualization| @gabegaster | 2015 may
open-source code
standing on
shoulders of giants
computing has progressed
30. Northwestern Data Visualization| @gabegaster | 2015 may
open-source code
standing on
shoulders of giants
reinventing the wheel
computing has progressed
31. Northwestern Data Visualization| @gabegaster | 2015 may
open-source code
standing on
shoulders of giants
reinventing the wheel
computing has progressed
32. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
33. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
knowing
what is possible
which were difficult to answer before
34. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
35. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
HOW
36. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
HOW WHY
37. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
38. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
tools
39. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools
40. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking why
tools
41. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools
42. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools WHY
43. Northwestern Data Visualization| @gabegaster | 2015 may
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools WHY
WHY
61. Northwestern Data Visualization| @gabegaster | 2015 may
goal: save money
task: find needle in the haystack (without poking yourself)
62. Northwestern Data Visualization| @gabegaster | 2015 may
aboutpatent
not
aboutpatent
goal: save money
task: find needle in the haystack (without poking yourself)
63. Northwestern Data Visualization| @gabegaster | 2015 may
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
goal: save money
task: find needle in the haystack (without poking yourself)
64. Northwestern Data Visualization| @gabegaster | 2015 may
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save money
task: find needle in the haystack (without poking yourself)
65. Northwestern Data Visualization| @gabegaster | 2015 may
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save money
task: find needle in the haystack (without poking yourself)
66. Northwestern Data Visualization| @gabegaster | 2015 may
turn over to plaintiff
don’t
turn over to plaintiff
goal: save money
task: find needle in the haystack (without poking yourself)
93. Northwestern Data Visualization| @gabegaster | 2015 may
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
94. Northwestern Data Visualization| @gabegaster | 2015 may
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
95. Northwestern Data Visualization| @gabegaster | 2015 may
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
96. Northwestern Data Visualization| @gabegaster | 2015 may
balances:
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
97. Northwestern Data Visualization| @gabegaster | 2015 may
balances:
True Positive Rate
False Positive Rate
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
98. Northwestern Data Visualization| @gabegaster | 2015 may
balances:
True Positive Rate
False Positive Rate
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
99. Northwestern Data Visualization| @gabegaster | 2015 may
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
100. Northwestern Data Visualization| @gabegaster | 2015 may
why?
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
101. Northwestern Data Visualization| @gabegaster | 2015 may
why?
…
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
102. Northwestern Data Visualization| @gabegaster | 2015 may
why?
…
upshot:
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
103. Northwestern Data Visualization| @gabegaster | 2015 may
why?
…
choice of metric matters a LOT
upshot:
in practice
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
106. Northwestern Data Visualization| @gabegaster | 2015 may
timeline of contest
Accuracy of Classification
AUC
random guess
basic SVM
107. Northwestern Data Visualization| @gabegaster | 2015 may
timeline of contest
goal?
Accuracy of Classification
AUC
random guess
basic SVM
108. Northwestern Data Visualization| @gabegaster | 2015 may
timeline of contest
goal: depends on why
Accuracy of Classification
AUC
random guess
basic SVM
109. Northwestern Data Visualization| @gabegaster | 2015 may
random guess
basic SVM
timeline of contest
Accuracy of Classification
AUC
111. Northwestern Data Visualization| @gabegaster | 2015 may
me
timeline of contest
Accuracy of Classification
AUC
turned out to place 9th — because overfitting
112. Northwestern Data Visualization| @gabegaster | 2015 may
me
timeline of contest
Accuracy of Classification
AUC
turned out to place 9th — because overfitting
very common problem
123. Northwestern Data Visualization| @gabegaster | 2015 may
We need to reduce the costs of Service Requests.
They are too expensive.
!
!
!
!
Thousands of engineers around the world, 24-7 read
through emails and hardware log files to determine
the cause of failure of a server. This is an expensive
process. We've tried to automate it. We can now
automatically resolve 7% of new Service Requests.
But we want more. That's why we bought a few
super computers with TBs of memory.
client
an example
!
from the industrial internet
124. Northwestern Data Visualization| @gabegaster | 2015 may
Why? Why do you need to set up a hadoop architecture
to do clustering? What will this help you achieve?
!
!
!
!
!
How do you handle Service Requests?
!
We need to reduce the costs of Service Requests.
They are too expensive.
!
!
!
!
Thousands of engineers around the world, 24-7 read
through emails and hardware log files to determine
the cause of failure of a server. This is an expensive
process. We've tried to automate it. We can now
automatically resolve 7% of new Service Requests.
But we want more. That's why we bought a few
super computers with TBs of memory.
client
125. Northwestern Data Visualization| @gabegaster | 2015 may
Why? Why do you need to set up a hadoop architecture
to do clustering? What will this help you achieve?
!
!
!
!
!
!
!
We need to reduce the costs of Service Requests.
They are too expensive.
!
!
!
!
Thousands of engineers around the world, 24-7 read
through emails and hardware log files to determine
the cause of failure of a server. This is an expensive
process. We've tried to automate it. We can now
automatically resolve 7% of new Service Requests.
But we want more. That's why we bought a few
super computers with TBs of memory.
client
126. Northwestern Data Visualization| @gabegaster | 2015 may
Why? Why do you need to set up a hadoop architecture
to do clustering? What will this help you achieve?
!
!
!
!
!
How do you handle Service Requests?
!
We need to reduce the costs of Service Requests.
They are too expensive.
!
!
!
!
Thousands of engineers around the world, 24-7 read
through emails and hardware log files to determine
the cause of failure of a server. This is an expensive
process. We've tried to automate it. We can now
automatically resolve 7% of new Service Requests.
But we want more. That's why we bought a few
super computers with TBs of memory.
client
127. Northwestern Data Visualization| @gabegaster | 2015 may
Why? Why do you need to set up a hadoop architecture
to do clustering? What will this help you achieve?
!
!
!
!
!
How do you handle Service Requests?
!
We need to reduce the costs of Service Requests.
They are too expensive.
!
!
!
!
Thousands of engineers around the world, 24-7 read
through emails and hardware log files to determine
the cause of failure of a server. This is an expensive
process. We've tried to automate it. We can now
automatically resolve 1% of new Service Requests.
But we want more. That's why we bought a few
super computers with TBs of memory.
client
129. Northwestern Data Visualization| @gabegaster | 2015 may
client
tools are not everything
but it is important to know
the right tool for the job
130. Northwestern Data Visualization| @gabegaster | 2015 may
client
tools are not everything
but it is important to know
the right tool for the job
131. Northwestern Data Visualization| @gabegaster | 2015 may
client
tools are not everything
but it is important to know
the right tool for the job
132. Northwestern Data Visualization| @gabegaster | 2015 may
client
tools are not everything
but it is important to know
the right tool for the job
don’t start w hadoop unless you have to.
!
133. Northwestern Data Visualization| @gabegaster | 2015 may
client
tools are not everything
but it is important to know
the right tool for the job
don’t start w hadoop unless you have to.
!
probably you don’t have to.
134. Northwestern Data Visualization| @gabegaster | 2015 may
client
How did you automate resolving Service Requests?
!
!
!
!
!
!
!
!
!
!
!
135. Northwestern Data Visualization| @gabegaster | 2015 may
client
How did you automate resolving Service Requests?
!
!
!
!
!
!
!
!
!
!
!
A group of senior engineers thought about different use
cases and came up with a list of conditions that, if any
are met, lead to predetermined solutions.
136. Northwestern Data Visualization| @gabegaster | 2015 may
client
How did you automate resolving Service Requests?
!
!
!
!
!
!
!
!
!
!
!
A group of senior engineers thought about different use
cases and came up with a list of conditions that, if any
are met, lead to predetermined solutions.
!
Took a year to create.
!
137. Northwestern Data Visualization| @gabegaster | 2015 may
client
How did you automate resolving Service Requests?
!
!
!
!
!
!
!
!
!
!
!
A group of senior engineers thought about different use
cases and came up with a list of conditions that, if any
are met, lead to predetermined solutions.
!
Took a year to create.
!
We’ve been keeping track of every solved request for
several years now.
138. Northwestern Data Visualization| @gabegaster | 2015 may
client
How did you automate resolving Service Requests?
!
!
!
!
!
!
!
!
!
!
!
A group of senior engineers thought about different use
cases and came up with a list of conditions that, if any
are met, lead to predetermined solutions.
!
Took a year to create.
!
We’ve been keeping track of every solved request for
several years now.
from sklearn import naive_bayes as nb!
nb.GaussianNB().fit(historical_requests,!
! ! ! ! ! ! historical_decisions)
140. Northwestern Data Visualization| @gabegaster | 2015 may
client
This works really well! But we can’t use it.
!
!
!
!
!
!
Oh. Why is that?
141. Northwestern Data Visualization| @gabegaster | 2015 may
client
This works really well! But we can’t use it.
!
!
!
!
!
!
Engineers don’t trust the predictions.
Oh. Why is that?
142. Northwestern Data Visualization| @gabegaster | 2015 may
client
This works really well! But we can’t use it.
!
!
!
!
!
!
Engineers don’t trust the predictions.
Oh. Why is that?
158. Northwestern Data Visualization| @gabegaster | 2015 may
emphasizes traffic
@flowingdata
lines between pts?
(the lines superimpose)
159. Northwestern Data Visualization| @gabegaster | 2015 may
emphasizes traffic
@flowingdata
lines between pts?
beautiful map
(the lines superimpose)
160. Northwestern Data Visualization| @gabegaster | 2015 may
emphasizes traffic
@flowingdata
lines between pts?
beautiful map
(the lines superimpose)
— but not suited for this goal
166. Northwestern Data Visualization| @gabegaster | 2015 may
can use gradient — to
show gradual differences
between stations
London transit map
@mySociety
171. Northwestern Data Visualization| @gabegaster | 2015 may
each point is related to the
closest station
what regions?
—> Voronoi
huh?
172. Northwestern Data Visualization| @gabegaster | 2015 may
each point is related to the
closest station
what regions?
—> Voronoi
huh?
http://alexbeutel.com/webgl/voronoi.html
173. Northwestern Data Visualization| @gabegaster | 2015 may
each point is related to the
closest station
what regions?
—> Voronoi
huh?
http://alexbeutel.com/webgl/voronoi.html
Find the closest station — that’s my region!
191. @gabegaster | http://bit.ly/1pdP2Tb
how to
use color?
• two colors not many
• binned not gradient
• transparent empty bin
binned v gradient
colors v colors
binned
192. @gabegaster | http://bit.ly/1pdP2Tb
how to
use color?
• two colors not many
• binned not gradient
• transparent empty bin
• iterate
binned v gradient
colors v colors
binned
203. Northwestern Data Visualization| @gabegaster | 2015 may
How are stations different?
when is the station used
how it used
who uses it
204. Northwestern Data Visualization| @gabegaster | 2015 may
How are stations different?
when is the station used
how it used
who uses it
use the time signature of a station