5. ● No need for scientific method
● Predict disease outbreaks before the CDC
● Cure cancer
● Innovating healthcare
● Solve world hunger
● Bring about world peace
Big Data Promises
6.
7. Big Data Criticism
● Garbage in, Garbage out
● Ignores the role of the scientific method
● Lots of questions don’t require large
amounts of data to get good stats
● Privacy issues
8. Big Data is just another way to think about data
9. Mental Models
“A mental model is simply a representation of
an external reality inside your head. Mental
models are concerned with understanding
knowledge about the world.”
- Farnam Street Blog
12. Relational Resistance
Resistance to big data concepts, technologies,
and techniques because of belief that the
relational model is the only way to think about
data.
See also: Theory induced blindness
13.
14. Data Mental Models
● Relational
● Linked
● Object Oriented
● Geospatial
● Temporal
● Semantic
● Event Based
● Data as Code
● Bayesian
● Unstructured
16. “Big data is high volume, high velocity, and/or
high variety information assets that require new
forms of processing to enable enhanced
decision making, insight discovery and process
optimization.”
According to Gartner
18. Cathedral and Bazaar
Traditional Data
● Clean
● Top down
● Carefully collected
● Scales vertically
● One true way
Big Data
● Disorderly
● Bottom up
● Randomly collected
● Scales horizontally
● More than one way
19. Big Data Differences
Relational
● Normalization
● ACID
● SQL/Query
● Structured/Schema
Big Data
● Denormalization
● BASE
● MapReduce/Other
● Loosely Structured
23. Information as an Asset
● Target specific customer's needs rather than
broad segments
● Just-in-time inventory management
● Evaluating demand for product
● Predict and track traffic patterns
24. Big Data and You
● What information do you have, that no one
else has?
● Can you easily integrate your data or is it
locked in silos?
● What data don’t you collect?
● What data don’t you archive?
26. Big Data Platforms
Cloud
● AWS
● Google
● Microsoft
Hadoop
● Cloudera
● MapR
● Hortonworks
This isn’t an all inclusive list, but a sample of
the big players in the space.
27. Big Data Stack
● Batch Processing
● Data Collection
● SQL/Query
● Search
● Machine Learning
● Serialization
● Security
● Stream Processing
● File Storage
● Resource
management
● Online NoSQL
● Data Pipeline
30. ● Data science is statistics on a Mac
● A data scientist is a statistician who lives in
San Francisco
● Person who is better at statistics than any
software engineer and better at software
engineering than any statistician.
What IS Data Science?
31.
32. The need for Data Science
● There is a LOT of data
● Too much data for people to look at it all
● Probabilistic models help extract signal from
the noise
● Need to automate the analysis and
exploitation of data
34. Black Swans and Big Data
● There are fundamental limits to prediction
● Hard to predict rare events where no prior
data exists (i.e. Black Swans)
● Complex systems often have feedback loops
(e.g. stock market)
36. Business
● Identify some
unresolved questions
● Figure out what data
could answer those
questions
● Pick the easiest and test
out your hypothesis
Getting Started
Technology
● Pick a technology you
know or want to learn
● Pick a platform
● Pick a data set and
identify some basic
problems to solve
37. My Info
Twitter: @shawnhermans
Github: github.com/shawnhermans
Blog: http://shawnhermans.github.io/ (In Progress)
Slideshare: www.slideshare.net/shawnhermans/
Quora: http://www.quora.com/Shawn-Hermans
41. Soothsayer
● Simple HTTP/JSON
API for
training/classifying
data
● Lots of built in
classifier statistics
https://github.com/shawnhermans/soothsayer
Quote by http://en.wikiquote.org/wiki/George_E._P._Box
See http://www.bloomberg.com/news/2011-10-25/bias-blindness-and-how-we-truly-think-part-2-daniel-kahneman.html
Inspired by Eric Raymond’s Cathedral and the Bazaar - http://www.catb.org/esr/writings/cathedral-bazaar/introduction/
BASE (basically available soft-state eventual consistency)
See CAP theorem for more details http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Big data might not save the world, but it could entertain us
http://www.fastcodesign.com/1671893/the-secret-sauce-behind-netflixs-hit-house-of-cards-big-data
“Big Data and You” sounds like a good children’s book title.
This is admin screen for Amazon Web Services. Not all of these services are Big Data, but it gives you a good idea of an integrated Big Data platform.
https://twitter.com/cdixon/status/428914681911070720
https://twitter.com/BigDataBorat/status/372350993255518208
https://twitter.com/josh_wills/status/198093512149958656
Although use of the term data science has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs.[13] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”[14]
From Drew Conway http://en.wikipedia.org/wiki/Data_science#mediaviewer/File:Data_Science_Venn_Diagram.png
See Nassim Taleb’s excellent essay The Fourth Quadrant - http://edge.org/conversation/the-fourth-quadrant-a-map-of-the-limits-of-statistics
See http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public for datasets