2. Taewook Eom
Data Programmer
Plaster(Planet Master)
of Big Data Infra
Pre-Assessor of Hiring Programmers
Mentor of 101 Startup Korea
Twitter: @taewooke
LinkedIn: http://kr.linkedin.com/in/taewookeom
http://www.flickr.com/photos/oreillyconf/10616622085/
3. Santa Clara
: Technical
New York
with Cloudera
: Financial, Business
Europe
: Privacy, Government
Boston
: Medical
http://strataconf.com/
by O’Reilly
Web 2.0
: Open, Sharing, Participation
Big Data
: Making Data Work
Change the World with Data.
4. Data
When hardware became commoditized,
software was valuable.
Now software being commoditized,
data is valuable.
– Tim O’Reilly, 2011
Data is like the blood of the enterprise.
– Amr Awadallah, CTO at Cloudera, 2013
5. What is Big Data?
All data that is not a fit for a traditional RDBMS,
whether used for OLTP or Analytics purposes
Big Data Architectural Patterns
http://strataconf.com/stratany2013/public/schedule/detail/30397
6. Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data
- Gartner, 2011
http://blog.vitria.com/Portals/47881/images/3values-resized-600.png
17. Big Data Space
No one tools is the right fit for all Big Data problem
Do not be afraid to recommend the right solution
for the problem over the popular solution
To do this, you must be aware of the entire ecosystem
Big Data Architectural Patterns
http://strataconf.com/stratany2013/public/schedule/detail/30397
18. Practical Performance Analysis and Tuning for Cloudera Impala
http://strataconf.com/stratany2013/public/schedule/detail/30551
19. Big Data Architectural Patterns
http://strataconf.com/stratany2013/public/schedule/detail/30397
20. Hadoop and the Relational Data Warehouse – When to Use Which?
http://strataconf.com/stratany2013/public/schedule/detail/30964
21. Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
22. Each speaker is allocated five minutes of presentation time
and is accompanied by 20 presentation slides.
During presentations, each slide is displayed for 15 seconds
and then automatically advanced.
- http://en.wikipedia.org/wiki/Ignite_(event)
http://oreilly.com/pub/pr/2242
23. Ignite Talks
Hilary: The Most Poisoned Name In US History - Hilary Parker
sudo make me a visualization! - Jeroen Janssens
Design as a Fulcrum for Societal Change: the influence of Jimmyjane on female sexuality - Lisa Green
Spaces in Between: The Transdisciplinary Niche to Type 1 Diabetes Living - Jorge Luna
Why are women better data scientists than men? - Carolyn Martin
Memoirs of a Prolific Moonlighter: A Chronic Writing Disorder…or Insanity? - Matthew Russell
The Data Behind H1B Visas - Melissa Smolensky
Signal Detection Theory: Man vs Machine - Kyle Redinger
Algorithms of Pain - Heather Fenby
Hadoop Playlist - Adam Kawa
Why a Data Community is like a Music Scene - Harlan Harris
A Tale of two Kinds of Startups - Jen van der Meer
http://strataconf.com/stratany2013/public/schedule/detail/32182
24. Ignite
Signal Detection Theory: Man vs Machine
Co-Founder @VividCortex
Kyle Redinger
http://www.youtube.com/watch?v=Fg6mN-jevds
(5 minutes 6 seconds)
http://www.slideshare.net/realkyleredinger/man-vs-machine-signal-detection-theory-and-big-data
25. Signal Detection Theory: Man vs Machine
Remove the obvious and look at what is important
Remember: Less is more.
26. Ignite
A Tale of Two Kinds of Startups
CSO at Luminary Labs
Jen van der Meer
http://www.youtube.com/watch?v=0ooIs4cy5uM
(5 minutes 2 seconds)
http://www.slideshare.net/bettybluegreen/twokindsofstartups
27.
28.
29. Keynote
Towards Strata 2014
Director of market research at O’Reilly Media
Roger Magoulas
http://www.youtube.com/watch?v=Ytd5VkEgQf8
(5 minutes 26 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31935
http://www.oreilly.com/data/free/files/stratasurvey.pdf
34. Science is fundamentally about data,
but data is not fundamentally about science
Beyond R and Ph.D.s: The Mythology of Data Science Debunked
Douglas Merrill (ZestFinance)
http://www.youtube.com/watch?v=J2sgObXbIWY (8 minutes 9 seconds)
35. People
A data scientist is a data analyst who lives in California.
– George Roumeliotis, (Intuit)
37. Data
Data
Data
Data
Businessperson: Business person, Leader, Entrepreneur
Creative: Artist, Jack-of-All-Trades, Hacker
Researcher: Scientist, Researcher, Statistician
Engineer: Engineer, Developer
http://datacommunitydc.org/blog/2012/08/data-scientists-survey-results-teaser/
http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf
38. Scientists think they can code,
software engineers think they are scientists.
Team them up so they collaborate.
– Scott Sorenson (Ancestry.com)
Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
39. How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany2013/public/schedule/detail/30707
40. Data scientists spend their lives as data janitors
instead of leveraging their skills
– Wes McKinney (DataPad)
Building More Productive Data Science and Analytics Workflows
41. Keynote
Is Bigger Really Better?
Predictive Analytics
with Fine-grained Behavior Data
Professor at the NYU Stern School of Business
Foster Provost
http://www.youtube.com/watch?v=1jzMiAfLH2c
(10 minutes 16 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31685
42. Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
43. Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
44. Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Predictive does not mean actionable.
– Scott Sorenson (Ancestry.com)
Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
45. More data gives you more precision, not more prediction.
Using multiple datasets to reduce errors when measuring values.
Is Bigger Really Better?
- Ravi Iyer (Ranker.com)
Predictive Analytics with Fine-grained Understand yourData Users, and Employees
Behavior Customers,
Using Graphs of Data to
46. Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
47. Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
48. Keynote
Big Impact from Big Data
Head of Analytics at Facebook
Ken Rudin
http://www.youtube.com/watch?v=RJFwsZwTBgg
(11 minutes 57 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31903
50. Hadoop is a hammer,
but you need other tools along with it.
Designing Your Data-Centric Organization
Josh Klahr (Pivotal)
http://www.youtube.com/watch?v=D86udfrVzrI (12 minutes)
51. Big Impact from Big Data
The way you organize information
depends on the question
you intend to ask of it.
- Richard Saul Wurman
Building a Data Platform
52. HaDump
: Loading data into Hadoop
for not reason.
Data Science Without a Scientist
http://strataconf.com/stratany2013/public/schedule/detail/31801
53. Big Impact from Big Data
Technical people still don't understand the business needs of business people!
Business people don't know what's a table.
- Anurag Tandon (MicroStrategy)
Inject Big Data into your Corporate DNA: Enable Every Employee to Make Data Driven Decisions
54. Ask the Right Questions
Organizations already have people who know their own data
better than mystical data scientists.
Learning Hadoop is easier than learning the company’s business.
- Gartner, 2012
Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
55. Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative
http://strataconf.com/stratany2013/public/schedule/detail/30207
56. Every Soldier is a Sensor: Countering Corruption in Afghanistan
http://strataconf.com/stratany2013/public/schedule/detail/30828
60. Value of Data
Usable < Useful < Actionable
with Impact
If you can't answer for "so what?",
you only have facts, not insight
- Baron Schwartz (VividCortex Inc)
Making Big Data Small
Descriptive (Easy)
Predictive (Medium)
Prescriptive (Hard)
What happened?
What will happen?
What should we do about it?
Hadoop & Data Science for the Enterprise
61. The Future of Hadoop
: What Happened
& What's Possible?
Co-Founder of Hadoop
Doug Cutting
http://www.youtube.com/watch?v=_WwuZI6AhN8
(14 minutes 41 seconds)
http://strataconf.com/stratany2013/public/
schedule/detail/31591
Big Data is first industry that was created
by open source.
- Jack Norris (MapR Technologies)
Separating Hadoop Myths from Reality
Hadoop the kernel of the OS for data.
62. Hadoop's Impact on the Future of Data Management
Mike Olson (Cloudera)
http://www.youtube.com/watch?v=puHS2JNKgRM
http://strataconf.com/stratany2013/public/schedule/detail/31380
63. Single
:
:
:
:
:
:
S/W & H/W system
security model
management model
metadata model
audit model
resource
management model
Common
: storage & schema
http://www.slideshare.net/cloudera/enterprise-data-hub-the-next-big-thing-in-big-data
64. Unifying Your Data Management Platform with Hadoop: Batch and Real-time Machine Data Ingest, Alerts, and Analytics
http://strataconf.com/stratany2013/public/schedule/detail/30282
65. Last generation of data management is not sufficient
More copies, representations, transformations increase risk
Index once and reuse across workloads, lifecycle
NoSQL: indexing and updates for interactive apps
Hadoop: staging, persistence, and analytics
Data Governance for Regulated Industries Using Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30738
66. Data Intelligence
Rethink How You See Data
Sharmila Shahani-Mulligan (ClearStory Data)
http://www.youtube.com/watch?v=07hGulTOZGk (9 minutes 6 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31742
67. The Data Availability Problem
?
Access
Question
Sampling
Analysis & Disc
Modeling
overy
Loading
Insight
Data Prep – too slow!
Information Supply Chain
Introducing a New Way to Interact with Insight
http://strataconf.com/stratany2013/public/schedule/detail/31743
Presentation
68. Running Non-MapReduce Big Data applications on Apache Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30755
69. Apache HBase for Architects
http://strataconf.com/stratany2013/public/schedule/detail/30619
What’s Next for Apache HBase: Multi-tenancy, Predictability, and Extensions.
http://strataconf.com/stratany2013/public/schedule/detail/30857
70. Securing the Apache Hadoop Ecosystem
http://strataconf.com/stratany2013/public/schedule/detail/30302
71. An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB
http://strataconf.com/stratany2013/public/schedule/detail/30959
72. Schema
Information does not exist until a schema is defined
and data is stored in a relational database
- anonymous
Building a Data Platform
http://strataconf.com/stratany2013/public/schedule/detail/31400
73. Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA)
http://strataconf.com/stratany2013/public/schedule/detail/30913
74. Managing a Rapidly Evolving Analytics Pipeline
http://strataconf.com/stratany2013/public/schedule/detail/30635
75. Managing a Rapidly Evolving Analytics Pipeline
http://strataconf.com/stratany2013/public/schedule/detail/30635
76. Stringer/Tez
Shark
SQL on/in Hadoop/Hbase Solutions
Perception is Key: Telescopes, Microscopes and Data
http://strataconf.com/strataeu2013/public/schedule/detail/32351
77. All SQL on Hadoop Solutions are
Missing the Point of Hadoop
Every Solution makes you define a schema
- SQL(Structured Query Language) is expressed over an assumed schema
Major reasons why Hadoop has taken of include:
- Ability to load data without defining a schema
- Process data using schema-on-read instead of first defining a schema
Hadoop contains a lot of:
- Raw, granular data sets with potentially inconsistent schemas
- Data sets in JSON, key-value, and other self-describing (non-relational) models
designed for schema-on-read processing
SQL on Hadoop solutions that make you first define a schema are missing
a major part of Hadoop’s usage patterns
Flexible Schema and the End of ETL
http://strataconf.com/stratany2013/public/schedule/detail/31868
79. Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
80. Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
81. Quick prototyping is the fastest way to internal advocacy. Ship It!
Cloud == Speed
We don’t always need a complicated solution. KISS
Play to your differentiating strengths. Experience >> Data
Bias towards impact.
It Takes a Village
EASE!! (Emulate, Analyze, Scale, Evaluate)
How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany2013/public/schedule/detail/30707
Prototyping is key to overcoming resistance to change
Technical architecture is heavily influenced by people organization
Developing a team of experienced Hadoop users can often be done
using internal employees
A culture of experimentation and innovation yields the best result
Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30499
84. References
Strata Conference + Hadoop World 2013 Keynotes & Interviews
http://www.youtube.com/playlist?list=PL055Epbe6d5ZtziVAooUC04i1hL_Z9Xvk
Slides & Video
http://strataconf.com/stratany2013/public/schedule/proceedings
Tweets
https://twitter.com/search?q=%23strataconf #strataconf
85. How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany2013/public/schedule/detail/30707
http://nordstrom.github.io/stratanyc/
87. Building a production machine learning infrastructure
http://www.slideshare.net/joshwills/production-machine-learninginfrastructure
88. Text Analytics at Scale: Listening to 45 Million Customers
http://strataconf.com/stratany2013/public/schedule/detail/30757
89. Words + Numbers = Insights
Text Analytics at Scale: Listening to 45 Million Customers
http://strataconf.com/stratany2013/public/schedule/detail/30757