SlideShare une entreprise Scribd logo
1  sur  37
©latitude51north.com
Strata+Hadoop World
London 2015
May 2015
©latitude51north.com
{“about” : “me”}
Harvinder Atwal
2
MoneySuperMarket.com
Web
@harvindersatwal
latitude51north
dunnhumby
• previous : Insight Director, Tesco Clubcard
Lloyds Banking Group
• previous : Senior Manager, Customer
Strategy and Insight
• current : Head of Customer Insight
and Marketing Optimisation
harvindersatwal
www.latitude51north.com
harvinder.s.atwal@gmail.com
©latitude51north.com
What is Strata+Hadoop World?
3
Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big data, cutting-
edge data science, and new business fundamentals intersect—and merge.
• The first day involved software tutorials and deep-dive’s, relating mostly to software in
the Hadoop ecosystem, with many given by the software authors or contributors. This
provides an excellent opportunity to take a closer look at a particular technology and
ask in-depth questions of the people in the know.
• On the second day, the conference proper starts and despite there now being four
Strata + Hadoop World conferences a year, offers a packed schedule of speakers from
many of the industry’s leading organisations. Speakers this year including people from
Barclay’s Bank, Google, CERN, Accenture, Pivotal, Databricks, Dato, MapR,
comparethemarket.com and a great many more.
©latitude51north.com
The quality of presentations was mixed. Here
are my favourite keynotes.
1. Shazam - http://www.youtube.com/embed/mcTPvxo8SXY?autoplay=1
2. Ideas that Matter - We're always talking about "innovation", but - says Tim Harford - there
are really two very different kinds of innovation. Using stories from sports, science, music, and
military history, Tim will make you think different about where good ideas come from and how
they should be encouraged. http://www.youtube.com/embed/ohCavVVxX0M?autoplay=1
3. Is Privacy Becoming a Luxury Good? Julia Angwin (ProPublica) We are being watched –
by companies, by the government, by our neighbors. Technology has made powerful
surveillance tools available to everyone. And now some of us are investing in counter-
surveillance techniques and tactics.
http://www.youtube.com/embed/fsWAZIfqPuU?autoplay=1
4. Overview of BT's internal multi-tenant hadoop platform - Phil Radley (Chief Data
Architect at British Telecom) gave an overview of BT's internal multi-tenant hadoop platform.
He explains their first production use case (master data management of BT UK Business
Customer data) and gives a flavour of their use case pipeline. https://youtu.be/YMoVShk5D
5. Julie Meyer (Ariadne Capital) - http://www.youtube.com/embed/a8u-
bOoqYA4?autoplay=1
4
©latitude51north.com
The most talked about technology at Strata +
Hadoop World was…
• Naturally, ‘Hadoop’ itself comes out on top with Spark
coming in a close second.
• Both Hadoop and Spark are frameworks to process very
large datasets on commodity computer clusters.
• Hadoop, Spark and many of the other most talked about
technologies (Hive, HDFS, Kylin, HBase, etc.) are Apache
Open Source Software Foundation projects.
• The Foundation is now responsible for most of the
developments in Big Data tech.
• Scala is a relatively new programming language gaining
rapid traction especially for productionised machine
learning applications. It runs on the Java Virtual Machine
and is interoperable with Java libraries. However, it is far
less verbose and easier to code than Java. It also fully
supports functional programming which is very
fashionable.
5
A plot produced using a quick scoring algorithm run on
data from the Strata session outlines. It clearly shows
the most talked about technologies from the
conference. (NB. SAS ranks high only because they were
conference sponsors and presenters. Not because they
were well talked about Tech!)
©latitude51north.com
My Key take-aways
• Apache Foundation Open-Source Software has become the industry standard
for Big Data processing, storage, and increasingly querying and analysis.
• Some examples you may have heard of: Hadoop, Spark, Cassandra, HBase, Kafka.
• Spark is likely to supplant Hadoop as the Big-Data processing platform of choice.
• Data Lakes and how to deal with large quantities of streaming data are two
hot topics in architecture
• A data lake is a storage repository that holds a vast amount of raw data in its native format until
it is needed. They enable greater agility and range of applications as the raw data is always
available.
• Lambda architecture is the common solution to processing large quantities of streaming data
• There were several Tools and Techniques we should explore further. Ivory
looks very useful:
• Ivory is an open source package that can speed up model building and analysis by turning raw
Event data (e.g. customer enquiries) into a summarisation at a point in time by entity (enquiries
by Customer in previous 12 months at 31 May 2015).
6
©latitude51north.com
Key theme 1 – Apache
Foundation tech
©latitude51north.com
The Apache ecosystem for Big Data is growing rapidly
and it’s getting confusing!
8
©latitude51north.com
So first a history lesson…
• In the early 2000s Google was finding it challenging to store and process the exploding
volume of content on the Internet.
• Sanjay Ghemawat and Jeffrey Dean, senior researchers at Google, wrote a series of seminal
papers that defined the way Google and everyone else since cracked the problem.
• In order to cope, Google invented a novel style of data processing known as MapReduce, a
new way of saving data called the Google File System (GFS), and an original way to store data,
BigTable, a Distributed Database.
9
“Google is living a few years in the future and sends the rest of us
messages,” Doug Cutting, Hadoop founder
©latitude51north.com
Wordcount is the canonical example for
MapReduce
10
By processing in parallel you overcome the limits of a single machine and can
scale by simply adding more nodes to the cluster
©latitude51north.com
Google’s papers lead to the development of
Open Source implementations
• Inspired by Google’s MapReduce paper Doug Cutting developed Hadoop at Yahoo.
• For the first time it enabled companies to process huge quantities of data on cheap
commodity hardware.
• GFS inspired HDFS, Hadoop’s Distributed File System.
• BigTable inspired HBase a non-relational database able to host millions of columns
and billions of rows.
• Many other applications have since been developed to build the Hadoop
Ecosystem:
• Pig – A language for querying using Hadoop
• Hive – a Hadoop layer for querying using SQL-like language
• Mahout – A machine learning library
• However, despite abstractions programming in Hadoop is still not straightforward.
11
©latitude51north.com
The Enterprise Hadoop market is now big
business
12
Three vendors dominate the market.
All have similar offerings.
MapR’s solution is the best if you want to try out Hadoop/Spark for yourself:
http://doc.mapr.com/display/MapR/Home
©latitude51north.com
is a complete game changer
• Spark is an engine for large-scale data processing that seems to be in the process of replacing
the MapReduce paradigm. What seems to be driving Spark’s adoption at the moment is its raw
speed. It claims speed increases of up to 100 times over in-memory MapReduce and 10 times
for on-disk.
• Like MapReduce, it works with the filesystem to distribute your data across the cluster, and
process that data in parallel. However, Spark tries to keep things in RAM memory (fast),
whereas MapReduce keeps shuffling things in and out of disk (slow).
• Spark is also much more powerful and expressive in terms of how you give it instructions to
crunch data, abstracting away a lot of complexity and allowing more interactive analysis of
data.
• Spark 1.4 will feature first class support and integration with R. With version 1.4, the SparkR
project will be officially integrated, which means R will join Java, Scala and Python as a fully
supported language.
• http://cdn.oreillystatic.com/en/assets/1/event/126/Apache%20Spark_%20What_s%20new_%20
what_s%20coming%20Presentation.pdf
13
©latitude51north.com
Spark has its own ecosystem extending
usability
• Spark is capable of working with data stored inside a
Hadoop cluster, can use data stored in Amazon’s S3
and can work with data stored locally, which means
it’s really easy to experiment with.
• You can perform interactive analysis of large datasets
without sampling and have the same architecture for
insight and production.
• Spark can be used for analysing live Streaming data
(web log, sensors, social media, etc.) using the same
API as batch data.
• The Spark ecosystem features MLlib a scalable
machine learning library.
• GraphX is a component for social network analysis,
fraud detection, recommendations and other graph
analysis.
• Spark is also useful for less glamorous jobs like ETL.
14
©latitude51north.com
I’ve been trying out Spark via the Python
API and can confirm it’s fast
15
ANALYSING LOG DATA TO COUNT THE NUMBER UNIQUE HOSTS PER DAY
MapR’s Sandbox is the best way to get started with Spark https://www.mapr.com/blog/getting-started-spark-mapr-
sandbox#.VZOxjPlViPw
Spark can be also be used with Vertica http://www.vertica.com/wp-content/uploads/2013/12/Vertica_MapR_Solution-Brief_Feb-
2014.pdf
Although Spark abstracts away
some of the complexity of Hadoop
the limited number of Actions and
Transformations available still
require a shift in mindset and more
steps compared to SQL and SAS.
©latitude51north.com
HBase and Cassandra were the most talked
about databases for big data
• HBase is an open source, non-relational, distributed database
modelled after Google's BigTable. Apache Impala is commonly used
to query HBase data
• Apache Cassandra was initially developed at Facebook to power their
Inbox Search feature by Avinash Lakshman (one of the authors of
Amazon's Dynamo) and Prashant Malik. Cassandra is essentially a
hybrid between a key-value and a column-oriented (or tabular)
database.
• Apache Accumulo is a key/value store based on Google BigTable.
• Hypertable is another open source database inspired by you-know-
who. Because Hypertable keeps data physically sorted by a primary
key it lends itself to applications that require fast access to ranges of
data (e.g. analytics, sorted URL lists, messaging applications, etc.)
• The big news at Strata though was that Google has made BigTable
itself publicly available through Google Cloud. Because they started it
all they claim it’s faster and better than everything else.
16
©latitude51north.com
Apache Drill looks exciting; use SQL to query
multiple NoSQL data sources
• Apache Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
• Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS,
MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query
can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with
a directory of event logs in Hadoop.
17
Good example here of Drill usage: https://www.mapr.com/blog/apache-drill-its-drilliant-query-json-files-tableau-desktop
Query
multiple data
types and
sources
Using
SQL
Drill even
connects to
Tableau for
visualisation
of output
©latitude51north.com
Key theme 2 – Enterprise Data
Hubs, Data Lakes (or Data
Swamps).
©latitude51north.com
There’s a trend away from fixed to flexible
data structures
19
• Traditionally data and code has been tightly
coupled to a schema to support specific
applications.
• This causes serious problems when you realise
you have a new application for your data but your
original schema won’t support it.
• The ADM and MDM in MSM are very good
examples of this problem. Both are designed for
specific use cases (Insight and Campaigns).
• If new data is required by the use cases then
expensive development is required.
• A lot of campaign data useful for insight is not
available in the ADM and vice-versa for insight.
Schema on write is the traditional approach to
processing and storing data.
Data goes through an ETL process to make it
uniform and fit the predefined schema or it’s
dropped.
©latitude51north.com
Flexible data structures try and overcome
traditional limitations
• The schema on read approach takes the same raw
data but lands it (relatively unprocessed) all in the
same place.
• Then instead of building a series of applications on
top of custom schemas you make the data
dynamically available for various services through
code.
• This is a very different way to use data but
provides much more agility.
• As data users we can get our head round this but
it will appear completely upside down (possibly
insane) to traditional data architects.
20
Schema-on-read keeps the data in
raw format.
A schema is only applied as you
decide how to use the data through
code.
©latitude51north.com
Schema-on-Read is one of the key differences between
a Data Warehouse and Data Lake
• A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is
needed.
• There are advantages and disadvantages to Data Lakes
• Pros:
• Agility – Data is always available for any use case.
Moves us away from a data-centric to use-centric
view.
• Value – Improves data discovery and advanced
analytics capability.
• Cons:
• Onus on user – Puts pressure on user to
understand raw data and write more code.
• Skills – Exploiting raw data requires an advanced
skillset.
21
©latitude51north.com
Cloudera’s compromise is to hold data in different layers – an approach
adopted by CTM, Goldman Sachs and others
• Raw data always available and
readable by Spark, Drill etc.
• An enriched view available to
Analysts and Data Scientists using
Spark, Impala, Drill etc.
• Shared layer available to business
with multiple sources joined
together.
• Optimised Layer used to
operationalise the use cases and
organised by data consumer not
source. Optimised for
performance.
• A speed layer following lambda
architecture principles for real-
time analysis.
• Can be more, or fewer layers.
22
Raw
Layer
Discovery
Layer
Shared
Layer
Data
Sources
Data
Consumers
Optimised
Layer
SpeedLayer
©latitude51north.com
Further reading
• Information architecture for Apache Hadoop - Mark Samson (Cloudera)
• http://cdn.oreillystatic.com/en/assets/1/event/126/Information%20architecture%20f
or%20Apache%20Hadoop%20Presentation.pptx
• It ain’t what you do to data, it’s what you do with it (Silicon Valley Data
Science)
• http://cdn.oreillystatic.com/en/assets/1/event/126/It%20ain%E2%80%99t%20what%
20you%20do%20to%20data,%20it%E2%80%99s%20what%20you%20do%20with%20i
t%20Presentation.pptx
• http://svds.com/art-abstraction
• Systems that enable agility
• https://speakerdeck.com/ept/systems-that-enable-data-agility
23
©latitude51north.com
Lambda architecture is the standard for
processing streaming Big Data
• Lambda architecture is a data-processing
architecture designed to handle massive
quantities of data by taking advantage of
both batch- and stream-processing methods.
• Nathan Marz designed this generic
architecture addressing common
requirements for big data based on his
experience working on distributed data
processing systems at Twitter.
• The Batch Layer manages the master dataset
and computes views infrequently.
• The Speed Layer is for real-time querying, the
data is dropped as soon as it’s processed by
the batch layer.
• The Serving Layer brings together the Batch
and Speed layers so they can be queried.
24
©latitude51north.com
Useful tools
©latitude51north.com
Ivory looks like an incredibly useful package for speeding
up the modelling pipeline
• Ivory is a package from Ambiata dubbed a
datastore for features.
• Very commonly in analysis or the modelling
process we need to know the states and/or
data summarisations for entities at historic
points in time e.g. number of enquiries in
month before PSD, had enquirer previously
bought Motor-ever, number of visits in
previous month, three months, 12 months,
etc.
• Ivory allows you to create easily create
these features from a table of Events.
https://speakerdeck.com/ambiata/improving-
feature-engineering-in-the-lab-and-production-
with-ivory
26
©latitude51north.com
Machine Learning
©latitude51north.com
The future of machine intelligence and why it matters -
Shivon Zilis (Bloomberg Beta)
• How machine learning can make your life easier.
• http://cdn.oreillystatic.com/en/assets/1/event/126/The%20future%20of%20machine
%20intelligence%20and%20why%20it%20matters%20Presentation%201.pdf
• Some very good resources/tools were mentioned in this presentation:
• Meeting Preparation: http://quid.com/
• Scheduling: https://claralabs.com/ https://x.ai/
• Competitive Analysis: http://mattermark.com/ https://app.datafox.co/
• Conference Calls: http://www.gridspace.com/
• Talent: https://www.textio.com/
• Emails: http://www.inboxvudu.com/
28
©latitude51north.com
Forecasting space-time events – Predictive policing,
Minority Report style
• This session uses the speaker’s experience in building a crime forecasting package to outline some tools and techniques useful in modelling
space-time event data. https://www.hunchlab.com/
• Concepts
• While many data scientists work with data that includes geographic information, this data is often used in rather rudimentary ways or
limited to vector data sets such as the point locations of stores or users. The session will introduce the strengths and weaknesses
behind raster-based geographic analysis. Some challenges faced when modelling data at a fine geographic and temporal resolution
will be discussed. For example, how can uncertainty around the time of occurrence for events be represented?
• GeoTrellis - http://geotrellis.io/
• The case study leverages the open source GeoTrellis framework to conduct geographic processing. GeoTrellis is currently an
incubating project within the Eclipse Foundation’s LocationTech working group. The project provides fast and scalable geographic
processing with an emphasis on raster-based analysis and routing through transportation networks. Already written in Scala, GeoTrellis
is currently being extended to integrate with Apache Spark.
• Modelling
• The modelling pipeline within the case study consists of several loosely coupled components. In addition to GeoTrellis, the project
uses R for machine learning and the Amazon Simple Workflow service for pipeline orchestration. The presentation will outline the
basic structure of the modelling process including details of the statistical techniques utilized within the process.
• Several statistical techniques were examined throughout the development of the project. The final approach included a stacked model
incorporating a gradient boosting machine (GBM) to model the presence of events, and a generalized additive model (GAM) to
transform these predictions into expected counts. The session will conclude by outlining some approaches to evaluating predictive
accuracy for these types of data sets.
29
©latitude51north.com
Data Visualisation and
prototyping
©latitude51north.com
PDF is the second biggest religion in the UK
and other amusing insights..
• Visualizing the world's largest democratic exercise
• The election results page for the 2014 Indian general elections was hosted on
CNN-IBN and bing.com. The focus was on real-time analysis of results for users
and TV anchors. With over 540 million voters and 100 million viewers, the
volume and complexity of data both provide a design challenge. This talk
focuses on the techniques behind this design.
• https://gramener.com/shows/slides/strata#/
• Situational awareness: This is not the data you're looking for
• http://cdn.oreillystatic.com/en/assets/1/event/126/Situational%20awareness_%2
0This%20is%20not%20the%20data%20you_re%20looking%20for%20Presentatio
n.pdf
31
©latitude51north.com
Accenture gave a great talk on the case for
building an in-house data insights lab
• Accenture talked about the challenges with traditional approaches
for getting buy-in for data science solutions within companies.
• Their solution is to make concepts practical for clients using open
source technology and open data, creating visualisations, and mock-
up a prototypes in a few weeks.
• This requires a multi-disciplinary team of data scientists, engineers
and designers who can use cutting edge technologies to bring
concepts to life – ‘The Technology Lab’.
• Their case study was a US Bank client. Accenture had difficulty
convincing the Senior Execs of the value of a risk dashboard. The
Technology Lab decided to build a prototype using open source
software and open mortgage loan/default data. Once the execs got
their hands on the prototype they soon gave the go ahead for a
version using internal data
• McKinsey also have a similar Digital Labs concept:
http://www.mckinsey.com/client_service/mckinsey_digital/expertise/
digital_labs
32
©latitude51north.com
Other presentations
©latitude51north.com
Advanced Machine Learning
• Deploying machine learning in production What could possibly go wrong ?- Alice Zheng (Dato)
• Building and deploying predictive applications require knowing how to evaluate, test, and track the performance
of machine learning models over time. Using available off-the-shelf tools, this talk engages potential application
builders on topics such as common evaluation metrics, A/B testing set up, tracking model performance, tracking
usage via real-time feedback, and updating models.
• http://cdn.oreillystatic.com/en/assets/1/event/126/Deploying%20machine%20learning%20in%20production%20
Presentation%201.pdf
• Scalable machine learning
• While the data management side of Big Data has seen tremendous progress in the past few years, bringing
technologies like Hadoop or Spark together with advanced machine learning and data analysis methods is still a
major challenge. In this talk, I will discuss recent advances, approaches, and patterns which are used to build
truly scalable machine learning solutions.
• http://cdn.oreillystatic.com/en/assets/1/event/126/Scalable%20machine%20learning%20Presentation.pdf
• Deep learning made doubly easy with reusable deep features
• http://blog.dato.com/deep-learning-blog-post
34
©latitude51north.com
Data Science
• The curiosity advantage: the most important skill for data science
• Curiosity is one of the most valued skills for people working in Data Science. But
how can we train it? Einstein said that "Curiosity is an important trait of a
genius". Let’s explore how we can develop our curiosity with three exercises in
the session: how to find pleasure in uncertainty; question the question we’re
asking; and find a beginner's mind. With direct application to data science.
• http://cdn.oreillystatic.com/en/assets/1/event/126/The%20curiosity%20advantag
e_%20the%20most%20important%20skill%20for%20data%20science%20Present
ation.pdf
• Measuring the benefit effect for customers with Bayesian predictive modelling
• http://cdn.oreillystatic.com/en/assets/1/event/126/Measuring%20the%20benefit
%20effect%20for%20customers%20with%20Bayesian%20predictive%20modelin
g%20Presentation.pdf
35
©latitude51north.com
Other
• Using Data for evil
• http://www.slideshare.net/DuncanRoss1/using-data-for-evil-2
• Apache Flink
• http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases-
at-prehadoop-summit-meetups?next_slideshow=1
• Moves the Needle brings Lean Startup principles, tools, tactics and strategy to
the enterprise
• http://www.movestheneedle.com/
36
©latitude51north.com
Resources
• Speakers and slides
• http://strataconf.com/big-data-conference-uk-2015/public/schedule/speakers
• O’Reilly data blog
• https://beta.oreilly.com/topics/data
• Spark Training slides
• http://training.databricks.com/workshop/sparkcamp.pdf
37

Contenu connexe

Dernier

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 

Dernier (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 

En vedette

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Strata+Hadoop World London 2015

  • 2. ©latitude51north.com {“about” : “me”} Harvinder Atwal 2 MoneySuperMarket.com Web @harvindersatwal latitude51north dunnhumby • previous : Insight Director, Tesco Clubcard Lloyds Banking Group • previous : Senior Manager, Customer Strategy and Insight • current : Head of Customer Insight and Marketing Optimisation harvindersatwal www.latitude51north.com harvinder.s.atwal@gmail.com
  • 3. ©latitude51north.com What is Strata+Hadoop World? 3 Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big data, cutting- edge data science, and new business fundamentals intersect—and merge. • The first day involved software tutorials and deep-dive’s, relating mostly to software in the Hadoop ecosystem, with many given by the software authors or contributors. This provides an excellent opportunity to take a closer look at a particular technology and ask in-depth questions of the people in the know. • On the second day, the conference proper starts and despite there now being four Strata + Hadoop World conferences a year, offers a packed schedule of speakers from many of the industry’s leading organisations. Speakers this year including people from Barclay’s Bank, Google, CERN, Accenture, Pivotal, Databricks, Dato, MapR, comparethemarket.com and a great many more.
  • 4. ©latitude51north.com The quality of presentations was mixed. Here are my favourite keynotes. 1. Shazam - http://www.youtube.com/embed/mcTPvxo8SXY?autoplay=1 2. Ideas that Matter - We're always talking about "innovation", but - says Tim Harford - there are really two very different kinds of innovation. Using stories from sports, science, music, and military history, Tim will make you think different about where good ideas come from and how they should be encouraged. http://www.youtube.com/embed/ohCavVVxX0M?autoplay=1 3. Is Privacy Becoming a Luxury Good? Julia Angwin (ProPublica) We are being watched – by companies, by the government, by our neighbors. Technology has made powerful surveillance tools available to everyone. And now some of us are investing in counter- surveillance techniques and tactics. http://www.youtube.com/embed/fsWAZIfqPuU?autoplay=1 4. Overview of BT's internal multi-tenant hadoop platform - Phil Radley (Chief Data Architect at British Telecom) gave an overview of BT's internal multi-tenant hadoop platform. He explains their first production use case (master data management of BT UK Business Customer data) and gives a flavour of their use case pipeline. https://youtu.be/YMoVShk5D 5. Julie Meyer (Ariadne Capital) - http://www.youtube.com/embed/a8u- bOoqYA4?autoplay=1 4
  • 5. ©latitude51north.com The most talked about technology at Strata + Hadoop World was… • Naturally, ‘Hadoop’ itself comes out on top with Spark coming in a close second. • Both Hadoop and Spark are frameworks to process very large datasets on commodity computer clusters. • Hadoop, Spark and many of the other most talked about technologies (Hive, HDFS, Kylin, HBase, etc.) are Apache Open Source Software Foundation projects. • The Foundation is now responsible for most of the developments in Big Data tech. • Scala is a relatively new programming language gaining rapid traction especially for productionised machine learning applications. It runs on the Java Virtual Machine and is interoperable with Java libraries. However, it is far less verbose and easier to code than Java. It also fully supports functional programming which is very fashionable. 5 A plot produced using a quick scoring algorithm run on data from the Strata session outlines. It clearly shows the most talked about technologies from the conference. (NB. SAS ranks high only because they were conference sponsors and presenters. Not because they were well talked about Tech!)
  • 6. ©latitude51north.com My Key take-aways • Apache Foundation Open-Source Software has become the industry standard for Big Data processing, storage, and increasingly querying and analysis. • Some examples you may have heard of: Hadoop, Spark, Cassandra, HBase, Kafka. • Spark is likely to supplant Hadoop as the Big-Data processing platform of choice. • Data Lakes and how to deal with large quantities of streaming data are two hot topics in architecture • A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. They enable greater agility and range of applications as the raw data is always available. • Lambda architecture is the common solution to processing large quantities of streaming data • There were several Tools and Techniques we should explore further. Ivory looks very useful: • Ivory is an open source package that can speed up model building and analysis by turning raw Event data (e.g. customer enquiries) into a summarisation at a point in time by entity (enquiries by Customer in previous 12 months at 31 May 2015). 6
  • 7. ©latitude51north.com Key theme 1 – Apache Foundation tech
  • 8. ©latitude51north.com The Apache ecosystem for Big Data is growing rapidly and it’s getting confusing! 8
  • 9. ©latitude51north.com So first a history lesson… • In the early 2000s Google was finding it challenging to store and process the exploding volume of content on the Internet. • Sanjay Ghemawat and Jeffrey Dean, senior researchers at Google, wrote a series of seminal papers that defined the way Google and everyone else since cracked the problem. • In order to cope, Google invented a novel style of data processing known as MapReduce, a new way of saving data called the Google File System (GFS), and an original way to store data, BigTable, a Distributed Database. 9 “Google is living a few years in the future and sends the rest of us messages,” Doug Cutting, Hadoop founder
  • 10. ©latitude51north.com Wordcount is the canonical example for MapReduce 10 By processing in parallel you overcome the limits of a single machine and can scale by simply adding more nodes to the cluster
  • 11. ©latitude51north.com Google’s papers lead to the development of Open Source implementations • Inspired by Google’s MapReduce paper Doug Cutting developed Hadoop at Yahoo. • For the first time it enabled companies to process huge quantities of data on cheap commodity hardware. • GFS inspired HDFS, Hadoop’s Distributed File System. • BigTable inspired HBase a non-relational database able to host millions of columns and billions of rows. • Many other applications have since been developed to build the Hadoop Ecosystem: • Pig – A language for querying using Hadoop • Hive – a Hadoop layer for querying using SQL-like language • Mahout – A machine learning library • However, despite abstractions programming in Hadoop is still not straightforward. 11
  • 12. ©latitude51north.com The Enterprise Hadoop market is now big business 12 Three vendors dominate the market. All have similar offerings. MapR’s solution is the best if you want to try out Hadoop/Spark for yourself: http://doc.mapr.com/display/MapR/Home
  • 13. ©latitude51north.com is a complete game changer • Spark is an engine for large-scale data processing that seems to be in the process of replacing the MapReduce paradigm. What seems to be driving Spark’s adoption at the moment is its raw speed. It claims speed increases of up to 100 times over in-memory MapReduce and 10 times for on-disk. • Like MapReduce, it works with the filesystem to distribute your data across the cluster, and process that data in parallel. However, Spark tries to keep things in RAM memory (fast), whereas MapReduce keeps shuffling things in and out of disk (slow). • Spark is also much more powerful and expressive in terms of how you give it instructions to crunch data, abstracting away a lot of complexity and allowing more interactive analysis of data. • Spark 1.4 will feature first class support and integration with R. With version 1.4, the SparkR project will be officially integrated, which means R will join Java, Scala and Python as a fully supported language. • http://cdn.oreillystatic.com/en/assets/1/event/126/Apache%20Spark_%20What_s%20new_%20 what_s%20coming%20Presentation.pdf 13
  • 14. ©latitude51north.com Spark has its own ecosystem extending usability • Spark is capable of working with data stored inside a Hadoop cluster, can use data stored in Amazon’s S3 and can work with data stored locally, which means it’s really easy to experiment with. • You can perform interactive analysis of large datasets without sampling and have the same architecture for insight and production. • Spark can be used for analysing live Streaming data (web log, sensors, social media, etc.) using the same API as batch data. • The Spark ecosystem features MLlib a scalable machine learning library. • GraphX is a component for social network analysis, fraud detection, recommendations and other graph analysis. • Spark is also useful for less glamorous jobs like ETL. 14
  • 15. ©latitude51north.com I’ve been trying out Spark via the Python API and can confirm it’s fast 15 ANALYSING LOG DATA TO COUNT THE NUMBER UNIQUE HOSTS PER DAY MapR’s Sandbox is the best way to get started with Spark https://www.mapr.com/blog/getting-started-spark-mapr- sandbox#.VZOxjPlViPw Spark can be also be used with Vertica http://www.vertica.com/wp-content/uploads/2013/12/Vertica_MapR_Solution-Brief_Feb- 2014.pdf Although Spark abstracts away some of the complexity of Hadoop the limited number of Actions and Transformations available still require a shift in mindset and more steps compared to SQL and SAS.
  • 16. ©latitude51north.com HBase and Cassandra were the most talked about databases for big data • HBase is an open source, non-relational, distributed database modelled after Google's BigTable. Apache Impala is commonly used to query HBase data • Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik. Cassandra is essentially a hybrid between a key-value and a column-oriented (or tabular) database. • Apache Accumulo is a key/value store based on Google BigTable. • Hypertable is another open source database inspired by you-know- who. Because Hypertable keeps data physically sorted by a primary key it lends itself to applications that require fast access to ranges of data (e.g. analytics, sorted URL lists, messaging applications, etc.) • The big news at Strata though was that Google has made BigTable itself publicly available through Google Cloud. Because they started it all they claim it’s faster and better than everything else. 16
  • 17. ©latitude51north.com Apache Drill looks exciting; use SQL to query multiple NoSQL data sources • Apache Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop. 17 Good example here of Drill usage: https://www.mapr.com/blog/apache-drill-its-drilliant-query-json-files-tableau-desktop Query multiple data types and sources Using SQL Drill even connects to Tableau for visualisation of output
  • 18. ©latitude51north.com Key theme 2 – Enterprise Data Hubs, Data Lakes (or Data Swamps).
  • 19. ©latitude51north.com There’s a trend away from fixed to flexible data structures 19 • Traditionally data and code has been tightly coupled to a schema to support specific applications. • This causes serious problems when you realise you have a new application for your data but your original schema won’t support it. • The ADM and MDM in MSM are very good examples of this problem. Both are designed for specific use cases (Insight and Campaigns). • If new data is required by the use cases then expensive development is required. • A lot of campaign data useful for insight is not available in the ADM and vice-versa for insight. Schema on write is the traditional approach to processing and storing data. Data goes through an ETL process to make it uniform and fit the predefined schema or it’s dropped.
  • 20. ©latitude51north.com Flexible data structures try and overcome traditional limitations • The schema on read approach takes the same raw data but lands it (relatively unprocessed) all in the same place. • Then instead of building a series of applications on top of custom schemas you make the data dynamically available for various services through code. • This is a very different way to use data but provides much more agility. • As data users we can get our head round this but it will appear completely upside down (possibly insane) to traditional data architects. 20 Schema-on-read keeps the data in raw format. A schema is only applied as you decide how to use the data through code.
  • 21. ©latitude51north.com Schema-on-Read is one of the key differences between a Data Warehouse and Data Lake • A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. • There are advantages and disadvantages to Data Lakes • Pros: • Agility – Data is always available for any use case. Moves us away from a data-centric to use-centric view. • Value – Improves data discovery and advanced analytics capability. • Cons: • Onus on user – Puts pressure on user to understand raw data and write more code. • Skills – Exploiting raw data requires an advanced skillset. 21
  • 22. ©latitude51north.com Cloudera’s compromise is to hold data in different layers – an approach adopted by CTM, Goldman Sachs and others • Raw data always available and readable by Spark, Drill etc. • An enriched view available to Analysts and Data Scientists using Spark, Impala, Drill etc. • Shared layer available to business with multiple sources joined together. • Optimised Layer used to operationalise the use cases and organised by data consumer not source. Optimised for performance. • A speed layer following lambda architecture principles for real- time analysis. • Can be more, or fewer layers. 22 Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer SpeedLayer
  • 23. ©latitude51north.com Further reading • Information architecture for Apache Hadoop - Mark Samson (Cloudera) • http://cdn.oreillystatic.com/en/assets/1/event/126/Information%20architecture%20f or%20Apache%20Hadoop%20Presentation.pptx • It ain’t what you do to data, it’s what you do with it (Silicon Valley Data Science) • http://cdn.oreillystatic.com/en/assets/1/event/126/It%20ain%E2%80%99t%20what% 20you%20do%20to%20data,%20it%E2%80%99s%20what%20you%20do%20with%20i t%20Presentation.pptx • http://svds.com/art-abstraction • Systems that enable agility • https://speakerdeck.com/ept/systems-that-enable-data-agility 23
  • 24. ©latitude51north.com Lambda architecture is the standard for processing streaming Big Data • Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. • Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter. • The Batch Layer manages the master dataset and computes views infrequently. • The Speed Layer is for real-time querying, the data is dropped as soon as it’s processed by the batch layer. • The Serving Layer brings together the Batch and Speed layers so they can be queried. 24
  • 26. ©latitude51north.com Ivory looks like an incredibly useful package for speeding up the modelling pipeline • Ivory is a package from Ambiata dubbed a datastore for features. • Very commonly in analysis or the modelling process we need to know the states and/or data summarisations for entities at historic points in time e.g. number of enquiries in month before PSD, had enquirer previously bought Motor-ever, number of visits in previous month, three months, 12 months, etc. • Ivory allows you to create easily create these features from a table of Events. https://speakerdeck.com/ambiata/improving- feature-engineering-in-the-lab-and-production- with-ivory 26
  • 28. ©latitude51north.com The future of machine intelligence and why it matters - Shivon Zilis (Bloomberg Beta) • How machine learning can make your life easier. • http://cdn.oreillystatic.com/en/assets/1/event/126/The%20future%20of%20machine %20intelligence%20and%20why%20it%20matters%20Presentation%201.pdf • Some very good resources/tools were mentioned in this presentation: • Meeting Preparation: http://quid.com/ • Scheduling: https://claralabs.com/ https://x.ai/ • Competitive Analysis: http://mattermark.com/ https://app.datafox.co/ • Conference Calls: http://www.gridspace.com/ • Talent: https://www.textio.com/ • Emails: http://www.inboxvudu.com/ 28
  • 29. ©latitude51north.com Forecasting space-time events – Predictive policing, Minority Report style • This session uses the speaker’s experience in building a crime forecasting package to outline some tools and techniques useful in modelling space-time event data. https://www.hunchlab.com/ • Concepts • While many data scientists work with data that includes geographic information, this data is often used in rather rudimentary ways or limited to vector data sets such as the point locations of stores or users. The session will introduce the strengths and weaknesses behind raster-based geographic analysis. Some challenges faced when modelling data at a fine geographic and temporal resolution will be discussed. For example, how can uncertainty around the time of occurrence for events be represented? • GeoTrellis - http://geotrellis.io/ • The case study leverages the open source GeoTrellis framework to conduct geographic processing. GeoTrellis is currently an incubating project within the Eclipse Foundation’s LocationTech working group. The project provides fast and scalable geographic processing with an emphasis on raster-based analysis and routing through transportation networks. Already written in Scala, GeoTrellis is currently being extended to integrate with Apache Spark. • Modelling • The modelling pipeline within the case study consists of several loosely coupled components. In addition to GeoTrellis, the project uses R for machine learning and the Amazon Simple Workflow service for pipeline orchestration. The presentation will outline the basic structure of the modelling process including details of the statistical techniques utilized within the process. • Several statistical techniques were examined throughout the development of the project. The final approach included a stacked model incorporating a gradient boosting machine (GBM) to model the presence of events, and a generalized additive model (GAM) to transform these predictions into expected counts. The session will conclude by outlining some approaches to evaluating predictive accuracy for these types of data sets. 29
  • 31. ©latitude51north.com PDF is the second biggest religion in the UK and other amusing insights.. • Visualizing the world's largest democratic exercise • The election results page for the 2014 Indian general elections was hosted on CNN-IBN and bing.com. The focus was on real-time analysis of results for users and TV anchors. With over 540 million voters and 100 million viewers, the volume and complexity of data both provide a design challenge. This talk focuses on the techniques behind this design. • https://gramener.com/shows/slides/strata#/ • Situational awareness: This is not the data you're looking for • http://cdn.oreillystatic.com/en/assets/1/event/126/Situational%20awareness_%2 0This%20is%20not%20the%20data%20you_re%20looking%20for%20Presentatio n.pdf 31
  • 32. ©latitude51north.com Accenture gave a great talk on the case for building an in-house data insights lab • Accenture talked about the challenges with traditional approaches for getting buy-in for data science solutions within companies. • Their solution is to make concepts practical for clients using open source technology and open data, creating visualisations, and mock- up a prototypes in a few weeks. • This requires a multi-disciplinary team of data scientists, engineers and designers who can use cutting edge technologies to bring concepts to life – ‘The Technology Lab’. • Their case study was a US Bank client. Accenture had difficulty convincing the Senior Execs of the value of a risk dashboard. The Technology Lab decided to build a prototype using open source software and open mortgage loan/default data. Once the execs got their hands on the prototype they soon gave the go ahead for a version using internal data • McKinsey also have a similar Digital Labs concept: http://www.mckinsey.com/client_service/mckinsey_digital/expertise/ digital_labs 32
  • 34. ©latitude51north.com Advanced Machine Learning • Deploying machine learning in production What could possibly go wrong ?- Alice Zheng (Dato) • Building and deploying predictive applications require knowing how to evaluate, test, and track the performance of machine learning models over time. Using available off-the-shelf tools, this talk engages potential application builders on topics such as common evaluation metrics, A/B testing set up, tracking model performance, tracking usage via real-time feedback, and updating models. • http://cdn.oreillystatic.com/en/assets/1/event/126/Deploying%20machine%20learning%20in%20production%20 Presentation%201.pdf • Scalable machine learning • While the data management side of Big Data has seen tremendous progress in the past few years, bringing technologies like Hadoop or Spark together with advanced machine learning and data analysis methods is still a major challenge. In this talk, I will discuss recent advances, approaches, and patterns which are used to build truly scalable machine learning solutions. • http://cdn.oreillystatic.com/en/assets/1/event/126/Scalable%20machine%20learning%20Presentation.pdf • Deep learning made doubly easy with reusable deep features • http://blog.dato.com/deep-learning-blog-post 34
  • 35. ©latitude51north.com Data Science • The curiosity advantage: the most important skill for data science • Curiosity is one of the most valued skills for people working in Data Science. But how can we train it? Einstein said that "Curiosity is an important trait of a genius". Let’s explore how we can develop our curiosity with three exercises in the session: how to find pleasure in uncertainty; question the question we’re asking; and find a beginner's mind. With direct application to data science. • http://cdn.oreillystatic.com/en/assets/1/event/126/The%20curiosity%20advantag e_%20the%20most%20important%20skill%20for%20data%20science%20Present ation.pdf • Measuring the benefit effect for customers with Bayesian predictive modelling • http://cdn.oreillystatic.com/en/assets/1/event/126/Measuring%20the%20benefit %20effect%20for%20customers%20with%20Bayesian%20predictive%20modelin g%20Presentation.pdf 35
  • 36. ©latitude51north.com Other • Using Data for evil • http://www.slideshare.net/DuncanRoss1/using-data-for-evil-2 • Apache Flink • http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases- at-prehadoop-summit-meetups?next_slideshow=1 • Moves the Needle brings Lean Startup principles, tools, tactics and strategy to the enterprise • http://www.movestheneedle.com/ 36
  • 37. ©latitude51north.com Resources • Speakers and slides • http://strataconf.com/big-data-conference-uk-2015/public/schedule/speakers • O’Reilly data blog • https://beta.oreilly.com/topics/data • Spark Training slides • http://training.databricks.com/workshop/sparkcamp.pdf 37