Strata+Hadoop World London 2015

©latitude51north.com
Strata+Hadoop World
London 2015
May 2015

{“about” : “me”}
Harvinder Atwal
2
MoneySuperMarket.com
Web
@harvindersatwal
latitude51north
dunnhumby
• previous : Insight Director, Tesco Clubcard
Lloyds Banking Group
• previous : Senior Manager, Customer
Strategy and Insight
• current : Head of Customer Insight
and Marketing Optimisation
harvindersatwal
www.latitude51north.com
harvinder.s.atwal@gmail.com

What is Strata+Hadoop World?
3
Presented by O’Reilly and Cloudera, Strata + Hadoop World is where big data, cutting-
edge data science, and new business fundamentals intersect—and merge.
• The first day involved software tutorials and deep-dive’s, relating mostly to software in
the Hadoop ecosystem, with many given by the software authors or contributors. This
provides an excellent opportunity to take a closer look at a particular technology and
ask in-depth questions of the people in the know.
• On the second day, the conference proper starts and despite there now being four
Strata + Hadoop World conferences a year, offers a packed schedule of speakers from
many of the industry’s leading organisations. Speakers this year including people from
Barclay’s Bank, Google, CERN, Accenture, Pivotal, Databricks, Dato, MapR,
comparethemarket.com and a great many more.

The quality of presentations was mixed. Here
are my favourite keynotes.
1. Shazam - http://www.youtube.com/embed/mcTPvxo8SXY?autoplay=1
2. Ideas that Matter - We're always talking about "innovation", but - says Tim Harford - there
are really two very different kinds of innovation. Using stories from sports, science, music, and
military history, Tim will make you think different about where good ideas come from and how
they should be encouraged. http://www.youtube.com/embed/ohCavVVxX0M?autoplay=1
3. Is Privacy Becoming a Luxury Good? Julia Angwin (ProPublica) We are being watched –
by companies, by the government, by our neighbors. Technology has made powerful
surveillance tools available to everyone. And now some of us are investing in counter-
surveillance techniques and tactics.
http://www.youtube.com/embed/fsWAZIfqPuU?autoplay=1
4. Overview of BT's internal multi-tenant hadoop platform - Phil Radley (Chief Data
Architect at British Telecom) gave an overview of BT's internal multi-tenant hadoop platform.
He explains their first production use case (master data management of BT UK Business
Customer data) and gives a flavour of their use case pipeline. https://youtu.be/YMoVShk5D
5. Julie Meyer (Ariadne Capital) - http://www.youtube.com/embed/a8u-
bOoqYA4?autoplay=1
4

The most talked about technology at Strata +
Hadoop World was…
• Naturally, ‘Hadoop’ itself comes out on top with Spark
coming in a close second.
• Both Hadoop and Spark are frameworks to process very
large datasets on commodity computer clusters.
• Hadoop, Spark and many of the other most talked about
technologies (Hive, HDFS, Kylin, HBase, etc.) are Apache
Open Source Software Foundation projects.
• The Foundation is now responsible for most of the
developments in Big Data tech.
• Scala is a relatively new programming language gaining
rapid traction especially for productionised machine
learning applications. It runs on the Java Virtual Machine
and is interoperable with Java libraries. However, it is far
less verbose and easier to code than Java. It also fully
supports functional programming which is very
fashionable.
5
A plot produced using a quick scoring algorithm run on
data from the Strata session outlines. It clearly shows
the most talked about technologies from the
conference. (NB. SAS ranks high only because they were
conference sponsors and presenters. Not because they
were well talked about Tech!)

My Key take-aways
• Apache Foundation Open-Source Software has become the industry standard
for Big Data processing, storage, and increasingly querying and analysis.
• Some examples you may have heard of: Hadoop, Spark, Cassandra, HBase, Kafka.
• Spark is likely to supplant Hadoop as the Big-Data processing platform of choice.
• Data Lakes and how to deal with large quantities of streaming data are two
hot topics in architecture
• A data lake is a storage repository that holds a vast amount of raw data in its native format until
it is needed. They enable greater agility and range of applications as the raw data is always
available.
• Lambda architecture is the common solution to processing large quantities of streaming data
• There were several Tools and Techniques we should explore further. Ivory
looks very useful:
• Ivory is an open source package that can speed up model building and analysis by turning raw
Event data (e.g. customer enquiries) into a summarisation at a point in time by entity (enquiries
by Customer in previous 12 months at 31 May 2015).
6

Key theme 1 – Apache
Foundation tech

The Apache ecosystem for Big Data is growing rapidly
and it’s getting confusing!
8

So first a history lesson…
• In the early 2000s Google was finding it challenging to store and process the exploding
volume of content on the Internet.
• Sanjay Ghemawat and Jeffrey Dean, senior researchers at Google, wrote a series of seminal
papers that defined the way Google and everyone else since cracked the problem.
• In order to cope, Google invented a novel style of data processing known as MapReduce, a
new way of saving data called the Google File System (GFS), and an original way to store data,
BigTable, a Distributed Database.
9
“Google is living a few years in the future and sends the rest of us
messages,” Doug Cutting, Hadoop founder

Wordcount is the canonical example for
MapReduce
10
By processing in parallel you overcome the limits of a single machine and can
scale by simply adding more nodes to the cluster

Google’s papers lead to the development of
Open Source implementations
• Inspired by Google’s MapReduce paper Doug Cutting developed Hadoop at Yahoo.
• For the first time it enabled companies to process huge quantities of data on cheap
commodity hardware.
• GFS inspired HDFS, Hadoop’s Distributed File System.
• BigTable inspired HBase a non-relational database able to host millions of columns
and billions of rows.
• Many other applications have since been developed to build the Hadoop
Ecosystem:
• Pig – A language for querying using Hadoop
• Hive – a Hadoop layer for querying using SQL-like language
• Mahout – A machine learning library
• However, despite abstractions programming in Hadoop is still not straightforward.
11

The Enterprise Hadoop market is now big
business
12
Three vendors dominate the market.
All have similar offerings.
MapR’s solution is the best if you want to try out Hadoop/Spark for yourself:
http://doc.mapr.com/display/MapR/Home

is a complete game changer
• Spark is an engine for large-scale data processing that seems to be in the process of replacing
the MapReduce paradigm. What seems to be driving Spark’s adoption at the moment is its raw
speed. It claims speed increases of up to 100 times over in-memory MapReduce and 10 times
for on-disk.
• Like MapReduce, it works with the filesystem to distribute your data across the cluster, and
process that data in parallel. However, Spark tries to keep things in RAM memory (fast),
whereas MapReduce keeps shuffling things in and out of disk (slow).
• Spark is also much more powerful and expressive in terms of how you give it instructions to
crunch data, abstracting away a lot of complexity and allowing more interactive analysis of
data.
• Spark 1.4 will feature first class support and integration with R. With version 1.4, the SparkR
project will be officially integrated, which means R will join Java, Scala and Python as a fully
supported language.
• http://cdn.oreillystatic.com/en/assets/1/event/126/Apache%20Spark_%20What_s%20new_%20
what_s%20coming%20Presentation.pdf
13

Spark has its own ecosystem extending
usability
• Spark is capable of working with data stored inside a
Hadoop cluster, can use data stored in Amazon’s S3
and can work with data stored locally, which means
it’s really easy to experiment with.
• You can perform interactive analysis of large datasets
without sampling and have the same architecture for
insight and production.
• Spark can be used for analysing live Streaming data
(web log, sensors, social media, etc.) using the same
API as batch data.
• The Spark ecosystem features MLlib a scalable
machine learning library.
• GraphX is a component for social network analysis,
fraud detection, recommendations and other graph
analysis.
• Spark is also useful for less glamorous jobs like ETL.
14

I’ve been trying out Spark via the Python
API and can confirm it’s fast
15
ANALYSING LOG DATA TO COUNT THE NUMBER UNIQUE HOSTS PER DAY
MapR’s Sandbox is the best way to get started with Spark https://www.mapr.com/blog/getting-started-spark-mapr-
sandbox#.VZOxjPlViPw
Spark can be also be used with Vertica http://www.vertica.com/wp-content/uploads/2013/12/Vertica_MapR_Solution-Brief_Feb-
2014.pdf
Although Spark abstracts away
some of the complexity of Hadoop
the limited number of Actions and
Transformations available still
require a shift in mindset and more
steps compared to SQL and SAS.

HBase and Cassandra were the most talked
about databases for big data
• HBase is an open source, non-relational, distributed database
modelled after Google's BigTable. Apache Impala is commonly used
to query HBase data
• Apache Cassandra was initially developed at Facebook to power their
Inbox Search feature by Avinash Lakshman (one of the authors of
Amazon's Dynamo) and Prashant Malik. Cassandra is essentially a
hybrid between a key-value and a column-oriented (or tabular)
database.
• Apache Accumulo is a key/value store based on Google BigTable.
• Hypertable is another open source database inspired by you-know-
who. Because Hypertable keeps data physically sorted by a primary
key it lends itself to applications that require fast access to ranges of
data (e.g. analytics, sorted URL lists, messaging applications, etc.)
• The big news at Strata though was that Google has made BigTable
itself publicly available through Google Cloud. Because they started it
all they claim it’s faster and better than everything else.
16

Apache Drill looks exciting; use SQL to query
multiple NoSQL data sources
• Apache Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
• Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS,
MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query
can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with
a directory of event logs in Hadoop.
17
Good example here of Drill usage: https://www.mapr.com/blog/apache-drill-its-drilliant-query-json-files-tableau-desktop
Query
multiple data
types and
sources
Using
SQL
Drill even
connects to
Tableau for
visualisation
of output

Key theme 2 – Enterprise Data
Hubs, Data Lakes (or Data
Swamps).

There’s a trend away from fixed to flexible
data structures
19
• Traditionally data and code has been tightly
coupled to a schema to support specific
applications.
• This causes serious problems when you realise
you have a new application for your data but your
original schema won’t support it.
• The ADM and MDM in MSM are very good
examples of this problem. Both are designed for
specific use cases (Insight and Campaigns).
• If new data is required by the use cases then
expensive development is required.
• A lot of campaign data useful for insight is not
available in the ADM and vice-versa for insight.
Schema on write is the traditional approach to
processing and storing data.
Data goes through an ETL process to make it
uniform and fit the predefined schema or it’s
dropped.

Flexible data structures try and overcome
traditional limitations
• The schema on read approach takes the same raw
data but lands it (relatively unprocessed) all in the
same place.
• Then instead of building a series of applications on
top of custom schemas you make the data
dynamically available for various services through
code.
• This is a very different way to use data but
provides much more agility.
• As data users we can get our head round this but
it will appear completely upside down (possibly
insane) to traditional data architects.
20
Schema-on-read keeps the data in
raw format.
A schema is only applied as you
decide how to use the data through
code.

Schema-on-Read is one of the key differences between
a Data Warehouse and Data Lake
• A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is
needed.
• There are advantages and disadvantages to Data Lakes
• Pros:
• Agility – Data is always available for any use case.
Moves us away from a data-centric to use-centric
view.
• Value – Improves data discovery and advanced
analytics capability.
• Cons:
• Onus on user – Puts pressure on user to
understand raw data and write more code.
• Skills – Exploiting raw data requires an advanced
skillset.
21

Cloudera’s compromise is to hold data in different layers – an approach
adopted by CTM, Goldman Sachs and others
• Raw data always available and
readable by Spark, Drill etc.
• An enriched view available to
Analysts and Data Scientists using
Spark, Impala, Drill etc.
• Shared layer available to business
with multiple sources joined
together.
• Optimised Layer used to
operationalise the use cases and
organised by data consumer not
source. Optimised for
performance.
• A speed layer following lambda
architecture principles for real-
time analysis.
• Can be more, or fewer layers.
22
Raw
Layer
Discovery
Layer
Shared
Layer
Data
Sources
Data
Consumers
Optimised
Layer
SpeedLayer

Further reading
• Information architecture for Apache Hadoop - Mark Samson (Cloudera)
• http://cdn.oreillystatic.com/en/assets/1/event/126/Information%20architecture%20f
or%20Apache%20Hadoop%20Presentation.pptx
• It ain’t what you do to data, it’s what you do with it (Silicon Valley Data
Science)
• http://cdn.oreillystatic.com/en/assets/1/event/126/It%20ain%E2%80%99t%20what%
20you%20do%20to%20data,%20it%E2%80%99s%20what%20you%20do%20with%20i
t%20Presentation.pptx
• http://svds.com/art-abstraction
• Systems that enable agility
• https://speakerdeck.com/ept/systems-that-enable-data-agility
23

Lambda architecture is the standard for
processing streaming Big Data
• Lambda architecture is a data-processing
architecture designed to handle massive
quantities of data by taking advantage of
both batch- and stream-processing methods.
• Nathan Marz designed this generic
architecture addressing common
requirements for big data based on his
experience working on distributed data
processing systems at Twitter.
• The Batch Layer manages the master dataset
and computes views infrequently.
• The Speed Layer is for real-time querying, the
data is dropped as soon as it’s processed by
the batch layer.
• The Serving Layer brings together the Batch
and Speed layers so they can be queried.
24

Useful tools

Ivory looks like an incredibly useful package for speeding
up the modelling pipeline
• Ivory is a package from Ambiata dubbed a
datastore for features.
• Very commonly in analysis or the modelling
process we need to know the states and/or
data summarisations for entities at historic
points in time e.g. number of enquiries in
month before PSD, had enquirer previously
bought Motor-ever, number of visits in
previous month, three months, 12 months,
etc.
• Ivory allows you to create easily create
these features from a table of Events.
https://speakerdeck.com/ambiata/improving-
feature-engineering-in-the-lab-and-production-
with-ivory
26

Machine Learning

The future of machine intelligence and why it matters -
Shivon Zilis (Bloomberg Beta)
• How machine learning can make your life easier.
• http://cdn.oreillystatic.com/en/assets/1/event/126/The%20future%20of%20machine
%20intelligence%20and%20why%20it%20matters%20Presentation%201.pdf
• Some very good resources/tools were mentioned in this presentation:
• Meeting Preparation: http://quid.com/
• Scheduling: https://claralabs.com/ https://x.ai/
• Competitive Analysis: http://mattermark.com/ https://app.datafox.co/
• Conference Calls: http://www.gridspace.com/
• Talent: https://www.textio.com/
• Emails: http://www.inboxvudu.com/
28

Forecasting space-time events – Predictive policing,
Minority Report style
• This session uses the speaker’s experience in building a crime forecasting package to outline some tools and techniques useful in modelling
space-time event data. https://www.hunchlab.com/
• Concepts
• While many data scientists work with data that includes geographic information, this data is often used in rather rudimentary ways or
limited to vector data sets such as the point locations of stores or users. The session will introduce the strengths and weaknesses
behind raster-based geographic analysis. Some challenges faced when modelling data at a fine geographic and temporal resolution
will be discussed. For example, how can uncertainty around the time of occurrence for events be represented?
• GeoTrellis - http://geotrellis.io/
• The case study leverages the open source GeoTrellis framework to conduct geographic processing. GeoTrellis is currently an
incubating project within the Eclipse Foundation’s LocationTech working group. The project provides fast and scalable geographic
processing with an emphasis on raster-based analysis and routing through transportation networks. Already written in Scala, GeoTrellis
is currently being extended to integrate with Apache Spark.
• Modelling
• The modelling pipeline within the case study consists of several loosely coupled components. In addition to GeoTrellis, the project
uses R for machine learning and the Amazon Simple Workflow service for pipeline orchestration. The presentation will outline the
basic structure of the modelling process including details of the statistical techniques utilized within the process.
• Several statistical techniques were examined throughout the development of the project. The final approach included a stacked model
incorporating a gradient boosting machine (GBM) to model the presence of events, and a generalized additive model (GAM) to
transform these predictions into expected counts. The session will conclude by outlining some approaches to evaluating predictive
accuracy for these types of data sets.
29

Data Visualisation and
prototyping

PDF is the second biggest religion in the UK
and other amusing insights..
• Visualizing the world's largest democratic exercise
• The election results page for the 2014 Indian general elections was hosted on
CNN-IBN and bing.com. The focus was on real-time analysis of results for users
and TV anchors. With over 540 million voters and 100 million viewers, the
volume and complexity of data both provide a design challenge. This talk
focuses on the techniques behind this design.
• https://gramener.com/shows/slides/strata#/
• Situational awareness: This is not the data you're looking for
• http://cdn.oreillystatic.com/en/assets/1/event/126/Situational%20awareness_%2
0This%20is%20not%20the%20data%20you_re%20looking%20for%20Presentatio
n.pdf
31

Accenture gave a great talk on the case for
building an in-house data insights lab
• Accenture talked about the challenges with traditional approaches
for getting buy-in for data science solutions within companies.
• Their solution is to make concepts practical for clients using open
source technology and open data, creating visualisations, and mock-
up a prototypes in a few weeks.
• This requires a multi-disciplinary team of data scientists, engineers
and designers who can use cutting edge technologies to bring
concepts to life – ‘The Technology Lab’.
• Their case study was a US Bank client. Accenture had difficulty
convincing the Senior Execs of the value of a risk dashboard. The
Technology Lab decided to build a prototype using open source
software and open mortgage loan/default data. Once the execs got
their hands on the prototype they soon gave the go ahead for a
version using internal data
• McKinsey also have a similar Digital Labs concept:
http://www.mckinsey.com/client_service/mckinsey_digital/expertise/
digital_labs
32

Strata+Hadoop World London 2015

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Strata+Hadoop World London 2015