In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

In-Hadoop, In-Database,
and In-Memory Processing
for Predictive Analytics
Predict to Act.

Stop Looking at the Rear View Mirror
2
From BI…
Business
Intelligence can
only show you what
already has
happened.
This is like driving a
car by only looking
into the rear view
mirror.
Do you really want
to drive your
business like that?

3
CONTINUED
…to…
Data Discovery and
Real-time Analytics
offer a view
through the
windscreen.
You can see what is
happening right
now but you still
cannot identify
upcoming chances
or threats.

4
CONTINUED
Predictions.
Predictive Analytics
delivers future
outcomes and
provides a look-
ahead view.
This is like
projecting what lies
behind the next
curve already on
your wind screen.
Predictive Insights
will show you that
there is an accident.
Prediction-based
actions will trigger
automatically that
the car slows down.
Warning:
Accident
Ahead!

Value is Higher for Prediction-based Actions
5
Inform.
Aggregation of micro-predictions will show you what can be expected
Allow for better decision making
Useful for supporting strategic decisions
Will not disrupt business processes
Limited & unspecified total value
Operationalize.
Millions of micro-predictions
Each Predictive Action is embedded into your business process
You will know how often you will be right and what your total gain
will be
Brings your business processes to a new, pro-active level
Huge total value
Predictive
Insights
Predactions*
*Predactions are Prediction-based Actions. You will predict what is going to happen. And then you will predact on this.

Science?
 Predictive analytics is complex
 Hadoop is complex
 Proposed solution: Let’s create more Data Scientists!
 But there are flaws with this approach:
– Scientists are supposed to create new things. Data scientists spend
95% of their time on integrating and transforming data.
– Shortage of data scientists predicted (KcKinsey report)
– Being a hardcore programmer, having a PhD in Statistics, and being
able to understand business problems is a rare skill mix…
6

Radoop: RapidMiner on Hadoop
 We do this with RapidMiner + Hadoop = Radoop
– Hadoop is primarily used for batch analytics workloads (ad-hoc
reporting, machine learning, etc.)
– Hadoop only provides programming APIs and command line tools
– Radoop is a partner of RapidMiner who brought the simplicity of
RapidMiner for advanced analytics to Hadoop clusters
– Radoop is developed since 2010
8
We need to empower collaborative teams with
different backgrounds to analyze data in Hadoop –
one team member might be the data scientist.

RapidMiner for Prediction-based Actions
9
Empower business users:
Easy-to-use GUI for the
design of processes.
Predictive insights shown to
improve decision making.
Business analysts in the
driver’s seat: Let your
analysts transform business
problems into Prediction-
based Actions. Create
millions of micro-predictions
and automate everyday
decision making.
Facilitates Collaboration
among business users,
business analysts, data
scientists, and IT
professionals.

Radoop: RapidMiner on Hadoop
10
 RapidMiner Data Flow Interface:
Simple design, execution and
maintenance of analytics processes
– Focus: ad-hoc reporting and
machine learning
– Also supports data
import/export, data
transformations, ETL
workloads, visualization
 Combines distributed and in-
memory analytics

Supported Hadoop Distributions
11

Client- or Server-based Architecture
12
Client-based Architecture Server-based Architecture

Segment Users based on Service Usage (ex.)
 Task: Define K user segments and assign users to segments
 Solution with Hadoop + Mahout:
– CREATE TABLE: define a schema for the service usage log file by
manually listing columns, types, defining separator character, etc.
– Write HiveQL queries (or Pig scripts or…) to aggregate service logs for
each user and calculate user attributes describing them
– Implement and execute a custom MapReduce job to convert data to
Mahout’s input format
– Run the Mahout K-Means algorithm with proper parameters
– Implement and execute a custom MapReduce job to convert the result
back into a delimited format
– Export the result from HDFS and import it into an RDBMS (or whatever
system makes use of the “predactions”…)
13

Segment Users based on Service Usage (ex.)
 Task: Define K user segments and assign users to segments
 Solution with Radoop:
14

Radoop: Supported Functions
 Import/Export data to/from Hadoop
– Read CSV
– Read Database
– Write CSV
– Write Database
– Retrieve/Store/Append to Hive
 Data Transformations
– Select Attributes
– Filter Examples
– Generate Attributes
– Generate ID
– Aggregate
– Join
– Sort
– Normalize
– Replace
– Replace/Declare Missing Values
– Hive/Pig Script
 Machine learning & Statistical modeling
– Clustering: K-Means, Fuzzy K-Means,
Dirichlet, Canopy
– Model learning: Naive Bayes
– Model scoring: Naive Bayes, Decision
Tree, Logistic Regression, Linear
Regression
– Evaluation: Performance
– …and more…
17

Engine Comparison
 In-Memory:
– In-memory analytics is always the fastest way to build analytical models
– Data set size is restricted by hardware (memory)
– Data set size: On decent hardware, up to ca. 100 million data points
 In-Database:
– Not applicable for all analysis tasks
– Runtime depends on the power of the database server
– Data set size: Unlimited (limit is the external storage capacity)
 In-Hadoop:
– Not applicable for all analysis tasks
– Runtime depends on the power of the Hadoop cluster
– Due to massive overhead introduced by Hadoop, the usage of Hadoop is not
recommended for smaller data set sizes
– Data set size: Unlimited (limit is the external storage capacity)
19

Runtime Comparison for Naïve Bayes (20 nodes)
20

Runtime Comparison for Number of Nodes
21

Conclusion
 Predictive Analytics on Hadoop for Everyone:
– RapidMiner + Radoop is an easy-to-use & efficient alternative supporting the
collaboration process between different team members
– Not only Predictive Intelligence but also Prediction-based Actions can be created on top
of Hadoop clusters by everyone
 Runtimes:
– Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in
terms of data set sizes have vanished today – but at the price of larger runtimes
– Running predictive analytics on Hadoop clusters is prohibitively slow for small data sets
and in many cases also for interactive real-time reports
– Depending on the data itself, the number of nodes, and the selected predictive analytics
algorithm, those can beat the other engines already at ca. 10M to 25M data points
– In general we recommend to stay in-memory for up to 100M data points and invest in
hardware before doing the switch to in-database (up to 500M data points) and then to
Hadoop clusters for data sets beyond this size
22

RapidMiner USA
RapidMiner, Inc. (Headquarters)
10 Fawcett St
Cambridge, MA 02138
United States
E-mail contact-us@rapidminer.com
Phone +1 - 617 - 401 - 7708
Fax +1 - 617 - 401 - 7709
CONTACT US
23
RapidMiner Germany
RapidMiner GmbH
Stockumer Str. 475
44227 Dortmund
Germany
E-mail contact-de@rapidminer.com
Phone +49 - 231 - 425 786 9-0
Fax +49 - 231 - 425 786 9-9
RapidMiner UK
RapidMiner Ltd.
Quatro House, Frimley Road
Camberley GU16 7ER
United Kingdom
E-mail contact-uk@rapidminer.com
Phone +44 1276 804 426
www.rapidminer.com

In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

Similaire à In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics