The document discusses using RapidMiner and Hadoop (Radoop) together for predictive analytics. It notes that while Hadoop is useful for large-scale batch analytics, it is slow for smaller datasets and real-time analytics. Radoop allows users of different backgrounds to easily design predictive analytics processes on Hadoop clusters through RapidMiner's visual interface. This facilitates collaboration and empowers business users and analysts to automate decision-making based on predictions. Runtime comparisons show Radoop can outperform Hadoop for smaller datasets depending on factors like algorithms and cluster size.
2. Stop Looking at the Rear View Mirror
2
From BI…
Business
Intelligence can
only show you what
already has
happened.
This is like driving a
car by only looking
into the rear view
mirror.
Do you really want
to drive your
business like that?
3. Stop Looking at the Rear View Mirror
3
CONTINUED
…to…
Data Discovery and
Real-time Analytics
offer a view
through the
windscreen.
You can see what is
happening right
now but you still
cannot identify
upcoming chances
or threats.
4. Stop Looking at the Rear View Mirror
4
CONTINUED
Predictions.
Predictive Analytics
delivers future
outcomes and
provides a look-
ahead view.
This is like
projecting what lies
behind the next
curve already on
your wind screen.
Predictive Insights
will show you that
there is an accident.
Prediction-based
actions will trigger
automatically that
the car slows down.
Warning:
Accident
Ahead!
5. Value is Higher for Prediction-based Actions
5
Inform.
Aggregation of micro-predictions will show you what can be expected
Allow for better decision making
Useful for supporting strategic decisions
Will not disrupt business processes
Limited & unspecified total value
Operationalize.
Millions of micro-predictions
Each Predictive Action is embedded into your business process
You will know how often you will be right and what your total gain
will be
Brings your business processes to a new, pro-active level
Huge total value
Predictive
Insights
Predactions*
*Predactions are Prediction-based Actions. You will predict what is going to happen. And then you will predact on this.
6. Science?
Predictive analytics is complex
Hadoop is complex
Proposed solution: Let’s create more Data Scientists!
But there are flaws with this approach:
– Scientists are supposed to create new things. Data scientists spend
95% of their time on integrating and transforming data.
– Shortage of data scientists predicted (KcKinsey report)
– Being a hardcore programmer, having a PhD in Statistics, and being
able to understand business problems is a rare skill mix…
6
8. Radoop: RapidMiner on Hadoop
We do this with RapidMiner + Hadoop = Radoop
– Hadoop is primarily used for batch analytics workloads (ad-hoc
reporting, machine learning, etc.)
– Hadoop only provides programming APIs and command line tools
– Radoop is a partner of RapidMiner who brought the simplicity of
RapidMiner for advanced analytics to Hadoop clusters
– Radoop is developed since 2010
8
We need to empower collaborative teams with
different backgrounds to analyze data in Hadoop –
one team member might be the data scientist.
9. RapidMiner for Prediction-based Actions
9
Empower business users:
Easy-to-use GUI for the
design of processes.
Predictive insights shown to
improve decision making.
Business analysts in the
driver’s seat: Let your
analysts transform business
problems into Prediction-
based Actions. Create
millions of micro-predictions
and automate everyday
decision making.
Facilitates Collaboration
among business users,
business analysts, data
scientists, and IT
professionals.
10. Radoop: RapidMiner on Hadoop
10
RapidMiner Data Flow Interface:
Simple design, execution and
maintenance of analytics processes
– Focus: ad-hoc reporting and
machine learning
– Also supports data
import/export, data
transformations, ETL
workloads, visualization
Combines distributed and in-
memory analytics
13. Segment Users based on Service Usage (ex.)
Task: Define K user segments and assign users to segments
Solution with Hadoop + Mahout:
– CREATE TABLE: define a schema for the service usage log file by
manually listing columns, types, defining separator character, etc.
– Write HiveQL queries (or Pig scripts or…) to aggregate service logs for
each user and calculate user attributes describing them
– Implement and execute a custom MapReduce job to convert data to
Mahout’s input format
– Run the Mahout K-Means algorithm with proper parameters
– Implement and execute a custom MapReduce job to convert the result
back into a delimited format
– Export the result from HDFS and import it into an RDBMS (or whatever
system makes use of the “predactions”…)
13
14. Segment Users based on Service Usage (ex.)
Task: Define K user segments and assign users to segments
Solution with Radoop:
14
19. Engine Comparison
In-Memory:
– In-memory analytics is always the fastest way to build analytical models
– Data set size is restricted by hardware (memory)
– Data set size: On decent hardware, up to ca. 100 million data points
In-Database:
– Not applicable for all analysis tasks
– Runtime depends on the power of the database server
– Data set size: Unlimited (limit is the external storage capacity)
In-Hadoop:
– Not applicable for all analysis tasks
– Runtime depends on the power of the Hadoop cluster
– Due to massive overhead introduced by Hadoop, the usage of Hadoop is not
recommended for smaller data set sizes
– Data set size: Unlimited (limit is the external storage capacity)
19
22. Conclusion
Predictive Analytics on Hadoop for Everyone:
– RapidMiner + Radoop is an easy-to-use & efficient alternative supporting the
collaboration process between different team members
– Not only Predictive Intelligence but also Prediction-based Actions can be created on top
of Hadoop clusters by everyone
Runtimes:
– Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in
terms of data set sizes have vanished today – but at the price of larger runtimes
– Running predictive analytics on Hadoop clusters is prohibitively slow for small data sets
and in many cases also for interactive real-time reports
– Depending on the data itself, the number of nodes, and the selected predictive analytics
algorithm, those can beat the other engines already at ca. 10M to 25M data points
– In general we recommend to stay in-memory for up to 100M data points and invest in
hardware before doing the switch to in-database (up to 500M data points) and then to
Hadoop clusters for data sets beyond this size
22
23. RapidMiner USA
RapidMiner, Inc. (Headquarters)
10 Fawcett St
Cambridge, MA 02138
United States
E-mail contact-us@rapidminer.com
Phone +1 - 617 - 401 - 7708
Fax +1 - 617 - 401 - 7709
CONTACT US
23
RapidMiner Germany
RapidMiner GmbH
Stockumer Str. 475
44227 Dortmund
Germany
E-mail contact-de@rapidminer.com
Phone +49 - 231 - 425 786 9-0
Fax +49 - 231 - 425 786 9-9
RapidMiner UK
RapidMiner Ltd.
Quatro House, Frimley Road
Camberley GU16 7ER
United Kingdom
E-mail contact-uk@rapidminer.com
Phone +44 1276 804 426
www.rapidminer.com