SlideShare une entreprise Scribd logo
1  sur  39
0Mu Sigma Confidential
Chicago, IL
Bangalore, India
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden"
Chicago, IL
Bangalore, India
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden"
Do The Math
Next Generation Reference Architecture: Analytics Domain
June 2013
Hadoop Summit June 2013 – San Jose
zubin.dowlaty@mu-sigma.com
Twitter: @crunchdata
11Mu Sigma Confidential
 Analytics Trends & Big Data
 Reference Designs
 How to Scale Analytics and Intelligent Systems Reference Design
Agenda
2Mu Sigma Confidential
The information explosion is leading to an unprecedented ability to
store, manage and analyze data; we need to run ahead..
Information Ecosystem
► Text
Expanding ApplicationsData Source Explosion
Technology EvolutionAdvanced Analytical Techniques
Digital Data in 2011:
~ 1750 Exabytes
Facebook: 100+
million users per day
Mobile Phone
Subscription: 4.6
billion
15 Million Bing
recommendation records
RFID Tags: 40 billion
I
N
F
O
R
M
A
T
I
O
N
C
L
O
U
D
Click-stream
data
Purchase intent
data
Social Media,
Blogs
Email
Archives
RFID & Geospatial Information
Events
Opinion Mining
Social Graph
Targeting
Influencer
Marketing
Offer
Optimization
Clinical Data
Analysis
Fraud
Detection
SKU
Rationalization
Massively large
databases
Search
Technologies
Complex Event
Processing
Cloud
Computing
Software as a
service
Interactive
Visualization
In Memory
Analytics
CART
Adaptive
Regression
Splines
Machine
Learning
Radial
Basis Functions
Support Vector
Machines
Neural
Networks Geospatial Predictive
Modeling
3Mu Sigma Confidential
Three Key Macro Analytics Trends Targeting Consumption
Macro Trends: Agenda in All Three
Analytics TrendsComputational Enablement
– Scalable Analytics, Big Data &
NoSQL technologies, GP-GPU, In –
Memory & High Performance
Computing
– Master Data Management, Cloud,
and Federated Data – the end of
the central EDW?
– Analytics as a Service and the
Cloud
– Open Source Maturity
– Rise of Agile Self Service
– Data Model Disruption: ETL vs. ELT
Usability and Visualization
– Dashboards will Evolve into Cockpits
– Improved Interactive Data Visualization & Aesthetics
– Augmented / Virtual Reality / Natural User Interfaces
– Graph Analytics sparked by Social Data
– Discovery with Computational Geometry
– Mobile BI and the Integration of Consumer & Enterprise Devices
– Gamification: Do you want to play a game?
Intelligent Systems
(“Anticipation Denotes Intelligence”)
– Operational Smart Systems are
Back propagating – Pervasive
Analytics
– Real Time Analytics and Event
Streams
– Artificial Intelligence (AI) is not
what we expected
– Don’t forget the 4th tier it’s Juicy
– Metaheuristics and the Ants
– Operational Analytics, You are the
Exception
– Experiments in the Enterprise and
the Quasi
– Just Say No to OLS
– Energy Informatics if Radiating
– Intelligent Search
4Mu Sigma Confidential
Why should you care, is this phenomenon all marketing hype or
business critical?
 Data is an asset and growing by leaps and bounds
– Structured and unstructured (DARK MATTER)
– Within and Beyond the firewall
 A competitive differentiator is the use of data to improve decision making thru analytics
– “Executives are Professional Gamblers”
– Need an Edge..
» We are in the information age…
 Consumption of analytics differentiates
» Embedded analytics vs. ppt culture
The idea is linked to the concept of a
digital age or digital revolution, and
carries the ramifications of a shift from
traditional industry that the industrial
revolution brought through
industrialization, to an economy based
on the manipulation of information, i.e.,
an information society (computation)
It’s all about analytics done at “SCALE”, enabling scale cost effectively.
Who Cares?
5Mu Sigma Confidential
Big Data’s recent enterprise popularity can be attributed to two key
factors
Big Data Technology Evolution
2003 2004 2005 2006 2007 2008 2009 2010 2011
Apache: Hadoop project
Yahoo: 10K core
cluster
IBM
Acquires
Early Research Open source dev
momentum
Initial success
stories
Commercialization &
Adoption
Cloudera named most promising
startup funded in 2009
elastic map-reduce
Teradata buys Aster
(map-reduce DB)
Google Trigger: Releases “Map Reduce” & “GFS” paper
Google: Bigtable
paper
Lucene subproject
HP Acquires Vertica
EMC Acquires
Greenplum
2012
Oracle
MSOFT
SAS
 Technology Trigger
– Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop /
NoSQL)
 Frustration Trigger..
– Tyranny of the Data Model, Cost & Complexity for data
integration, Agility, Impact, Bureaucracy, Difficulty & Costly to scale of existing EDW technology
Faced with storing massive data sets from indexing the
web, Google searches for & finds a way to store & retrieve
the data in commodity hardware
6Mu Sigma Confidential
A Mindset for Enabling Decision Sciences At
Scale
Dataset -> Toolset -> Skillset -> Mindset
Big Data – What is it?
 Skillset
– Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data
– Parallel computation & Stream Processing
– Data Scientist and Data Explorer
– “…big data job postings on Dice more than tripled year-over-
year – Jan 2013”
 Mindset
– Automation and Scale
– Higher utilization of machines for decisions
– Agile
– Learning vs. Knowing
– Consumption
– Algorithmic and Heuristic (Man & Machine)
77Mu Sigma Confidential
 Analytics Trends & Big Data
 Reference Designs
 How to Scale Analytics and Intelligent Systems Reference Design
Agenda
8Mu Sigma Confidential
the Big Data eco-system is a challenge to keep up with..
Mindset: Knowing vs. Learning..
The Big Data Ecosystem
”Formulating a right question is always hard, but with big data, it is an order of
magnitude harder, because you are blazing the trail (not grazing on the green field).”
source: Gartner
Source: The Big Data Group llc
9Mu Sigma Confidential
Mindset: Qualitative
 Next Generation Analytics Platform (evoked set)
– Enterprise Architecture (Modern Best of Breed componentized Services, SOA)
– Scale (analytics done at scale)
– Value
– Commodity (scale horizontally using commodity hardware)
– Promote Reuse (published API, PMML, BPMN, SQL, CAMEL, CRISP-DM)
– Build Discipline (model versioning, GIT, unit testing, and coding standards)
– Batch and Real Time Capable
– Agile (integration and rapid deployment – analyst)
– Process Automation
– Knowledge Management
– Model Management
– Open Standards (HTML5, PMML)
– Quality Assurance (Unit Testing)
– Distributed & Parallel Capable (Hadoop, MPI, GPGPU)
– Inject intelligence into the ecosystem (consumption)
– Deploy Intelligent Systems & Intelligent Machines
– Champion / Challenger (test and learn)
– Robust and Agile ETL (HIVE, HBASE, SQL ETL tools)
– Beyond the tyranny of the data model..
10Mu Sigma Confidential
The “decision” stack will lead to bundling and tighter integration of
technology components
Privileged and confidential Copyright 2012, Wal-Mart stores, Inc.
DATA ENGINEERING LAYER
DATA SCIENCE LAYER
DECISION SUPPORT STACK
DECISION SCIENCE LAYER
• Data Acquisition (Data at Rest, Data at Motion)
• Data Storage (Databases, Big Data, In-memory)
• Data Processing (ETL, ELT, Event Processing)
• Workflow Management
• Statistical Analysis
• Reporting
• Data mining
• Visualization
• Text Mining/Unstructured Data Analysis
• Business Contextualization
• Intelligent Analytics Workflows
• Insight Generation tools
• Learning frameworks
• Operationalize & Scale
Math + Biz + Tech
Math + Tech
Technology
11Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
12Mu Sigma Confidential
A typical Big Data Architecture design pattern leverages NoSQL as
a data store “reservoir”
RDBMS
NoSQL
Traditional
EDW
FTP server
IMPORT
SQOOP/ 3rd
party
connectors
Edge Node
Monitoring Workflow engine
(Oozie)
Streams
Cloud
NoSQL DBs HDFS/ MAP REDUCE/(MPI)
Hadoop cluster hardware
ETL/ ELT
EDW for storing the data after analytical processing
(STAR Schema, Snow flake, etc.)
BI Tools
REAL TIME
 In Memory: Spark, Storm, Shark,
Impala, Drill, Stinger
 NoSQL: Cassandra, MongoDB,
HBase
BATCH
 Hive
 Pig
 Native map/reduce
 Hadoop streaming
Business users/ Applications
Traditional
EDW EXPORT
SQOOP/ 3rd
party
connectors
Flume/ Avro/ Storm
Data Governance
 Authentication
 Authorization
 Kerberos
 LDAP/ AD
 Meta data mgmt.
Data sourcing
Mu Sigma’s POV: Big Data Arch. design pattern
CLOUD
13Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
14Mu Sigma Confidential
Trend: Dashboards will Evolve to be More Actionable and Forward
Looking
Dashboredom?
http://www.information-
management.com/issues/20_7/dashboards_data_management_quality_b
usiness_intelligence-10019101-1.html
“As we move from the information age into the intelligence age”..
Usability and Visualization: Evolution
15Mu Sigma Confidential
Trend: Interactive Visualization Improvements and the Importance
of Aesthetics & Design
Advanced Visualization will continue to increase in depth and
relevance to broader audiences. Visionary vendors
like Tableau, QlikTech, SAS JMP/Visual Analytics, Panopticon,
and Spotfire (now Tibco) made their mark by providing
significantly differentiated visualization capabilities compared
with the trite bar and pie charts of most BI players' reporting
tools.
The latest advances in state-of-the-art UI technologies such as
Microsoft’s SilverLight, Adobe Flex, R, and Processing,
Javascript/ HTML5 via frameworks like Ext JS, D3
http://d3js.org/ augur the era of a revolution in state-of-the art
visualization capabilities, and Multi Touch Devices …
http://bardoli.blogspot.com/2009/12/top-10-trends-for-2010-in-analytics.html movie: Avengers & Iron Man
http://selection.datavisualization.ch/Usability and Visualization: Evolution
16Mu Sigma Confidential
Tapping the power of visual perception
 Vision is by far our most powerful sense. Seeing and thinking are intimately connected.
 I see what you mean. It isn't accidental that when we begin to understand something we say,
"I see." Not "I hear" or "I smell," but "I see."
 Approximately 70% of the sense receptors in our bodies are dedicated to vision
[Few, 2006]
Consumption Enablement
17Mu Sigma Confidential
Visually Encoding Data for Rapid Perception
 Pre-attentive processing, the early stage of visual perception that rapidly occurs below the
level of consciousness, is tuned to detect a specific set of visual attributes.
 How many 3’s?
1281736875613897654698450698560498286762
9809858453822450985645894509845098096585
9091030209905959595772564675050678904567
8845789809821677654872664908560932949686
1281736875613897654698450698560498286762
9809858453822450985645894509845098096585
9091030209905959595772564675050678904567
8845789809821677654872664908560932949686
Frameworks
Source: Tibco
18Mu Sigma Confidential
Key Elements for Visual Analysis
Data:
• integrate multiple data sources
• enrich your data
Add dimensions to your data:
• color
• ShapE
• Size
• “Labels”
•
• Graph Theory
Interactive:
• Ad-hoc ask & answer (self-service)
• Query filters (Drill down - hierarchy)
• Linked Visualizations (Brushing)
• Bookmarks & Tabs
• Movement (animation) Time Emphasis
Frameworks
Pre-Attentive Attributes of Visual
Perception most applicable to Data
Presentation: Form, Color, Spatial,
and Motion
“Pre-attentive processing is the unconscious
accumulation of information from the environment.
All available information is pre-attentively
processed. Then, the brain filters and processes
what is important. Information that has the highest
salience (a stimulus that stands out the most) or
relevance to what a person is thinking about is
selected for further and more complete analysis by
conscious (attentive) processing”
19Mu Sigma Confidential
Eloquence through Simplicity
 Edward R. Tufte introduced a concept in his 1983 classic The Visual Display of Quantitative
Information that he calls the "data-ink ratio." Tufte defines the data-ink ratio in the following
way: “A large share of ink on a graphic should present data-information, the ink changing as
the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink
arranged in response to variation in the numbers represented”
Reduce the non-data pixels
 Eliminate all unnecessary non-data pixels
 De-emphasize and regularize the non-data
pixels that remain
Enhance the data pixels
 Eliminate all unnecessary data pixels
 Highlight the most important data pixels
that remain
Frameworks
20Mu Sigma Confidential
General Principles of Design
 What’s taught in Art Class?
Frameworks
http://www.johnlovett.com/test.htm
21Mu Sigma Confidential
Example: Good, Bad, and the Ugly
22Mu Sigma Confidential
Example: Good, Bad, and the Ugly
23Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
24Mu Sigma Confidential
Welcome to DSpace … Decision Sciences Home Page
Each Community
has its own Home
Page.
Recent submissions to this
entire Community are
featured on the Home Page.
Other
Links
Example: Knowledge Management
25Mu Sigma Confidential
Decision Science Wiki and the Power of Reproducible Analytics
Example: Knowledge Management
26Mu Sigma Confidential
Decision Sciences Wiki and the Power of Reproducible Analytics
Example: Knowledge Management
27Mu Sigma Confidential
Decision Sciences Wiki and the Power of Reproducible Analytics
Example: Knowledge Management
28Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
29Mu Sigma Confidential
Typical Analytics Workflow
Teradata
15%oftheprocess70%oftheprocess15%
Time Period Selection
Sister Store Selection
Processing of codes
Basic data pull
Email
Request Client – Analytics
Report
Run the optimization
process
Set inputs in the code
SAS
Processing of codes
Optmodel optimization
VBA Excel
Output as excels
Client Report Generation
Excel
CPLEX
optimization
SAS
1
2
4
5
6
7
8
9
10
11
3
Example of Analytics Batch Process Mgmt
30Mu Sigma Confidential
Utilize “Analytics BPM”: Workflow and Business Process
Management Platform to Manage Man & Machine
Batch Analytics Automation
BPM is a rapid analytics automation solution that boosts the
productivity of analytical workgroups. Analytics BPM allows one
to build human and machine workflows and embed them
within your enterprise.
•Nexus BPM v1.5 (open source LGPL; based on Jboss JBPM)
•muFLOW v2.1 (based on Activiti.org)
Performance Gains Across Client Engagements
Industry Engagement Business Impact
Retail
Technology
Healthcare
Automates reporting
framework
Optimizes search engine
marketing spends
Helps create marketing
mix solution
Improvement in analyst productivity
Increased collaboration and utilization of
better process management strategies
Improved model granularity
Flexible bidding process
Increased click through rates by 8%
Easier tool deployment
Reduced development time by 15%
Support creation of additional reports rapidly
Key Features
 A modeling environment that “glues”
together reusable components from
multiple technologies and enables
automation
 A robust and flexible data integration
capability
 Ability to create rich analytics vehicles
that can be customized by client
 Provision to leverage past knowledge
 Facilitates efficient process
management such as modeling,
automation, monitoring, and analysis
31Mu Sigma Confidential
Professional Shops Require Model Management: “Algo Wars”
Source: SAS Model Manager
Key Features
Model Lifecycle Management
• “Software” Parallels
• Dev, Test, Stage,
Production
• Versioning
• PMML; vendor agnostic
Continuous model performance
management
• Champion/Challenger
• Live Update and
Optimization
Proactive notification and live
switch overs
Model Management
3232Mu Sigma Confidential
 Analytics Trends & Big Data
 Reference Designs
 How to Scale Analytics & Intelligent Systems Reference Design
Agenda
Analytics Trends
33Mu Sigma Confidential
Yippee ! 5 % discount ..
Yeah.. I think I need these
headphones too
The current business scenario demands automating and
injecting intelligence into various enterprise touch points
Business Situation
Webpage
Add to
Cart
Checkout
Yearly Bonus 
Let me buy a new laptop
Taxonomy,
Associations,
Nearest Neighbor
Popup
Optimization,
Need State
Personalization
Collaborative
Filtering
Recommender
Systems
Offer Optimization
Behind the
Scenes
Email
Marketing
Mix
CRM
Revenue
Management
Hey ! I have some good
deals being provided here
Wow ! Great offers ..
Since I have some more
money, let me have a look
Customer Segmentation
Propensity Score
Attribution Methods to
Optimize Media Spend
Pricing
Elasticity
Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information
Search Buy laptop
34Mu Sigma Confidential
We are in the „Information Revolution‟ and its high time that
data and analytics capabilities are integrated to scale with
machine intelligence
Intelligent Systems
Intelligent Systems: Next Wave of Innovation and Change
The Next Big Opportunity !
Optimal Utilization of Available Information and Migration to Intelligent Systems
 Convergence of the global industrial
system with the power of advanced
computing, analytics, low-cost sensing
and new levels of connectivity
permitted by the Internet
* Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines)
Intelligent
Machines
Advanced
Analytics
People
At Work
35Mu Sigma Confidential
Current News on Intelligent Systems
CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress
Apama (Software AG)
USA Today, September 16, 2012
– http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this-
probes-math-in-global-trading/57782600/1
 Two Second Advantage: Published 2011
– Excellent portrayal of Intelligent Systems, patterns from the human brains – mental
models
– Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey
puck would be.
– Enterprise 2.0: Every transaction became a bit of digital data
– Enterprise 3.0: Every EVENT can become a bit of digital data
– There will always be value in making projections, days, months, years, predictive
analytics, data mining, and analytics systems on historical data – batch systems
– But in today’s world, enterprises need something
more, instantaneous, proactive, predictive capability – real time systems
 GE Mind an Machines: November 2012
» Industrial Internet, Intelligent Systems, and Intelligent Machines
» http://www.ge.com/docs/chapters/Industrial_Internet.pdf
» http://www.youtube.com/watch?v=SvI3Pmv-DhE
36Mu Sigma Confidential
High-level design for a Robust Intelligent System
Reference Architecture for Intelligent Systems
37Mu Sigma Confidential
Real Time Computation for Analytics: The Agent Metaphor
Real Time Automation
Java Agent Development Environment (JADE) is a java based middleware to aid in
agent oriented programming (AOP) across distributed machines. The Jade platform and
framework provides the functionality to easily develop software agents and control the
lifecycle of these agents.
Agent-based computing is an approach to design and implementation that facilitates
the design and development of sophisticated systems by viewing them as a society of
independent communicating agents working together to meet the goals of the system.
JADE Features:
 Agent
Communication
 Distributed Platform
 Agent Mobility
 Monitoring tools like
Remote Monitoring
Agent (RMA)
Agent Attributes:
 Autonomous
 Goal-directed
 Sociable
RJADE
Ragentsembeddedinside
Javawithrsession
associatedwiththem
38Mu Sigma Confidential
Thank You
Chicago, IL
Bangalore, India
2013
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly prohibited"
zubin.dowlaty@mu-sigma.com
Twitter: @crunchdata

Contenu connexe

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 

Plus de DataWorks Summit (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Dernier

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 

Dernier (20)

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 

Next Generation Analytics: A Reference Architecture

  • 1. 0Mu Sigma Confidential Chicago, IL Bangalore, India www.mu-sigma.com Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden" Chicago, IL Bangalore, India www.mu-sigma.com Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden" Do The Math Next Generation Reference Architecture: Analytics Domain June 2013 Hadoop Summit June 2013 – San Jose zubin.dowlaty@mu-sigma.com Twitter: @crunchdata
  • 2. 11Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics and Intelligent Systems Reference Design Agenda
  • 3. 2Mu Sigma Confidential The information explosion is leading to an unprecedented ability to store, manage and analyze data; we need to run ahead.. Information Ecosystem ► Text Expanding ApplicationsData Source Explosion Technology EvolutionAdvanced Analytical Techniques Digital Data in 2011: ~ 1750 Exabytes Facebook: 100+ million users per day Mobile Phone Subscription: 4.6 billion 15 Million Bing recommendation records RFID Tags: 40 billion I N F O R M A T I O N C L O U D Click-stream data Purchase intent data Social Media, Blogs Email Archives RFID & Geospatial Information Events Opinion Mining Social Graph Targeting Influencer Marketing Offer Optimization Clinical Data Analysis Fraud Detection SKU Rationalization Massively large databases Search Technologies Complex Event Processing Cloud Computing Software as a service Interactive Visualization In Memory Analytics CART Adaptive Regression Splines Machine Learning Radial Basis Functions Support Vector Machines Neural Networks Geospatial Predictive Modeling
  • 4. 3Mu Sigma Confidential Three Key Macro Analytics Trends Targeting Consumption Macro Trends: Agenda in All Three Analytics TrendsComputational Enablement – Scalable Analytics, Big Data & NoSQL technologies, GP-GPU, In – Memory & High Performance Computing – Master Data Management, Cloud, and Federated Data – the end of the central EDW? – Analytics as a Service and the Cloud – Open Source Maturity – Rise of Agile Self Service – Data Model Disruption: ETL vs. ELT Usability and Visualization – Dashboards will Evolve into Cockpits – Improved Interactive Data Visualization & Aesthetics – Augmented / Virtual Reality / Natural User Interfaces – Graph Analytics sparked by Social Data – Discovery with Computational Geometry – Mobile BI and the Integration of Consumer & Enterprise Devices – Gamification: Do you want to play a game? Intelligent Systems (“Anticipation Denotes Intelligence”) – Operational Smart Systems are Back propagating – Pervasive Analytics – Real Time Analytics and Event Streams – Artificial Intelligence (AI) is not what we expected – Don’t forget the 4th tier it’s Juicy – Metaheuristics and the Ants – Operational Analytics, You are the Exception – Experiments in the Enterprise and the Quasi – Just Say No to OLS – Energy Informatics if Radiating – Intelligent Search
  • 5. 4Mu Sigma Confidential Why should you care, is this phenomenon all marketing hype or business critical?  Data is an asset and growing by leaps and bounds – Structured and unstructured (DARK MATTER) – Within and Beyond the firewall  A competitive differentiator is the use of data to improve decision making thru analytics – “Executives are Professional Gamblers” – Need an Edge.. » We are in the information age…  Consumption of analytics differentiates » Embedded analytics vs. ppt culture The idea is linked to the concept of a digital age or digital revolution, and carries the ramifications of a shift from traditional industry that the industrial revolution brought through industrialization, to an economy based on the manipulation of information, i.e., an information society (computation) It’s all about analytics done at “SCALE”, enabling scale cost effectively. Who Cares?
  • 6. 5Mu Sigma Confidential Big Data’s recent enterprise popularity can be attributed to two key factors Big Data Technology Evolution 2003 2004 2005 2006 2007 2008 2009 2010 2011 Apache: Hadoop project Yahoo: 10K core cluster IBM Acquires Early Research Open source dev momentum Initial success stories Commercialization & Adoption Cloudera named most promising startup funded in 2009 elastic map-reduce Teradata buys Aster (map-reduce DB) Google Trigger: Releases “Map Reduce” & “GFS” paper Google: Bigtable paper Lucene subproject HP Acquires Vertica EMC Acquires Greenplum 2012 Oracle MSOFT SAS  Technology Trigger – Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop / NoSQL)  Frustration Trigger.. – Tyranny of the Data Model, Cost & Complexity for data integration, Agility, Impact, Bureaucracy, Difficulty & Costly to scale of existing EDW technology Faced with storing massive data sets from indexing the web, Google searches for & finds a way to store & retrieve the data in commodity hardware
  • 7. 6Mu Sigma Confidential A Mindset for Enabling Decision Sciences At Scale Dataset -> Toolset -> Skillset -> Mindset Big Data – What is it?  Skillset – Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data – Parallel computation & Stream Processing – Data Scientist and Data Explorer – “…big data job postings on Dice more than tripled year-over- year – Jan 2013”  Mindset – Automation and Scale – Higher utilization of machines for decisions – Agile – Learning vs. Knowing – Consumption – Algorithmic and Heuristic (Man & Machine)
  • 8. 77Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics and Intelligent Systems Reference Design Agenda
  • 9. 8Mu Sigma Confidential the Big Data eco-system is a challenge to keep up with.. Mindset: Knowing vs. Learning.. The Big Data Ecosystem ”Formulating a right question is always hard, but with big data, it is an order of magnitude harder, because you are blazing the trail (not grazing on the green field).” source: Gartner Source: The Big Data Group llc
  • 10. 9Mu Sigma Confidential Mindset: Qualitative  Next Generation Analytics Platform (evoked set) – Enterprise Architecture (Modern Best of Breed componentized Services, SOA) – Scale (analytics done at scale) – Value – Commodity (scale horizontally using commodity hardware) – Promote Reuse (published API, PMML, BPMN, SQL, CAMEL, CRISP-DM) – Build Discipline (model versioning, GIT, unit testing, and coding standards) – Batch and Real Time Capable – Agile (integration and rapid deployment – analyst) – Process Automation – Knowledge Management – Model Management – Open Standards (HTML5, PMML) – Quality Assurance (Unit Testing) – Distributed & Parallel Capable (Hadoop, MPI, GPGPU) – Inject intelligence into the ecosystem (consumption) – Deploy Intelligent Systems & Intelligent Machines – Champion / Challenger (test and learn) – Robust and Agile ETL (HIVE, HBASE, SQL ETL tools) – Beyond the tyranny of the data model..
  • 11. 10Mu Sigma Confidential The “decision” stack will lead to bundling and tighter integration of technology components Privileged and confidential Copyright 2012, Wal-Mart stores, Inc. DATA ENGINEERING LAYER DATA SCIENCE LAYER DECISION SUPPORT STACK DECISION SCIENCE LAYER • Data Acquisition (Data at Rest, Data at Motion) • Data Storage (Databases, Big Data, In-memory) • Data Processing (ETL, ELT, Event Processing) • Workflow Management • Statistical Analysis • Reporting • Data mining • Visualization • Text Mining/Unstructured Data Analysis • Business Contextualization • Intelligent Analytics Workflows • Insight Generation tools • Learning frameworks • Operationalize & Scale Math + Biz + Tech Math + Tech Technology
  • 12. 11Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
  • 13. 12Mu Sigma Confidential A typical Big Data Architecture design pattern leverages NoSQL as a data store “reservoir” RDBMS NoSQL Traditional EDW FTP server IMPORT SQOOP/ 3rd party connectors Edge Node Monitoring Workflow engine (Oozie) Streams Cloud NoSQL DBs HDFS/ MAP REDUCE/(MPI) Hadoop cluster hardware ETL/ ELT EDW for storing the data after analytical processing (STAR Schema, Snow flake, etc.) BI Tools REAL TIME  In Memory: Spark, Storm, Shark, Impala, Drill, Stinger  NoSQL: Cassandra, MongoDB, HBase BATCH  Hive  Pig  Native map/reduce  Hadoop streaming Business users/ Applications Traditional EDW EXPORT SQOOP/ 3rd party connectors Flume/ Avro/ Storm Data Governance  Authentication  Authorization  Kerberos  LDAP/ AD  Meta data mgmt. Data sourcing Mu Sigma’s POV: Big Data Arch. design pattern CLOUD
  • 14. 13Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
  • 15. 14Mu Sigma Confidential Trend: Dashboards will Evolve to be More Actionable and Forward Looking Dashboredom? http://www.information- management.com/issues/20_7/dashboards_data_management_quality_b usiness_intelligence-10019101-1.html “As we move from the information age into the intelligence age”.. Usability and Visualization: Evolution
  • 16. 15Mu Sigma Confidential Trend: Interactive Visualization Improvements and the Importance of Aesthetics & Design Advanced Visualization will continue to increase in depth and relevance to broader audiences. Visionary vendors like Tableau, QlikTech, SAS JMP/Visual Analytics, Panopticon, and Spotfire (now Tibco) made their mark by providing significantly differentiated visualization capabilities compared with the trite bar and pie charts of most BI players' reporting tools. The latest advances in state-of-the-art UI technologies such as Microsoft’s SilverLight, Adobe Flex, R, and Processing, Javascript/ HTML5 via frameworks like Ext JS, D3 http://d3js.org/ augur the era of a revolution in state-of-the art visualization capabilities, and Multi Touch Devices … http://bardoli.blogspot.com/2009/12/top-10-trends-for-2010-in-analytics.html movie: Avengers & Iron Man http://selection.datavisualization.ch/Usability and Visualization: Evolution
  • 17. 16Mu Sigma Confidential Tapping the power of visual perception  Vision is by far our most powerful sense. Seeing and thinking are intimately connected.  I see what you mean. It isn't accidental that when we begin to understand something we say, "I see." Not "I hear" or "I smell," but "I see."  Approximately 70% of the sense receptors in our bodies are dedicated to vision [Few, 2006] Consumption Enablement
  • 18. 17Mu Sigma Confidential Visually Encoding Data for Rapid Perception  Pre-attentive processing, the early stage of visual perception that rapidly occurs below the level of consciousness, is tuned to detect a specific set of visual attributes.  How many 3’s? 1281736875613897654698450698560498286762 9809858453822450985645894509845098096585 9091030209905959595772564675050678904567 8845789809821677654872664908560932949686 1281736875613897654698450698560498286762 9809858453822450985645894509845098096585 9091030209905959595772564675050678904567 8845789809821677654872664908560932949686 Frameworks Source: Tibco
  • 19. 18Mu Sigma Confidential Key Elements for Visual Analysis Data: • integrate multiple data sources • enrich your data Add dimensions to your data: • color • ShapE • Size • “Labels” • • Graph Theory Interactive: • Ad-hoc ask & answer (self-service) • Query filters (Drill down - hierarchy) • Linked Visualizations (Brushing) • Bookmarks & Tabs • Movement (animation) Time Emphasis Frameworks Pre-Attentive Attributes of Visual Perception most applicable to Data Presentation: Form, Color, Spatial, and Motion “Pre-attentive processing is the unconscious accumulation of information from the environment. All available information is pre-attentively processed. Then, the brain filters and processes what is important. Information that has the highest salience (a stimulus that stands out the most) or relevance to what a person is thinking about is selected for further and more complete analysis by conscious (attentive) processing”
  • 20. 19Mu Sigma Confidential Eloquence through Simplicity  Edward R. Tufte introduced a concept in his 1983 classic The Visual Display of Quantitative Information that he calls the "data-ink ratio." Tufte defines the data-ink ratio in the following way: “A large share of ink on a graphic should present data-information, the ink changing as the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented” Reduce the non-data pixels  Eliminate all unnecessary non-data pixels  De-emphasize and regularize the non-data pixels that remain Enhance the data pixels  Eliminate all unnecessary data pixels  Highlight the most important data pixels that remain Frameworks
  • 21. 20Mu Sigma Confidential General Principles of Design  What’s taught in Art Class? Frameworks http://www.johnlovett.com/test.htm
  • 22. 21Mu Sigma Confidential Example: Good, Bad, and the Ugly
  • 23. 22Mu Sigma Confidential Example: Good, Bad, and the Ugly
  • 24. 23Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
  • 25. 24Mu Sigma Confidential Welcome to DSpace … Decision Sciences Home Page Each Community has its own Home Page. Recent submissions to this entire Community are featured on the Home Page. Other Links Example: Knowledge Management
  • 26. 25Mu Sigma Confidential Decision Science Wiki and the Power of Reproducible Analytics Example: Knowledge Management
  • 27. 26Mu Sigma Confidential Decision Sciences Wiki and the Power of Reproducible Analytics Example: Knowledge Management
  • 28. 27Mu Sigma Confidential Decision Sciences Wiki and the Power of Reproducible Analytics Example: Knowledge Management
  • 29. 28Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
  • 30. 29Mu Sigma Confidential Typical Analytics Workflow Teradata 15%oftheprocess70%oftheprocess15% Time Period Selection Sister Store Selection Processing of codes Basic data pull Email Request Client – Analytics Report Run the optimization process Set inputs in the code SAS Processing of codes Optmodel optimization VBA Excel Output as excels Client Report Generation Excel CPLEX optimization SAS 1 2 4 5 6 7 8 9 10 11 3 Example of Analytics Batch Process Mgmt
  • 31. 30Mu Sigma Confidential Utilize “Analytics BPM”: Workflow and Business Process Management Platform to Manage Man & Machine Batch Analytics Automation BPM is a rapid analytics automation solution that boosts the productivity of analytical workgroups. Analytics BPM allows one to build human and machine workflows and embed them within your enterprise. •Nexus BPM v1.5 (open source LGPL; based on Jboss JBPM) •muFLOW v2.1 (based on Activiti.org) Performance Gains Across Client Engagements Industry Engagement Business Impact Retail Technology Healthcare Automates reporting framework Optimizes search engine marketing spends Helps create marketing mix solution Improvement in analyst productivity Increased collaboration and utilization of better process management strategies Improved model granularity Flexible bidding process Increased click through rates by 8% Easier tool deployment Reduced development time by 15% Support creation of additional reports rapidly Key Features  A modeling environment that “glues” together reusable components from multiple technologies and enables automation  A robust and flexible data integration capability  Ability to create rich analytics vehicles that can be customized by client  Provision to leverage past knowledge  Facilitates efficient process management such as modeling, automation, monitoring, and analysis
  • 32. 31Mu Sigma Confidential Professional Shops Require Model Management: “Algo Wars” Source: SAS Model Manager Key Features Model Lifecycle Management • “Software” Parallels • Dev, Test, Stage, Production • Versioning • PMML; vendor agnostic Continuous model performance management • Champion/Challenger • Live Update and Optimization Proactive notification and live switch overs Model Management
  • 33. 3232Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics & Intelligent Systems Reference Design Agenda Analytics Trends
  • 34. 33Mu Sigma Confidential Yippee ! 5 % discount .. Yeah.. I think I need these headphones too The current business scenario demands automating and injecting intelligence into various enterprise touch points Business Situation Webpage Add to Cart Checkout Yearly Bonus  Let me buy a new laptop Taxonomy, Associations, Nearest Neighbor Popup Optimization, Need State Personalization Collaborative Filtering Recommender Systems Offer Optimization Behind the Scenes Email Marketing Mix CRM Revenue Management Hey ! I have some good deals being provided here Wow ! Great offers .. Since I have some more money, let me have a look Customer Segmentation Propensity Score Attribution Methods to Optimize Media Spend Pricing Elasticity Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information Search Buy laptop
  • 35. 34Mu Sigma Confidential We are in the „Information Revolution‟ and its high time that data and analytics capabilities are integrated to scale with machine intelligence Intelligent Systems Intelligent Systems: Next Wave of Innovation and Change The Next Big Opportunity ! Optimal Utilization of Available Information and Migration to Intelligent Systems  Convergence of the global industrial system with the power of advanced computing, analytics, low-cost sensing and new levels of connectivity permitted by the Internet * Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines) Intelligent Machines Advanced Analytics People At Work
  • 36. 35Mu Sigma Confidential Current News on Intelligent Systems CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress Apama (Software AG) USA Today, September 16, 2012 – http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this- probes-math-in-global-trading/57782600/1  Two Second Advantage: Published 2011 – Excellent portrayal of Intelligent Systems, patterns from the human brains – mental models – Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey puck would be. – Enterprise 2.0: Every transaction became a bit of digital data – Enterprise 3.0: Every EVENT can become a bit of digital data – There will always be value in making projections, days, months, years, predictive analytics, data mining, and analytics systems on historical data – batch systems – But in today’s world, enterprises need something more, instantaneous, proactive, predictive capability – real time systems  GE Mind an Machines: November 2012 » Industrial Internet, Intelligent Systems, and Intelligent Machines » http://www.ge.com/docs/chapters/Industrial_Internet.pdf » http://www.youtube.com/watch?v=SvI3Pmv-DhE
  • 37. 36Mu Sigma Confidential High-level design for a Robust Intelligent System Reference Architecture for Intelligent Systems
  • 38. 37Mu Sigma Confidential Real Time Computation for Analytics: The Agent Metaphor Real Time Automation Java Agent Development Environment (JADE) is a java based middleware to aid in agent oriented programming (AOP) across distributed machines. The Jade platform and framework provides the functionality to easily develop software agents and control the lifecycle of these agents. Agent-based computing is an approach to design and implementation that facilitates the design and development of sophisticated systems by viewing them as a society of independent communicating agents working together to meet the goals of the system. JADE Features:  Agent Communication  Distributed Platform  Agent Mobility  Monitoring tools like Remote Monitoring Agent (RMA) Agent Attributes:  Autonomous  Goal-directed  Sociable RJADE Ragentsembeddedinside Javawithrsession associatedwiththem
  • 39. 38Mu Sigma Confidential Thank You Chicago, IL Bangalore, India 2013 www.mu-sigma.com Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly prohibited" zubin.dowlaty@mu-sigma.com Twitter: @crunchdata

Notes de l'éditeur

  1. How many did you find? The correct answer is five. Whether you got the answer right or not, the process in the first case took you a while because it involved attentive processing. The list of numbers did not exhibit any pre-attentive attributes that you could use to distinguish the threesfrom the other numbers.Much easier the second time, wasn't it? In this figure the threes could easily be distinguished from the other numbers, due to their differing color intensity (one of the pre-attentive attributes discussed in the next slide): the threes are black while all the other numbers are gray, which causes them to stand out in clear contrast. Why couldn't we easily distinguish the threes in the first set of numbers based purely on their unique shape? Because the complex shapes of the numbers are not attributes that we perceive pre-attentively. Simple shapes such as circles and squares are pre-attentively perceived, but the shapes of numbers are too elaborate.
  2. The key idea behind Tufte’s principles is that a major portion of the graphic should directly correlate to the data that is to be presented. The change in the ink should happen only if the data is changing since the major portion of the ink is data-ink. The fluff (non-data ink) should be minimal. Examples of non-data ink are grid lines, axis, etc.In the context of computer systems, we can think in terms of pixels instead of ink (with the background pixels not being considered).