2. RIGHTFOCUSANDONTARGET
Agenda
Analyze &
Define
• Progression of Analytics
• The new phenomenon - Big Data
• Big Data Defined
Technology
Discussion
• Big Data Technology – Hadoop
• Big Data – Big Savings – Hadoop
Use Cases
• What can we solve with Big Data – example
• What is next ? Where are the opportunities
3. RIGHTFOCUSANDONTARGET
Progression of Analytics
Structured – Known
Data
Traditional – ETL, Data Marts,
DW, RDBMS
Growth – Normal
Incremental – Archive
Less Cross Functional Integration
More Tactical than
Strategic
Sizes GBs to TBs
Data Architects vs.
Functional
So Far…..
5. RIGHTFOCUSANDONTARGET
The new phenomenon - Big Data
1. No to “fit-for-all” but Yes to “fit-for-purpose”
2. Proliferation of data sources – variety of data
3. Proliferation of volume of data
4. The demand for the speed (velocity) of data
5. Demand for high value & accuracy ( veracity)
of info
6. Massive Parallel processing
7. Commodity servers vs. Specialized servers
DATA DRIVEN BUSINESS
is
THE SMART BUSINESS
6. RIGHTFOCUSANDONTARGET
Big Data Definition
• High volume of
data which is
growing every
year more than
50 % every year
• High Speed
Streaming,
Machine
generated data
etc
• Different Data
sources In-the-
enterprise and
external data
around the
enterprise data
• Data collected
taking huge
memory (typically
100 TB or more)
where RDBMS is
inefficient
Value Variety
VolumeVelocity
VERACITY
Meaningful
7. RIGHTFOCUSANDONTARGET
Big Data Definition
VERACITY
Big Data is the new art and science, using Massive
Parallel Processing (MPP) technology, of
collection, storage, processing, distribution, and
analysis of data with any of the attributes – high
volume, high velocity, high variety to extract high
value and greater accuracy (veracity).
IBM Says, BIG DATA means
1.Volume (Terabytes --‐> Zettabytes) 2. Variety (Structured --‐>
Semi--‐structured --‐> Unstructured)
3. Velocity (Batch --‐> Streaming Data)
8. RIGHTFOCUSANDONTARGET Big Data Technologies – Typical Stack
Big Data Infrastructure
Data Manipulation & Management
Data Analysis & Mining
Predictive & Prescriptive Analysis
Process Automation& Decision Support Systems
Big Data Stack
9. RIGHTFOCUSANDONTARGET Big Data Technologies – SMAQ
User-friendly Analytics
1. PIG ( simple Query Language), 2. HIVE ( Similar to SQL)
3. Cascading ( Workflow) 4. Mahout ( Machine Learning)
5. Zookeeper (Coordination Service)
Data Distribution & Management across nodes in Batch Mode
1. Hadoop MapReduce
2. Alternative – BashReduce, Disco Project, Spark, GraphLab (C&M),
Strom, HPCC (LexisNexis)
Distributed Non-Relational
1. HBase ( columnar DB)
2. HDFS – Hadoop Distributed File System
Query
Map Reduce
Storage
SMAQ Stack
10. RIGHTFOCUSANDONTARGET
Big Data – Big Savings – Economics
ROI on Big Data Approach (with Hadoop)
Source : American Institute for Analytics
1TB of RDBMS TCO
$37,000 - Traditional RDBMS
$2,000 only !!!! Hadoop
Source :American Institute for Analytics
11. RIGHTFOCUSANDONTARGET
Where is the market on Big Data
Infrastructure / Framework / Analytics software
Horizontal Solutions like EDW etc
HealthCare
RetailIndustry
Government/
Publicsector
Education&
HumanCapital
HealthSciences
/Genomics
Telecommunicat
ions/Services
Energy&
Utilities
E-Commerce/
Marketing
Media&
Entertainment
Source: IDC 2011
0
5
10
15
20
2010 2011 2012 2013 2014 2015
Big Data Market In $B
Current
State
13. RIGHTFOCUSANDONTARGET
Web Logs
Images &
Videos
Social Media
Documents
Structured
Data
Big Data /
Hadoop
etc.
Prescriptive
Predictive
Reporting
OLAP
Modeling
Pure Big data Implementation - Architecture
Pure Big Data
Connectors
/ Adapters
Barriers
Disruption to existing Analytics ?!
Roadmap / Methodology
Certainty of costs
HADOOP / Big Table can replace traditional EDWs !!
17. RIGHTFOCUSANDONTARGET
BIG Data Opportunities
Some Gaps & opportunities
•Real-time Analysis ( may be use SAP HANA etc !!)
•User interface (UI) frameworks
•App development Big Data on Cloud (multi-Tenancy)
•Security & Data Governance
•Cross Application Integration
•Industry Standards
19. RIGHTFOCUSANDONTARGET
Business Focus
Identify data needs
Identify Business Issues
Layout data dependencies
between functions
Resolve Competing priorities
Clearly lay out the levels of
data, cross-functional
requirements
Stakeholder Focus
Identify the stake holders
Align best practices with the
project
Plan out the
objectives, scope, and timelines
Identify the
KPIs, Reports, Dashboards, Predictiv
e & Prescriptive Analysis to be
delivered
Technology Focus
Synergies in current technology
Take stock of existing “technology
assets” towards Big Data
Assess your current capabilities and
architecture
Identify the resources and minimize
“specialties” to exploit synergies with
existing resource pool
Lay out a development methodology
to streamline delivery
Process Focus
Establish clear data flows
Identify Data Governance
execution process –
People, Processes, Mechanisms
Design the process to be more
Business focused than IT
Clearly establish measures to
achieve –
Accuracy, Repeatability, Agility, and
accountability ( reconcilability)
Our Big Data Strategy at a glance
20. RIGHTFOCUSANDONTARGET
Our Execution Approach – AGILE methodology
Agile Approach to reduce risks
• Close coordination
between the customer and
the developer
• Small incremental steps
makes testing easier and
manageable & avoid
surprises
• Early recovery from
expectation mismatch
• Clarity on Design
understanding and regular
communication with user.
• Early warning about risks
regular status reports.
• Full Knowledge Transfer
Progression of Analytics 3 minutes The new phenomenon - Big Data 4 minutesBig Data Defined 3 minutes 2 minutesWhere is the Technology 5 minutesWhat can we solve with Big Data – example Case Studies 5 minutesWhat is next ? Where are the opportunities ? 10 minutes
Internal Information –Known questions and answers - Known structures, structured data types, known volumes, mostly transactional dataMaster data is very well defined - Storage Typical Data Warehouses, Data Marts using batch processing & traditional ETL, and relational databasesData growth is incremental and regular archivalJust reporting, a little bit of mining – mostly descriptive - predictive analysis is very light Cross functional integration of data is very limited, very structured around customers, services & products, logistics etc.Functional & Technical responsibilities are very clearly demarcated. Mostly Data engineers / architects at the backend supporting business analysts / users.Most of the reports are just a measurement of their tactics – more supporting the strategy than inducing a strategyData sizes are in Giga and Terra byte range, becomes inefficient and costly after a certain size limit.
Narrow & focused business missions – not “fit-for-all” but “fit-for-purpose” The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sourcesNeed to capture & store data and just not collect Proliferation of data sources – variety of dataMulti-Dimensional Data Streaming Data Geo Spatial DataSocial Networking Data Internal Data (RDBMS) Video & Image dataText data (logs etc) Time series Data GenomicsProliferation of volume of data ( crossed to Petabytes and above)Internet / intranet Social networks ( FB & Twitter) Mobile DevicesSmart Home devices Smart systems (Utilities etc) Media & entertainmentThe demand for the speed (velocity) of the data collected, understood, processed, and distributedAccessibility - where when, who, and how Time value – Real Time or notIncreased speeds of consumption Increased speeds of data generation Demand for high value & accuracy ( veracity) of information Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologiesAffordability of infrastructure – Commodity servers vs. Specialized serversHadoop enables a computing solution that is:Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
The word of the hour is “SMART” !! Smart Business – Targeted value proposition Businesses are under pressure to maximize their investments ( focused approach, not one-fit-all methodology)Targeted value proposition Targeted advertisement, Tailored menu, Focused Initiatives, Individualized Attention, Non-impersonal Messaging, Efficient Governance, Greater AccuracyNarrow & focused business missions – not “fit-for-all” but “fit-for-purpose” The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sourcesNeed to capture & store data and just not collect Proliferation of data sources – variety of dataMulti-Dimensional Data Streaming Data Geo Spatial DataSocial Networking Data Internal Data (RDBMS) Video & Image dataText data (logs etc) Time series Data GenomicsProliferation of volume of data ( crossed to Petabytes and above)Internet / intranet Social networks ( FB & Twitter) Mobile DevicesSmart Home devices Smart systems (Utilities etc) Media & entertainmentThe demand for the speed (velocity) of the data collected, understood, processed, and distributedAccessibility - where when, who, and how Time value – Real Time or notIncreased speeds of consumption Increased speeds of data generation Demand for high value & accuracy ( veracity) of information Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologiesAffordability of infrastructure – Commodity servers vs. Specialized serversHadoop enables a computing solution that is:Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracyBusinesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, ratherthan one-fit-for all solutionsThere is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracyBusinesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, ratherthan one-fit-for all solutionsThere is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
Market has just started picking upThere is a lot of gap in vertical solutionsBiggest gap is in Big Data ServicesHardware & Software components seem to have been available already
Adapting to Real-time Analysis ( may be use HANA !!)Development of industry standardsDevelopment of Universal Schema for metadata and catalogingTools to support security & data governanceSupport for Cloud-ification (multi-tenancy)Support for data lineageFramework for cross-application integrationSupport for testingAutomated & configurable monitoring and management console User interface (UI) frameworks
Business Focus Identify data needs for strategic business functions Identify Business Issues that need to be solved by big Data Layout data dependencies between functions Resolve Competing priorities Clearly lay out the levels of data, cross-functional requirementsTechnology Focus Identify the right technology to align with the current landscape for synergies in technology Take stock of existing “technology assets” towards Big DataAssess your current capabilities and architecture to support your goals, and select the deployment strategy that best fits your Big Data questions Identify the resources and minimize “specialties” to exploit synergies with existing resource pool Lay out a development methodology to streamline deliveryStakeholder Focus Clearly identify the stake holders at all levels of data consumption Present best practices and align them with the project Plan out the objectives, scope, and timelinesIdentify the KPIs, Reports, Dashboards, Predictive & Prescriptive Analysis to be deliveredProcess Focus Establish clear data flows from collection of data to consumption of data Identify Data Governance execution process – People, Processes, Mechanisms Design the process to be more Business focused than IT Clearly establish measures to achieve – Accuracy, Repeatability, Agility, and accountability ( reconcilability)