SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
- 1-
Data Warehouse design
Design of Enterprise Systems
University of Pavia
10/12/2013 2h for the first; 2h for hadoop
- 2-
Table of Contents
Big Data Overview
Big Data DW & BI
Big Data Market
Hadoop & Mahout
- 3-
BIG DATA OVERVIEW
Data Warehouse design
- 4-
Big Data Overview: Table of Contents
Big Data
Overview
Data Growth Definition
Big Data v.s.
Relational
Data
Its Value
Big Data
Benefit
Big Data
Usage
Challenges
- 5-
Big Data Overview: Data Growth
 Storage capacity increases 23% on
average annually
 End the ability to store all the
available information
0/通用格式
19/通用格式
9/通用格式
29/通用格式
18/通用格式
6/通用格式
26/通用格式
15/通用格式
Exabytes
Years
Data Storage Growth
0/通用格式
8/通用格式
18/通用格式
24/通用格式
3/通用格式
11/通用格式
18/通用格式
28/通用格式
6/通用格式
15/通用格式
Exabytes Years
Data Storage Growth
 Exponential growth during a decade
starts from 2010
- 6-
Big Data Overview: Definition
Gartner Definition(2012): "Big data is high volume,
high velocity, and/or high variety information assets
that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization."
- 7-
Big Data Overview: Big Data V.S. Relational Data
Application Relation-Based Data Big Data
Data processing
Single-computer
platform that scales with
better CPUs, centralized
processing.
Cluster platforms that
scale to thousands of
nodes, distributed
process.
Data management
Relational database
(SQL), centralized
storage.
Non-relational
databases that manage
varied data types and
formats (NoSQL),
distributed storage.
Analytics
Batched, descriptive,
centralized.
Real-time, predictive
and prescriptive,
distributed analytics.
- 8-
Big Data Overview: Its Value 1/3
Several classes of company
heading the revenue
chart($11.59 billion)
 broad-portfolio tech giants
(IBM, HP, Oracle, EMC)
 leading software houses
(Teradata, SAP, Microsoft)
 professional services
companies (PwC, Accenture)
Source: Wikibon, Big Data
Vendor Revenue and Market
Forecast 2012-2017
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 9-
Big Data Overview: Its Value 2/3
 Pure play: vendors who
derive 100 percent of
their revenue from this
market
Source: Wikibon, Big Data
Vendor Revenue and
Market Forecast 2012-
2017
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 10-
Big Data Overview: Its Value 3/3
Source: Worldwide Big Data Technologies and
Services: 2012-2015 Forecast (IDC, 2012)
 IDC: Big data will become a
$17 billion business by
2015($23.8 billion by
2016)
 Big data storage will
account for 6.8% of the
entire worldwide storage
market by 2015
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 11-
Big Data Overview: Big Data Benefits
Business benefits received by implementing an effective Big Data
methodology. The survey is based on 1153 responses from 325 respondents
- 12-
Big Data Overview: Big Data Usage 1/2
 E-Commerce and Market Intelligence
– Recommender system
– Social media monitoring and analysis
– Crowd-sourcing systems
– Social and virtual games
 E-Government and Politics 2.0
– Ubiquitous government services
– Equal access and public services
– Citizen engagement
 Science & Technology
– S&T innovation
– Hypothesis testing
– Knowledge discovery
 Smart Health and Wellbeing
– Human and plant genomics
– Healthcare decision support
– Patient community analysis
 Security and Public Safety
– Crime analysis
– Computational criminology
– Terrorism informatics
– Open-source intelligence
– Cyber security
- 13-
Big Data Overview: Big Data Usage 2/2
Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA)
- 14-
Big Data Overview: Challenges 1/2
Main challenges between Big Data and companies. The survey is based on
1153 responses from 325 respondents
- 15-
Big Data Overview: Challenges 2/2
A Survey of European
companies from Steria's
Business Intelligence Maturity
Audit (biMA)
 Technical
– 38% has data quality
problem
– A lack of data
governance; no master
data management
system(38%)
 Organizational
– 72% has no BI strategy;
70% has no BI governance
– 7% grades big data as
relevant
Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7-
of-european-companies-rate-big-data-as-very-relevant-to-their-business/
- 16-
BIG DATA, DW & BI
Data Warehouse design
- 17-
Big Data, DW & BI: Table of Contents
Big Data,
DW & BI
Evolution Techniques Cost
Best
Practices
- 18-
BI Evolution
Key Characteristics
Gartner BI Platforms Core
Capabilities
Gartner Hype Cycle
BI&A 1.0
-DBMS-based, structured content.
-RDBMS & data warehousing.
-ETL & OLAP.
-Dashboards & scorecards.
-Data mining & statistical analysis.
-Ad hoc query & search-based BI
-Reporting, dashboards &
scorecards
-OLAP
-Interactive visualization
-Predictive modeling & data mining.
-Column-based DBMS
-In-memory DBMS
-Real-time decision
-Data mining workbenches
BI&A 2.0
Web-based, unstructured content
-Information retrieval and
extraction
-Opinion mining
-Question answering
-Web analytics and web
intelligence
-Social media analytics
-Social network analysis
-Spatial-temporal analysis
-Information semantic
services
-Natural language question
answering
-Content & text analytics
BI&A 3.0
Mobile and sensor-based content
-Location-aware analysis
-Person-centered analysis
-Context-relevant analysis
-Mobile visualization & HCI
-Mobile BI
BI and Analytics: evolution and characteristics
- 19-
Big Data Overview: Techniques 1/2
A/B Testing
A technique in which a control group is compared with a
variety of test groups in order to determine what treatments
will improve a given objective. An example application is
determining what copy text, layouts, images, or colors will
improve conversion rates on an e-commerce Web site. Big
Data enables huge numbers of tests to be executed and
analyzed.
Cluster Analysis
A statistical method aimed to classify an huge data set and
in particular to identify a common behavior.
Classification
Classification. A set of techniques to identify the categories
in which new data points belong, based on a training set
containing data points that have already been categorized.
Data Mining
A set of techniques and technologies with the purpose to
extract patterns from large datasets through the combination
of methods following statistics and algorithms. These
techniques include association rule learning, cluster analysis,
classification, and regression.
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(1/2)
- 20-
Big Data Overview: Techniques 2/2
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(2/2)
Network analysis
A set of techniques used to characterize relationships among discrete
nodes in a graph or a network. In social network analysis, connections
between individuals in a community or organization are analyzed.
Predictive modeling
A set of techniques in which a mathematical model is created or
chosen to best predict the probability of an outcome.
Sentiment analysis
Application of natural language processing and other analytic
techniques to identify and extract subjective information from source text
material.
Statistics
The science of the collection, organization, and interpretation of data,
including the design of surveys and experiments. Statistical techniques
are often used to understand the relationships between all the variables.
Visualization
Techniques used to create images, diagrams or animations, usually
integrated in more complex dashboards.
- 21-
Big Data: Cost 1/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Item Cost Notes
Servers $400,000
@$22k each; enterprise class with dual
power supplies, 36TB of serial attached
SCSI (SAS) storage, 48-64 gigabytes
memory, 1 rack
Server support $60,000 @15% of server cost
Switches $15,000
3 @ $5k for InfiniBand; in older network
switches will run at least 3x the costs of
InfiniBand
Distribution/systems
management software
$90,000 Cloudera: 18 nodes @ $5k each
Integration $100,000 Licenses and dedicated hardware
Information
Management Tools
$20,000 320 hours @ $100/hour human cost
Node Configuration
and Implementation
$16,000
8 hours/node, 20 nodes = 160 hours,
$100/hour
Build Project Costs $733,000
Those project items where a "buy" option
exists
Build Versus Buy Elements (Using Build Pricing)
- 22-
Big Data: Cost 2/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Build Versus Buy Elements (Using Buy Pricing)
Item Cost Notes
Build Total $733,000
Buy (Oracle Big Data
Appliance)
$450,000
Cost of Oracle Big Data
Appliance for same
infrastructure and tasks
costs (list)
Buy (Oracle Big Data
Appliance) Savings
$283,000
Not lifecycle costs, just
for initial project
ESG Estimated Savings
~39%
Oracle Big Data
Appliance lowers costs
versus do-it-yourself
- 23-
Big Data: Best Practices 1/3
First of all, however, we need to focus on some considerations on when is suitable
to use Big Data technologies
 Analyze a huge quantity of data not only structured but also semi-structured and
unstructured from a wide variety of resources;
 All of the data gathered must be analyzed against a sample or in another case,
sampling of data is not as effective as the analysis made upon a large amount of
data;
 Iterative and explorative analysis when business measures on the data are not
determined a priori;
 Solving information and business challenges that are not properly addressed by a
traditional relational database approach.
- 24-
Big Data: Best Practices 2/3
The best practices that we are going to describe regard both the
management aspects and the organizational and technological ones.
 Muting the HiPPOs: the highest-paid person opinions are those on which
depend the most important decisions on how to retrieve and analyze data.
Today these people rely too much on intuition and experience rather than
the pure rationality of data so there is the need to transform this behavior;
 Start with initiative that led to customer-centric outcome. It is very
important for those organization that are customer oriented to begin with
customer analytics that enable better services as a result of a deep
understand of customers needs and future behaviors;
 Develop an enterprise schema that include the vision, the strategies and the
requirements for Big Data and is useful to align the business users need
and the implementation roadmap of information technologies;
 In order to achieve near-term results is crucial the adoption of a pragmatic
approach, starting from the most logical and cost-effective place to look for
insight that is within the enterprise;
- 25-
Big Data: Best Practices 3/3
 Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools.
So the enterprises should invest in acquiring both tools and skills;
 The Big Data strategy and the business analytics should encompass an evaluation of the
decision-making processes of the organization as well as an evaluation on the groups
and types of decision makers;
 Try to uncover new metrics, key performance indicators and new analytics technique to
lock at new and existing data in a different way in order to find new opportunity. This
could require setting up a separate Big Data team with the purpose of experiment and
innovate;
 The final goal of a Big Data project is not the collection of much data as possible but the
support of the concrete business needs and provide new reliable information to decision
makers;
 Only one technology cannot meet all the Big Data requirements. The presence of
different workloads, data types, and user types should be served by the most suitable
technology. For example, Hadoop could be the best choice for a large-scale Web log
analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies
must coexist and address use cases for which they are optimized.
- 26-
BIG DATA MARKET
Data Warehouse design
- 27-
Big Data Market Definition
IDC(2012) defines
the big data
market as an
aggregation of
storage, server,
networking,
software, and
services market
segments, each
with several sub-
segments.
Big Data Technology Stack
- 28-
Big Data Market Segments
 Services
– business consulting, business process
outsourcing, plus IT projectbased
services, IT outsourcing, and IT support,
and training services related to Big Data
implementations
 Infrastructure
– External storage systems
– Servers(including internal storage,
memory, network cards) and supporting
system software as well as spending for
self-built servers by large cloud service
providers
– Datacenter networking infrastructure
used in support of Big Data server and
storage infrastructure
 Softwares
– Data organization and management
software, including parallel and
distributed file systems and others
– Analytics and discovery software,
including search engines used for Big
Data applications, data mining, text
mining, rich media analysis, data
visualization, and others
- 29-
Big Data Market Analysis
Marketsandmarkets
– Big Data Market By Types (Hardware; Software;
Services; BDaaS - HaaS; Analytics; Visualization as
Service); By Software (Hadoop, Big Data Analytics
and Databases, System Software (IMDB, IMC):
Worldwide Forecasts & Analysis (2013 – 2018)
- 30-
HADOOP & MAHOUT
Data Warehouse design
- 31-
Hadoop & Mahout: Table of Contents
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms
- 32-
Hadoop: Overview
Master Node
Hadoop Overview
Slave Node1 Slave Node K Slave Node N
...... ......
Storage
Computing
Storage
Computing
Storage
Computing
HDFS
Map-Reduce
 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
– Open source
– Scalable
– Distributed
 Master Node controls everything!
- 33-
Hadoop & Mahout: Table of Contents
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms
- 34-
Hadoop: HDFS Structure
Name Node Metadata
HDFS Structure
Data Node1 Data Node K Data Node N
…....
..
…....
..
1
22
3
1
22
3
1
22
3
File
 Name node controls almost everything about storage
 Large files are partitioned into chunks and stored across multiple nodes
 File chunks are replicated to mitigate the node failure problems
- 35-
Hadoop: HDFS write
 Operation series when writing a file
- 36-
Hadoop: HDFS Read
 Operation series when reading a file
- 37-
Hadoop & Mahout: Table of Contents
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms
- 38-
Hadoop: Map-Reduce Structure
 Job tracker controls almost everything about computing
 Key concepts of Map-Reduce
– Computation goes with data
Job Tracker
Map-Reduce Structure
TaskTracker1 TaskTracker K TaskTracker N
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
…......…......
- 39-
Hadoop: Job submission
 The initialization takes some time
 Job execution is monitored by Job tracker through heartbeat
- 40-
Hadoop: Map-Reduce Execution
 Bandwidth required in the copy process
- 41-
Hadoop & Mahout: Table of Contents
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms
- 42-
Hadoop Ecosystem: HBase
 HDFS
– Structured/semi-
structure/unstructure
d data
– Write only once, read
many
 Hbase is an open-
source, distributed,
versioned, column-
oriented store
modeled after
Google's Bigtable
 Column based database. It
supports
– Insert
– Delete
– Update
- 43-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase is a column-oriented database
- 44-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase storage system
- 45-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase storage system
- 46-
Hadoop Ecosystem: Pig
 Hadoop
– A lot of java codes in
case of analyzing
– No scripting
 Pig is a platform for analyzing large
data sets that consists of a high-
level language for expressing data
analysis programs
 Pig generates and compiles a
Map/Reduce program(s) on the fly.
- 47-
Hadoop Ecosystem: Pig Sample Scripts
RawInput = LOAD '$INPUT' USING
com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreach RawInput GENERATE ContextCategoryId as Category,
DefLevelId , TagId, URL,Impressions;
 defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);
GroupedInput = GROUP defFilter BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group,
SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING
com.contextweb.pig.CWHeaderStore();
- 48-
Hadoop Ecosystem: Hive
 Hive is a data warehouse infrastructure built on top of hadoop
 Supports analysis of large datasets stored in Hadoop compatible file systems like
HDFS and Amazon S3 file system
 Provides SQL-Like query language called HiveSQL
 Provides index to accelerate queries
- 49-
Hadoop Ecosystem: HiveSQL
 DML
– Select
 DDL
– SHOW TABLES
– CREATE TABLE
– ALTER TABLE
– DROP TABLE
- 50-
Mahot
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms
- 51-
Mahout: Overview
 A scalable machine
learning library built on
Hadoop, written in java
 Driven by Ng et al.’s
paper “MapReduce for
Machine Learning on
Multicore”
- 52-
Mahout: Algorithms
 Classification
– Logistic Regression
– Bayesian
– SVM
– NN
– Hidden Markov Models
 Clustering
– Kmeans
– Mean Shift Clustering
– Spectral Clustering
– Top Down Clustering
 Pattern Mining
– Parallel FP Growth
Algorithm
 Regression
– Locally Weighted Linear
Regression
 Dimension reduction
– SVD
– PCA
– GDA
 Collaborative filtering
– Non-distributed
recommenders
– Distributed Item-Based
Collaborative Filtering
- 53-
EXERCISE
Data Warehouse design
- 54-
Mobility Analyzer: A Show Case
HANA DB
CSV Files
Sequence Files
Mahout
Clusterdump
Cluster Info.
Cluster Info.
HANA DB
Site Data Flow Modules
CSVConverter
ImportClusterInfo
ExportTweetsInfoLocal
Hadoop
Local
Run.sh

Contenu connexe

Tendances

Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data BSP Media Group
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data AnalyticsProduct School
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use casesAllied Consultants
 
Use of big data technologies in capital markets
Use of big data technologies in capital marketsUse of big data technologies in capital markets
Use of big data technologies in capital marketsInfosys
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
Big Data Impact on Purchasing and SCM - PASIA World Conference Discussion
Big Data Impact on Purchasing and SCM - PASIA World Conference DiscussionBig Data Impact on Purchasing and SCM - PASIA World Conference Discussion
Big Data Impact on Purchasing and SCM - PASIA World Conference DiscussionBill Kohnen
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Taniya Fansupkar
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Bigdata and Social Media Analytics
Bigdata and Social Media Analytics Bigdata and Social Media Analytics
Bigdata and Social Media Analytics Dillip kumar
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesaziksa
 
Bigdata analysis in supply chain managment
Bigdata analysis in supply chain managmentBigdata analysis in supply chain managment
Bigdata analysis in supply chain managmentKushal Shah
 
Societal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackSocietal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackStealth Project
 

Tendances (19)

Big data
Big dataBig data
Big data
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data Analytics
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use cases
 
Use of big data technologies in capital markets
Use of big data technologies in capital marketsUse of big data technologies in capital markets
Use of big data technologies in capital markets
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Big Data Impact on Purchasing and SCM - PASIA World Conference Discussion
Big Data Impact on Purchasing and SCM - PASIA World Conference DiscussionBig Data Impact on Purchasing and SCM - PASIA World Conference Discussion
Big Data Impact on Purchasing and SCM - PASIA World Conference Discussion
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
 
The dawn of Big Data
The dawn of Big DataThe dawn of Big Data
The dawn of Big Data
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Rulex big data and analytics
Rulex big data and analyticsRulex big data and analytics
Rulex big data and analytics
 
Bigdata and Social Media Analytics
Bigdata and Social Media Analytics Bigdata and Social Media Analytics
Bigdata and Social Media Analytics
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Bigdata analysis in supply chain managment
Bigdata analysis in supply chain managmentBigdata analysis in supply chain managment
Bigdata analysis in supply chain managment
 
Societal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackSocietal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data Stack
 

En vedette (18)

Agent technology for e commerce-recommendation systems
Agent technology for e commerce-recommendation systemsAgent technology for e commerce-recommendation systems
Agent technology for e commerce-recommendation systems
 
Android+ax+app+wcf
Android+ax+app+wcfAndroid+ax+app+wcf
Android+ax+app+wcf
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-data
 
Sqlite tutorial
Sqlite tutorialSqlite tutorial
Sqlite tutorial
 
Net framework
Net frameworkNet framework
Net framework
 
B14200
B14200B14200
B14200
 
0321146182
03211461820321146182
0321146182
 
Andrei shakirin rest_cxf
Andrei shakirin rest_cxfAndrei shakirin rest_cxf
Andrei shakirin rest_cxf
 
Json generation
Json generationJson generation
Json generation
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Soap toolkits
Soap toolkitsSoap toolkits
Soap toolkits
 
Soap pt1
Soap pt1Soap pt1
Soap pt1
 
Introduction to visual studio and c sharp
Introduction to visual studio and c sharpIntroduction to visual studio and c sharp
Introduction to visual studio and c sharp
 
Recommender systems session b
Recommender systems session bRecommender systems session b
Recommender systems session b
 
Httpclient tutorial
Httpclient tutorialHttpclient tutorial
Httpclient tutorial
 
Show loader to open url in web view
Show loader to open url in web viewShow loader to open url in web view
Show loader to open url in web view
 
Culbert recommender systems
Culbert recommender systemsCulbert recommender systems
Culbert recommender systems
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 

Similaire à 13 pv-do es-18-bigdata-v3

Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)GICTTraining
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopRemas Ittahir
 
Impact of big data on DCMI market
Impact of big data on DCMI marketImpact of big data on DCMI market
Impact of big data on DCMI marketMohsin Baig
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research ReportIla Group
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy Hussain Sultan
 
SC7 Workshop 1: Big Data in Secure Societies
SC7 Workshop 1: Big Data in Secure Societies SC7 Workshop 1: Big Data in Secure Societies
SC7 Workshop 1: Big Data in Secure Societies BigData_Europe
 
exploit_big_data_v1
exploit_big_data_v1exploit_big_data_v1
exploit_big_data_v1Attila Barta
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsPethuru Raj PhD
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big DataSonovate
 
Smart Data Module 6 d drive the future
Smart Data Module 6 d drive the futureSmart Data Module 6 d drive the future
Smart Data Module 6 d drive the futurecaniceconsulting
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 

Similaire à 13 pv-do es-18-bigdata-v3 (20)

Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Impact of big data on DCMI market
Impact of big data on DCMI marketImpact of big data on DCMI market
Impact of big data on DCMI market
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Seminarppt
SeminarpptSeminarppt
Seminarppt
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research Report
 
Big data
Big dataBig data
Big data
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
SC7 Workshop 1: Big Data in Secure Societies
SC7 Workshop 1: Big Data in Secure Societies SC7 Workshop 1: Big Data in Secure Societies
SC7 Workshop 1: Big Data in Secure Societies
 
exploit_big_data_v1
exploit_big_data_v1exploit_big_data_v1
exploit_big_data_v1
 
A technical Introduction to Big Data Analytics
A technical Introduction to Big Data AnalyticsA technical Introduction to Big Data Analytics
A technical Introduction to Big Data Analytics
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
R180305120123
R180305120123R180305120123
R180305120123
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big Data
 
Smart Data Module 6 d drive the future
Smart Data Module 6 d drive the futureSmart Data Module 6 d drive the future
Smart Data Module 6 d drive the future
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 

Dernier

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 

Dernier (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 

13 pv-do es-18-bigdata-v3

  • 1. - 1- Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop
  • 2. - 2- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout
  • 3. - 3- BIG DATA OVERVIEW Data Warehouse design
  • 4. - 4- Big Data Overview: Table of Contents Big Data Overview Data Growth Definition Big Data v.s. Relational Data Its Value Big Data Benefit Big Data Usage Challenges
  • 5. - 5- Big Data Overview: Data Growth  Storage capacity increases 23% on average annually  End the ability to store all the available information 0/通用格式 19/通用格式 9/通用格式 29/通用格式 18/通用格式 6/通用格式 26/通用格式 15/通用格式 Exabytes Years Data Storage Growth 0/通用格式 8/通用格式 18/通用格式 24/通用格式 3/通用格式 11/通用格式 18/通用格式 28/通用格式 6/通用格式 15/通用格式 Exabytes Years Data Storage Growth  Exponential growth during a decade starts from 2010
  • 6. - 6- Big Data Overview: Definition Gartner Definition(2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
  • 7. - 7- Big Data Overview: Big Data V.S. Relational Data Application Relation-Based Data Big Data Data processing Single-computer platform that scales with better CPUs, centralized processing. Cluster platforms that scale to thousands of nodes, distributed process. Data management Relational database (SQL), centralized storage. Non-relational databases that manage varied data types and formats (NoSQL), distributed storage. Analytics Batched, descriptive, centralized. Real-time, predictive and prescriptive, distributed analytics.
  • 8. - 8- Big Data Overview: Its Value 1/3 Several classes of company heading the revenue chart($11.59 billion)  broad-portfolio tech giants (IBM, HP, Oracle, EMC)  leading software houses (Teradata, SAP, Microsoft)  professional services companies (PwC, Accenture) Source: Wikibon, Big Data Vendor Revenue and Market Forecast 2012-2017 Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
  • 9. - 9- Big Data Overview: Its Value 2/3  Pure play: vendors who derive 100 percent of their revenue from this market Source: Wikibon, Big Data Vendor Revenue and Market Forecast 2012- 2017 Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
  • 10. - 10- Big Data Overview: Its Value 3/3 Source: Worldwide Big Data Technologies and Services: 2012-2015 Forecast (IDC, 2012)  IDC: Big data will become a $17 billion business by 2015($23.8 billion by 2016)  Big data storage will account for 6.8% of the entire worldwide storage market by 2015 Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
  • 11. - 11- Big Data Overview: Big Data Benefits Business benefits received by implementing an effective Big Data methodology. The survey is based on 1153 responses from 325 respondents
  • 12. - 12- Big Data Overview: Big Data Usage 1/2  E-Commerce and Market Intelligence – Recommender system – Social media monitoring and analysis – Crowd-sourcing systems – Social and virtual games  E-Government and Politics 2.0 – Ubiquitous government services – Equal access and public services – Citizen engagement  Science & Technology – S&T innovation – Hypothesis testing – Knowledge discovery  Smart Health and Wellbeing – Human and plant genomics – Healthcare decision support – Patient community analysis  Security and Public Safety – Crime analysis – Computational criminology – Terrorism informatics – Open-source intelligence – Cyber security
  • 13. - 13- Big Data Overview: Big Data Usage 2/2 Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA)
  • 14. - 14- Big Data Overview: Challenges 1/2 Main challenges between Big Data and companies. The survey is based on 1153 responses from 325 respondents
  • 15. - 15- Big Data Overview: Challenges 2/2 A Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA)  Technical – 38% has data quality problem – A lack of data governance; no master data management system(38%)  Organizational – 72% has no BI strategy; 70% has no BI governance – 7% grades big data as relevant Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7- of-european-companies-rate-big-data-as-very-relevant-to-their-business/
  • 16. - 16- BIG DATA, DW & BI Data Warehouse design
  • 17. - 17- Big Data, DW & BI: Table of Contents Big Data, DW & BI Evolution Techniques Cost Best Practices
  • 18. - 18- BI Evolution Key Characteristics Gartner BI Platforms Core Capabilities Gartner Hype Cycle BI&A 1.0 -DBMS-based, structured content. -RDBMS & data warehousing. -ETL & OLAP. -Dashboards & scorecards. -Data mining & statistical analysis. -Ad hoc query & search-based BI -Reporting, dashboards & scorecards -OLAP -Interactive visualization -Predictive modeling & data mining. -Column-based DBMS -In-memory DBMS -Real-time decision -Data mining workbenches BI&A 2.0 Web-based, unstructured content -Information retrieval and extraction -Opinion mining -Question answering -Web analytics and web intelligence -Social media analytics -Social network analysis -Spatial-temporal analysis -Information semantic services -Natural language question answering -Content & text analytics BI&A 3.0 Mobile and sensor-based content -Location-aware analysis -Person-centered analysis -Context-relevant analysis -Mobile visualization & HCI -Mobile BI BI and Analytics: evolution and characteristics
  • 19. - 19- Big Data Overview: Techniques 1/2 A/B Testing A technique in which a control group is compared with a variety of test groups in order to determine what treatments will improve a given objective. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big Data enables huge numbers of tests to be executed and analyzed. Cluster Analysis A statistical method aimed to classify an huge data set and in particular to identify a common behavior. Classification Classification. A set of techniques to identify the categories in which new data points belong, based on a training set containing data points that have already been categorized. Data Mining A set of techniques and technologies with the purpose to extract patterns from large datasets through the combination of methods following statistics and algorithms. These techniques include association rule learning, cluster analysis, classification, and regression. McKinsey Global Institute in 2011 provided a list of the top 10 common techniques applicable across a range of industries, particularly in response to the need to analyze new amounts of data and their combination. List of the top 10 techniques which require Big data(1/2)
  • 20. - 20- Big Data Overview: Techniques 2/2 McKinsey Global Institute in 2011 provided a list of the top 10 common techniques applicable across a range of industries, particularly in response to the need to analyze new amounts of data and their combination. List of the top 10 techniques which require Big data(2/2) Network analysis A set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed. Predictive modeling A set of techniques in which a mathematical model is created or chosen to best predict the probability of an outcome. Sentiment analysis Application of natural language processing and other analytic techniques to identify and extract subjective information from source text material. Statistics The science of the collection, organization, and interpretation of data, including the design of surveys and experiments. Statistical techniques are often used to understand the relationships between all the variables. Visualization Techniques used to create images, diagrams or animations, usually integrated in more complex dashboards.
  • 21. - 21- Big Data: Cost 1/2  ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in particular a comparison between a “build” and “buy” solution. Item Cost Notes Servers $400,000 @$22k each; enterprise class with dual power supplies, 36TB of serial attached SCSI (SAS) storage, 48-64 gigabytes memory, 1 rack Server support $60,000 @15% of server cost Switches $15,000 3 @ $5k for InfiniBand; in older network switches will run at least 3x the costs of InfiniBand Distribution/systems management software $90,000 Cloudera: 18 nodes @ $5k each Integration $100,000 Licenses and dedicated hardware Information Management Tools $20,000 320 hours @ $100/hour human cost Node Configuration and Implementation $16,000 8 hours/node, 20 nodes = 160 hours, $100/hour Build Project Costs $733,000 Those project items where a "buy" option exists Build Versus Buy Elements (Using Build Pricing)
  • 22. - 22- Big Data: Cost 2/2  ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in particular a comparison between a “build” and “buy” solution. Build Versus Buy Elements (Using Buy Pricing) Item Cost Notes Build Total $733,000 Buy (Oracle Big Data Appliance) $450,000 Cost of Oracle Big Data Appliance for same infrastructure and tasks costs (list) Buy (Oracle Big Data Appliance) Savings $283,000 Not lifecycle costs, just for initial project ESG Estimated Savings ~39% Oracle Big Data Appliance lowers costs versus do-it-yourself
  • 23. - 23- Big Data: Best Practices 1/3 First of all, however, we need to focus on some considerations on when is suitable to use Big Data technologies  Analyze a huge quantity of data not only structured but also semi-structured and unstructured from a wide variety of resources;  All of the data gathered must be analyzed against a sample or in another case, sampling of data is not as effective as the analysis made upon a large amount of data;  Iterative and explorative analysis when business measures on the data are not determined a priori;  Solving information and business challenges that are not properly addressed by a traditional relational database approach.
  • 24. - 24- Big Data: Best Practices 2/3 The best practices that we are going to describe regard both the management aspects and the organizational and technological ones.  Muting the HiPPOs: the highest-paid person opinions are those on which depend the most important decisions on how to retrieve and analyze data. Today these people rely too much on intuition and experience rather than the pure rationality of data so there is the need to transform this behavior;  Start with initiative that led to customer-centric outcome. It is very important for those organization that are customer oriented to begin with customer analytics that enable better services as a result of a deep understand of customers needs and future behaviors;  Develop an enterprise schema that include the vision, the strategies and the requirements for Big Data and is useful to align the business users need and the implementation roadmap of information technologies;  In order to achieve near-term results is crucial the adoption of a pragmatic approach, starting from the most logical and cost-effective place to look for insight that is within the enterprise;
  • 25. - 25- Big Data: Best Practices 3/3  Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools. So the enterprises should invest in acquiring both tools and skills;  The Big Data strategy and the business analytics should encompass an evaluation of the decision-making processes of the organization as well as an evaluation on the groups and types of decision makers;  Try to uncover new metrics, key performance indicators and new analytics technique to lock at new and existing data in a different way in order to find new opportunity. This could require setting up a separate Big Data team with the purpose of experiment and innovate;  The final goal of a Big Data project is not the collection of much data as possible but the support of the concrete business needs and provide new reliable information to decision makers;  Only one technology cannot meet all the Big Data requirements. The presence of different workloads, data types, and user types should be served by the most suitable technology. For example, Hadoop could be the best choice for a large-scale Web log analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies must coexist and address use cases for which they are optimized.
  • 26. - 26- BIG DATA MARKET Data Warehouse design
  • 27. - 27- Big Data Market Definition IDC(2012) defines the big data market as an aggregation of storage, server, networking, software, and services market segments, each with several sub- segments. Big Data Technology Stack
  • 28. - 28- Big Data Market Segments  Services – business consulting, business process outsourcing, plus IT projectbased services, IT outsourcing, and IT support, and training services related to Big Data implementations  Infrastructure – External storage systems – Servers(including internal storage, memory, network cards) and supporting system software as well as spending for self-built servers by large cloud service providers – Datacenter networking infrastructure used in support of Big Data server and storage infrastructure  Softwares – Data organization and management software, including parallel and distributed file systems and others – Analytics and discovery software, including search engines used for Big Data applications, data mining, text mining, rich media analysis, data visualization, and others
  • 29. - 29- Big Data Market Analysis Marketsandmarkets – Big Data Market By Types (Hardware; Software; Services; BDaaS - HaaS; Analytics; Visualization as Service); By Software (Hadoop, Big Data Analytics and Databases, System Software (IMDB, IMC): Worldwide Forecasts & Analysis (2013 – 2018)
  • 30. - 30- HADOOP & MAHOUT Data Warehouse design
  • 31. - 31- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms
  • 32. - 32- Hadoop: Overview Master Node Hadoop Overview Slave Node1 Slave Node K Slave Node N ...... ...... Storage Computing Storage Computing Storage Computing HDFS Map-Reduce  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models – Open source – Scalable – Distributed  Master Node controls everything!
  • 33. - 33- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms
  • 34. - 34- Hadoop: HDFS Structure Name Node Metadata HDFS Structure Data Node1 Data Node K Data Node N ….... .. ….... .. 1 22 3 1 22 3 1 22 3 File  Name node controls almost everything about storage  Large files are partitioned into chunks and stored across multiple nodes  File chunks are replicated to mitigate the node failure problems
  • 35. - 35- Hadoop: HDFS write  Operation series when writing a file
  • 36. - 36- Hadoop: HDFS Read  Operation series when reading a file
  • 37. - 37- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms
  • 38. - 38- Hadoop: Map-Reduce Structure  Job tracker controls almost everything about computing  Key concepts of Map-Reduce – Computation goes with data Job Tracker Map-Reduce Structure TaskTracker1 TaskTracker K TaskTracker N Mapper Reducer Mapper Reducer Mapper Reducer …......…......
  • 39. - 39- Hadoop: Job submission  The initialization takes some time  Job execution is monitored by Job tracker through heartbeat
  • 40. - 40- Hadoop: Map-Reduce Execution  Bandwidth required in the copy process
  • 41. - 41- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms
  • 42. - 42- Hadoop Ecosystem: HBase  HDFS – Structured/semi- structure/unstructure d data – Write only once, read many  Hbase is an open- source, distributed, versioned, column- oriented store modeled after Google's Bigtable  Column based database. It supports – Insert – Delete – Update
  • 43. - 43- Hadoop Ecosystem: Hbase Storage model 1/3  Hbase is a column-oriented database
  • 44. - 44- Hadoop Ecosystem: Hbase Storage model 1/3  Hbase storage system
  • 45. - 45- Hadoop Ecosystem: Hbase Storage model 1/3  Hbase storage system
  • 46. - 46- Hadoop Ecosystem: Pig  Hadoop – A lot of java codes in case of analyzing – No scripting  Pig is a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs  Pig generates and compiles a Map/Reduce program(s) on the fly.
  • 47. - 47- Hadoop Ecosystem: Pig Sample Scripts RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreach RawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;  defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 48. - 48- Hadoop Ecosystem: Hive  Hive is a data warehouse infrastructure built on top of hadoop  Supports analysis of large datasets stored in Hadoop compatible file systems like HDFS and Amazon S3 file system  Provides SQL-Like query language called HiveSQL  Provides index to accelerate queries
  • 49. - 49- Hadoop Ecosystem: HiveSQL  DML – Select  DDL – SHOW TABLES – CREATE TABLE – ALTER TABLE – DROP TABLE
  • 50. - 50- Mahot Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms
  • 51. - 51- Mahout: Overview  A scalable machine learning library built on Hadoop, written in java  Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore”
  • 52. - 52- Mahout: Algorithms  Classification – Logistic Regression – Bayesian – SVM – NN – Hidden Markov Models  Clustering – Kmeans – Mean Shift Clustering – Spectral Clustering – Top Down Clustering  Pattern Mining – Parallel FP Growth Algorithm  Regression – Locally Weighted Linear Regression  Dimension reduction – SVD – PCA – GDA  Collaborative filtering – Non-distributed recommenders – Distributed Item-Based Collaborative Filtering
  • 54. - 54- Mobility Analyzer: A Show Case HANA DB CSV Files Sequence Files Mahout Clusterdump Cluster Info. Cluster Info. HANA DB Site Data Flow Modules CSVConverter ImportClusterInfo ExportTweetsInfoLocal Hadoop Local Run.sh