SlideShare une entreprise Scribd logo
1  sur  65
Télécharger pour lire hors ligne
© 2014 IBM Corporation1
Big Data
Xavier Constant
xavier.constant@es.ibm.com
Lecture at EADA
International Master in Marketing (2014)
© 2014 IBM Corporation2
 Big Data Concepts
 Big Data Technology
 Data Scientists
© 2014 IBM Corporation3
Traditional DW
BI
Server
ERP
CRM
Data
Marts
Reports /
Dashboards
Operational
System
ETL ETL
BENEFITS:
 Mature Technology
 SQL Language (declarative, non technical)
 Skills & resources availablity (programmers, DBAs,…)
LIMITATIONS:
 Big operational data volumes
 Queries take too long or don’t even finish
 Admin complexity (partitions, archiving,…)
 New data types
 Free text, images, video, audio,…
 Data in real time (sensors, logs, geospatial data, etc…)
 New analysis types
 Exploratory
 Predictive
Flat files,
Spread
sheets
Data
Warehouse(s)
© 2014 IBM Corporation4
1 in 2
business leaders
don’t have access to
data they need
83%
of CIO’s cited BI and
analytics as part of their
visionary plan
5.4X
more likely that top
performers use
business analytics
80%
of the world’s
data today is
unstructured
90%
of the world’s
data was created
in the last two
years
20%
of available data can
be processed by
traditional systems
Source: GigaOM, Software Group, IBM Institute for Business Value"
Intrinsic Property of Data … it grows
© 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer: It’s NOT just how
fast data is produced or changed, BUT the
speed at which it must be analyzed
received, understood, and processed.
© 2014 IBM Corporation6
Paradigm shifts enabled by big data I
Leverage more of the data being captured
© 2014 IBM Corporation7
Paradigm shifts enabled by big data I
Leverage more of the data being captured
Bank X
© 2014 IBM Corporation8
Paradigm shifts enabled by big data II
Reduce effort required to leverage data
© 2014 IBM Corporation9
Paradigm shifts enabled by big data II
Reduce effort required to leverage data
© 2014 IBM Corporation10
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
© 2014 IBM Corporation11
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
© 2014 IBM Corporation12
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
© 2014 IBM Corporation13
Paradigm shifts enabled by big data IV
Leverage data as it is captured
© 2014 IBM Corporation14
Paradigm shifts enabled by big data IV
Leverage data as it is captured
© 2014 IBM Corporation15
Complementary Analytics
Traditional Approach
Structured, analytical, logical
New Approach
Creative, holistic thought, intuition
Multimedia
Data
Warehouse
Web Logs
Social Data
Sensor data:
images
RFID
Internal App
Data
Transaction
Data
Mainframe
Data
OLTP System
Data
Traditional
databases
ERP
Data
Structured
Repeatable
Linear
Unstructured
Exploratory
Dynamic
Text Data:
emails
Hadoop and
Streams
New
Sources
© 2014 IBM Corporation16
Types of Analytic Tools
© 2014 IBM Corporation17
Organisations are prioritising internal data sources
Untapped stores of internal data
 Size and scope of some internal data, such as
detailed transactions and operational log data,
have become too large and varied to manage
within traditional systems
 New infrastructure components make them
accessible for analysis
 Some data has been collected, but not
analyzed, for years
Focus on customer insights
 Customers – influenced by digital experiences
– often expect information provided to an
organization will then be “known” during future
interactions
 Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions, Emails, Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization.
© 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100% due to rounding
© 2014 IBM Corporation19
Hadoop workloads
92%
92%
83%
58%
42%
25%
58%
92%
92%
92%
67%
67%
67%
83%
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop. BI
Leadership Forum, April, 2012
© 2014 IBM Corporation20
Big Data Exploration
Find, visualize, understand all
big data to improve decision
making
Enhanced 360o View
of the Customer
Extend existing customer views
(MDM, CRM, etc) by
incorporating additional
internal and external
information sources
Operations Analysis
Analyze a variety of machine
data for improved business results
Data Warehouse Modernization
Integrate big data and data warehouse
capabilities to increase operational efficiency
Security/Intelligence
Extension
Lower risk, detect fraud and
monitor cyber security in
real-time
Key Big Data Use Cases
© 2014 IBM Corporation21
 Big Data Concepts
 Big Data Technology
 Data Scientists
© 2014 IBM Corporation22
Solution for Big Data
Rest Data:
– Data to analyze are already stored (structured and unstructured)
– Examples: logs, facebook, twitter, etc.
– Solution: Hadoop (open source)
Data in motion:
– Data are analyzed in real time, just in the moment they are generated.
They are analyzed with any previous storage
– Examples: Sensors, RFID, etc.
– Solution: Streams / CEP solutions
© 2014 IBM Corporation23
Hardware improvements through the years...
 CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
 RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
 Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
 Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
© 2014 IBM Corporation24
How long it will take to read 1TB of data?
 1TB (at 80Mb / sec):
– 1 disk - 3.4 hours
– 10 disks - 20 min
– 100 disks - 2 min
– 1000 disks - 12 sec
© 2014 IBM Corporation25
Parallel Data Processing is the answer!
 It was with us for a while:
– GRID computing - spreads processing load
– Distributed workload - hard to manage applications, overhead on
developer
– Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the
data)
© 2014 IBM Corporation26
What is Apache Hadoop?
 Apache Open source software framework.
 Flexible, enterprise-class support for processing large volumes of
data
– Inspired by Google technologies (MapReduce, GFS, BigTable, …)
– Initiated at Yahoo
• Originally built to address scalability problems of Nutch, an open source Web search
technology
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data
 Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel, cost effective manner
– CPU + local disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
© 2014 IBM Corporation27
Design principles of Hadoop
 New way of storing and processing the data:
– Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Meant for heterogeneous commodity hardware
 Bring processing to Data!
 Hadoop = HDFS + MapReduce infrastructure
 Optimized to handle
– Massive amounts of data through parallelism
– A variety of data (structured, unstructured, semi-structured)
– Using inexpensive commodity hardware
 Reliability provided through replication
© 2014 IBM Corporation28
What is the Hadoop Distributed File System?
 Driving principals
– Data is stored across the entire cluster (multiple nodes)
– Programs are brought to the data, not the data to the program
– Follows the Divide and Conquer paradigm.
 Data is stored across the entire cluster (the DFS)
– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
1011010
0101001
0011100
1111110
0101001
1101001
0100101
1001001
0101001
1000101
0010111
0101110
1011110
1101101
0101101
0010101
0010101
0101011
1001001
1010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
© 2011 IBM Corporation29
Introduction to MapReduce
 Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes
© 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
 Count number of word's occurrences
Map 2:
< Hello, 1>
< IBM, 1>
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
© 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
 It's not just runtime. Development phase has to be taken into
account.
 Although the Hadoop framework is implemented in Java,
MapReduce applications do not need to be written in Java
 To abstract complexities of Hadoop programming model, a few
application development languages have emerged that build on top
of Hadoop:
– Pig
– Hive
– Jaql
– ... Jaql
© 2014 IBM Corporation32
Pig, Hive, Jaql – Similarities
 Reduced program size over Java
 Applications are translated to map
and reduce jobs behind scenes
 Extension points for extending
existing functionality
 Interoperability with other
languages
 Not designed for random
reads/writes or low-latency queries
Pig, Hive, Jaql – Differences
Characteristic Pig Hive Jaql
Developed by Yahoo! Facebook IBM
Language Pig Latin HiveQL Jaql
Type of
language Data flow Declarative (SQL
dialect) Data flow
Data structures
supported Complex Better suited for
structured data
JSON, semi
structured
Schema Optional Not optional Optional
Hadoop Distributions
34
© 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization & Discovery
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
BigSheets JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard &
Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro
© 2014 IBM Corporation36
Open Source frameworks I
 Avro: A data serialization system that includes a schema within each file. A schema defines the data types that are
contained within a file, and is validated as the data is written to the file using the Avro APIs. Users can include primary data
types and complex type definitions within a schema.
 Flume: A distributed, reliable, and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
 HBase: A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications
are written in Java™, much like a typical MapReduce application. HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together. This approach is different from a row-oriented
relational database, where all columns of a row are stored together
 HCatalog: A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored. You can change how you write data, while still supporting existing data in
older formats. HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
 Hive: A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations, in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS). SQL developers write statements,
which are broken down by the Hive service into MapReduce jobs, and then run across a Hadoop cluster. InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software.
© 2014 IBM Corporation37
Open Source frameworks II
 Lucene: A high-performance text search engine library that is written entirely in Java. When you search within a
collection of text, Lucene breaks the documents into text fields and builds an index from them. The index is the key
component of Lucene that forms the basis of rapid text search capabilities. You use the searching methods within the Lucene
libraries to find text components. With InfoSphere BigInsights, Lucene is integrated into Jaql, providing the ability to build,
scan, and query Lucene indexes
 Oozie: A management application that simplifies workflow and coordination between MapReduce jobs. Oozie provides
users with the ability to define actions and dependencies between actions. Oozie then schedules actions to run when the
required dependencies are met. Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system.
 R: A Project for Statistical Computing
 Scoop: A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster. You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses.
 Zookeeper: A centralized infrastructure and set of services that enable synchronization across a cluster. ZooKeeper
maintains common objects that are needed in large cluster environments, such as configuration information, distributed
synchronization, and group services.
© 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
© 2014 IBM Corporation39
BigSheets
 Browser based Analytical Tool that generates Map/Reduce Jobs
working over Hadoop Big Data.
 Helps non-programmers to work with Hadoop cluster.
 User models their big data as familiar spreadsheet-like tabular data
structures (collections). Once data is represented in a collection,
business analysts can filter and enrich its content using built-in
functions and macros. Furthermore, analysts can combine data
residing in different collections as well as generate charts and new
“sheets” (collections) to visualize their data. They can even export
data into a variety of common formats with a click of a button.
 Much of the technology included in Sheets was derived from the
BigSheets project of IBM’s Emerging Technologies team.
© 2014 IBM Corporation40
BigSheets: Collection Sample
 Spreadsheet-like structures defined by user
 Based on data accessible through BigInsights Web console – e.g., file
system data, output from Web crawl, etc.
© 2014 IBM Corporation41
Big Sheets: Collection Operations
 Work with built-in “sheets” editor
 Add / delete columns
 Filter data
 Specify formulas to compute new
values using spreadsheet-style
syntax
 Apply built-in or custom macro
functions
 …………..
© 2014 IBM Corporation42
BigSheets: Collection Graphic Visualization
 Built-in charting facility aids analysis
 Pie charts, bar charts, tag clouds, maps, etc.
 Hover over sections to reveal details
© 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
© 2014 IBM Corporation44
What is Big SQL?
 Big SQL brings robust SQL support to the Hadoop ecosystem
– Scalable server architecture
– Comprehensive SQL'92 ansi support
– Standards compliant client drivers (JDBC & ODBC)
– Efficient handling of "point queries"
– Wide variety of data sources and file formats
– Extensive HBase focus
– Open source interoperability
 Our driving design goals
– Existing queries should run with no or few modifications
– Existing JDBC and ODBC compliant tools should continue to function
– Queries should be executed as efficiently as the chosen storage
mechanisms allow
© 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastore
– Each can query the others tables
 SQL engine analyzes incoming
queries
– Separates portion(s) to execute at
the server vs. portion(s) to execute
on the cluster
– Re-writes query if necessary for
improved performance
– Determines appropriate storage
handler for data
– Produces execution plan
– Executes and coordinates query
 Server layout and relative sizes
for illustrative purposes only!
Application
SQL Language
JDBC / ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
Files
HBase RDBMS •••
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
© 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
Text Analytics
© 2014 IBM Corporation47
What is Text Analytics?
 High Performance and Scalable rule based Information Extraction Engine.
 Distill structured information from unstructured data
- Rich annotator library supports multiple languages
 Provides sophisticated tooling to help build, test, and refine rules.
– Developer tools, an easy to use text analytics language, and a set of
extractors for fast adoption.
– Multilingual support, including support for DBCS languages.
 Developed at IBM Research since 2004: System T
 BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
© 2014 IBM Corporation48
Annotator Query Language (AQL)
 Language to create rules for Text Analytics.
 SQL Like Language.
 Fully declarative text analytics language.
 Once compiled produced an AOG plan to work in the data.
 No “black boxes” or modules that can’t be customized.
 Tooling for easy customization because you are abstracted from the
programmatic details.
 Competing solutions make use of locked up black-box modules that cannot be
customized, which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern <N.match> <U.match>
as match
from Number N, Unit U;
© 2014 IBM Corporation49
Text Analytic: Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010, one team distinguished well
from the rest winning the final. Early in the second
half, Netherlands’ striker, Arjen Robben, had a chance
to score, but the awesome keeper for Spain, Iker
Casillas made the save. Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win.
© 2014 IBM Corporation50
50
Text Analytic: Real Example
© 2014 IBM Corporation51
51
One step beyond: Watson
© 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
Text Analytics
R
© 2014 IBM Corporation53
Big R
• Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
• Scale out R with MR
programming
– Partitioning of large data
– Parallel cluster execution of R
code
• Distributed Machine
Learning
– A scalable statistics engine that
provides canned algorithms, and
an ability to author new ones, all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
© 2014 IBM Corporation54
Where Does BigData Fit?
Analytical
database
(DW)
Source
Systems
Analytical
tools
5. Explore data
6. Parse, aggregate
“Capture in case
it’s needed”
1. Extract, transform, load
“Capture only what’s
needed”
9. Report and mine data
© 2014 IBM Corporation55
 Big Data Concepts
 Big Data Technology
 Data Scientists
© 2014 IBM Corporation56
Data scientist – The new cool guy in town
Article in Fortune “The unemployment rate in
the U.S. continues to be abysmal (9.1% in
July), but the tech world has spawned a
new kind of highly skilled, nerdy-cool job
that companies are scrambling to fill: data
scientist”
McKinsey Global Institute “Big data Report”
By 2018, the United States alone could
face a shortage of 140,000 to 190,000
people with deep analytical skills as well as
1.5 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
© 2014 IBM Corporation57
Data Science is Multidisciplinary
© 2014 IBM Corporation58
Successful Data Scientist Characteristics
© 2014 IBM Corporation59
Data Scientist Qualities
© 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist?
© 2014 IBM Corporation61
www.kaggle.com
© 2014 IBM Corporation62
Kaggle ranking
© 2014 IBM Corporation63 © 2013 IBM Corporation63
Learn Big Data
 Reading Materials - Online
– Understanding Big Data – Free PDF Book
• http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
– Developing, publishing, and deploying your first big data application with InfoSphere
BigInsights
• www.ibm.com/developerworks/data/library/techarticle/dm-1209bigdatabiginsights/index.html
– Implementing IBM InfoSphere BigInsights on System x - Redbook
• http://www.redbooks.ibm.com/redpieces/abstracts/sg248077.html
 Resources
– Big Data Information Center
• www-01.ibm.com/software/ebusiness/jstart/bigdata/infocenter.html
– InfoSphere BigInsights
• www-01.ibm.com/software/data/infosphere/biginsights/
– Stream Computing
• www-01.ibm.com/software/data/infosphere/stream-computing/
– DeveloperWorks, forums, demos ...
• http://www.ibm.com/developerworks/wiki/biginsights/
© 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversity.com
Flexible on-line delivery
allows learning @your place
and @your pace
 Free courses, free study
materials.
 Cloud-based sandbox for
exercises – zero setup
 Robust Course
Management System and
Content Distribution
infrastructure
© 2014 IBM Corporation65
65

Contenu connexe

Tendances

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 

Tendances (20)

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 

En vedette

Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Thorhildur Jetzek, Ph.D.
 
Content analytics for healthcare
Content analytics for healthcareContent analytics for healthcare
Content analytics for healthcareXavier Constant
 
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Georg Rehm
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsightsCynthia Saracco
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3IBMInfoSphereUGFR
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
محاضرات تحليل احصائي Spss
محاضرات تحليل احصائي Spssمحاضرات تحليل احصائي Spss
محاضرات تحليل احصائي Spsschamkki999
 

En vedette (9)

Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
 
Content analytics for healthcare
Content analytics for healthcareContent analytics for healthcare
Content analytics for healthcare
 
Presenting Data In Excel
Presenting Data In ExcelPresenting Data In Excel
Presenting Data In Excel
 
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
محاضرات تحليل احصائي Spss
محاضرات تحليل احصائي Spssمحاضرات تحليل احصائي Spss
محاضرات تحليل احصائي Spss
 

Similaire à Big data presentation (2014)

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)Cloudera, Inc.
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Big data and you
Big data and you Big data and you
Big data and you IBM
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisExploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisNetAppUK
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 
Stratebi Big Data
Stratebi Big DataStratebi Big Data
Stratebi Big DataStratebi
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Sycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxSycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxshujee381
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 

Similaire à Big data presentation (2014) (20)

Big Data
Big DataBig Data
Big Data
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
 
Big Data
Big DataBig Data
Big Data
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big data and you
Big data and you Big data and you
Big data and you
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis KapsalisExploring the Wider World of Big Data- Vasalis Kapsalis
Exploring the Wider World of Big Data- Vasalis Kapsalis
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Stratebi Big Data
Stratebi Big DataStratebi Big Data
Stratebi Big Data
 
Big Data
Big DataBig Data
Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Sycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxSycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptx
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 

Dernier

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Dernier (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Big data presentation (2014)

  • 1. © 2014 IBM Corporation1 Big Data Xavier Constant xavier.constant@es.ibm.com Lecture at EADA International Master in Marketing (2014)
  • 2. © 2014 IBM Corporation2  Big Data Concepts  Big Data Technology  Data Scientists
  • 3. © 2014 IBM Corporation3 Traditional DW BI Server ERP CRM Data Marts Reports / Dashboards Operational System ETL ETL BENEFITS:  Mature Technology  SQL Language (declarative, non technical)  Skills & resources availablity (programmers, DBAs,…) LIMITATIONS:  Big operational data volumes  Queries take too long or don’t even finish  Admin complexity (partitions, archiving,…)  New data types  Free text, images, video, audio,…  Data in real time (sensors, logs, geospatial data, etc…)  New analysis types  Exploratory  Predictive Flat files, Spread sheets Data Warehouse(s)
  • 4. © 2014 IBM Corporation4 1 in 2 business leaders don’t have access to data they need 83% of CIO’s cited BI and analytics as part of their visionary plan 5.4X more likely that top performers use business analytics 80% of the world’s data today is unstructured 90% of the world’s data was created in the last two years 20% of available data can be processed by traditional systems Source: GigaOM, Software Group, IBM Institute for Business Value" Intrinsic Property of Data … it grows
  • 5. © 2014 IBM Corporation5 Characteristics of Big Data Velocity is the game changer: It’s NOT just how fast data is produced or changed, BUT the speed at which it must be analyzed received, understood, and processed.
  • 6. © 2014 IBM Corporation6 Paradigm shifts enabled by big data I Leverage more of the data being captured
  • 7. © 2014 IBM Corporation7 Paradigm shifts enabled by big data I Leverage more of the data being captured Bank X
  • 8. © 2014 IBM Corporation8 Paradigm shifts enabled by big data II Reduce effort required to leverage data
  • 9. © 2014 IBM Corporation9 Paradigm shifts enabled by big data II Reduce effort required to leverage data
  • 10. © 2014 IBM Corporation10 Paradigm shifts enabled by big data III Data leads the way – and sometimes correlations are good enough
  • 11. © 2014 IBM Corporation11 Paradigm shifts enabled by big data III Data leads the way – and sometimes correlations are good enough Hypothesis based correlation Weird correlation
  • 12. © 2014 IBM Corporation12 Paradigm shifts enabled by big data III Data leads the way – and sometimes correlations are good enough
  • 13. © 2014 IBM Corporation13 Paradigm shifts enabled by big data IV Leverage data as it is captured
  • 14. © 2014 IBM Corporation14 Paradigm shifts enabled by big data IV Leverage data as it is captured
  • 15. © 2014 IBM Corporation15 Complementary Analytics Traditional Approach Structured, analytical, logical New Approach Creative, holistic thought, intuition Multimedia Data Warehouse Web Logs Social Data Sensor data: images RFID Internal App Data Transaction Data Mainframe Data OLTP System Data Traditional databases ERP Data Structured Repeatable Linear Unstructured Exploratory Dynamic Text Data: emails Hadoop and Streams New Sources
  • 16. © 2014 IBM Corporation16 Types of Analytic Tools
  • 17. © 2014 IBM Corporation17 Organisations are prioritising internal data sources Untapped stores of internal data  Size and scope of some internal data, such as detailed transactions and operational log data, have become too large and varied to manage within traditional systems  New infrastructure components make them accessible for analysis  Some data has been collected, but not analyzed, for years Focus on customer insights  Customers – influenced by digital experiences – often expect information provided to an organization will then be “known” during future interactions  Combining disparate internal sources with advanced analytics creates insights into customer behavior and preferences (Transactions, Emails, Call center interaction records) Big data sources Respondents were asked which data sources are currently being collected and analyzed as part of active big data efforts within their organization.
  • 18. © 2014 IBM Corporation18 Stages of Big Data adoption 18 Big data adoption When segmented into four groups based on current levels of big data activity, respondents showed significant consistency in organizational behaviors Total respondents n = 1061 Totals do not equal 100% due to rounding
  • 19. © 2014 IBM Corporation19 Hadoop workloads 92% 92% 83% 58% 42% 25% 58% 92% 92% 92% 67% 67% 67% 83% Staging area Online archive Transformation Engine Ad hoc queries Scheduled reports Visual exploration Data mining Today In 18 Months Based on respondents that have implemented Hadoop. BI Leadership Forum, April, 2012
  • 20. © 2014 IBM Corporation20 Big Data Exploration Find, visualize, understand all big data to improve decision making Enhanced 360o View of the Customer Extend existing customer views (MDM, CRM, etc) by incorporating additional internal and external information sources Operations Analysis Analyze a variety of machine data for improved business results Data Warehouse Modernization Integrate big data and data warehouse capabilities to increase operational efficiency Security/Intelligence Extension Lower risk, detect fraud and monitor cyber security in real-time Key Big Data Use Cases
  • 21. © 2014 IBM Corporation21  Big Data Concepts  Big Data Technology  Data Scientists
  • 22. © 2014 IBM Corporation22 Solution for Big Data Rest Data: – Data to analyze are already stored (structured and unstructured) – Examples: logs, facebook, twitter, etc. – Solution: Hadoop (open source) Data in motion: – Data are analyzed in real time, just in the moment they are generated. They are analyzed with any previous storage – Examples: Sensors, RFID, etc. – Solution: Streams / CEP solutions
  • 23. © 2014 IBM Corporation23 Hardware improvements through the years...  CPU Speeds: – 1990 - 44 MIPS at 40 MHz – 2000 - 3,561 MIPS at 1.2 GHz – 2010 - 147,600 MIPS at 3.3 GHz  RAM Memory – 1990 – 640K conventional memory (256K extended memory recommended) – 2000 – 64MB memory – 2010 - 8-32GB (and more)  Disk Capacity – 1990 – 20MB – 2000 - 1GB – 2010 – 1TB  Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec
  • 24. © 2014 IBM Corporation24 How long it will take to read 1TB of data?  1TB (at 80Mb / sec): – 1 disk - 3.4 hours – 10 disks - 20 min – 100 disks - 2 min – 1000 disks - 12 sec
  • 25. © 2014 IBM Corporation25 Parallel Data Processing is the answer!  It was with us for a while: – GRID computing - spreads processing load – Distributed workload - hard to manage applications, overhead on developer – Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)
  • 26. © 2014 IBM Corporation26 What is Apache Hadoop?  Apache Open source software framework.  Flexible, enterprise-class support for processing large volumes of data – Inspired by Google technologies (MapReduce, GFS, BigTable, …) – Initiated at Yahoo • Originally built to address scalability problems of Nutch, an open source Web search technology – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of data  Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner – CPU + local disks = “node” – Nodes can be combined into clusters – New nodes can be added as needed without changing • Data formats • How data is loaded • How jobs are written
  • 27. © 2014 IBM Corporation27 Design principles of Hadoop  New way of storing and processing the data: – Let system handle most of the issues automatically: • Failures • Scalability • Reduce communications • Distribute data and processing power to where the data is • Make parallelism part of operating system • Meant for heterogeneous commodity hardware  Bring processing to Data!  Hadoop = HDFS + MapReduce infrastructure  Optimized to handle – Massive amounts of data through parallelism – A variety of data (structured, unstructured, semi-structured) – Using inexpensive commodity hardware  Reliability provided through replication
  • 28. © 2014 IBM Corporation28 What is the Hadoop Distributed File System?  Driving principals – Data is stored across the entire cluster (multiple nodes) – Programs are brought to the data, not the data to the program – Follows the Divide and Conquer paradigm.  Data is stored across the entire cluster (the DFS) – The entire cluster participates in the file system – Blocks of a single file are distributed across the cluster – A given block is typically replicated as well for resiliency 1011010 0101001 0011100 1111110 0101001 1101001 0100101 1001001 0101001 1000101 0010111 0101110 1011110 1101101 0101101 0010101 0010101 0101011 1001001 1010111 0100 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  • 29. © 2011 IBM Corporation29 Introduction to MapReduce  Scalable to thousands of nodes and petabytes of data MapReduce Application 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) Return a single result setResult Set Shuffle public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get(); . . . Distribute map tasks to cluster Hadoop Data Nodes
  • 30. © 2011 IBM Corporation30 MapReduce Example Hello World Bye World Hello IBM Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1: < Hello, 1> < World, 1> < Bye, 1> < World, 1>  Count number of word's occurrences Map 2: < Hello, 1> < IBM, 1> Entry Data Map Process Reduce Process Shuffle Process
  • 31. © 2014 IBM Corporation31 How to Analyze Large Data Sets in Hadoop  It's not just runtime. Development phase has to be taken into account.  Although the Hadoop framework is implemented in Java, MapReduce applications do not need to be written in Java  To abstract complexities of Hadoop programming model, a few application development languages have emerged that build on top of Hadoop: – Pig – Hive – Jaql – ... Jaql
  • 32. © 2014 IBM Corporation32 Pig, Hive, Jaql – Similarities  Reduced program size over Java  Applications are translated to map and reduce jobs behind scenes  Extension points for extending existing functionality  Interoperability with other languages  Not designed for random reads/writes or low-latency queries
  • 33. Pig, Hive, Jaql – Differences Characteristic Pig Hive Jaql Developed by Yahoo! Facebook IBM Language Pig Latin HiveQL Jaql Type of language Data flow Declarative (SQL dialect) Data flow Data structures supported Complex Better suited for structured data JSON, semi structured Schema Optional Not optional Optional
  • 35. © 2014 IBM Corporation35 Example of Hadoop Ecosystem Visualization & Discovery Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) BigSheets JDBC Applications & Development Text Analytics MapReduce Pig & Jaql Hive Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management HCatalog Security Audit & History Lineage R Guardium Platform Computing Cognos IBMOpen Source GPFS-FPO Big SQL NameNode High Avail Avro
  • 36. © 2014 IBM Corporation36 Open Source frameworks I  Avro: A data serialization system that includes a schema within each file. A schema defines the data types that are contained within a file, and is validated as the data is written to the file using the Avro APIs. Users can include primary data types and complex type definitions within a schema.  Flume: A distributed, reliable, and highly available service for efficiently moving large amounts of data in a Hadoop cluster  HBase: A column-oriented database management system that runs on top of HDFS and is often used for sparse data sets. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications are written in Java™, much like a typical MapReduce application. HBase allows many attributes to be grouped into column families so that the elements of a column family are all stored together. This approach is different from a row-oriented relational database, where all columns of a row are stored together  HCatalog: A table and storage management service for Hadoop data that presents a table abstraction so that you do not need to know where or how your data is stored. You can change how you write data, while still supporting existing data in older formats. HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service that includes functions for both MapReduce and Pig  Hive: A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations, in addition to analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS). SQL developers write statements, which are broken down by the Hive service into MapReduce jobs, and then run across a Hadoop cluster. InfoSphere BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos Business Intelligence software.
  • 37. © 2014 IBM Corporation37 Open Source frameworks II  Lucene: A high-performance text search engine library that is written entirely in Java. When you search within a collection of text, Lucene breaks the documents into text fields and builds an index from them. The index is the key component of Lucene that forms the basis of rapid text search capabilities. You use the searching methods within the Lucene libraries to find text components. With InfoSphere BigInsights, Lucene is integrated into Jaql, providing the ability to build, scan, and query Lucene indexes  Oozie: A management application that simplifies workflow and coordination between MapReduce jobs. Oozie provides users with the ability to define actions and dependencies between actions. Oozie then schedules actions to run when the required dependencies are met. Workflows can be scheduled to start based on a given time or based on the arrival of specific data in the file system.  R: A Project for Statistical Computing  Scoop: A tool designed to easily import information from structured databases (such as SQL) and related Hadoop systems (such as Hive and HBase) into your Hadoop cluster. You can also use Sqoop to extract data from Hadoop and export it to relational databases and enterprise data warehouses.  Zookeeper: A centralized infrastructure and set of services that enable synchronization across a cluster. ZooKeeper maintains common objects that are needed in large cluster environments, such as configuration information, distributed synchronization, and group services.
  • 38. © 2014 IBM Corporation38 Example of Hadoop Ecosystem Dashboard & Visualization Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) JDBC Applications & Development Text Analytics MapReduce Pig & Jaql Hive Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Apps Workflow Monitoring Management HCatalog Security Audit & History Lineage R Guardium Platform Computing Cognos IBMOpen Source GPFS-FPO Big SQL NameNode High Avail Avro Visualization & Discovery BigSheets
  • 39. © 2014 IBM Corporation39 BigSheets  Browser based Analytical Tool that generates Map/Reduce Jobs working over Hadoop Big Data.  Helps non-programmers to work with Hadoop cluster.  User models their big data as familiar spreadsheet-like tabular data structures (collections). Once data is represented in a collection, business analysts can filter and enrich its content using built-in functions and macros. Furthermore, analysts can combine data residing in different collections as well as generate charts and new “sheets” (collections) to visualize their data. They can even export data into a variety of common formats with a click of a button.  Much of the technology included in Sheets was derived from the BigSheets project of IBM’s Emerging Technologies team.
  • 40. © 2014 IBM Corporation40 BigSheets: Collection Sample  Spreadsheet-like structures defined by user  Based on data accessible through BigInsights Web console – e.g., file system data, output from Web crawl, etc.
  • 41. © 2014 IBM Corporation41 Big Sheets: Collection Operations  Work with built-in “sheets” editor  Add / delete columns  Filter data  Specify formulas to compute new values using spreadsheet-style syntax  Apply built-in or custom macro functions  …………..
  • 42. © 2014 IBM Corporation42 BigSheets: Collection Graphic Visualization  Built-in charting facility aids analysis  Pie charts, bar charts, tag clouds, maps, etc.  Hover over sections to reveal details
  • 43. © 2014 IBM Corporation43 Example of Hadoop Ecosystem Dashboard & Visualization Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) JDBC Applications & Development Text Analytics MapReduce Pig & Jaql Hive Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Apps Workflow Monitoring Management HCatalog Security Audit & History Lineage R Guardium Platform Computing Cognos IBMOpen Source GPFS-FPO NameNode High Avail Avro Visualization & Discovery BigSheets Big SQL
  • 44. © 2014 IBM Corporation44 What is Big SQL?  Big SQL brings robust SQL support to the Hadoop ecosystem – Scalable server architecture – Comprehensive SQL'92 ansi support – Standards compliant client drivers (JDBC & ODBC) – Efficient handling of "point queries" – Wide variety of data sources and file formats – Extensive HBase focus – Open source interoperability  Our driving design goals – Existing queries should run with no or few modifications – Existing JDBC and ODBC compliant tools should continue to function – Queries should be executed as efficiently as the chosen storage mechanisms allow
  • 45. © 2014 IBM Corporation45 Architecture Big SQL shares catalogs with Hive via the Hive metastore – Each can query the others tables  SQL engine analyzes incoming queries – Separates portion(s) to execute at the server vs. portion(s) to execute on the cluster – Re-writes query if necessary for improved performance – Determines appropriate storage handler for data – Produces execution plan – Executes and coordinates query  Server layout and relative sizes for illustrative purposes only! Application SQL Language JDBC / ODBC Driver BigInsights Cluster Head Node Big SQL Server Head Node Name Node Head Node Job TrackerNetwork Protocol SQL Engine Storage Handlers Del Files SEQ Files HBase RDBMS ••• ••• Compute Node Task Tracker Data Node Region Server Compute Node Task Tracker Data Node Region Server ••• Compute Node Task Tracker Data Node Region Server Head Node Hive Metastore
  • 46. © 2014 IBM Corporation46 Example of Hadoop Ecosystem Dashboard & Visualization Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) JDBC Applications & Development MapReduce Pig & Jaql Hive Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Apps Workflow Monitoring Management HCatalog Security Audit & History Lineage R Guardium Platform Computing Cognos IBMOpen Source GPFS-FPO NameNode High Avail Avro Visualization & Discovery BigSheets Big SQL Text Analytics
  • 47. © 2014 IBM Corporation47 What is Text Analytics?  High Performance and Scalable rule based Information Extraction Engine.  Distill structured information from unstructured data - Rich annotator library supports multiple languages  Provides sophisticated tooling to help build, test, and refine rules. – Developer tools, an easy to use text analytics language, and a set of extractors for fast adoption. – Multilingual support, including support for DBCS languages.  Developed at IBM Research since 2004: System T  BigInsights is the first time IBM opens up the Text Analytics Engine technology for customization and development
  • 48. © 2014 IBM Corporation48 Annotator Query Language (AQL)  Language to create rules for Text Analytics.  SQL Like Language.  Fully declarative text analytics language.  Once compiled produced an AOG plan to work in the data.  No “black boxes” or modules that can’t be customized.  Tooling for easy customization because you are abstracted from the programmatic details.  Competing solutions make use of locked up black-box modules that cannot be customized, which restricts flexibility and are difficult to optimize for performance create view AmountWithUnit as extract pattern <N.match> <U.match> as match from Number N, Unit U;
  • 49. © 2014 IBM Corporation49 Text Analytic: Simple Example NetherlandsStrikerArjen Robben Keeper SpainIker Casillas WingerAndres Iniesta Spain World Cup 2010 Highlights Football World Cup 2010, one team distinguished well from the rest winning the final. Early in the second half, Netherlands’ striker, Arjen Robben, had a chance to score, but the awesome keeper for Spain, Iker Casillas made the save. Winner superiority was reflected when Winger Andres Iniesta scored for Spain for the win.
  • 50. © 2014 IBM Corporation50 50 Text Analytic: Real Example
  • 51. © 2014 IBM Corporation51 51 One step beyond: Watson
  • 52. © 2014 IBM Corporation52 Example of Hadoop Ecosystem Dashboard & Visualization Integration Workload Optimization Streams Netezza Flume DB2 DataStage IBM InfoSphere BigInsights Runtime Advanced Analytic Engines File System MapReduce HDFS Data Store HBase Text Processing Engine & Extractor Library) JDBC Applications & Development MapReduce Pig & Jaql Hive Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Apps Workflow Monitoring Management HCatalog Security Audit & History Lineage Guardium Platform Computing Cognos IBMOpen Source GPFS-FPO NameNode High Avail Avro Visualization & Discovery BigSheets Big SQL Text Analytics R
  • 53. © 2014 IBM Corporation53 Big R • Explore, visualize, transform, and model big data using familiar R syntax and paradigm • Scale out R with MR programming – Partitioning of large data – Parallel cluster execution of R code • Distributed Machine Learning – A scalable statistics engine that provides canned algorithms, and an ability to author new ones, all via R R Clients Scalable Machine Learning Data Sources Embedded R Execution IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3
  • 54. © 2014 IBM Corporation54 Where Does BigData Fit? Analytical database (DW) Source Systems Analytical tools 5. Explore data 6. Parse, aggregate “Capture in case it’s needed” 1. Extract, transform, load “Capture only what’s needed” 9. Report and mine data
  • 55. © 2014 IBM Corporation55  Big Data Concepts  Big Data Technology  Data Scientists
  • 56. © 2014 IBM Corporation56 Data scientist – The new cool guy in town Article in Fortune “The unemployment rate in the U.S. continues to be abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist” McKinsey Global Institute “Big data Report” By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions
  • 57. © 2014 IBM Corporation57 Data Science is Multidisciplinary
  • 58. © 2014 IBM Corporation58 Successful Data Scientist Characteristics
  • 59. © 2014 IBM Corporation59 Data Scientist Qualities
  • 60. © 2014 IBM Corporation60 How Long Does It Take For a Beginner to Become a Good Data Scientist?
  • 61. © 2014 IBM Corporation61 www.kaggle.com
  • 62. © 2014 IBM Corporation62 Kaggle ranking
  • 63. © 2014 IBM Corporation63 © 2013 IBM Corporation63 Learn Big Data  Reading Materials - Online – Understanding Big Data – Free PDF Book • http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF – Developing, publishing, and deploying your first big data application with InfoSphere BigInsights • www.ibm.com/developerworks/data/library/techarticle/dm-1209bigdatabiginsights/index.html – Implementing IBM InfoSphere BigInsights on System x - Redbook • http://www.redbooks.ibm.com/redpieces/abstracts/sg248077.html  Resources – Big Data Information Center • www-01.ibm.com/software/ebusiness/jstart/bigdata/infocenter.html – InfoSphere BigInsights • www-01.ibm.com/software/data/infosphere/biginsights/ – Stream Computing • www-01.ibm.com/software/data/infosphere/stream-computing/ – DeveloperWorks, forums, demos ... • http://www.ibm.com/developerworks/wiki/biginsights/
  • 64. © 2014 IBM Corporation64 Learn Big Data Technologies BigDataUniversity.com Flexible on-line delivery allows learning @your place and @your pace  Free courses, free study materials.  Cloud-based sandbox for exercises – zero setup  Robust Course Management System and Content Distribution infrastructure
  • 65. © 2014 IBM Corporation65 65

Notes de l'éditeur

  1. Sistemas transaccionales maduros, con años de historia, los datos van creciendo. De Gigabytes a Terabytes
  2. Obstancles- Latency, user concurrency, single threaded How do this without having companies hire specialists who know how to query Hadoop using Java or overcome latency. Latency: via Hcatalog – Query: Better interfaces Won’t fix things like user concurrency – THIS IS ASPIRATION BUT LOTS OF OBSTACLES PREVENTING – - Latency via batch, user concurrency cause no workload mgmt or prioritization or query optimizer Know coding
  3. Skills, or more accurately a shortage of skills, is widely recognized as the leading inhibitor to the broader acceptance of Big Data solutions in the enterprise. To address the skills shortage IBM has sponsored community-driven effort to deliver Big Data education regardless of physical location or budget. We call this @your place, @your pace education and it has turned out to be a huge success with well over 8000 registered students and over 1800 students enrolled in the Hadoop Fundamentals class alone. We have seen people gain sufficient skills in a matter of a week to enter and complete the Hadoop Programming Challenge by submitting very innovative solutions. IBM’s sponsorship provides BigDataUniversity.com participants with a comprehensive set of education materials, access to free products for hands on labs and a cloud-based course management system to make the learning process easy and fun.