Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...
2. WHO AM I
• Big Data / Analytics / BI & Cloud Solutions Specialist
• http://www.linkedin.com/in/JulioPhilippe
• Skills
Architecture
Business Intelligence
IT Transformation
Cloud Computing
IT Solutions
Management
Mentoring
Big Data
Analytics
Business Development
Hadoop
Datacenter
Optimization
Data Warehousing
2
Big Data with Not Only SQL
3. BIG DATA MANAGEMENT INSIGHT
« Data don’t spring relevant,
they become though ! »
3
Big Data with Not Only SQL
4. DATA-DRIVEN ON-LINE WEBSITES
• To run the apps : messages, posts, blog
entries, video clips, maps, web graph...
• To give the data context : friends
networks, social networks, collaborative
filtering...
• To keep the applications running : web
logs, system logs, system metrics, database
query logs...
4
Big Data with Not Only SQL
5. BIG DATA – NOT ONLY DATA VOLUME
• Improve analytics and statistics
models
• Extract business value by
analyzing large volumes of multistructured data from various
sources such as
databases, websites, blogs, social
media, smart sensors...
• Have efficient
architectures, massively
parallel, highly scalable and
available to handle very large
data volumes up to several
petabytes
5
Thematics
•
•
•
•
•
•
Web Technologies
Database Scale-out
Relational Data Analytics
Distributed Data Analytics
Distributed File Systems
Real Time Analytics
Big Data with Not Only SQL
6. BIG DATA APPLICATIONS DOMAINS
• Digital marketing optimization (e.g., web analytics, attribution, golden
path analysis)
• Data exploration and discovery (e.g., identifying new data-driven
products, new markets)
• Fraud detection and prevention (e.g., revenue protection, site integrity
& uptime)
• Social network and relationship analysis (e.g., influencer marketing,
outsourcing, attrition prediction)
• Machine-generated data analytics (e.g., remote device insight, remote
sensing, location-based intelligence)
• Data retention (e.g. long term conservation, data archiving
6
Big Data with Not Only SQL
7. SOME BIG DATA USE CASES BY INDUSTRY
Energy
Telecommunications
Retail
Smart meter analytics
Network performance
Dynamic price optimization
Distribution load forecasting & scheduling
New products & services creation
Localized assortment
Condition-based maintenance
Call Detail Records (CDRs) analysis
Supply-chain management
Customer relationship
Customer relationship management
management
Manufacturing
Banking
Insurance
Supply chain management
Fraud detection
Catastrophe modeling
Customer Care Call Centers
Trade surveillance
Claims fraud
Preventive Maintenance and Repairs
Compliance and regulatory
Reputation management
Customer relationship management
Customer relationship management
Customer relationship management
Public
Media
Healthcare
Fraud detection
Large-scale clickstream analytics
Clinical trials data analysis
Fighting criminality
Abuse and click-fraud prevention
Patient care quality and program analysis
Threats detection
Social graph analysis and profile segmentation
Supply chain management
Cyber security
Campaign management and loyalty programs
Drug discovery and development analysis
7
Big Data with Not Only SQL
8. TOP 10 BIG DATA SOURCES
1. Social network profiles
2. Social influencers
3. Activity-generated data
4. SaaS & Cloud Apps
5. Public web information
6. MapReduce results
7. Data warehouse appliances
8. Columnar/NoSQL databases
9. Network and in-stream monitoring technologies
10. Legacy documents
8
Big Data with Not Only SQL
9. NEW DATA AND MANAGEMENT ECONOMICS
Compute Trends
Storage Trends
New Analytics
New Data Structure
(Massively Parallel Processing, Algorithms…)
Distributed File Systems, NoSQL Database, NewSQL…)
Logical
Data Warehouse
Master/Slave
Enterprise
data warehouse
Objects storage
Multi-Structured
Data
Master/Master
General purpose
data warehouse
Proprietary and dedicated
data warehouse
Distributed File Systems
OLTP is the
data warehouse
Master Data Management, Data Quality, Data Integration
9
Big Data with Not Only SQL
Federated/
Sharded
10. MOVING COMPUTATION TO STORAGE
General Purpose Storage Servers
•
Combine server with disks & networking for reducing latency
•
Specialized software enables general purpose systems designs to provide high
performance data services
Moving Data processing to Storage
Legacy
Emerging
Next Gen.
Application
Application
Application
Data Processing
Data Processing
Metadata Mgmt
Network
Data Processing
Metadata Mgmt
Storage
Metadata Mgmt
Storage
Storage
Storage Array (SAN, NAS)
10
Big Data with Not Only SQL
Servers
11. BIG DATA ARCHITECTURE
BI & DWH Architecture - Conventional
• SQL based
• High availability
• Enterprise database
• Right design for structured data
• Current storage hardware (SAN, NAS, DAS)
Analytics Architecture – Next Generation
• Not only SQL based
• High scalability, availability and flexibility
• Compute and storage in the same box for
reducing the network latency
• Right design for semi-structured and
unstructured data
App
Servers
Edge
Nodes
Network
Switches
Network
Switches
Database
Servers
Storage Array
SAN
Switch
11
Data
Nodes
Big Data with Not Only SQL
12. DATA WAREHOUSE
• Data Warehouse appliances
– EMC Greenplum
– Microsoft Parallel Data
Warehouse
– IBM Netezza
– Oracle Exadata
– SAP HANA
– ParAccel Analytic Database
– Teradata
– HP Vertica
12
• SQL Database
• Massively Parallel Processing
• Hadoop Connectivity
• Column-Oriented database
• In-Memory database
Big Data with Not Only SQL
13. MAPREDUCE ALGORITHMS
MapReduce
• MapReduce is the programming
paradigm popularized by Google
researchers
• Open-source Hadoop
implementation of MapReduce by
Yahoo
• Open source software framework for
distributed computation
• Parallel computation (Map) on each
block (Split) of data in an DFS file and
output a stream of (Key, Value) pairs
to the local file system
• JobTracker schedules and manages
jobs
• TaskTracker executes individual map()
and reduce() tasks on each cluster
node
13
Algorithms
• Association Rule Learning
Algorithms
• Genetic Algorithms
• Neural Network Algorithms
• Statistical Algorithms (Pandas)
• Machine Learning Algorithms
(Mahout, Weka, Scikit Learn)
• Natural Language Processing
Algorithms
• Trading Algorithms
• Clinical design Algorithms
• Searching Algorithms (Lucene,
Solr, Katta, ElasicSearch,
OpenSearchServer…)
Big Data with Not Only SQL
Languages
• PHP
• Erlang
• Python
• Ruby
• R
• Java
14. DISTRIBUTED FILE SYSTEMS
• System that permanently store data
• Divided into logical units
(files, shards, chunks, blocks…)
• A file path joins file and directory names into
a relative or absolute address to identify a
file
Master
Slave
Slave
• Support access to file and remote servers
• Support concurrency
App
• Support distribution
• Support replication
• NFS, GPFS, Hadoop
HDFS, GlusterFS, MogileFS, MooseFS….
14
Big Data with Not Only SQL
Slave
15. NOSQL DATABASES CATEGORIES
Column
BigTable (Google), HBase,
Cassandra (DataStax),
Hypertable…
NoSQL = Not only SQL
•
Key-Value
Redis, Riak (Basho), CouchBase,
Voldemort (LinkedIn)
MemcacheDB…
Popular name for a subset of structured storage
software that is designed with the intention of delivering
increased optimization for high-performance operations
on large datasets
•
Basically, available, scalable, eventually consistent
•
Easy to use
•
Tolerant of scale by way of horizontal distribution
Graph
Neo4j (Neo Technology), Jena,
InfiniteGraph (Objectivity),
FlockDB (Twitter)…
15
Big Data with Not Only SQL
Document
MongoDB (10Gen),
CouchDB, Terrastore,
SimpleDB (AWS) …
16. NOSQL DATABASES CATEGORIES
Key-Value
Column
Document
Graph
Store items as
alphanumeric identifier
(Key)
Associate values in a
simple standalone
tables
Values must be (string,
list, set)
Data search base on key
Fast and highly scalable
to retrieve a value
BigTable-style database
Column-oriented data
structure that
accommodates multiple
attributes per key
Petabyte scale
Domains: Distributed
data storage, Versioning
with timestamp,
Sorting, Parsing
Data exploration
Domains: managing
user profiles, retrieving
product name…
Documents (objects) map
nicely to programming
language data types
Value =
Collection>Document>Field
Embedded documents and
arrays reduce need for
joins
Dynamically-typed for
easy schema evolution
No joins and no multidocument transactions for
high performance and
easy scalability
Structured relational
graphs of
interconnected keyvalue pairings
Object-oriented
network of nodes
(Node), Nodes
Relationship (Edge),
Properties (nodes
attributes expressed as
key-value pairs)
Relation between data
Domains: social
networks,
recommendations,
investigations,
relationships…
Collection
Key
Value
User001
Peter
User002
Paul
User003
Key
Timestamp
Type
Size
Document
Name
Age
12
Zebra
Medium
Doc001
Paul
30
11
Lion
Big
Doc002
Jacques
35
E2
13
Bird
Small
NoSQL Data Modeling Techniques
Geo hashing, Index table, Composite keys aggregation, Materialized paths…
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
16
Big Data with Not Only SQL
Node
Name
Age
X
John
30
Y
E1
Rick
Node
Bob
50
Edge
a
b
X
Y
Y
X
17. NEW SQL
• Relational database with horizontal scalability
• MySQL Ecosystem
• Distributed database with MySQL compliance: Cubrid
• Analytic database: InfiniDB
• In-Memory database with MySQL compliance: VoltDB
17
Big Data with Not Only SQL
18. BIG DATA ARCHITETURE OVERVIEW
ADMINISTRATOR
ENGINEERS
ANALYSTS
BUSINESS USERS
Development
Data
Management
DATA SCIENTISTS
Data Modeling
BI / Analytics
Activity
Reporting
Data Quality
Master Data
Management
MOBILE CLIENTS
Mobile Apps
Data Analysis & Visualization
NoSQL
SQL
Unstructured and structured Data Warehouse,
MPP, No SQL Engine, Distributed File Systems
Share-Nothing Architecture, Algorithms
Structured Data Warehouse and OLAP Cubes,
MPP, In-Memory, Columns Database, SQL
Engine, Share-Nothing Architecture
Data
Transfer
Data Integration
Files
18
Web Data
RDBMS
Data sources
Big Data with Not Only SQL
19. HDFS & MAPREDUCE
•
Clients
Hadoop Distributed File System
-
Asynchronous replication
-
Write-once and read-many (WORM)
-
Hadoop cluster with 3 DataNodes minimum
-
Data divided into blocks, each block replicated 3 times
(default)
-
No RAID required for DataNode
-
Interfaces: Java, Thrift, C
Library, FUSE, WebDAV, HTTP, FTP
-
NameNode holds filesystem metadata
-
•
A scalable, Fault tolerant, High performance distributed
file system
Files are broken up and spread over the DataNodes
Hadoop MapReduce
-
Software framework for distributed computation
-
Input | Map() | Copy/Sort | Reduce() | Output
-
JobTracker schedules and manages jobs
-
19
Master Node
TaskTracker executes individual map() and reduce() tasks
on each cluster node
Big Data with Not Only SQL
Worker Nodes
20. HBASE
•
•
•
•
•
•
•
•
•
•
•
•
•
Clone of Big Table (Google)
Implemented in Java (Clients : Java, C++, Ruby...)
Data is stored “Column‐oriented”
Distributed over many servers
Tolerant of machine failure
Layered over HDFS
Strong consistency
It's not a relational database (No joins)
Sparse data – nulls are stored for free
Semi-structured or unstructured data
Data changes through time
Versioned data
Scalable – Goal of billions of rows x millions
of columns
Table
Row
Timestamp
Animal
Repair
Type
Enclosure1
Enclosure2
Key
Cost
12
Region
Size
Zebra
Medium
1000€
11
Lion
Big
13
Monkey
Small
Family
Column
1500€
Cell
(Table, Row_Key, Family, Column, Timestamp) = Cell (Value)
20
Big Data with Not Only SQL
21. HBASE
• Table
-
Regions for scalability, defined by
row [start-key, end-key)
Store for efficiency, 1 per Family
- 1..n StoreFiles
(HFile format on HDFS)
• Everything is byte
• Rows are ordered sequentially by
key
• Special tables -ROOT- , .META.
-
Tell clients where to find user
data
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
21
Big Data with Not Only SQL
22. HADOOP INFRASTRUCTURE
Network Switches
2 x Apps Server
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1
22
2 x NameNode/BackupNode/Admin
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1
Big Data with Not Only SQL
3 to n x DataNode
•
2 CPU 6 core
•
48 GB RAM
•
12 x HDD
23. MOGILEFS OVERVIEW
•
•
Asynchronous Replication
•
No Single Point of Failure
•
Automatic file replication (3 replications recommended)
•
Better than RAID
•
Flat NameSpace
•
Share-Nothing
•
No RAID required
•
Local filesystem agnostic
•
Tracker client transfer (mogilefsd) - Replication -- Deletion
- Query - Reaper - Monitor
Clients
A scalable, Fault tolerant, High performance distributed file
system
Tracker
Host1
Host4
Tracker
•
DBNode MySQL stores the MogileFS metadata (the
namespace, and which files are where)
•
Host2
Storage Node
Host5
Files are broken up and spread over the
Storage Node (mogstored) HTTP and WebDAV server
•
Storage Node
Client Library : Ruby, Perl, Java, Python, PHP…
DBNode
Host3
23
Big Data with Not Only SQL
Storage Node
Host6
25. MOGILEFS INFRASTRUCTURE
Network Switches
°°°
2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1
25
2 x DB Node + 2 to n x Tracker
•
2 CPU 6 core
•
32 GB RAM
•
6 x HDD 600GB 15K Raid1
Big Data with Not Only SQL
3 to n x Storage Node
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
26. GLUSTERFS OVERVIEW
•
A scalable, Fault tolerant, High performance distributed and replicated
file system
•
No Single Point of Failure
•
Synchronous replication of volumes across storage servers
•
Asynchronous replication across geographically distributed clusters
•
Easily accessible usage quotas
•
No Meta-Data Server (fully distributed architecture - Elastic Hash)
•
Distributed / Distributed Replicated / Distributed Striped
•
POSIX compliant
•
FUSE (Standard)
•
GlusterFS native, NFS, CIFS, HTTP, FTP, WebDAV, ZFS, EXT4…
•
No proprietary format to store files on disk
•
NameSpace : The unified global namespace aggregates disk and
memory resources into a single pool, virtualizing the underlying
hardware
GlusterFS
Server
Host1
GlusterFS
Server
•
Data Store : Data is stored in logical volumes that are abstracted from
the hardware and logically partitioned from each other
•
Development: API, Command Line Interface, Python, Ruby, PHP
languages
26
Clients
Big Data with Not Only SQL
Host2
GlusterFS
Server
Host3
GlusterFS
Server
Host4
GlusterFS
Server
Host5
GlusterFS
Server
Host6
28. GLUSTERFS INFRASTRUCTURE
Network Switches
2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1
28
2 x Backup Node / Admin
•
2 CPU 6 core
•
32 GB RAM
•
6 x HDD 600GB 15K Raid1
Big Data with Not Only SQL
3 to n x GlusterFS Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
29. MOOSEFS OVERVIEW
•
•
•
•
•
•
•
•
•
•
•
•
29
A scalable, Fault tolerant, High performance distributed and
replicated file system
Spread data over several physical servers which are visible to the
user as one resource
No Single Point of Failure
Distribution of data across data servers via chunks
Maximum chunks size = 64MB
File duplication (1 to 3 and more if necessary)
POSIX compliant
FUSE Interface
No proprietary format to store files on disk
Master Server: a single machine managing the whole
filesystem, storing metadata for every file (information on
size, attributes and file location(s), including all information about
non-regular files, i.e. directories, sockets, pipes and devices.
Metadata is stored in memory
Metalogger Server: any number of servers, all of which store
metadata changelogs and periodically downloading main metadata
file; so as to promote these servers to the role of the Managing
server when primary master stops working
Data Server any number of commodity servers storing files data
and synchronizing it among themselves
Big Data with Not Only SQL
Clients
Master
Server
Host1
Data
Server
Host2
Data
Server
Host3
Metalogger
Server
Host4
Data
Server
Host5
Data
Server
Host6
30. MOOSEFS READ PROCESS
Read Process
1. Where is the data
2. The data is on x chunks
servers
3. Send me the data
4. The Data
http://www.moosefs.org/
30
Big Data with Not Only SQL
31. MOOSEFS WRITE PROCESS
Write Process
1. Where to write the data
2. Create new chunk on x
chunk server
3. Success
4. Write the data
5. Synchronize the data
6. Success
7. Success
8. Send write session end
signal
http://www.moosefs.org/
31
Big Data with Not Only SQL
32. MOOSEFS INFRASTRUCTURE
Network Switches
2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•
6 x HDD 600GB 15K Raid1
32
2 x Master/ Metalogger/ Admin Server
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1
Big Data with Not Only SQL
3 to n x Data Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD
33. CASSANDRA OVERVIEW
• Every node play the same role
Cassandra API
• Highly Available
Storage Layer
• Really fast reads, really fast writes
• Flexible schemas
Partitioner
Replicator
Failure Detector
Cluster Membership
Messaging Layer
• Distributed, Replicated
• No Master, no Slaves
• No Single Point of Failure
• Client can talk to any node
• Written in Java
33
Tools
Big Data with Not Only SQL
36. MONGODB OVERVIEW
Clients
• Documents database oriented, High performance, scalability and
availability
• Support MapReduce
• Shard: hold a portion of the total data. Reads and writes are
automatically routed to the appropriate shard(s). Each shard is
backed by a replica set – which just holds the data for that shard
• Replica: set is one or more servers, each holding copies of the
same data. At any given time one is primary and the rest are
secondaries. If the primary goes down one of the secondaries
takes over automatically as primary. All writes and consistent
reads go to the primary, and all eventually consistent reads are
distributed amongst all the secondaries. Replica set is an
asynchronous cluster replication technology
• Config: multiple config servers, each one holds a copy of the
meta data indicating which data lives on which shard
• Router: one or more routers, each one acts as a server for one or
more clients. Clients issue queries/updates to a router and the
router routes them to the appropriate shard while consulting the
config servers
• Client: one or more clients, each one is (part of) the user's
application and issues commands to a router via the mongo
client library (driver) for its language
36
Big Data with Not Only SQL
mongos
Servers
Router
mongod
Servers
Config
mongod
Servers
Shard
mongos
Servers
Router
mongod
Servers
Config
mongod
Servers
Shard
38. MONGODB INFRASTRUCTURE
Network Switches
1 to n Router server
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10
38
1 to n Config servers
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10
Big Data with Not Only SQL
1 to n Shard servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K
39. COUCHDB OVERVIEW
Clients
•
•
•
•
•
•
•
•
•
•
•
•
Open Source Distributed Database
RESTful API
Schema-less document store (document in JSON format)
Multi-Version-Concurrency-Control model
User-defined query structured as map/reduce
Incremental Index Update mechanism
Multi-Master Replication model
Written in Erlang
Support MapReduce
Easy to use data storage
Easy to integrate with web applications : JavaScript, JSON
Scalability for large web applications : Incremental
Replication, bi-directional conflict detection and
management
• Query-able and index-able
• Offline by default
39
Big Data with Not Only SQL
CouchDB
Servers
Master
CouchDB
Servers
Slave
CouchDB
Servers
Slave
•
•
•
•
•
CouchDB
Servers
Master
CouchDB
Servers
Slave
CouchDB
Servers
Slave
Master → Slave replication
Master ↔ Master replication
Filtered Replication
Incremental and bi-directional replication
Conflict management
40. COUCHDB FUNCTIONALITIES
• Document storage
– CouchDB server hosts named databases, which store documents
• ACID Properties
– CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent
state
• Compaction
– On schedule, or when the database file exceeds a certain amount of wasted space, the compaction process clones all the
active data to a new file and then discards the old file
• Views (Model, Function, Index)
– View model is the method of aggregating and reporting on the documents in a database, and are built on-demand to
aggregate, join and report on database documents
– View function takes a CouchDB document as an argument and then does whatever computation it needs to do to
determine the data that is to be made available through the view, if any. It can add multiple rows to the view based on a
single document, or it can add no rows at all
– View index is a dynamic representation of the actual document contents of a database, and CouchDB makes it easy to
create useful views of data. But generating a view of a database with hundreds of thousands or millions of documents is
time and resource consuming, it's not something the system should do from scratch each time
• Security
– To protect who can read and update documents, CouchDB has a simple reader access and update validation model that can
be extended to implement custom security models
• Distributed update and replication
– CouchDB is a peer-based distributed database system, it allows for users and servers to access and update the same shared
data while disconnected and then bi-directionally replicate those changes later
40
Big Data with Not Only SQL
41. COUCHDB INFRASTRUCTURE
Network Switches
1 to n Router server
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10
41
1 to n Master servers
2 CPU 6 core
96 GB RAM
6 x HDD 600GB 15K Raid10
Big Data with Not Only SQL
1 to n Slaves servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K