Big Data with Not Only SQL

WHO AM I
• Big Data / Analytics / BI & Cloud Solutions Specialist

• http://www.linkedin.com/in/JulioPhilippe
• Skills

Architecture
Business Intelligence
IT Transformation

Cloud Computing
IT Solutions

Management
Mentoring

Big Data

Analytics

Business Development

Hadoop
Datacenter
Optimization

Data Warehousing
2

Big Data with Not Only SQL

BIG DATA MANAGEMENT INSIGHT

« Data don’t spring relevant,
they become though ! »

3


DATA-DRIVEN ON-LINE WEBSITES
• To run the apps : messages, posts, blog
entries, video clips, maps, web graph...

• To give the data context : friends
networks, social networks, collaborative
filtering...
• To keep the applications running : web
logs, system logs, system metrics, database
query logs...

4


BIG DATA – NOT ONLY DATA VOLUME
• Improve analytics and statistics
models
• Extract business value by
analyzing large volumes of multistructured data from various
sources such as
databases, websites, blogs, social
media, smart sensors...
• Have efficient
architectures, massively
parallel, highly scalable and
available to handle very large
data volumes up to several
petabytes
5

Thematics
•
•
•
•
•
•

Web Technologies
Database Scale-out
Relational Data Analytics
Distributed Data Analytics
Distributed File Systems
Real Time Analytics


BIG DATA APPLICATIONS DOMAINS
• Digital marketing optimization (e.g., web analytics, attribution, golden
path analysis)
• Data exploration and discovery (e.g., identifying new data-driven
products, new markets)
• Fraud detection and prevention (e.g., revenue protection, site integrity
& uptime)
• Social network and relationship analysis (e.g., influencer marketing,
outsourcing, attrition prediction)
• Machine-generated data analytics (e.g., remote device insight, remote
sensing, location-based intelligence)
• Data retention (e.g. long term conservation, data archiving

6


SOME BIG DATA USE CASES BY INDUSTRY
Energy

Telecommunications

Retail



Smart meter analytics



Network performance



Dynamic price optimization



Distribution load forecasting & scheduling



New products & services creation



Localized assortment



Condition-based maintenance



Call Detail Records (CDRs) analysis



Supply-chain management



Customer relationship



Customer relationship management

management

Manufacturing

Banking

Insurance



Supply chain management



Fraud detection



Catastrophe modeling



Customer Care Call Centers



Trade surveillance



Claims fraud



Preventive Maintenance and Repairs



Compliance and regulatory



Reputation management










Public

Media

Healthcare



Fraud detection



Large-scale clickstream analytics



Clinical trials data analysis



Fighting criminality



Abuse and click-fraud prevention



Patient care quality and program analysis



Threats detection



Social graph analysis and profile segmentation



Supply chain management



Cyber security



Campaign management and loyalty programs



Drug discovery and development analysis

7


TOP 10 BIG DATA SOURCES
1. Social network profiles
2. Social influencers
3. Activity-generated data
4. SaaS & Cloud Apps
5. Public web information
6. MapReduce results
7. Data warehouse appliances
8. Columnar/NoSQL databases
9. Network and in-stream monitoring technologies

10. Legacy documents

8


NEW DATA AND MANAGEMENT ECONOMICS
Compute Trends

Storage Trends

New Analytics

New Data Structure

(Massively Parallel Processing, Algorithms…)

Distributed File Systems, NoSQL Database, NewSQL…)

Logical
Data Warehouse
Master/Slave

Enterprise
data warehouse

Objects storage

Multi-Structured
Data
Master/Master

General purpose
data warehouse
Proprietary and dedicated
data warehouse

Distributed File Systems

OLTP is the
data warehouse

Master Data Management, Data Quality, Data Integration

9


Federated/
Sharded

MOVING COMPUTATION TO STORAGE
General Purpose Storage Servers
•

Combine server with disks & networking for reducing latency

•

Specialized software enables general purpose systems designs to provide high
performance data services

Moving Data processing to Storage
Legacy

Emerging

Next Gen.

Application

Application

Application

Data Processing

Data Processing

Metadata Mgmt

Network
Data Processing
Metadata Mgmt
Storage

Metadata Mgmt

Storage

Storage

Storage Array (SAN, NAS)

10


Servers

BIG DATA ARCHITECTURE
BI & DWH Architecture - Conventional
• SQL based
• High availability
• Enterprise database
• Right design for structured data
• Current storage hardware (SAN, NAS, DAS)

Analytics Architecture – Next Generation
• Not only SQL based
• High scalability, availability and flexibility
• Compute and storage in the same box for
reducing the network latency
• Right design for semi-structured and
unstructured data

App
Servers
Edge
Nodes
Network
Switches
Network
Switches
Database
Servers

Storage Array
SAN
Switch

11

Data
Nodes


DATA WAREHOUSE

• Data Warehouse appliances
– EMC Greenplum
– Microsoft Parallel Data
Warehouse
– IBM Netezza
– Oracle Exadata
– SAP HANA
– ParAccel Analytic Database
– Teradata
– HP Vertica

12

• SQL Database

• Massively Parallel Processing
• Hadoop Connectivity
• Column-Oriented database
• In-Memory database


MAPREDUCE ALGORITHMS
MapReduce
• MapReduce is the programming
paradigm popularized by Google
researchers
• Open-source Hadoop
implementation of MapReduce by
Yahoo
• Open source software framework for
distributed computation
• Parallel computation (Map) on each
block (Split) of data in an DFS file and
output a stream of (Key, Value) pairs
to the local file system
• JobTracker schedules and manages
jobs
• TaskTracker executes individual map()
and reduce() tasks on each cluster
node

13

Algorithms
• Association Rule Learning
Algorithms
• Genetic Algorithms
• Neural Network Algorithms
• Statistical Algorithms (Pandas)
• Machine Learning Algorithms
(Mahout, Weka, Scikit Learn)
• Natural Language Processing
Algorithms
• Trading Algorithms
• Clinical design Algorithms
• Searching Algorithms (Lucene,
Solr, Katta, ElasicSearch,
OpenSearchServer…)


Languages
• PHP
• Erlang
• Python
• Ruby
• R
• Java

DISTRIBUTED FILE SYSTEMS
• System that permanently store data
• Divided into logical units
(files, shards, chunks, blocks…)

• A file path joins file and directory names into
a relative or absolute address to identify a
file

Master

Slave

Slave

• Support access to file and remote servers
• Support concurrency

App

• Support distribution
• Support replication
• NFS, GPFS, Hadoop
HDFS, GlusterFS, MogileFS, MooseFS….

14


Slave

NOSQL DATABASES CATEGORIES
Column
BigTable (Google), HBase,
Cassandra (DataStax),
Hypertable…

NoSQL = Not only SQL
•

Key-Value
Redis, Riak (Basho), CouchBase,
Voldemort (LinkedIn)
MemcacheDB…

Popular name for a subset of structured storage
software that is designed with the intention of delivering
increased optimization for high-performance operations
on large datasets

•

Basically, available, scalable, eventually consistent

•

Easy to use

•

Tolerant of scale by way of horizontal distribution

Graph
Neo4j (Neo Technology), Jena,
InfiniteGraph (Objectivity),
FlockDB (Twitter)…

15


Document
MongoDB (10Gen),
CouchDB, Terrastore,
SimpleDB (AWS) …

NOSQL DATABASES CATEGORIES
Key-Value

Column

Document

Graph
















Store items as
alphanumeric identifier
(Key)
Associate values in a
simple standalone
tables
Values must be (string,
list, set)
Data search base on key
Fast and highly scalable
to retrieve a value






BigTable-style database
Column-oriented data
structure that
accommodates multiple
attributes per key
Petabyte scale
Domains: Distributed
data storage, Versioning
with timestamp,
Sorting, Parsing
Data exploration






Domains: managing
user profiles, retrieving
product name…

Documents (objects) map
nicely to programming
language data types
Value =
Collection>Document>Field
Embedded documents and
arrays reduce need for
joins
Dynamically-typed for
easy schema evolution
No joins and no multidocument transactions for
high performance and
easy scalability






Structured relational
graphs of
interconnected keyvalue pairings
Object-oriented
network of nodes
(Node), Nodes
Relationship (Edge),
Properties (nodes
attributes expressed as
key-value pairs)
Relation between data
Domains: social
networks,
recommendations,
investigations,
relationships…

Collection
Key

Value

User001

Peter

User002

Paul

User003

Key

Timestamp

Type

Size

Document

Name

Age

12

Zebra

Medium

Doc001

Paul

30

11

Lion

Big

Doc002

Jacques

35

E2

13

Bird

Small

NoSQL Data Modeling Techniques
Geo hashing, Index table, Composite keys aggregation, Materialized paths…
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

16


Node

Name

Age

X

John

30

Y

E1

Rick

Node

Bob

50

Edge

a

b

X

Y

Y

X

NEW SQL
• Relational database with horizontal scalability
• MySQL Ecosystem

• Distributed database with MySQL compliance: Cubrid
• Analytic database: InfiniDB
• In-Memory database with MySQL compliance: VoltDB

17


BIG DATA ARCHITETURE OVERVIEW
ADMINISTRATOR

ENGINEERS

ANALYSTS

BUSINESS USERS

Development

Data
Management

DATA SCIENTISTS

Data Modeling

BI / Analytics

Activity
Reporting

Data Quality
Master Data
Management

MOBILE CLIENTS

Mobile Apps

Data Analysis & Visualization

NoSQL

SQL

Unstructured and structured Data Warehouse,
MPP, No SQL Engine, Distributed File Systems
Share-Nothing Architecture, Algorithms

Structured Data Warehouse and OLAP Cubes,
MPP, In-Memory, Columns Database, SQL
Engine, Share-Nothing Architecture

Data
Transfer

Data Integration

Files

18

Web Data

RDBMS

Data sources


HDFS & MAPREDUCE
•

Clients

Hadoop Distributed File System
-

Asynchronous replication

-

Write-once and read-many (WORM)

-

Hadoop cluster with 3 DataNodes minimum

-

Data divided into blocks, each block replicated 3 times
(default)

-

No RAID required for DataNode

-

Interfaces: Java, Thrift, C
Library, FUSE, WebDAV, HTTP, FTP

-

NameNode holds filesystem metadata

-

•

A scalable, Fault tolerant, High performance distributed
file system

Files are broken up and spread over the DataNodes

Hadoop MapReduce
-

Software framework for distributed computation

-

Input | Map() | Copy/Sort | Reduce() | Output

-

JobTracker schedules and manages jobs

-

19

Master Node

TaskTracker executes individual map() and reduce() tasks
on each cluster node


Worker Nodes

HBASE
•
•
•
•
•
•
•

•
•
•
•
•
•

Clone of Big Table (Google)
Implemented in Java (Clients : Java, C++, Ruby...)
Data is stored “Column‐oriented”
Distributed over many servers
Tolerant of machine failure
Layered over HDFS
Strong consistency

It's not a relational database (No joins)
Sparse data – nulls are stored for free
Semi-structured or unstructured data
Data changes through time
Versioned data
Scalable – Goal of billions of rows x millions
of columns

Table
Row

Timestamp

Animal

Repair

Type
Enclosure1
Enclosure2
Key

Cost

12
Region

Size

Zebra

Medium

1000€

11

Lion

Big

13

Monkey

Small
Family

Column

1500€
Cell

(Table, Row_Key, Family, Column, Timestamp) = Cell (Value)

20


HBASE
• Table
-

Regions for scalability, defined by
row [start-key, end-key)
Store for efficiency, 1 per Family
- 1..n StoreFiles
(HFile format on HDFS)

• Everything is byte
• Rows are ordered sequentially by
key
• Special tables -ROOT- , .META.
-

Tell clients where to find user
data

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

21


HADOOP INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
96 GB RAM
•
6 x HDD 600GB 15K Raid1

22

2 x NameNode/BackupNode/Admin
•
2 CPU 6 core
•
96 GB RAM
•


3 to n x DataNode
•
2 CPU 6 core
•
48 GB RAM
•
12 x HDD

MOGILEFS OVERVIEW
•
•

Asynchronous Replication

•

No Single Point of Failure

•

Automatic file replication (3 replications recommended)

•

Better than RAID

•

Flat NameSpace

•

Share-Nothing

•

No RAID required

•

Local filesystem agnostic

•

Tracker client transfer (mogilefsd) - Replication -- Deletion
- Query - Reaper - Monitor

Clients

A scalable, Fault tolerant, High performance distributed file
system

Tracker

Host1

Host4

Tracker

•

DBNode MySQL stores the MogileFS metadata (the
namespace, and which files are where)

•

Host2

Storage Node

Host5

Files are broken up and spread over the
Storage Node (mogstored) HTTP and WebDAV server

•

Storage Node

Client Library : Ruby, Perl, Java, Python, PHP…

DBNode

Host3

23


Storage Node

Host6

MOGILEFS ARCHITECTURE
Database

Client Library
Tracker

Tracker

Storage Node

24

Storage Node


MOGILEFS INFRASTRUCTURE

Network Switches

°°°
2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•

25

2 x DB Node + 2 to n x Tracker
•
2 CPU 6 core
•
32 GB RAM
•


3 to n x Storage Node
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD

GLUSTERFS OVERVIEW
•

A scalable, Fault tolerant, High performance distributed and replicated
file system

•


•

Synchronous replication of volumes across storage servers

•

Asynchronous replication across geographically distributed clusters

•

Easily accessible usage quotas

•

No Meta-Data Server (fully distributed architecture - Elastic Hash)

•

Distributed / Distributed Replicated / Distributed Striped

•

POSIX compliant

•

FUSE (Standard)

•

GlusterFS native, NFS, CIFS, HTTP, FTP, WebDAV, ZFS, EXT4…

•

No proprietary format to store files on disk

•

NameSpace : The unified global namespace aggregates disk and
memory resources into a single pool, virtualizing the underlying
hardware

GlusterFS
Server

Host1

GlusterFS
Server

•

Data Store : Data is stored in logical volumes that are abstracted from
the hardware and logically partitioned from each other

•

Development: API, Command Line Interface, Python, Ruby, PHP
languages

26

Clients


Host2

GlusterFS
Server

Host3

GlusterFS
Server

Host4

GlusterFS
Server

Host5

GlusterFS
Server

Host6

GLUSTERFS ARCHITECTURE

27


GLUSTERFS INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•

28

2 x Backup Node / Admin
•
2 CPU 6 core
•
32 GB RAM
•


3 to n x GlusterFS Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD

MOOSEFS OVERVIEW
•
•
•
•
•
•
•
•
•
•

•

•

29

A scalable, Fault tolerant, High performance distributed and
replicated file system
Spread data over several physical servers which are visible to the
user as one resource
Distribution of data across data servers via chunks
Maximum chunks size = 64MB
File duplication (1 to 3 and more if necessary)
POSIX compliant
FUSE Interface
No proprietary format to store files on disk
Master Server: a single machine managing the whole
filesystem, storing metadata for every file (information on
size, attributes and file location(s), including all information about
non-regular files, i.e. directories, sockets, pipes and devices.
Metadata is stored in memory
Metalogger Server: any number of servers, all of which store
metadata changelogs and periodically downloading main metadata
file; so as to promote these servers to the role of the Managing
server when primary master stops working
Data Server any number of commodity servers storing files data
and synchronizing it among themselves


Clients

Master
Server

Host1

Data
Server

Host2

Data
Server

Host3

Metalogger
Server

Host4

Data
Server

Host5

Data
Server

Host6

MOOSEFS READ PROCESS

Read Process
1. Where is the data
2. The data is on x chunks
servers
3. Send me the data
4. The Data

http://www.moosefs.org/
30


MOOSEFS WRITE PROCESS

Write Process
1. Where to write the data
2. Create new chunk on x
chunk server
3. Success
4. Write the data
5. Synchronize the data
6. Success
7. Success
8. Send write session end
signal

http://www.moosefs.org/
31


MOOSEFS INFRASTRUCTURE

Network Switches

2 x Apps Server
•
2 CPU 6 core
•
48 GB RAM
•

32

2 x Master/ Metalogger/ Admin Server
•
2 CPU 6 core
•
96 GB RAM
•


3 to n x Data Server
•
2 CPU 6 core
•
32 GB RAM
•
12 x HDD

CASSANDRA OVERVIEW
• Every node play the same role

Cassandra API

• Highly Available

Storage Layer

• Really fast reads, really fast writes
• Flexible schemas

Partitioner

Replicator

Failure Detector

Cluster Membership

Messaging Layer

• Distributed, Replicated
• No Master, no Slaves
• No Single Point of Failure
• Client can talk to any node
• Written in Java

33

Tools


CASSANDRA – COLUMN-ORIENTED
Key

SuperColumn
Column

Column
•

Column
+Name
+Value
+Timestamp

•
•

•
•
•

34

Column

Column Family
• Think of it as a DB table
Column
• Key-Value Pair (not just a value, like a DB column)
• Timestamp
SuperColumn
• Columns inside a column
• The value are columns
• No timestamp
Keyspace – like a namespace, generally 1 per app
Indexes
Queries


CASSANDRA INFRASTRUCTURE

Network Switches

Cassandra Nodes
•
•
•

35

2 CPU 6 core
32 GB RAM
12 x HDD Raid0


MONGODB OVERVIEW
Clients

• Documents database oriented, High performance, scalability and
availability
• Support MapReduce
• Shard: hold a portion of the total data. Reads and writes are
automatically routed to the appropriate shard(s). Each shard is
backed by a replica set – which just holds the data for that shard
• Replica: set is one or more servers, each holding copies of the
same data. At any given time one is primary and the rest are
secondaries. If the primary goes down one of the secondaries
takes over automatically as primary. All writes and consistent
reads go to the primary, and all eventually consistent reads are
distributed amongst all the secondaries. Replica set is an
asynchronous cluster replication technology
• Config: multiple config servers, each one holds a copy of the
meta data indicating which data lives on which shard
• Router: one or more routers, each one acts as a server for one or
more clients. Clients issue queries/updates to a router and the
router routes them to the appropriate shard while consulting the
config servers

• Client: one or more clients, each one is (part of) the user's
application and issues commands to a router via the mongo
client library (driver) for its language

36


mongos
Servers

Router

mongod
Servers

Config

mongod
Servers

Shard

mongos
Servers

Router

mongod
Servers

Config

mongod
Servers

Shard

MONGODB DEPLOYMENT
Shard

Secondary

Shard

Shard

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

Primary

Shard

mongod

mongod

mongod

Replica set

Config
mongod

Router
mongos

mongos

mongod
mongod
App

37

….


….

MONGODB INFRASTRUCTURE

Network Switches

1 to n Router server
2 CPU 6 core
96 GB RAM

38

1 to n Config servers
2 CPU 6 core
96 GB RAM


1 to n Shard servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K

COUCHDB OVERVIEW

Clients

•
•
•
•
•
•
•
•
•
•
•
•

Open Source Distributed Database
RESTful API
Schema-less document store (document in JSON format)
Multi-Version-Concurrency-Control model
User-defined query structured as map/reduce
Incremental Index Update mechanism
Multi-Master Replication model
Written in Erlang
Support MapReduce
Easy to use data storage
Easy to integrate with web applications : JavaScript, JSON
Scalability for large web applications : Incremental
Replication, bi-directional conflict detection and
management
• Query-able and index-able
• Offline by default

39


CouchDB
Servers

Master

CouchDB
Servers

Slave

CouchDB
Servers

Slave
•
•
•
•
•

CouchDB
Servers

Master

CouchDB
Servers

Slave

CouchDB
Servers

Slave

Master → Slave replication
Master ↔ Master replication
Filtered Replication
Incremental and bi-directional replication
Conflict management

COUCHDB FUNCTIONALITIES
• Document storage
– CouchDB server hosts named databases, which store documents

• ACID Properties
– CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent
state

• Compaction
– On schedule, or when the database file exceeds a certain amount of wasted space, the compaction process clones all the
active data to a new file and then discards the old file

• Views (Model, Function, Index)
– View model is the method of aggregating and reporting on the documents in a database, and are built on-demand to
aggregate, join and report on database documents

– View function takes a CouchDB document as an argument and then does whatever computation it needs to do to
determine the data that is to be made available through the view, if any. It can add multiple rows to the view based on a
single document, or it can add no rows at all
– View index is a dynamic representation of the actual document contents of a database, and CouchDB makes it easy to
create useful views of data. But generating a view of a database with hundreds of thousands or millions of documents is
time and resource consuming, it's not something the system should do from scratch each time

• Security
– To protect who can read and update documents, CouchDB has a simple reader access and update validation model that can
be extended to implement custom security models

• Distributed update and replication
– CouchDB is a peer-based distributed database system, it allows for users and servers to access and update the same shared
data while disconnected and then bi-directionally replicate those changes later

40


COUCHDB INFRASTRUCTURE

Network Switches

1 to n Router server
2 CPU 6 core
96 GB RAM

41

1 to n Master servers
2 CPU 6 core
96 GB RAM


1 to n Slaves servers
2 CPU 6 core
48 GB RAM
12 x HDD 1TB 7.2K

Big Data with Not Only SQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data with Not Only SQL

Similar to Big Data with Not Only SQL (20)

Recently uploaded

Recently uploaded (20)

Big Data with Not Only SQL