SQL, NoSQL, BigData in Data Architecture

SQL, NoSQL & BigData
in
Data Architecture

Venu Anuganti
Nov 2012
http://scalein.com/
http://venublog.com/

Who am I
• Data Architect, Technology Advisor & Seed Investor

• Design, Implement & Support SQL, NoSQL and BigData
Solutions

– Industry: Databases, Games, Social, Video, SaaS, Analytics,
Warehouse, Web, Financial, Telco, Mobile, Advertising & SEM
Marketing

– Consulted for more than 22+ from Fortune-500 companies

• http://scalein.com/

Agenda
• Current trends in SQL, NoSQL and BigData

• Why “data architecture” is key for every company

• Key factors in getting the right solution

• Typical Big “Data Architecture”

• Overview of popular data sources, quick comparison

• How to build “data analytics” for “data science”

Current Trends
• Lot of dynamics in the market, too much data

• SQL, NoSQL, BigData & Analytics – Buzz including
investors

• NoSQL, BigData is becoming hot topic for every engineer,
team, company & management

• Nothing less to the current tablet war between Apple,
Microsoft, Google, Amazon and Samsung

• Very good sign as technology is evolving. But lot of people
getting confused. What solution should I start with ?

– Confusion makes a slow start for lot of startups and even for
leaders in the industry to make a shift

Current Trends - SQL
• SQL is slowing down.. not really
– OLTP can’t be replaced easily

• Key factors - Pros
– Transactional, Concurrency, Consistency & Durability
– Proven, SQL, JDBC/ODBC, native protocols
– Widely adopted, fits for all & interoperability
– Legacy, risk free, easy adoption & expert community
– Low latency response times, almost ~0 secs
– Very good for small data-sets, takes advantage of bleeding hardware
(SSD, Flash cards, high memory, latest CPUs, cloud enabled)
– Easy read scaling, writes needs application logic

• Key factors – Cons
– Transactional, Concurrency, Consistency & Durability
– Scalability, Clustering, Distributed
– Fixed schema, online management
– Built-in clustering is hard due to the nature of ACID
– Bound by hardware, Scale-UP

Current Trends - NoSQL
• NoSQL is racing

– Overcomes known SQL limitations
– Eventually consistent
– Clustering, Scalability, Distributed (not all)
– Schema free
– Each solves it’s own specific problem
– Easy to adopt

– Consistency (varies), Durability
– Maturity, major solutions are not yet “production” grade
– Does not fit for all, individual solution for each problem
– Response time, depends on each solution
– 95% relays on application logic to explore data store data

Current Trends - BigData
• BigData is the latest industry buzz, trend or …

• Gartner – 28B in 2012 & 34B in 2013 spend
– 2013 top-10 technology trends – 6th place

• Solves large data problems that existed for years
– Social, User, Mobile growth demanded such a solution (FB
crossed 1B users, classic example)
– Google “BigTable” is the key, and new papers like Dremel drives
it further
– Amazon “Dynamo” follows
– Hadoop & ecosystem is becoming synonym for BigData

• Combines vast structured/un-structured data
– Overcomes from legacy warehouse model
– Brings data analytics & data science
– Real-time, mining, insights, discovery & complex reporting

Current Trends - BigData

– Can handle any size
– Commodity hardware
– Scalable, Distributed, Highly Available
– Ecosystem & growing community


– Latency
– Redundancy, Durability, Maturity
– Tradeoff on consistency
– Hardware evolution, even though designed for commodity

Data Architecture
• No standard solution that fits to all

• Business and data defines the right solution

• It’s all about solving “business” problems

• You need to find the right tool that does the job

– If company X uses MySQL to scale their 500M users, does not
mean you can use MySQL to scale your 100M users

– If company Y uses MongoDB for storing 100M user profile
data, does not mean you can also take it for granted

Key Factors
• Resources are the key
– A good engineer can make bad product to work
– A bad engineer can make good product to suck

• Understand the business
– Data sources & data growth
– Data consumption
• end user vs. API vs. data science vs. reporting vs. internal
– SLA, Response time, Turn around time, Recovery times
– Cost; Evolve as business grows, don’t over-architect from day-1
– Capacity planning, leave enough room for failure & growth

Tradeoff – Data Architecture
• Performance vs. Scale vs. Stability

• OLTP vs. OLAP

• Internal vs. External

• Application stack

• Cloud vs. Data center

• Hardware vs. features vs. product vs. cost

Typical “Data”
Architecture

Choosing The Right Solution
• Store:

– SQL, key-value, in-memory, document, graph, bigdata,
node.js (server end service), s3, azure, file system, …

• Log:

– Log processing tools for structured/un-structured
(scripts, splunk, flume, scribe, chukwa, loggly, kibana, .)

• Caching:

– File System, Use replicas, Write Through Cache (WTC),
Read From Cache (RFC)
– CDN/S3/Azure frequent processing, local cache

Choosing The Right Solution ..
• Platform:

– php, ruby, java, scala, python, c/c++, client/server, rest,
soap, http, api, etc.

• (Dev)-Operation:

– OS, file system, automation using puppet/chef,
security, performance metrics, monitoring, in-depth
exposure to every layer (nagios, ganglia, zabbix, new-
relic, tsdb, etc.)

• Search:

– built-in, solr, elastic search, full-text

Evaluate – Data Store
• Key Evaluation Requirements
–Transactional, Durability & Consistency
–Response time
–Functionality
–Data characteristics
–Scalability, Clustering
–Failover
–Maintenance, Online changes, Node Management
–Maturity
–Community, Support
–Hosted or Managed
–Cost, open source
–Big “NO” to Appliance models, premium cost solutions

Decide what you need
• SQL
– Relational, transactional processing

• NoSQL
– Non relational, distributed, high performance and highly
scalable

• Analytics, Warehouse, BigData
– Data Warehousing, Analytics, Data science, and reporting

• Combination of all 3
– Begin with SQL, NoSQL and eventually need BigData/Analytics
platform

SQL Stores
• Disk based storage, Fixed schema

• Data is stored as table (row by row and columns – row
store), Durable and transactional

• Mainly B-tree as the indexing mechanism

• Dynamic locking/ Lock free for concurrency control

• Write-ahead log (WAL) / transactional log for crash
recovery
• Takes advantage of bleeding hardware (SSD, flash cards,
CPUs, memory, cloud enabled, …)

• Concurrent read/write/update/delete same row

SQL – Good
• Simple or complex aggregation

• Statistics, reports at data store level

• Need access to more than one tuple of information

• Results based on multiple search conditions
– SELECT foo FROM bar where X=1 and Y=2

• Fetching of ordered or array of data

• Compatible with many tools

SQL – Bad

• SQL complexity, parsing cost, client/server
overhead

• Learning and relational model design

• Performance and Scalability

– Strictly single node write
– Sharding causes more trouble operationally
– Operational maintenance, fire fighting

• Puts a break to rapid development cycles

NoSQL Stores …
• Non relational, schema free

• Highly Distributed

• Simple CLI, REST, SOAP or API driven

• Eventually consistent, depends store to store

• Ability to dynamically define new attributes

• Concurrency & Consistency – @application

NoSQL Stores …
• Multiple Types based on storage architecture

• Key Value, KV
• Very popular for simple key-value lookups; disk/memory

• Document
• Popular for document type of storage

• Graph
• Connected graph with entity relationship

• Column Family
• Key value with fixed column families, allows dynamic columns
within column family

NoSQL Stores
• Key-Value Stores • Column Family

– Dynamo Clones – BigTable Clones

• Membase • Cassandra
• Riak • HBase
• Redis • HyperTable
• Tokyo Cabinet
• Voldemort

• Document Stores • Graph Databases
• MongoDB • Neo4J
• CouchDB • InfoGrid
• SimpleDB • AllegroGraph
• FlockDB

NoSQL - Good
• Fits very well for volatile data

• High read or write throughput

• Automatic horizontal scalability (Consistent hashing)

• Simple to implement, no investment for schema design

• Application logic defines object model

• Support of MVCC in some form

• Compaction and un-compaction happens at app tier

• In-memory or disk based or combination @performance penalty

NoSQL - Good
• Rapid development cycles, programmer friendly

• Reduces the footprint at data store level

• NoSQL in general faster than SQL

• Supports INSERT, DELETE, SELECT

• Data is distributed by KEY over nodes (depends on solution)

• Lists, sets, queues, pub-sub are also supported by some NoSQL –
Redis, Riak

NoSQL - Bad
• Packing and Un-packing of each key

• Lack of relation from one key to another

• Need whole value from the key even when you need 1-byte

• Concurrency for latest copy is your take

• Data store is merely a storage layer, can’t be used for:

– Analytics
– Reporting
– Aggregation
– Ordered values

SQL/NoSQL – Good and Bad
• Performance mainly depends on amount of memory

• Disk bound both takes a hit

– SQL has advantage due to sequential and read-ahead

• Optimization towards frequently accessed data

– SQL engines maintain LRU, buffer pool
– Read from slave nodes, may not be up2date

• SQL Engines are proven and widely in use

• People use WTC – NoSQL & SQL

Analytic Stores
• Data warehousing, mainly for processing large data
sets

• Data marts, Dimensional, Fact and Aggregate
tables

• ETL, BI, Reporting, Analytics

• Columnar, Distribution and Compression is the key

Data Analytics
• Data Analytics is critical for every business

– Combine heterogeneous data sources
• Weblogs, user activity, transactional data, purchase history,
user profile, crm, marketing, campaign performance, …
– Complex Reporting
– Understand user behavior, geo, interest levels
– Recommendation
– User (re)targeting
– Product usage, features most (not) liked
– Increase ROI, user satisfaction

• It helps business in every aspect to inspect,
understand, implement, apply – Waterfall model

Data Science
• Large data helps to build good models due to high
probability
–Statistics
–Predictions
–Data Analysis
–Build test models, continuously
• AB test
• Apply slowly to selected users or clients
• Fine tune it
• Adopt globally

Analytic Stores
• Columnar data warehouse solutions

– GreenPlum (EMC, DCA appliance)
– Vertica (HP, appliance coming)
– ParAccel
– InfoBright (MySQL based)
– InfiniDB (open source, Calpont appliance)
– Netezza (IBM, appliance)
– XtremeData dbX (appliance)
– TeraData

Analytic Stores - BigData
• Hadoop is leading the BigData platform

• Rapidly Growing - Analytics Platform

– HDFS, Map Reduce direct processing
– HIVE
– HBASE
– IMPALA - Cloudera announced last week based on
Google’s Dremel
– DRILL – Apache open source version, in works
– Google BigQuery

Document Store
•Document Stores

– Supports complex data model than KV
– Good at handling content management, session,
profile data
– Multi index support
– Dynamic schemas, Nested schemas
– Auto distributed, eventual consistency
– MVCC (CouchDB) or app logic (MongoDB)

•MongoDB, SimpleDB: widely adopted in this space

•Use Case: Search by complex patterns & CRUD apps

Column Family Store

• Hbase (Apache), Cassanda (Facebook) and HyperTable (Bidu)

– Hbase – CA
– Cassandra – AP

• Model consists of rows and columns

• Scalability: Splitting of both rows and columns

– Rows are split across nodes using primary key, range
– Columns are distributed using groups
– Horizontal and vertical partitioning can be used simultaneous

• Extension of document store

Graph Store
• Social Graph

• Relationship between entities

• Data modeling on social networks

• Common Use Cases

–List of friends, Shared with common property
–Recommendation system
–Following
–Followers
–Common Connections

Cloud Data Stores
• “Database As Service” Models:

– Amazon RDS, DynamoDB, SimpleDB, PostgreSQL
– Xeround (MySQL)
– Microsoft SQL Azure Database (SQL Server)
– Google App Engine (NoSQL)
– SalesForce Database.com (Oracle)
– ClearDB (MySQL)
– Cloudant(CouchDB)

Finally …

 SQL
Works great, can’t easily scale

 NoSQL
Works great, can’t fit for all

 Analytics, BigData
Every business need it

Questions ?

•http://scalein.com/
•http://venublog.com/
•venu@venublog.com
•Twitter: @vanuganti

SQL, NoSQL, BigData in Data Architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à SQL, NoSQL, BigData in Data Architecture

Similaire à SQL, NoSQL, BigData in Data Architecture (20)

Dernier

Dernier (20)

SQL, NoSQL, BigData in Data Architecture

Notes de l'éditeur