BIG DATA TITLE

The Elephant in the Room
A DBA’s Guide to Hadoop & Big Data

Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more

About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com

About You
Assume that
● mostly developers
● SQL experience
● exposure to database admin &
architecture
● little to no experience with Big Data

Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely

The Four V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Variability - same data, different interpretations
http://goo.gl/6icouZ

RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving

Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost

Foundations
“Gentlemen, this is a
football…”
- Vince Lombardi

Hadoop Ecosystem (Hortonworks)
Hortonworks

Hadoop
Scaleable, distributed processing framework
open-source
Hortonworks*
Cloudera
proprietary components
Facebook
Yahoo

HDFS
Hadoop Distributed File System
Inspired by Google FileSystem (2002-2003)
Cluster storage of large files across servers
Yahoo - 10,000 core Hadoop cluster(s)
Facebook - 100 PB+ (June, 2012)
http://goo.gl/SpSN

HDFS
File permissions and authentication.
Rack aware
fsck: find missing files or blocks.
Scheduled Rebalancing
Redundancy & Replication
Built around MapReduce

MapReduce
“Developed” by Google; patent issued in 2004
Map - filtering and sorting
Reduce - summarization
Inherently distributed

Hive
HiveQL - SQL like syntax
DDL scripts define tables
Query transformed into MapReduce jobs
Performance increases with scalability
Stinger initiative - MicrosoftHortonworks

Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';

HCatalog
Tight integration with Hive, but supports all
Hadoop data access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial

Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation

Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination

Hadoop SQL Server
HDFS Windows Cluster
Database
MapReduce Query Optimizer
Master Web Interface SQL Server Management Studio
Hive SQL
HCatalog Views
Pig Powershell
SSIS

Big Data Administration
The possession of
facts is knowledge,
the use of them is
wisdom. – Thomas
Jefferson

PERFORMANCE
APPLICATION GROWTH
RDBMS

PERFORMANCE
APPLICATION GROWTH
BIG DATA

PERFORMANCE
APPLICATION GROWTH

Scale-Up Costs (SQL Server)
Single Server
Maximum RAM
SAN
Licenses
Windows
SQL Server
Microsoft Support
Personnel
Developers
DBA
SAN Admin
Network Admin
Facilities
Minimum Footprint

Scale-Out Costs (Hortonworks HDP)
Multiple Servers
Commodity
Licenses
Windows ($$$)
Linux ($)
HDP Support
Personnel
Developer
HDP Admin
Network Admin
Facilities
Power
Space
Air

Performance Tuning
SYSTEM
CODE
RDBMS
SYSTEM
CODE
HADOOP
Performance Tuning Tips

Performance Architecture
Nathan Marz - Twitter, Storm
Lambda Architecture

Getting Started (Massive Size)
1. Lab Environment (Virtualized)
2. Setup OS (Windows or Linux)
3. Download (MSI or RPM)
4. Deploy Prereqs (Python, Java, C++)
5. Setup Master Node(s)
6. Setup Data Node(s)

Word Count
Problem: count the number of times a word
displays in a specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur
adipiscing elit.”...

Word Count
SQL Server
Create UDF to
parse strings
Hadoop
Pig script to parse
strings

Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)

SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,
@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN

SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END

Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE
((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';

Getting Started (Complex Analysis)
1. Lab Environment (Virtualized)
2. Install Hortonworks Sandbox
1. Setup Azure account
2. HDInsight

Theoretically, can scale to PB, but
no idea what that will cost you.
Note that the interface highlights
Hive (with Stinger); Pig commands
are run through Powershell

In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Different Use Cases
Massive Data
Complex Analysis

Contact Me
Stuart R. Ainsworth
Twitter: @codegumbo
Email: stuart@codegumbo.com
SpeakerRate: http://spkr8.com/t/33521

Big Data - Dangerous
http://www.thefacehawk.com/

BIG DATA TITLE

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à BIG DATA TITLE

Similaire à BIG DATA TITLE (20)

Dernier

Dernier (20)

BIG DATA TITLE