4. Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more
5. About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com
6. About You
Assume that
● mostly developers
● SQL experience
● exposure to database admin &
architecture
● little to no experience with Big Data
8. Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely
9. The Four V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Variability - same data, different interpretations
http://goo.gl/6icouZ
10. RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving
11. Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
22. Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';
24. HCatalog
Tight integration with Hive, but supports all
Hadoop data access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial
25. Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation
28. Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination
31. Hadoop SQL Server
HDFS Windows Cluster
Database
MapReduce Query Optimizer
Master Web Interface SQL Server Management Studio
Hive SQL
HCatalog Views
Pig Powershell
SSIS
33. Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
37. Scale-Up Costs (SQL Server)
Single Server
Maximum RAM
SAN
Licenses
Windows
SQL Server
Microsoft Support
Personnel
Developers
DBA
SAN Admin
Network Admin
Facilities
Minimum Footprint
38. Scale-Out Costs (Hortonworks HDP)
Multiple Servers
Commodity
Licenses
Windows ($$$)
Linux ($)
HDP Support
Personnel
Developer
HDP Admin
Network Admin
Facilities
Power
Space
Air
45. Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
46. Word Count
Problem: count the number of times a word
displays in a specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur
adipiscing elit.”...
48. Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)
49. Word Count - SQL Server
SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,
@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
50. Word Count - SQL Server
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END
51. Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE
((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';
53. Theoretically, can scale to PB, but
no idea what that will cost you.
Note that the interface highlights
Hive (with Stinger); Pig commands
are run through Powershell
54.
55. In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Different Use Cases
Massive Data
Complex Analysis