Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
1. Big Data in the Real World
Orlando PASS
October 2013
http://www.pssug.org
Mark Kromer
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
2. What we’ll (try) to cover today
‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ Microsoft + Hortonworks (Yahoo!) = HDInsights
2
3. Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing
‣ Techniques for processing and analyzing unstructured (and structured) LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
4. Mark’s Big Data Myths
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
Google, Facebook, et al) but not the same thing
‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ NoSQL does not have to be Big Data
‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner
and analyzing data that was otherwise too complex to provide value
‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and
insights
‣ Use the DW to aggregate and store the summaries of those calculations for
reporting
10. MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
11. MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
12. Get Data into Hadoop
‣
‣
‣
‣
Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup
114 2013-05-07 12:11 /import/sales.csv
‣ View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
13. Use Hive for Data Schema and Analysis
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as
textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE
INTO TABLE ext_sales;
14. Sqoop
Data transfer to & from Hadoop & SQL Server
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
password –table customers -m 1
‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
15. SQL Server Big Data – Data Loading
Amazon HDFS & EMR
Data Loading
Amazon S3 Bucket
16. Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a
SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
17. SQL Server Big Data Environment
‣ SQL Server Database
‣
‣
‣
‣
‣
‣
‣
‣
SQL 2012 Enterprise Edition
Page Compression
2012 Columnar Compression on Fact Tables
Clustered Index on all tables
Auto-update Stats Asynch
Partition Fact Tables by month and archive data with sliding window technique
Drop all indexes before nightly ETL load jobs
Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services
‣
‣
‣
‣
SSAS 2012 Enterprise Edition
2008 R2 OLAP cubes partition-aligned with DW
2012 cubes in-memory tabular cubes
All access through MSMDPUMP or SharePoint
18. SQL Server Big Data Analytics Features
‣ Columnstore
‣ Sqoop adapter
‣ PolyBase
‣ Hive
‣ In-memory analytics
‣ Scale-out MPP
19. Microsoft’s Data Solution – Big Data & PDW
Excel with PowerPivot
Predictive Analytics
Embedded BI
Data Market Place
Familiar End User Tools
S
S
R
S
SSAS
BI Platform
Hundreds of TB of Data
(structured)
Petabytes of Data
(Unstructured)
Hadoop On
Windows
Azure
Sensors
Hadoop On
Windows
Server
Devices
Data Market
Bots
Connectors
Parallel Data Warehouse
Crawlers
ERP
CRM
LOB
APPs
Unstructured and Structured Data
19
19
20. MICROSOFT BIG DATA
immersive data
experiences
PowerPivot
Self-Service
Power View
Collaboration
Corporate Apps
Devices
connecting with
worlds data
Combine
Discover
Refine
Microsoft HDInsight Server
any data, any
size, anywhere
StreamInsight
Parallel Data Warehouse
Relational
HDInsight Service
Non-relational
Analytical
Streaming
21.
22. Microsoft .NET Hadoop APIs
‣ WebHDFS
‣ Linq to Hive
‣ MapReduce
‣ C#
‣ Java
‣ Hive
‣ Pig
‣ http://hadoopsdk.codeplex.com/
‣ SQL on Hadoop
‣ Cloudera Impala
‣ Teradata SQL-H
‣ Microsoft Polybase
‣ Hadapt
23. Data Movement to the Cloud
‣ Use Windows Azure Blob Storage
• Already stored in 3 copies
• Hadoop can read from Azure blob storage
• Allows you to upload while using no Hadoop network or CPU resources
‣ Compress files
•
•
•
•
Hadoop can read Gzip
Uses less network resources than uncompressed
Costs less for direct storage costs
Compress directories where source files are created as well.
23
24. Wrap-up
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why do we take this Big Data Analytics approach?
‣ TBs of change data in each subject area
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL
Server for Big Data
‣ With the configs mentioned previously, SQL Server works great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse