WordPress Websites for Engineers: Elevate Your Brand
Big Data and NoSQL for Database and BI Pros
1. Big Data-BI Fusion:
Microsoft HDInsight & MS BI
Level: Intermediate
March 28, 2013
Andrew Brust
CEO and Founder
Blue Badge Insights
2. • CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 18 years as a speaker
• Founder, MS BI and Big Data User Group of NYC
– http://www.msbigdatanyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer
News
• brustblog.com, Twitter: @andrewbrust
Meet Andrew
5. What is Big Data?
• 100s of TB into PB and higher
• Involving data from: financial data,
sensors, web logs, social media, etc.
• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big
Data too
• Processing of data sets too large for
transactional databases
– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on
small data problems
6. The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
7. What’s MapReduce?
• Divide and conquer approach to “Big”
data processing
• Partition the data and send to mappers
(nodes in cluster)
• Mappers pre-process into key-value pairs,
then all output for (a) given key(s) goes to
a reducer
• Reducer performs aggregations; one
output per key, with value
• Map and Reduce code natively written as
Java functions
9. HDFS
• File system whose data gets distributed
over commodity disks on commodity
servers
• Data is replicated
• If one box goes down, no data lost
– “Shared Nothing”
– Except the name node
• BUT: Immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM
10. HBase
• A Wide-Column Store, NoSQL database
• Modeled after Google BigTable
• HBase tables are HDFS files
– Therefore, Hadoop-compatible
• Hadoop often used with HBase
– But you can use either without the other
• HDInsight (more on next slide) does not
(yet) include HBase
11. Microsoft HDInsight
• Developed with Hortonworks and
incorporates Hortonworks Data Platform
(HDP) for Windows
• Windows Azure HDInsight and Microsoft
HDInsight Server
– Single node preview runs on Windows client
• Includes ODBC Driver for Hive
– And Excel add-in that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source
Apache Project
12. Azure HDInsight Provisioning
• HDInsight preview now public, so…
• Go to Windows Azure portal
• Sign up for the public preview
• Select HDInsight from left navbar
• Click “+ NEW” button @ lower-left
• Specify cluster name, number of nodes, admin
password, storage account
– Credentials used for browser login, RDP and ODBC
– During preview, you will be billed 50% of Azure compute rates
for nodes in cluster. Will be 100% at GA.
• Click “CREATE HDINSIGHT CLUSTER”
• Wait for provisioning to complete
• Navigate to http://clustername.azurehdinsight.net
New!
14. Submitting, Running and
Monitoring Jobs
• Upload a JAR
• Use Streaming
– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name
and params) or use GUI
17. The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured
data, then extract manageable subsets
• Load the subsets into conventional DW/BI
servers and use familiar analytics tool to
examine
• This is the current rationalization of
Hadoop + BI tools’ coexistence
• Will it stay this way?
18. Hive
• Used by most BI products which connect
to Hadoop
• Provides a SQL-like abstraction over
Hadoop
– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output of
which becomes result set
• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)
20. HDInsight Data Sources
• Files in HDFS
• Azure Blob Storage (Azure HDInsight only)
– Use asv:// URLs (“Azure Storage Vault”)
• Hive tables
• HBase?
21. Just-in-time Schema
• When looking at unstructured data,
schema is imposed at query time
• Schema is context specific
– If scanning a book, are the values words, lines, or
pages?
– Are notes a single field, or is each word a value?
– Are date and time two fields or one?
– Are street, city, state, zip separate or one value?
– Pig and Hive let you determine this at query time
– So does the Map function in MapReduce code
22. How Does MS BI Fit In?
• Excel, PowerPivot: can query via Hive
ODBC driver
• Analysis Services (SSAS) Tabular Mode
– Also compatible with Hive ODBC Driver
Multidimensional mode is not
• Power View
– Works against PowerPivot and SSAS Tabular
• RDBMS + Parallel Data Warehouse (PDW)
– Sqoop connectors
– Columnstore Indexes
Enterprise Edition and PDW only
• PDW: PolyBase
23. Excel, PowerPivot
• Excel and PowerPivot use the BI Semantic
Model (BISM), which can query Hadoop via
Hive and its ODBC driver
• Excel also features “Data Explorer”
(currently in Beta) which can query HDFS
directly and insert the results into a BISM
repository
• Excel BISM accommodates millions of
rows through compression. Not petabyte
scale, but sufficient to store and analyze
output of Hadoop queries.
24. PowerPivot, SSAS Tabular
• SQL Server Analysis Services Tabular
mode is the enterprise server
implementation of BISM
• Features partitioning and role-based
security
• Can store billions of rows. So even better
for Hadoop output analysis.
• Excel-based BISM repositories can be
upsized to SSAS Tabular
26. Sqoop
• Acronym for “SQL to Hadoop”
• Essentially a technology for moving data
between data warehouses and Hadoop
• Command line utility; allows specification
of source/target HDFS file and relational
server, database and table
• Sqoop connectors available for SQL
Server and PDW
• Sqoop generates MapReduce job to
extract data from, or insert data into, HDFS
27. PDW, PolyBase
• SQL Server Parallel Data Warehouse
(PDW) is a Massively Parallel Proicessing
(MPP) data warehouse appliance version
of SQL Server
• MPP manages a grid of relational database
servers for divide-and-conquer processing
of large data sets.
• PDW v2 includes “PolyBase,” a
component which allows PDW to query
data in Hadoop directly.
– Bypasses MapReduce; addresses data nodes directly
and orchestrates parallelism itself
28. PolyBase Versus Hive, Sqoop
• Hive and Sqoop generate MapReduce
jobs, and work in batch mode
• PolyBase addresses HDFS data itself
• This is true SQL over Hadoop.
• Competitors:
– Cloudera Impala
– Teradata Aster SQL-H
– EMC/Greenplum Pivotal HD
– Hadapt
29. Usability Impact
• PowerPivot makes analysis much easier,
self-service
• Power View is great for discovery and
visualization; also self-service
• Combine with the Hive ODBC driver and
suddenly Hadoop is accessible to
business users
• Caveats
– Someone has to write the HiveQL
– Can query Big Data, but must have smaller result
30. Resources
• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page
– http://hadoop.apache.org/
• Hive & Pig home pages
– http://hive.apache.org/
– http://pig.apache.org/
• Hadoop on Azure home page
– https://www.hadooponazure.com/
• SQL Server 2012 Big Data
– http://bit.ly/sql2012bigdata