Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.
In this session you will learn:
* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Mastering MapReduce: MapReduce for Big Data Management and Analysis
1. Mastering MapReduce Series, Session I:MapReduce for Big Data Management and Analysis Curt Monash, Monash Research Steve Wooledge, Aster Data Peter Pawlowski, Aster Data Eric Friedman, Aster Data October 15th, 2009
2. Aster Data Overview SQL-MapReduce Example SQL-MapReduce applications SQL-MapReduce Syntax/example Q&A Topics
3. Aster Data Creating the Next-Generation Data Management System Founded in 2005 to revolutionize data processing & management of very large data volumes Founding team innovated on the ‘big data’ problem at Stanford University and were joined by big data experts from Google, Oracle, and Microsoft Aster’s first commercial product, nCluster, has been in market since 2007. Customers include MySpace, LinkedIn, Coremetrics, Akamai, others. Since 2008, innovated on Google’s well-known MapReduceframework to transform data processing. Created patent-pending SQL-MapReduce(In-Database MapReduce)
18. Aster’s Solution - A Massively Parallel Data Warehouse With the Unique Ability to Embed Applications Deeper, Faster Analytics on Big Data OtherApplications(C, C++, Perl, Python…) Leading BI Tools Key Classes ofApplications Custom JAVAApplications Custom .NET Applications Packaged Analytic Apps 6 Aster nCluster System Aster’s SQL-MapReduce orStandard Interfaces Unified Interface SQL SQL-MapReduce 5 High Volume, Fast Querying Industry-leading WLM: 300+ Concurrent Workloads 4 Dynamic Workload Manager (WLM) Data .NET App Java App Embedded Parallelized Apps – executes within the DB Pack’gdApp Other Apps 3 3 Data Data Data Data Data Data MPP Data Warehouse withIncremental Scaling (scale by function) Data Data Data Data Data 2 Massively -Parallel Data Store 1 Commodity Hardware
19. Aster SQL-MapReduce (SQL-MR) Bring your applications to the data “Data-Applications” Development Platform Rich portfolio of supported languages – Java, .NET, Python, Ruby, Perl, C++, R and More Use SQL to develop rich data apps Expressive flexibility Reusability across applications and reports
26. Aster’s Patent-Pending SQL-MapReduce Enables faster, easier, and more powerful analytics SQL-MapReduce framework (for developers to create and extend) Flexible: MapReduce expressiveness, languages, polymorphism Performance: Massive parallelization, computational push-down Availability: Fault isolation, resource management Powerful SQL-MR functions (for analysts to consume) Deep insights: Unlimited analytical power at your disposal Ease of use: Simply plug in to the SQL you know and love The Power of Aster’s SQL-MapReduce Framework Write Install Use and Reuse Write a SQL-MR function in Java, C, etc. Install inside Aster nCluster Invoke SQL-MR function from SQL 3 1 2
35. Expensive HW & maintenanceBest of both worlds! Traditional Database
36. MapReduce Applications Behavioral Analytics (CRM) Sequential pattern analysis (e.g., up-sell/cross-sell) Spam/BOT analysis Sessionization analysis Risk & Fraud analysis Consumer credit scoring/default risk, market risk/VaR, operational risk, etc Fraud detection Graph analysis Social network “connectedness” (e.g., SSSP, APSP, etc) Text analysis Tokenization (e.g., word count classification) Natural language processing Statistical analysis (machine learning) Linear regression K-means clustering R Project algorithms
37. Aster’s SQL-MapReduce Library: Pre-packaged (SDK), SQL-MR APIs, and documentation Pre-packaged SQL-MR sample functions nPath – complex sequential analysis for time-series and behavioral pattern analysis SSSP – single source shortest path Graph algorithm useful for fraud and segmentation analysis Sessionize– session categorization based on a sequence of clicks within a specified timeout Approximate percentiles – ultra-fast percentile (or N-tile) statistical distribution analysis Linear regression – statistical technique used to predict values based on a set of related variables. Tokenize – text analysis that splits strings into words, categorizes them, and does a word count
38.
39. Requires dozens of SQL queries every N minutes (dozens of times per day)
52. nPath is a SQL-MR function included with nCluster. nPath enables analysis of ordered data: Clickstream data Financial transaction data User interaction data Anything of a time series nature Leverages the power of the SQL-MR framework to transcend SQL’s limitations with respect to ordered data What is Aster nPath? 17
53. Example: Analyzing a Clickstream Business question How many distinct users: Start at the home page. Click on an auction. View the seller’s profile. Bid on the item. Available Data A database table clicks, populated with web log data, that has columns user_id, timestamp, and page_type.
54. The nPath query SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (1) Partition: Form groups by user_id. (2) Order: Sort each group by timestamp.
55. The nPath query (3b) Match: Define the subsequences of interest via regex. SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (3a) Match: Define a set of symbols.
56. The nPath query SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (4) Compute Aggregates over matched subsequences.
57. Market Basket Analysis Example Question Detect customers - that purchase the same category of items - in three market baskets in a row - with total value > $150
58. Two Methods – Same Answer Multi-pass Nested Sub-selects Single Pass SQL-MR nPath Query 5187 17769 3542 1889 5753 2001 156 193 2521 156 1416 75194 75194 10411 27355
61. Upcoming Webcast: Mastering MapReduce Part II Save the date!: December 3rd MapReduce Resources - http://www.asterdata.com/mapreduce/index.php Recorded application use-cases Code samples and tutorials DBMS2 on MapReduce: http://www.dbms2.com/category/parallelization/mapreduce/ Aster’s SQL-MapReduce http://www.asterdata.com/product/mapreduce.php http://www.asterdata.com/blog/index.php/category/mapreduce/ TDWI Technical whitepaper Contact us hello@asterdata.com Steve.wooledge@asterdata.com Thank You!