2. What we'll answer in 50 minutes
• Who is this guy?
• How do I enable AdHoc, self
service reporting on NoSQL?
• How do I improve the
performance of dashboards
on top of NoSQL?
• How do I integrate NoSQL
data with my other data not
inside NoSQL?
• How do I enable, easy to build
simple reports but also
preserve the ability for rich
NoSQL queries?
3. Nicholas Goodman
• Open Source BI thought leader
– 50+ Open Source BI customer projects
– Blogger, whitepapers, etc
• Entrepreneur
– DynamoBI Corporation
– Bayon Technologies, Inc.
• Data Geek, hacker, tinkerer, committer
GOAL: Share perspectives,
research, opinions.
DISCLAIMER: Your Mileage ...
5. Promise of “Big Data”
• NoSQL/Hadoop/MapReduce Systems
– Keep more of it
– Cost effective analysis
– “Massive scale” data, now accessible to everyone (elastic)
– Not just SQL queries, more complex analysis
ACCOMPLISHED: WEB SCALE, MASSIVE
NEVER BEFORE SEEN SCALE OF DATA
STORAGE AND PROCESSING
6. Reality Check!
• Petabytes? Y • Fast Queries? N
• Cheap Storage? Y • Ad Hoc access? N
• Raw Processing? Y • Accessibility to commodity BI
tools? N
• Rich Query Languages? Y
• Flexible data structures? Y• Easy report authoring? N
• Reliable, Fault Tolerant? Y• Levels of Aggregation? N
• Integrated Data? N
Big Data has solved the INFRASTRUCTURE of
raw/core data storage but has provided less value
to what BUSINESS users want for analytics.
8. Levels of Aggregation
SAME DATA AT VARIOUS
LEVELS OF AGGREGATION
HUGELY IMPORTANT IN REAL
LIFE IMPLEMENTATIONS!
10K
1 ROW 1 MILLION
TO 100 MILLION
1 BILLION ROWS
100 BILLION
9. Architectures
• NoSQL reports
• NoSQL thru and thru
• NoSQL + MySQL
• NoSQL as ETL Source
• NoSQL programs in BI Tools
• NoSQL via BI Database (SQL)
10. NoSQL reports
• Pay Developer to build applications for reports
Apps
• 100% Richness of NoSQL • $$, developer driven process
• Up to date, current • No commodity BI tools
• Excellent performance on • Managing rollups/summaries
large datasets • Schema-less = Harder!
• Custom built, beautiful • Hard to integrate other
reports/dashboards reporting information
• Single system to manage
11. NoSQL thru and thru
• Pay Developer to build FLEXIBLE applications for reports
Indices Advanced
Aggs Apps
• All of NoSQL report • $$, developer driven process
advantages • $$, app required for aggs
• Managed aggregations, • No commodity BI tools
rollups
• Hard to integrate other
• “Guided Adhoc” available reporting information
inside application
• Limited AdHoc (only
• Higher performance for developer built
dashboards/summaries combinations)
12. NoSQL + MySQL
• Pay Developer to build FLEXIBLE applications for reports
ETL
App MySQL
• Less IT $$ since developers • Data freshness (24 hrs old)
aren't “building reports” • Once into MySQL no rich
• Rich, NoSQL analysis left in NoSQL application use (M/R)
place (ETL + NoSQL) • BI Tool can connect ONLY to
• Easy, Ad Hoc reporting via data in MySQL, not NoSQL
commodity BI tools • Aggregations still self
• Easier to understand data for managed in MySQL
self service reports
13. NoSQL as ETL Data Source
• NoSQL treated like any other data source
Informatica Teradata
• Allows use of consolidated, • ETL Development Expense
BI tool for AdHoc • Data Latency
• Enables integrated • Loss of NoSQL language
(combined) datasets for richness
reporting
• Traditional DW tools are $$
• Aggregations Often
“managed” • Scaling issues with DW
Database
• Best of Breed tools
14. NoSQL programs in BI Tools
• Write a program in BI tool that flattens data, output into report
• Rich use of NoSQL native • Developer required to write
language program ($$)
• Direct, up to date access • Slow-er (aggs, summaries)
• Access to 100% of dataset • Lacks integration with other
• Leverage “guided” report datasets
parameter pages • Still (usually) no AdHoc
• Less expensive than apps access
15. NoSQL via BI Database (SQL)
• Enable NoSQL data access via SQL (gasp!) Live Query
Cached, 24hr data
• Easy reports, easy (SQL) • Another system in between
• Integration with other data • Still needs to be refreshed,
• ETL is simple INSERT/MERGEs nightly
• Live, up to date access • Not all capabilities for NoSQL
richness available via SQL
• High performance, cached data
• AdHoc access to Live + Cached
• Aggregations/Summaries
16. Mozilla: NoSQL thru and thru(DB)
• Socorro Project: Crash reports, optionally sent to Mozilla
• https://crash-stats.mozilla.com
17. X: NoSQL via SQL
• Using “Splunk” (ie, a commercial NoSQL-eee data aggregator/etc)
• Desire to use Tableau for advanced analytics/visualization
18. Meteor Solutions:
NoSQL thru and thru
• Using Cloudant BigCouch solution (SaaS)
• High performance set of multi purpose indices on pre defined
aggregations
• Up to date aggregation/reports
• Better fit for Social Media graph structures over relational DB
• Custom built BI applications (dashboards/reports) providing a
flexible guided view through data
Advanced
Apps
19. A,B,C: NoSQL + MySQL
• Many Many companies (3 we've worked with)
• All “web related” companies (semi structured, some, mostly
volume)
• Heavy lifting and storage, and “ETL/Data prepartion” inside
Hadoop
• Push summarized, aggregated data into MySQL for analysis by
easy, dashboarding/BI Tools
ETL
App MySQL