This document discusses how Tableau and MongoDB can work together for visual analytics of big data. It describes how MongoDB is a NoSQL database that can handle unstructured and semi-structured data like JSON, and how Tableau allows users to connect to MongoDB through an ODBC driver and visualize the data without needing to write code. The document outlines scenarios where big data comes from human, machine, and process sources and how the combination of Tableau and MongoDB's schema-on-read approach reduces the need for ETL. It also previews demos of connecting Tableau to MongoDB using both the ODBC driver and a PostgreSQL interface.
Tableau & MongoDB: Visual Analytics at the Speed of Thought
1. Tableau & MongoDB –
Visual Analytics at the Speed of
Thought
MongoDB World 2015
Tableau Software
June 2nd, 2015
2. Introduction
Jeff Feng
Product Manager – Big
Data
@jtfeng
Clara Siegel
Product Manager
@clara_siegel
Our Viz-dentials
jfeng@tableau.co
m
csiegel@tableau.co
m
Clara – let me know
if this works for you!
14. Visual analytics
• Ad-hoc calculations
• New Calculation Editor
• Auto-complete for calculations
• Level of detail expressions
• Drag-and-drop analytics
• Instant Analytics
• Lasso and radial selection
• Geographic search
• New pan-and-zoom experience
• Demographic data layers
Tableau Server
• Vizportal - New Server &
improved interface
• Infinite scrolling
• Universal search
• Improved Permissions management
• High Availability Improvements
• REST APIs for provisioning and
content management
• Tabcmd Improvements
• New Admin Views
Performance
• Parallel queries
• Data Engine Vectorization
• Parallel aggregation
• Temp table support on Data Server
• Saved Query Caching
• Query Fusion
• Query Batch Ordering
• Shadow Extracts
User Experience
• Redesigned Start and Connect
Experience
• Enhanced Story Points Formatting
• Responsive Marks
• Fast Tooltips
• Thumbnail Previews in Desktop
• Reset button in continuous Quick
Filters
Data Preparation
• Excel Clean-up
• Pivot
• Data Split
• REGEX
• Metadata grid
• Data Extract API for Mac OS
• Publish and append in the TDE API
• Access files from SPSS, SAS and R
• Improved Salesforce connector
Mobile
• Redesigned App Experience
• Offline snapshots of Favorites
• Create and Edit calculations in
Mobile Authoring
19. Scenarios: 3 main sources of Big Data
Human-generated data
+ Social media
+ Emails, text messages
+ YouTube videos
Machine-generated data
+ Sensors
+ Internet of Things
Process-generated data
+ Business systems
+ Web logs
21. {
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
Name Gender Age
Michael M 6
Jennifer F 3
RDBMS example
JSON example
JSON is both schema-less and
complex
23. Hierarchical Databases
HW & Application-
specific data
Relational Databases
Application independent
Scale-up architecture
Structured data only
Schema-on-write
Limited data processing
High cost
Today’s use cases are driving the need
for a new generation of databases
NoSQL Databases
ALL DATA – Structured
& Unstructured
Massive scale – scale-
out
Schema-on-read
Storage with Compute
Low cost
24. Broad access to Big Data platforms
Visual analytics without coding
Hybrid data architecture
Data blending across data sources
Platform query performance
Consistent interface to visualizing
Tableau plays a fundamental role in
Big Data analysis
30. DataApplication
Tableau users can connect to
MongoDB through an ODBC
interface ODBC Driver
ODBCInterface
DriverImplementation
SQL-92 MongoDB
Native API
31. Tableau users can connect to
MongoDB through an ODBC
interface ODBC Driver
ODBCInterface
DriverImplementation
SQL-92 MongoDB
Native API
Simba MongoDB ODBC Driver
Translates SQL into Native MongoDB
API calls
Allows users to infer or define schemas
on schema-less JSON data
Converts JSON data to relational data
Based on ODBC 3.80 Standard
Full 64-bit and 32-bit support
Full SQL support
32. Tableau + MongoDB Demo:
Simba Driver
<CLARA TO FILL IN OVER-RIDING QUESTION WE WANT TO
ANSWER OF OUR DATA> Clara – please fill in
question and/or add
slides you need to
setup your demo
33. <INSERT SLIDES FROM MONGODB>
Ron and/or Asya–
please insert any
slides you’d like to
include on the deep
dive for the new
interface to
MongoDB
34. Tableau + MongoDB Demo:
PostgreSQL Interface
<JEFF OR CLARA TO FILL IN OVER-RIDING QUESTION WE WANT
TO ANSWER OF OUR DATA> Jeff, Clara – please
fill in question and/or
add slides you need
to setup your demo
37. Jeff Feng | Product Manager @jtfengjfeng@tableau.co
m
Clara Siegel | Product Manager csiegel@tableau.
com
@clara_siegel
Ron Avnur | VP, Product Management ron.avnur@mongodb.c
om
Asya Kamsky | Principal Solutions
Architect
asya@mongodb.c
om
Abstract
Tableau enables people to ask questions of their data by bringing analysis and visualization together with revolutionary technology. In this session, you’ll learn how to leverage Tableau and MongoDB for visual analytics of rich JSON data at the speed of thought, dramatically reducing the time-to-insight for users. The talk will include interactive demos and best practices to drive smart and fast business insights.
We believe in the triumph of facts. We believe in unleashing human ingenuity. We believe in empowered workplaces. We enable people to contribute and make achievements that they consider to be the highest use of their skills, intellect, and capabilities. When this happens, they improve their lives, their organizations, and the world.
We do this by making software that helps people, regardless of technical skill, see and understand their data. This liberates people’s natural curiosity and creative energy. It enables them to have a conversation with their data that was never before possible – leading to valuable discoveries that challenge the status quo.
Traditional BI implementations are famous for falling flat on their face, leaving customers weary and shy of taking a chance on BI projects. History has left a sour taste amonst many companies leaving them un trustworthy of change.
We have 4 main products
We are SWIMMING in data.
Companies are literally DROWNING in a TIDAL WAVE of Big Data
And not only is it NOT going away… IT’S GETTING WORSE!
Eric Schmidt said that 5 EXABYTES of information, // that’s TEN to the 18 zeros // was created from the DAWN of civilization until 2003.
<CLICK>
NOW that amount of information is created every 2 days
Just remember the 4 V’s of Big Data: VOLUME, VELOCITY, VARIETY & VALUE
THIS is the ACADEMIC view of Big Data
There are 3 main signs or sources of Big Data
<CLICK>
Number ONE look for Process Generated Data. This includes data from business systems and web logs.
For business systems, if they want to analyze operational data from Salesforce, NetSuite, TFS, Alpo, Egencia, & Concur.
Or perhaps operational web log data from their website // THAT IS BIG DATA
<CLICK>
Number TWO look for Machine Generated Data // that includes sensors and the Internet of Things.
Smartphones generate TONS of data // As do cars, elevators and factory machinery
THAT IS BIG DATA
<CLICK>
Lastly, there’s all the human-generated data.
E-mail data, social data, document data // THAT IS BIG DATA
Something else you should know about Big Data
Only 20% of the world’s data is structured data
<CLICK>
Which means that MOST of the world’s data is UNSTRUCTURED // and NOT immediately accessible.
<CLICK>
We are moving from a WORLD of FLAT FILES to a WORLD of JSON
All of this wrecks havoc for visual analysis
We’re at the tip of the iceburg – LITERALLY <PAUSE>
<CLICK>
MongoDB is one of the best solutions for addressing the opportunity beneath the surface of the iceburg
Relational databases CANNOT solve the data challenges that we face today
Someday VERY SOON, we are going to look back at relational databases and think THIS
<CLICK>
Take this in for a second
Relational Databases are great when…
… you know the relationships in advance
… when the schema doesn’t change
… when your data is flat and not nested
… when the data fits on one machine
Today’s use cases are driving the NEED // for a new generation of databases
In the not so distant past, // we used Hierarchical Databases // which contained HW and Application Specific Data.
This meant that data was NOT easily reusable across applications
Then came the era of the Relational Database.
Relational databases have a lot of limitations relative to Hadoop & NoSQL
They have a scale-up architecture. // They can only handle structured data. // They are schema-on-write. // They have limited data processing capabilities. // And they are expensive.
That said, relational databases are still very relevant as a transaction source, // and they will continue to be
It’s just that relational databases are DECLINING as a Big Data DESTINATION. // Companies are RIGHT-SIZING their Relational DBs
In reaction, the Relational Database Vendors are creating reference architectures with Hadoop // to SLOW the bleeding
NoSQL Databases on the other hand // (of which Hadoop is one type) // invite you to store ALL OF YOUR DATA – both structured and unstructured.
They are designed to be distributed with a scale-out architecture, // meaning when you scale // you just add another BOX // instead of getting a BIGGER BOX
They allow you to be more AGILE // They allow for SCHEMA-ON-READ – this means you don’t have to actually DEFINE your schema // UNTIL you are ready to analyze it
They combine STORAGE together with COMPUTE
It’s CHEAPER!!! Think $300/TB for HADOOP, $1500/TB for TERADATA
Tableau plays a FUNDAMENTAL role // in the analysis of Big Data.
If you have ever walked the floors at any of the Big Data conferences, // you’ll see Tableau EVERYWHERE // and it all goes back to our core value proposition for ALL Data
So Why Does Tableau WIN for Big Data?
First, we provide broad access to Big Data platforms // – We have a number of direct connectors in our product for Hadoop, NoSQL and Cloud-based data sources
For Hadoop, our top partners include Cloudera, Hortonworks & MapR
For NoSQL, we have connectors to Datastax Cassandra & MarkLogic & hopefully soon MongoDB
And for Cloud, we have Google BigQuery, Amazon Redshift & Amazon EMR
We help to unlock huge data stores by enabling visual analytics without coding // – Data that is stored in Hadoop is NOT easily accessible, // ESPECIALLY to business users. // With Tableau, you don’t need to write code // – this extends the accessibility and usefulness of Big Data // to ALL users!
We have a hybrid data architecture - // Tableau can connect LIVE to data sources or bring it IN-MEMORY. // LIVE connectivity works great for the data exploration use case // OR when connecting to FAST, INTERACTIVE query engines such as Impala & Spark against large datasets. // IN ADDITION, we can also ACCELERATE slower data sources by using our in-memory Data Engine.
This enables TWO distinct use cases in Big Data: Data Exploration and Data Reporting
In the data exploration use case, // Tableau users connect directly to their data // to understand the shape of their data, // identify initial trends and outliers, // and decide what view of the data they want to expose to their end users
In the data reporting use case, // Tableau users access prepared views of the data // to create purpose-built dashboards and storypoints
We enable mashup with other data via data blending – As a Tableau user, // you are not FORCED to move any of your data, // NOR does it need to be in ONE place. It’s not just about BIG DATA, it’s about DISTRIBUTED DATA
We invest in our platform query performance – // ALL of those GREAT performance enhancements in V9 will really shine here. // Of the improvements, parallel queries are ESPECIALLY relevant on distributed architectures such as Hadoop
We provide a consistent interface to visualizing data - // If a user is accustomed to using Tableau for small data, // it’s the same familiar interface for the analysis of Big Data as well
Analyzing JSON data in a BI or data visualization tool traditionally requires an ETL process.
There are three reasons you would want to use an ETL process:
1. Data standardization or data cleansing – making the data more queryable
2. Representational transformation – changing the source data from nested & complex to flat & relational
3. Data movement – moving the data from the staging area to a target system
Performing traditional ETL requires additional operational overhead as well as longer time-to-insight due to the additional steps.
It also often requires IT involvement to help transform the data due to the limited availability and capability of “business user friendly” ETL tools
Interfaces that enable “schema-on-read” such as the Simba ODBC driver for MongoDB can eliminate aspects 2 & 3 for data transformation and data movement.
Representing nested data in a relational model is a big challenge.
The easiest way of representing nested data in a relational model is by simple flattening.
The drawback of this approach is that nested elements such as arrays can cause the flattened relational model to become very sparse.
A preferred method of representing nested data in a relational model is to create a separate virtual sub-table for each nested element.
In the main fact table, key value pairs become the column names and table elements while embedded sub-documents would become flattened.
However, a nested array would be represented as a virtual sub-table that is linked together with a foreign key to the main table to maintain the relationship between the records.
The foreign keys are not part of the original dataset, but they are generated during schema inference to establish the relationship.
An example of this is shown with users being the main table and cars being the virtual sub-table from the nested array.
Now if your customer asks you how we connect to Big Data // it’s primarily through an ODBC interface
The ODBC driver // translates SQL-92 queries into SQL-LIKE languages such as HiveQL
To achieve the BEST performance possible, // we custom tune the SQL we generate.
We also push down aggregations, filters and other SQL operations // TO the big data platforms // to take ADVANTAGE of their capability to handle LARGE amounts of data
Simba MongoDB ODBC Driver
Translates SQL into Native MongoDB API calls
Performs schema inference to capture relational metadata for JSON – This is to help maps schema-less data to fixed schema
Converts JSON data to relational data
Based on ODBC 3.80 Standard
Full 64-bit and 32-bit support
Full SQL support