The document discusses Oracle Big Data Discovery and how it can be used to analyze and gain insights from data stored in a Hadoop data reservoir. It provides an example scenario where Big Data Discovery is used to analyze website logs, tweets, and website posts and comments to understand popular content and influencers for a company. The data is ingested into the Big Data Discovery tool, which automatically enriches the data. Users can then explore the data, apply additional transformations, and visualize relationships to gain insights.
2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Mark Rittman, Co-Founder of Rittman Mead
‣Oracle ACE Director, specialising in Oracle BI&DW
‣14 Years Experience with Oracle Technology
‣Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
‣Oracle Business Intelligence Developers Guide
‣Oracle Exalytics Revealed
‣Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
•Email : mark.rittman@rittmanmead.com
•Twitter : @markrittman
About the Speaker
3. info@rittmanmead.com www.rittmanmead.com @rittmanmead 3
•Started back in 1997 on a bank Oracle DW project
•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL
and shell scripts
•Went on to use Oracle Developer/2000 and Designer/2000
•Our initial users queried the DW using SQL*Plus
•And later on, we rolled-out Discoverer/2000 to everyone else
•And life was fun…
15+ Years in Oracle BI and Data Warehousing
4. info@rittmanmead.com www.rittmanmead.com @rittmanmead 4
•Over time, this data warehouse architecture developed
•Added Oracle Warehouse Builder to
automate and model the DW build
•Oracle 9i Application Server (yay!)
to deliver reports and web portals
•Data Mining and OLAP in the database
•Oracle 9i for in-database ETL (and RAC)
•Data was typically loaded from
Oracle RBDMS and EBS
•It was turtles Oracle all the way down…
The Oracle-Centric DW Architecture
5. info@rittmanmead.com www.rittmanmead.com @rittmanmead 5
•Many customers and organisations are now running initiatives around “big data”
•Some are IT-led and are looking for cost-savings around data warehouse storage + ETL
•Others are “skunkworks” projects in the marketing department that are now scaling-up
•Projects now emerging from pilot exercises
•And design patterns starting to emerge
Many Organisations are Running Big Data Initiatives
6. info@rittmanmead.com www.rittmanmead.com @rittmanmead 6
•Typical implementation of Hadoop and big data in an analytic context is the “data lake”
•Additional data storage platform with cheap storage, flexible schema support + compute
•Data lands in the data lake or reservoir in raw form, then minimally processed
•Data then accessed directly by “data scientists”, or processed further into DW
Common Big Data Design Pattern : “Data Reservoir”
10. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
An Interesting Question.
11. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Meanwhile, back in the real world…
12. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
13. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
15. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Customer 360-Degree Insight
16. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
17. info@rittmanmead.com www.rittmanmead.com @rittmanmead 17
Data from Real-Time, Social & Internet Sources is Strange
Single Customer View
Enriched
Customer Profile
Correlating
Modeling
Machine
Learning
Scoring
•Typically comes in non-tabular form
•JSON, log files, key/value pairs
•Users often want it speculatively
‣Haven’t though through final
purpose
•Schema can change over time
‣Or maybe there isn’t even one
•But the end-users want it now
‣Not when your ETL team are next
free
18. info@rittmanmead.com www.rittmanmead.com @rittmanmead 18
•Hadoop & NoSQL better suited to exploratory analysis of
newly-arrived data reservoir type-data
‣Flexible schema - applied by user rather than ETL
‣Cheap expandable storage for detail-level data
‣Better native support for machine-learning and
data discovery tools and processes
‣Potentially a great fit for our new and emerging
customer 360 datasets, and great platform for analysis
Introducing Hadoop - Cheap, Flexible Storage + Compute
20. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
•Start with pilot for area of the business that needs a single view of customers
•Then, over time, iterate and build out the Customer 360-degree view
Delivering a Successful Customer 360-Degree View
Start with a business area
that
needs a single
customer view
Obtain clear
understanding of
customer online & offline
behaviour
Build out
Predictive Models
and Decision Engines
to deliver value now
Build out Hadoop Data
Reservoir, Feeds
and link to DW + CRM
Iterate and Build-out,
add new integrations,
incrementally building
capability
Develop and Implement Strategy, Deliver Business Value
Build DevOps Capability
Pilot & Quick Win
Create Full Production InfrastructurePilot (Virtualised / Commodity) Hadoop Infrastructure
21. info@rittmanmead.com www.rittmanmead.com @rittmanmead 21
But … These Data Sources are Strange
Single Customer View
Enriched
Customer Profile
Correlating
Modeling
Machine
Learning
Scoring
•Typically comes in non-tabular form
•JSON, log files, key/value pairs
•Users often want it speculatively
‣Haven’t though through final
purpose
•Schema can change over time
‣Or maybe there isn’t even one
•But the end-users want it now
‣Not when your ETL team are next
free
25. info@rittmanmead.com www.rittmanmead.com @rittmanmead 25
•Data loaded into the reservoir needs preparation and curation before presenting to users
•Specialist skills typically needed to ingest and understand data - and those staff are scarce
•How do we staff and scale projects as our use of big data matures?
But … Working with Unstructured Textual Data Is Hard
29. info@rittmanmead.com www.rittmanmead.com @rittmanmead 29
•Part of the acquisition of Endeca back in 2012 by
Oracle Corporation
•Based on search technology and concept of
“faceted search”
•Data stored in flexible NoSQL-style in-memory
database called “Endeca Server”
•Added aggregation, text analytics and text
enrichment features for “data discovery”
‣Explore data in raw form, loose connections,
navigate via search rather than hierarchies
‣Useful to find out what is relevant and valuable in
a dataset before formal modeling
What Was Oracle Endeca Information Discovery?
30. info@rittmanmead.com www.rittmanmead.com @rittmanmead 30
•Proprietary database engine focused on search and analytics
•Data organized as records, made up of attributes stored as key/value pairs
•No over-arching schema,
no tables, self-describing attributes
•Endeca Server hallmarks:
‣Minimal upfront design
‣Support for “jagged” data
‣Administered via web service calls
‣“No data left behind”
‣“Load and Go”
•But … limited in scale (>1m records)
‣… what if it could be rebuilt on Hadoop?
Endeca Server Technology Combined Search +
Analytics
40. info@rittmanmead.com www.rittmanmead.com @rittmanmead 40
•A visual front-end to the Hadoop data reservoir, providing end-user access to datasets
•Catalog, profile, analyse and combine schema-on-read datasets across the Hadoop cluster
•Visualize and search datasets to gain insights, potentially load in summary form into DW
Oracle Big Data Discovery
41. info@rittmanmead.com www.rittmanmead.com @rittmanmead 41
What Does Big Data Discovery Do?
•Provide a visual catalog and search function across data in the data reservoir
•Profile and understand data, relationships, data quality issues
•Apply simple changes, enrichment to incoming data
•Visualize datasets including combinations (joins)
42. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
•Start with pilot for area of the business that needs a single view of customers
•Then, over time, iterate and build out the Customer 360-degree view
Delivering a Successful Customer 360-Degree View
Start with a business area
that
needs a single
customer view
Obtain clear
understanding of
customer online & offline
behaviour
Build out
Predictive Models
and Decision Engines
to deliver value now
Build out Hadoop Data
Reservoir, Feeds
and link to DW + CRM
Iterate and Build-out,
add new integrations,
incrementally building
capability
Develop and Implement Strategy, Deliver Business Value
Build DevOps Capability
Pilot & Quick Win
Create Full Production InfrastructurePilot (Virtualised / Commodity) Hadoop Infrastructure
43. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Delivering a Successful Customer 360-Degree View
Build out
Predictive Models
and Decision Engines
to deliver value now
Build out Hadoop Data
Reservoir, Feeds
and link to DW + CRM
Build DevOps Capability
44. info@rittmanmead.com www.rittmanmead.com @rittmanmead 44
•Provide a visual catalog and search function across data in the data reservoir
•Profile and understand data, relationships, data quality issues
•Apply simple changes, enrichment to incoming data
•Visualize datasets including combinations (joins)
What Does Big Data Discovery Do?
45. info@rittmanmead.com www.rittmanmead.com @rittmanmead 45
•Rittman Mead want to understand drivers and audience for their website
‣What is our most popular content? Who are the most in-demand blog authors?
‣Who are the influencers? What do they read?
•Three data sources in scope:
Example Scenario : Social Media Analysis
RM Website Logs Twitter Stream Website Posts, Comments etc
46. info@rittmanmead.com www.rittmanmead.com @rittmanmead 46
•Datasets in Hive have to be ingested into DGraph engine before analysis, transformation
•Can either define an automatic Hive table detector process, or manually upload
•Typically ingests 1m row random sample
‣1m row sample provides > 99% confidence that answer is within 2% of value shown
no matter how big the full dataset (1m, 1b, 1q+)
‣Makes interactivity cheap - representative dataset
Ingesting & Sampling Datasets for the DGraph Engine
47. info@rittmanmead.com www.rittmanmead.com @rittmanmead 47
•Ingested datasets are now visible in Big Data Discovery Studio
•Create new project from first dataset, then add second
View Ingested Datasets, Create New Project
48. info@rittmanmead.com www.rittmanmead.com @rittmanmead 48
•Ingestion process has automatically geo-coded host IP addresses
•Other automatic enrichments run after initial discovery step, based on datatypes, content
Automatic Enrichment of Ingested Datasets
49. info@rittmanmead.com www.rittmanmead.com @rittmanmead 49
•For the ACCESS_PER_POST_CAT_AUTHORS dataset, 18 attributes now available
•Combination of original attributes, and derived attributes added by enrichment process
Initial Data Exploration On Uploaded Dataset Attributes
50. info@rittmanmead.com www.rittmanmead.com @rittmanmead 50
•Data ingest process automatically applies some enrichments - geocoding etc
•Can apply others from Transformation page - simple transformations & Groovy expressions
Data Transformation & Enrichment
51. info@rittmanmead.com www.rittmanmead.com @rittmanmead 51
•Uses Salience text engine under the covers
•Extract terms, sentiment, noun groups, positive / negative words etc
Transformations using Text Enrichment / Parsing
52. info@rittmanmead.com www.rittmanmead.com @rittmanmead 52
•Choose option to Create New Attribute, to add derived attribute to dataset
•Preview changes, then save to transformation script
Create New Attribute using Derived (Transformed) Values
12
3
53. info@rittmanmead.com www.rittmanmead.com @rittmanmead 53
•Users can upload their own datasets into BDD, from MS Excel or CSV file
•Uploaded data is first loaded into Hive table, then sampled/ingested as normal
Upload Additional Datasets
1
2
3
54. info@rittmanmead.com www.rittmanmead.com @rittmanmead 54
•Used to create a dataset based on the intersection (typically) of two datasets
•Not required to just view two or more datasets together - think of this as a JOIN and SELECT
Join Datasets On Common Attributes
57. info@rittmanmead.com www.rittmanmead.com @rittmanmead 57
•BDD Studio dashboards support faceted search across all attributes, refinements
•Auto-filter dashboard contents on selected attribute values - for data discovery
•Fast analysis and summarisation through Endeca Server technology
Faceted Search Across Entire Data Reservoir
Further refinement on
“OBIEE” in post keywords
3
Results now filtered
on two refinements
4
58. info@rittmanmead.com www.rittmanmead.com @rittmanmead 58
•Visual Analyzer also provides a form of “data discovery” for BI users
‣Similar to Tableau, Qlikview etc
‣Inspired by BI elements of OEID
•Uses OBIEE RPD as the primary datasource,
so data needs to be curated + structured
•Probably a better option for users who
aren’t concerned its “big data”
•But can still connect to Hadoop via
Hive, Impala and Oracle Big Data SQL
Comparing BDD to Oracle Visual Analyzer
59. info@rittmanmead.com www.rittmanmead.com @rittmanmead 59
•Data in the data reservoir typically is raw, hasn’t been organised into facts, dimensions yet
•In this initial phase, you don’t want to it to be - too much up-front work with unknown data
•Later on though, users will benefit from structure and hierarchies being added to data
•But this takes work, and you need to understand cost/benefit of doing it now vs. later
Managed vs. Free-Form Data Discovery
60. info@rittmanmead.com www.rittmanmead.com @rittmanmead 60
•Transformations within BDD can then be used to create curated fact + dim Hive tables
•Can be used then as a more suitable dataset for use with OBIEE RPD + Visual Analyzer
•Or exported then in to Exadata or Exalytics to combine with main DW datasets
Export Prepared Datasets Back to Hive, for OBIEE + VA
61. info@rittmanmead.com www.rittmanmead.com @rittmanmead 61
•Users in Visual Analyzer then have
a more structured dataset to use
•Data organised into dimensions,
facts, hierarchies and attributes
•Can still access Hadoop directly
through Impala or Big Data SQL
•Big Data Discovery though was
key to initial understanding of data
Further Analyse in Visual Analyzer for Managed
Dataset
62. info@rittmanmead.com www.rittmanmead.com @rittmanmead 62
•Oracle Big Data Discovery used to go back to the raw event data add more meaning
•Enrich data, extract nouns + terms, add reference data from file, RDBMS etc
•Understand sentiment + meaning of tweets, link disparate + loosely coupled events
•Faceted search dashboards
Oracle BDD for Data Wrangling + Data Enrichment
63. info@rittmanmead.com www.rittmanmead.com @rittmanmead 63
•Previous counts assumed that all tweet references equally important
•But some Twitter users are far more influential than others
‣Sit at the centre of a community, have 1000’s of followers
‣A reference by them has massive impact on page views
‣Positive or negative comments from them drive perception
•Can we identify them?
‣Potentially “reach out” with analyst program
‣Study what website posts go “viral”
‣Understand out audience, and the conversation, better
But Who Are The Influencers In Our Community?
64. info@rittmanmead.com www.rittmanmead.com @rittmanmead 64
•Rittman Mead website features many types of content
‣Blogs on BI, data integration, big data, data warehousing
‣Op-Eds (“OBIEE12c - Three Months In, What’s the Verdict?”)
‣Articles on a theme, e.g. performance tuning
‣Details of new courses, new promotions
•Different communities likely to form around these content types
•Different influencers and patterns of recommendation, discovery
•Can we identify some of the communities, segment our audience?
What Communities and Networks Are Our Audience?
65. info@rittmanmead.com www.rittmanmead.com @rittmanmead 65
Graph Example : RM Blog Post Referenced on Twitter
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
00 0 0 Page Views10 0 0 Page Views
Follows
20 0 0 Page Views
Follows
30 0 0 Page Views
66. info@rittmanmead.com www.rittmanmead.com @rittmanmead 66
Network Effect Magnified by Extent of Social Graph
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
30 0 0 Page Views70 0 5 Page Views
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
67. info@rittmanmead.com www.rittmanmead.com @rittmanmead 67
Retweets by Influential Twitter Users Drive Visits
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
30 0 0 Page Views
Retweet
50 0 3 Page ViewsRT: Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
69. info@rittmanmead.com www.rittmanmead.com @rittmanmead 69
Property Graph Terminology
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
Mentions
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
Retweets
Node, or “Vertex”
Directed Connection, or “Edge”
Node, or “Vertex”
70. info@rittmanmead.com www.rittmanmead.com @rittmanmead 70
•Different types of Twitter interaction could imply more or less “influence”
‣Retweet of another user’s Tweet
implies that person is worth quoting
or you endorse their opinion
‣Reply to another user’s tweet
could be a weaker recognition of
that person’s opinion or view
‣Mention of a user in a tweet is a
weaker recognition that they are
part of a community / debate
Determining Influencers - Factors to Consider
71. info@rittmanmead.com www.rittmanmead.com @rittmanmead 71
Relative Importance of Edge Types Added via
Weights
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
Mentions, Weight = 30
Lifting the Lid on OBIEE Internals with
Linux Diagnostics Tools http://t.co/gFcUPOm5pI
Retweet, Weight = 100
Edge Property
Edge Property
72. info@rittmanmead.com www.rittmanmead.com @rittmanmead 72
•Graph, spatial and raster data processing for big data
‣Runs on-prem, or in Oracle Big Data Cloud Service
‣Installable on commodity cluster using CDH
•Data stored in Apache HBase or Oracle NoSQL DB
‣Complements Spatial & Graph in Oracle Database
‣Designed for trillions of nodes, edges etc
•Out-of-the-box spatial enrichment services
•Over 35 of most popular graph analysis functions
‣Graph traversal, recommendations
‣Finding communities and influencers,
‣Pattern matching
Oracle Big Data Spatial & Graph
73. info@rittmanmead.com www.rittmanmead.com @rittmanmead 73
Calculating Top 10 Users using Page Rank Algorithm
Top 10 influencers:
markrittman
rmoff
rittmanmead
mRainey
JeromeFr
Nephentur
borkur
BIExperte
i_m_dave
dw_pete
78. info@rittmanmead.com www.rittmanmead.com @rittmanmead 78
Determining Communities via Twitter Interactions
• Clusters based on actual interaction
patterns, not hashtags
• Detects real communities, not ones
that exist just in-theory
79. info@rittmanmead.com www.rittmanmead.com @rittmanmead 79
•Extend your organisation’s reach into your data with Oracle Big Data Discovery, Cloudera
Hadoop and the Rittman Mead Big Data Rapid Start.
•The Big Data Rapid Start is a fixed price, two week engagement delivered by Rittman
Mead’s team of Oracle, Big Data and Data Discovery consultants, designed to quickly
provide everything required to begin discovering the hidden value of your data.
•Move forward with confidence in the technology, process and application of Big Data
Discovery with the support of the world’s leaders.
Big Data Rapid Start from Rittman Mead
80. info@rittmanmead.com www.rittmanmead.com @rittmanmead 80
•Articles on the Rittman Mead Blog
‣http://www.rittmanmead.com/category/oracle-big-data-appliance/
‣http://www.rittmanmead.com/category/big-data/
‣http://www.rittmanmead.com/category/oracle-big-data-discovery/
•Rittman Mead offer consulting, training and managed services for Oracle Big Data
‣Oracle & Cloudera partners
‣http://www.rittmanmead.com/bigdata
Additional Resources