2. 2 Copyright Teradata
Scale-out NoSQL
+ Scale-out DW
Data Warehouse =
context
JSON in the
Data Warehouse
Integration:
Data Sharing
Use Cases
3. 3 Copyright Teradata
• Analytic database
> In-memory, in-database
• Scale-out MPP
> 30+ petabyte sites
> 35PB, 4096 cores
• Self service BI
> Dashboards, reports, OLAP
> Predictive analytics
• Complex SQL
> 20-50 way joins
> 350 pages of SQL
• Real time access/load
• Mixed workloads
What is a Teradata Data Warehouse?
Data
scientists
Power
users
Sales,
partners
1024 nodes
Intel
CPUs
512GB
Intel
CPUs
512GB
Intel
CPUs
512GB
Intel
CPUs
512GB
4. 4 Copyright Teradata
What is a Data Warehouse? Context
Price
history
Inventory
Supplier
Contracts
Product/Services
Channels
E-Commerce
Labor
Associate
Customer
Sales
transactions
Point of Sale
ShipmentCarrier
Campaigns
Promotion
Warehouse
5. 5 Copyright Teradata
A Day at the Ticket Agency
• 185 applications
> Travel agents & corporate
travel managers
> Mobile: airline executives
> Corporate travel managers
and travel agents
> Hoteliers
• Teradata 5650 V13.10
> 25TB of data
> 1000+ users
• Mini-batch every 15 min
• GoldenGate replication
• Tactical queries 0.2 seconds
• 14M queries/day
99.7
99.78
99.98 99.94
99.4
99.6
99.8
100
2008 2009 2010 2011
Availability
9. 9 Copyright Teradata
Late Binding in SQL
Early
binding
Late
binding
RuntimeLoad time
Data
Warehouse
Source
data
Schema
ETL
table
SQL +
JSONPath
BI
tools
JSON
10. 10 Copyright Teradata
JSONPath inside SQL
Color Size Prod_ID Create_Time
----- ----- ------- -------------------
Blue Small 96 2013-06-17 20:07:27
SELECT
box.MFG_Line.Product.Color AS "Color",
box.MFG_Line.Product.Size AS "Size",
box.MFG_Line.Product.Prod_ID AS "Prod_ID",
box.MFG_Line.Product.Create_Time AS "Create_Time"
FROM mfgTable
WHERE CAST(box.MFG_Line.Product.Create_Time
AS TIMESTAMP) >= TIMESTAMP'2013-06-16 00:00:00'
AND box.MFG_Line.Product.Prod_ID = 96;
11. 11 Copyright Teradata
• JSON object schema column
> Treated like any column
> Use any BI tool
• Apply “schema” at runtime
• Why not shred JSON into columns?
> Urgency, agility
> Bypass extensive change controls
> Complex data
– Bill of materials, etc.
Flexible: Schema-on-Read
13. Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC
TOOLS & APPS
USERS
INTEGRATED DISCOVERY
PLATFORM
INTEGRATED DATA WAREHOUSE
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
DATA
PLATFORM
ACCESSMANAGEMOVE
TERADATA UNIFIED DATA ARCHITECTURE
System Conceptual View
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
TERADATA
DATABASE
HORTONWORKS
TERADATA DATABASE
TERADATA ASTER DATABASE
25. 25 Copyright Teradata
Internet of Things: Making Sense of Sensors
Condition-based
maintenance
R&D testing
Yield management
Warranty mgmt.
Data
Warehouse
Shard
Shard
Shard
Shard
Shard
Shard
Shard
Shard
26. 26 Copyright Teradata
Conclusions
• Two scale out architectures
> OLTP scale-out
> Analytics scale-out
• JSON in the data warehouse
• Context from the DW
> Enriching MongoDB
applications
• Integration
> Import/export
> Teradata QueryGrid
Key Message: A good business model is enormously bigger than a star schema or snowflake schema. Its 1000s of tables.
A logical data model is concerned with providing a representation of data throughout the entire corporation. The LDM is focused on entities (named things), attributes (details about the named thing) and relationships (mutual keys and table layouts.)
Each subject area in the LDM contains 20-200 tables once it is translated into a physical data model. Each subject area provides the plan for integrating numerous data sources into a consistent representation of the business itself. Since subject areas are so grand in scope, all the tables may not be implemented and populated initially. Instead, the Professional Services experts focus on loading data that will produce results quickly and help solve customer pains.
A Logical Data Model
Helps establish a common understanding of business data elements and requirements
Provides foundation for designing a database
Facilitates data re-use and sharing
Logical Models are about relationships
Independent of function
Independent of technical limits
Physical models are about technology functions
Performance
Data Management
Late binding’s advantage is that differing late bindings can be used as the discovery process uncovers the best way to analyze the data. Say a street address is in a clob. You can look at the data from a zip code perspective and then a city/street perspective.
The short answer is early binding means mapping record layouts at compile time and late binding maps record layouts at runtime. Late binding is anytime the program must locate the data within a record at runtime as opposed to knowing the record layout when the program is compiled and the data is first stored in a repository.
Late binding is the process of figuring out the data format and physical location in memory at runtime. That is, the program does not know how data is stored on disk until the data is read into memory. At that point in time, the program uses a function (a subroutine) to extract the data from the record and/or file. This is a common technique in Java programming called a SerDes or a method (two different functions).
Teradata and Hadoop use similar approaches to solving the late binding task. If a block of data arrives in memory without a known structure, both application environments call a subroutine to parse out the relevant fields in the block. Hadoop uses SerDes or Java “methods”. Teradata uses JSONPath operators. They all do the same thing.
One difference between Teradata and Hadoop is that Hadoop only really does late binding on flat files. When Hive is used, the data is “data typed” meaning the data is expected to be stored in rows and columns, just like a relational database.
JSON and XML tools extract the data at runtime, parsing out the meaningful data while the query is running, placing the result in virtual columns created only for the duration of the query. This is in contrast to parsing out data and placing it columns during the normal ETL processing cycle. It is late binding in the sense that data is inside an object (CLOB or Varchar) and is parsed out at runtime into a format usable by the query. There is no schema for the Blob or Varchar, no fixed format that can be relied on without the parsing.
Table operators have the ability to define an input schema and an output schema at query runtime. Generally the input data is existing relational tables but can include foreign data structures such as Oracle tables or flat files. When foreign tables or files are inputs, a dynamic internal schema is generated for the purpose of reading data. This is a form of late binding. The output table format is always dynamically defined, and is a form of late binding since it does not involve a predefined schema. These are not what most people consider late binding but technically it is identical.
What items do we need to recall based on the quality issue on 6/16 with product #96?
CAST looks at the JSON data type and formats it as a timestamp.
Schema-on-read is a Java and C++ term that means the data is not organized and validated at the time it is first written to disk. With schema-on-read, the data is read from disk and at runtime a schema layout is applied to most of the data while some data must be located via parsing. This is poular in the process oriented languages because it allows for flexibility in interpreting data, especially data that has not been explored before.
Schema-on-read allows us to
Querying data without having to fully understand it before using it. This is discovery and exploration.
Schema on read makes for a very fast initial load, since the data does not have to be transformed or validated.
It is more flexible, too: consider having two schemas for the same underlying data, depending on the analysis being performed.
Some data cannot be defined in a schema since the name-value pairs can be in different locations in the JSON object.
Why not just materialize all possible JSON columns immediately? There are a few reasons:
In some cases, the result would be unwieldy and sparse. You could end up with 50M nulls or worse.
You don’t always know weeks and months in advance of new data arrivals.
We want to preserve the agile nature of JSON with new data flowing into the system without extensive modeling and governance
The UDA architecture allows us to identify major subsystems and in this case actual hardware platforms performing the processing.
Adding MongoDB to the QueryGrid is the vision we are working towards.
The unique characteristics of each specialized engine are brought to bear on the IDW work.
MongoDB builds its scale-out architecture using Shards. These are similar to the concept of AMPs in Teradata or Vworkers in Aster. Data is hashed across the MongoDB cluster and stored in a primary shard. It is also replicated to a secondary shard on another node to enable recovery should the primary shard be unavailable.
Connectivity to shards is actually done through the query routers which send requests to the correct cluster node based on hashed keys. Its drawn this way for simplicity.
Note: click for animations
A table operator request is submitted to PE
PE launches contract function via the EAH
EAH opens JDBC to Query Router
Note: click for animations
EAH requests table metadata for specified table
Metadata also includes ??? information
PE & dispatcher distribute the output row format to all AMPs
Note: click for animations
Each AMP is mapped to a series of Shards
AMP connects to its corresponding Shard via the EAH
Note: click for animations
Each AMP reads rows of data from a shard and spools the reformatted row into Teradata spool
This is an existing Teradata customer who has evolved into using MongoDB for their eCommerce website. Formerly a mail order company, they have become a full eTailer. On a nightly basis, they extract data from MongoDB and load it into the data warehouse. They use deep dive predictive analytics, buyer preferences, promotional objectives, and other data to provide context and next-best-offers to the MongoDB application. Once calculated, the new information is exported to files and loaded into the MongoDB shards to make the website visitor experience more relevant and hopefully more sales come with it.
THE major source of rich customer information is in the data warehouse. For years, DWs have collected customer purchases, payment history, buyer preferences, claims, plus next best offers and upsell opportunities. A lot of this data is historical going back 3-5 years. And some of it is the result of predictive analytics coupled with campaign management tools
Real time tactical access to the data warehouse is the same as accessing any relational database. We call this Active Data Warehousing. 100s of Teradata customers are accessing data in near real time with their Active Data Warehouse.
Combining these rich subject areas with MongoDB JSON data helps provide a faster time to resolution, next best offers, and the correct customer treatments based on their status with the corporation.
One of the key IoT concepts is the development of intelligent, connected “edge” devices. One example for such an IoT device is the Bosch Rexroth Nexo,
An industrial nut runner wrench which is equipped with an on-board computer and wireless connectivity. The on-board computer supports many aspects of the
tightening process, from configuration (e.g. which torque to use) to creating a protocol of the work completed (e.g. which torque was actually measured).
In addition, the Nexo features a laser scanner for component identification. By integrating such an intelligent edge device into the IoT, very powerful services can be developed that can help with supply chain optimization and modularizing the production line. For example, these intelligent tightening tools can now be managed by a central asset management application, which provides different services:
• Basic services could include features like helping to actually locate the equipment in a large production facility
• Geo-fencing concepts can be applied to help ensure only an approved tool with the right specification and configuration can be used on a specific product in a production cell.
MongoDB collects sensor data like this and makes it available to Java applications for tracking. Passing the data to Teradata allows for deep trend analysis, maintenance planning, and other IoT analytics.