Best Practices for Development Apps for Big Data. Exadata, Exalytics, Big Data Appliance. Hadoop, HDFS, Using R with Oracle Database and Hadoop. Fast Data for Gathering Information.
2. Developing a Successful
Big Data Strategy
Best Practices for Development
Raul Goycoolea S.
Solution Architect Manager
Oracle Latin America
Architecture Team
Mexico Developer Day, Apr 2014
12. Big Data Analysis Characteristics
• Integrate
– Traditional and New data
• Explore
– More data, More sources
• Discover
– Plan, Visualize, Model, Act
13. Big Data Analysis In Retail: The Problem
Fashion retailer sees flat
and declining sales
No apparent differences
by geography or standard
demographics
New marketing program
didn’t help
14. Step 1: New Segmentation
• Analyze weblog files
– Response rates
– Frequency and duration of visits
– Shopping cart activity
– Devices used to access
• Cross reference with demographics
– Affinity program
– Online profiles
• New insight: younger, affluent women are not buying
15. Step 2: Sentiment Analysis
• Analyze all comments
– Social media, forums
• Cross reference with customer information
– Affinity programs
– Online activity
– Sales records
• New insight: new segment expresses “out of stock”
16. Step 3: Inventory Analysis
• Analyze promoted products
– No stocking problems
• Cross-reference with all shopper activities
– Online shopping cart activity
– Affinity program
– Shopper location information
– “Out of stock” comments
• Key insight: matching accessories are out of stock
17. Big Data Analysis In Retail: The Answer
Young women with higher
disposable income (and
smart phones) did not buy
a designer sweater when
the matching sleeveless
top was out of stock.
19. Oracle Exadata Database Machine
• Fastest Data Warehouse & OLTP
• Best Cost/Performance Data Warehouse & OLTP
• Optimized Hardware (per rack)
• Processor: up to128 Intel Cores and 2 TB DRAM
• Network: 880 Gb/Sec Throughput
• Storage: 5 TB Flash and up to 336 TB Disk
• Software Breakthroughs
• Exadata Smart Storage Grid
• Smart Flash Cache
• Hybrid Columnar Compression
• Parallel Scale-Out Database and Storage
• Scales from ¼ Rack to 8 Full Racks
Data Warehousing, Transaction Processing, Consolidation
20. Oracle In-Database Analytics Platform
XML Relational OLAP Spatial
Data Layer RDF Media
Parallel Processing Engine
Oracle R
Enterprise
Oracle
Data Mining
Text and
Search
Spatial
Analytics
SQL
Analytics
Oracle
MapReduce
22. Oracle Exalytics In-Memory Machine
First engineered
system for analytics
Visual Analysis
without limits
Smarter analytic
applications
23. End-user Experience with Exalytics
Speed of Thought Interactive Analysis
Interactive Analysis
Free Exploration
Dense Visualizations
Fully Mobile
24. Over 80 Analytic Applications Run on Exalytics
No application changes required
Financials, HR
Sales, marketing
Planning, forecasting
Many industries
25. Analyzing Big Data
• Comprehensive
• Enterprise ready
• Engineered to work together
• Optimized for extreme analytics
Start by introducing you to the platform. We’ll talk about use cases - and then zero-in on the use case that you will be working with as part of your HOLs. Frankly, across these use cases - you’ll find similarity in terms of data processing flows. We’ll review the Oracle MoviePlex design pattern/architecture.
In the rest of the presentation we’ll walk through the lifecycle of big data. So Big Data is all about making better business decisions to grow revenue and lower costs. The lifecycle of big data is acquire, organize, analyze, decide.
Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analyticsAll connected by InfiniBand
Added standalone software componentsSo to summarize … I think we have the industry’s most complete and integrated solution for acquring, organizing, and analyzing big data.If someone comes up to you and needs you to deploy big datg in a few weeks, we can help you do this. Fastest time to value.We have the software – nosql db, em cc, hadoop, data integrator for hadoop, loader for hadoop, R, BIEE.Plus we have the Big Data Appliance, Exadata, and Exalytics to provide engineered solutions for running the software.In closing, I hope this session has been informative and you can now all go back to your organizations and tell them what big data is (hi vol low value), how it can be acquired, organized, loaded into your existing dw, and analyzed to bring new value to your business.
So you have BIG data. You’re running M/R on that data. You want to load or access some of that data on Oracle Database for further analytics. This is what the Oracle Big Data Connectors are for.Note that the data is transformed into a structured form before loaded/accessed by the connector.s
You have seen a similar slide in other big data presentations from Oracle, outlining the different stages in a big data application.The potential treasure trove of less structured data such as weblogs, social media, email, sensors, and location data can provide a wealth of useful information for business applications. Hadoop provides a massively parallel architecture to distill desired information from huge volumes of unstructured and semi-structured content. Frequently, this data needs to be analyzed with existing data in relational databases, the platform for most commercial applications. The two sets of data needs to be combined to enable users to derive greater insights from the less structured data that is processed and stored on Hadoop clusters, using the data in relational databases.A set of technologies and utilities referred to as “connectors” are necessary to make the data on Hadoop available to the database for analysis with the data in the database. Oracle Loader for Hadoop and Oracle SQL Connector for H-D-F-S (HDFS) are two high performance connectors to load and access very large volumes of data on Hadoop. We see here the different stages in a big data solution. Oracle has engineered solutions for each of these stages: Oracle Big Data Appliance, Oracle Exadata (an engineered system for running Oracle Database), and Oracle Exalytics (engineered system for BI applications), all connected by Infiniband, the super highway that integrates Oracle’s engineered systems. Note that the Big Data Connectors work both with the engineered systems and generic Hadoop and database installations (I will be discussing specific versions later in the presentation).Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analytics====================================================All connected by InfiniBand---------------Set up for today’s conversation.Know a lot about Exadata & Exalytics (Oracle BI) – Been hard at work developing a key component to the Big Data Platform – the BDA. Excited to speak to you about that today.Leveraging both Oracle’s appliance expertise – and importantly we’re leveraging the advice and technology of industry experts – to develop to create an open platform. Although it’s new, it offers a solid foundation – using tech that is well tested by the biggest players in the market.And, we then took this open system and optimized it for Oracle – delivering unique capabilities that simplify connections to the rest of your Oracle ecosystem – plus delivers outstanding performance.Level set – Introduce the system – and then step thru a use case that illustrates the flow of info across the system. Highlight the optimizations along the way – things that are unique to Oracle.Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analyticsAll connected by InfiniBand - a key enabler and an example of Oracle’s Superior TechnologyWithout InfiniBand ( without the super highway that integrates Oracle’s Engineered Solutions): Customers will try to squeeze all these capabilities into one box for either a Performance or Price advantage. - They will fail at bothWith InfiniBand, Customer have the right tool Optimized for the right job : The Value of integrated Oracle Solutions is greater than the sum of the parts
Connectors work with Oracle’s engineered systems and also with other Hadoop distributions and Oracle databases (as long as it is a version we support)
Parallelism:PQ slaves in the database will read data in parallelIf you have 64 pq slaves, 64 files will be read in parallel# of PQ slaves is limited by # of location files
When OSCH is invoked with the –createTable option,the external table definition is generated, the external table is created, and the location files are populated. You can examine the location files if you like. Their contents were also displayed on screen, along with the external table definition.
Interesting properties: tableName (name of the external table), sourceType, hive.tableName, hive.databaseName
Let us try some queries on this external table
You will see that the external table has two location files, because of the value we specified in the locationFileCount property. You can see that the URIs of the smaller data files have been grouped into one location file. OSCH does this to load balance the reading of data as much as possible. URIs in the location files are read in parallel. You can examine the location files if you like.
Interesting properties: tableName (external table that will be created),sourceType, dataPaths, locationFileCount
How does this perform? The alternative to OSCH, is to use Fuse-dfs. We are 5 times faster than Fuse-dfs, while using 75% less CPU.Test was performed on BDA (18 Sun x4270 M2 Servers, 216 cores, 48 GB memory per server (864 GB total)) and Exadata X2-8 single instance (8 Intel Xeon X7560 servers, 64 cores, 1TB memory)The data size used in the CPU usage graph is 0.25 TB.
OLH is a MapReduce job that runs on the Hadoop clusterJob submitted to the cluster like any MapReduce jobData is read through input formatsDatabase table partitions loaded in parallel by reducer tasksOnline and offline modesOnline: Pre-process and load in the same jobOffline: Write out data files on HDFS (text or Oracle Data Pump) for load laterThe data pre-processing performs partitioning, sorting, and data conversion on Hadoop.
Now let us look at Oracle Loader for Hadoop. In additional the file containing the configuration parameters, we have a loader map table that describes the columns in the target table we are loading into. If all columns in the table are loaded, and the data columns have the default date format, this file is not needed. Here the date format in the data is different from the default and is specified in the loader map file.
We first create the target table in the database that we want to load data into.
Themapreduce.outputformat.class specifies OCIOutputFormat. This specifies that the online load option with direct path load will be used.mapred.input.dir specifies the datapath for the data files. mapreduce.inputformat.class specifies that the data is in DelimitedTextFormat
Loader Map file. Note the specification of the date format.
We use 85% less CPU, and are more than ten times as fast.The data size used in the CPU usage graph is 0.25 TB.
This is a big deal. We spend significant time and effort keeping up with the versions. This saves you the time to make a connector work with a Hadoopdistro you are working with.
The connectors can be used together. Oracle Data Pump files can be created by Oracle Loader for Hadoop, and then accessed/loaded into Oracle Database using Oracle SQL Connector for HDFS.So if the text is not de-limited text files, the Oracle Loader for Hadoop can be first used to transform the data into data pump files (or de-limited text files), which are then loaded/accessed by Oracle SQL Connector for HDFS.This is also a good time to highlight the offline load option of OLH.