Kurt Lueckhas over 20 years of experience within the Business Intelligence and Analytics field. During his consulting career he has worked with over 40 different organizations in multiple industries on a variety of technologies. In his current role, Mr. Lueck manages the BI & Analytics practice for Pactera.
Good Afternoon and Good Morning on the west Coast.I appreciate everyone’s attendance and sincerely hope that you gather some very valuable information and insights from our presentation today. This is a very exciting topic.As promised, we have our 10 steps, 5 critical mistakes, and 2 success stories but I also wanted to start with some quick definitions, drivers, and key predictions .This presentation was built as a primer and we will be having several follow-up presentations in the coming weeks and months that dive deeper into industry (Financial services and Retail for example) and particular vendor solutions (for example: What is Oracle’s Big Data Solution).Lets get started.
Ok, I feel obligated to start with the 4 V’s. The definition of Big Data has been a work in progress over the past few years. The established definition at this point always has the 3V’s somewhere in the mix. Most recently I have seen another V mentioned but first the traditional 3 V’s.Volume – This is probably the most mentioned. The shear volume of data has been the biggest driver. Velocity – As the saying goes Speed Kills. Social Media put the bullet in most traditional attempts for retail organizations. Other industries such as our Energy client are getting overwhelmed by Smart Grid initiatives. Each industry has their own issues from some new technology.Variety – If it was just traditional data then there probably would not be a neccesity for any of this discussion. However, the fact is we have all different type of data that are simply not handled correctly in the traditional Oracle/DB2/SQL databases. Sure they can store them but they cannot do anything with them very efficiently.What is the fourth V? Value
There is a general consensus that there are 3 drivers of Big Data. The first is something called Dark Data – This is the data that we stored because we had to or wanted to store the data but never used. The thought was we better store it and at some point we might get some value out of it. We never did. This data volume has increased and increased.
I am always doing research and thought these Predictions seem very relevant to our presentation today. These are straight from GartnerI won’t read all of these predictions but the bottom line is the BIG Data IS in a hype cycle….but it IS here to stay. I was recently at the TDWI Conference on BIG DATA and the group was reminded that there have been a number of terms and products that in the beginning were USED in front of every product…. WEB-ENABLED. This is the assumption today. BIG DATA is here to stay for a number of reasons.Enterprise clients MUST engage BIG DATA as a competitive advantage today and later as an equalizer.The last point that I want to drive home is the amount of jobs that will go unfullfilled in the big data arena. If you have any college age kids this is where you should push them. However, I believe it takes a very science oriented mind to really engage this profession.
Most It departments are simply feeling overwhelmed with the amount of data and the amount of pressure from the business to combine data to provide business insight. This can be an incredibly exciting opportunity for IT and business to work together.IF you can gain an understanding of What a Big Data solutions look like then and only then will you be able to determine how Big data can actually help.The chart on this page show IN general which areas are most positively impacted by Big data. Financial Services as usual is right up front on the overall volume of data, velocity of data. Media Services however has a highvariety of data.As an interesting side note, Pactera has worked with Microsoft to develop solutions that will read in videos and decipher them into textual …hence searchable output. This is just one of many example where EVERYTHING is becoming searchable. Pictures, Videos, blogs, and the traditional data Action Item: Look around your enterprise, and identify scenarios where combining and analyzingdiverse datasets will generate substantial business value.
Most It departments are simply feeling overwhelmed with the amount of data and the amount of pressure from the business to combine data to provide business insight. This can be an incredibly exciting opportunity for IT and business to work together.IF you can gain an understanding of What a Big Data solutions look like then and only then will you be able to determine how Big data can actually help.The chart on this page show IN general which areas are most positively impacted by Big data. Financial Services as usual is right up front on the overall volume of data, velocity of data. Media Services however has a highvariety of data.As an interesting side note, Pactera has worked with Microsoft to develop solutions that will read in videos and decipher them into textual …hence searchable output. This is just one of many example where EVERYTHING is becoming searchable. Pictures, Videos, blogs, and the traditional data Action Item: Look around your enterprise, and identify scenarios where combining and analyzingdiverse datasets will generate substantial business value.
The main items that I am worried about for companies is this role called a data scientist. I believe most organizations simply do not have any or enough.What are the key roles of a data scientist?To make a big data project or any analytics project succeed, you actually need a lot of skills. I think of it as a combination of functional skills and technical skills … Most people when they think of data scientists, they think of the technical side. And their minds immediately go to analytics, which is important, but it’s not the whole part of the story. To me it’s 2 Sides:Analytics & DesignSo on the analytics, it’s the things around statistics, operations research, computer science, machine learning in particular is important for data science … But then there’s technology in the sense of being able to understand systems, particularly large systems, because you need to store data all over the place in distributed form, and the ability to program -- to write code that acts as a glue to put all these pieces together. The second functional area is around Design:There’s also the design side of things, which is basically being able to create an interface to the data so people will find it usable, and there's the data side, which is data manipulation, data modeling, data cleansing. So if I got the numbers right, there should be kind of two functional skill sets and four technical skill sets. And all of those need to be combined to make a good data science project work. This is a LOT to ask of ONE person. I believe this set of skills comes from teams of individuals who work on projects together and use each others strengths.
Stage 1 -- Initial At this stage, organizations have sporadic, inconsistent and uncoordinated activities of information management. The main characteristics are: The organization makes decisions based inaccurate and incomplete information aggregated by various departments/LOBs through inconsistent processes. Information is fragmented and inconsistent across many different applications and data stores under different LOBs. Business and IT organizations view information as by product of applications, and usually handled on a project-by-project and department-by-department basis. There is no concept of information ownership and stewardship regarding governance, security or accountability of key information assets. Stage 2 – Manage At this stage, organizations perceive the enterprise information management as necessary to be more effective and efficient for multiple business processes across LOBs. They are taking actions to improve information management but mostly focus on immediate needs, reactively and inconsistently. The main characteristics are: Stage 3 – Advance At this stage, organizations identify information-driven as critical activities for business growth and cost reductions. Therefore, organizations formally establish enterprise information management with support by executive management and actively build these capabilities. The main characteristics are: Stage 4 – Optimize At this stage, organizations complete significant portions of Information Architecture Domain components. The enterprise information becomes pervasive, and part of foundation of business processes to drive profitability and organizational effectiveness. The main characteristics are: Stage 5 – Innovate At this stage, Organizations extend the boundary of entire information ecosystems to external sources and channels to provide innovations in organization growth and drive the market. Information Architecture becomes part of the culture of organizations. The main characteristics are:
Oracle offers a broad portfolio of products to help enterprises acquire, manage, and integrate big data with existing information, with the goal of achieving a complete view of business in the fastest, most reliable, and cost effective way.The Oracle Big Data Appliance is an engineered system of hardware and software designed to help enterprises derive maximum value from their big data strategies. It combines optimized hardware with a comprehensive software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy offering for acquiring, organizing and analyzing big data, with enterprise-class performance, availability, supportability, and security. The Oracle Big Data Appliance incorporates Cloudera’s Distribution, including Apache Hadoop with Cloudera Manager, plus an open source distribution of R, all running on Oracle Linux. The Oracle Big Data Appliance comes in a full rack configuration of 18 Oracle Sun servers and scales by connecting multiple racks together via an InfiniBand network, enabling it to acquire, organize, and analyze extreme data volumes. The Oracle Big Data Appliance offers the following benefits:8 Rapid provisioning of a highly-available and scalable system for managing massive amounts of data8 A high-performance platform for acquiring, organizing, and analyzing big data in Hadoop and using R on raw-data sources8 Control of IT costs by pre-integrating all hardware and software components into a single big data solution that complements enterprise data warehousesOracle Big Data Connectors is an optimized software suite to help enterprises integrate data stored in Hadoop or Oracle NoSQL Databases with Oracle Database 11g. It enables very fast data movements between these two environments using Oracle Loader for Hadoop and Oracle Direct Connector for Hadoop Distributed File System (HDFS), while Oracle Data Integrator Application Adapter for Hadoop and Oracle R Connector for Hadoop provide non-Hadoop experts with easier access to HDFS data and MapReduce functionality.
Oracle Big Data Appliance includes a combination of open source software and specialized software developed by Oracle to address enterprise big data requirements. The Oracle Big Data Appliance integrated software includes: Full distribution of Cloudera’s Distribution including Apache Hadoop (CDH) Cloudera Manager to administer all aspects of Cloudera CDH Open source distribution of the statistical package R for analysis of unfiltered data on Oracle Big Data Appliance Oracle NoSQL Database Community Edition3 And Oracle Enterprise Linux operating system and Oracle Java VM The Oracle Big Data Appliance incorporates Cloudera’s Distribution, including Apache Hadoop with Cloudera Manager, plus an open source distribution of R, all running on Oracle Linux. The Oracle Big Data Appliance comes in a full rack configuration of 18 Oracle Sun servers and scales by connecting multiple racks together via an InfiniBand network, enabling it to acquire, organize, and analyze extreme data volumes. The Oracle Big Data Appliance offers the following benefits:- Rapid provisioning of a highly-available and scalable system for managing massive amounts of dataA high-performance platform for acquiring, organizing, and analyzing big data in Hadoop and using R on raw-data sourcesControl of IT costs by pre-integrating all hardware and software components into a single big data solution that complements enterprise data warehousesIf you are looking for an ORACLE version of Big data literally in a box then this is it!
While Hadoop offers many advantages for organizations, Hadoop is not a wholesale replacement for the traditional relational system and other storage and analysis solutions. Rather, Hadoop is a strong complement to many existing systems. The combination of these technologies offers enterprises tremendous opportunities to maximize IT investments and expand business capabilities by aligning IT workloads to the strengths of each system.
The Oracle Big Data Appliance incorporates Cloudera’s Distribution, including Apache Hadoop with Cloudera Manager, plus an open source distribution of R, all running on Oracle Linux. The Oracle Big Data Appliance comes in a full rack configuration of 18 Oracle Sun servers and scales by connecting multiple racks together via an InfiniBand network, enabling it to acquire, organize, and analyze extreme data volumes. The Oracle Big Data Appliance offers the following benefits:8 Rapid provisioning of a highly-available and scalable system for managing massive amounts of data8 A high-performance platform for acquiring, organizing, and analyzing big data in Hadoop and using R on raw-data sources8 Control of IT costs by pre-integrating all hardware and software components into a single big data solution that complements enterprise data warehouses
Oracle Exalytics In-Memory Machine is purpose-built to deliver the fastest performance for business intelligence (BI) and planning applications. It is designed to provide real-time, speed-of-thought visual analysis, and enable new types of analytic applications so organizations can make decisions faster in the context of rapidly shifting business conditions, while broadening user adoption of BI though introduction of interactive visualization capabilities. Organizations can extend BI initiatives beyond reporting and dashboards to modeling, planning, forecasting, and predictive analytics. The Oracle Exalytics In-Memory Machine is the industry‟s first engineered in-memory analytics machine that delivers extreme performance for Business Intelligence and Enterprise Performance Management applications. The Oracle Exalytics In-Memory Machine hardware is a single server that is optimally configured for in-memory analytics for business intelligence workloads and includes powerful compute capacity, abundant memory, and fast networking options. The Oracle Exalytics In-Memory Machine features an optimized Oracle BI Foundation Suite (Oracle BI Foundation) and Oracle TimesTen In-Memory Database for Exalytics. Business Intelligence Foundation takes advantage of large memory, processors, concurrency, storage, networking, operating system, kernel, and system configuration of the Oracle Exalytics hardware. This optimization results in better query responsiveness, higher user scalability and markedly lower TCO compared to standalone software. The TimesTen In-Memory Database for Exalytics is an optimized in-memory analytic database, with features exclusively available on Oracle Exalytics platform. How does Exalytics and Exadata go together? InfiniBand: Two quad-data rate (QDR) 40 GB/s InfiniBand ports are available with each machine expressly for Oracle Exadata . When connected to Oracle Exadata, Oracle Exalytics becomes an integral part of the Oracle Exadata private InfiniBand network and has high-speed, low latency access to the database servers. When multiple Oracle Exalytics machines are clustered together, the InfiniBand fabric also serves as the high-speed cluster interconnect.
Alright this slide puts it all together.If you are an Oracle shop you have a lot of choices in designing and implementing your big data architecture. Please here me that the Oracle Big Data architecture should be aligned with your BI Strategy and ultimately your business.Big Data is not a standalone concept. Big Data should fit within your existing BI strategy.At the ground level you can start with
The point of this slide is begin to show the nuances and decisions that will have to be made when you design and purchase your Oracle Big Data strategy. At one extreme you can go completely open-source. On the other end, you can go completely Oracle Big Data.Here is a brief outline of Big Data capabilities and their primary technologies: Storage and Management Capability Hadoop Distributed File System (HDFS): An Apache open source distributed file system, http://hadoop.apache.org Expected to run on high-performance commodity hardware Known for highly scalable storage and automatic data replication across three nodes for fault tolerance Automatic data replication across three nodes eliminates need for backup Write once, read many times Cloudera Manager: Cloudera Manager is an end-to-end management application for Cloudera’s Distribution of Apache Hadoop, http://www.cloudera.com Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization. An Oracle White Paper in Enterprise Architecture—Information Architecture: An Architect’s Guide to Big Data 8 Database Capability Oracle NoSQL: (Click for more information) Dynamic and flexible schema design. High performance key value pair database. Key value pair is an alternative to a pre-defined schema. Used for non-predictive and dynamic data. Able to efficiently process data without a row and column structure. Major + Minor key paradigm allows multiple record reads in a single API call Highly scalable multi-node, multiple data center, fault tolerant, ACID operations Simple programming model, random index reads and writes Not Only SQL. Simple pattern queries and custom-developed solutions to access data such as Java APIs. Apache HBase: (Click for more information) Allows random, real time read/write access Strictly consistent reads and writes Automatic and configurable sharding of tables Automatic failover support between Region Servers Apache Cassandra: (Click for more information) Data model offers column indexes with the performance of log-structured updates, materialized views, and built-in caching Fault tolerance capability is designed for every node, replicating across multiple datacenters Can choose between synchronous or asynchronous replication for each update Apache Hive: (Click for more information) Tools to enable easy data extract/transform/load (ETL) from files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase Uses a simple SQL-like query language called HiveQL Query execution via MapReduceAn Oracle White Paper in Enterprise Architecture—Information Architecture: An Architect’s Guide to Big Data 9 Processing Capability MapReduce: Defined by Google in 2004. (Click here for original paper) Break problem up into smaller sub-problems Able to distribute data workloads across thousands of nodes Can be exposed via SQL and in SQL-based BI tools Apache Hadoop: Leading MapReduce implementation Highly scalable parallel batch processing Highly customizable infrastructure Writes multiple copies across cluster for fault tolerance Data Integration Capability Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator: (Click here for Oracle Data Integration and Big Data) Exports MapReduce results to RDBMS, Hadoop, and other targets Connects Hadoop to relational databases for SQL processing Includes a graphical user interface integration designer that generates Hive scripts to move and transform MapReduce results Optimized processing with parallel data import/export Can be installed on Oracle Big Data Appliance or on a generic Hadoop cluster Statistical Analysis Capability Open Source Project R and Oracle R Enterprise: Programming language for statistical analysis (Click here for Project R) Introduced into Oracle Database as a SQL extension to perform high performance in-database statistical analysis (Click here for Oracle R Enterprise) Oracle R Enterprise allows reuse of pre-existing R scripts with no modification
This slide discussed the 5 Most Common mistakes that we are seeing within the marketplace.In no particular order:Lack of Expertise – I am actually not referring to the Hadoop or Java expertise that is required. If that was the case most projects never even get started. I am referring more to data scientist type of resources. There are projections from various “critical thinking” organizations such as Gartner who project a significant short-fall of data scientist. The truth is you may have to develop this internally. I would suggest you do that now. I also suggest looking at your universities and hiring graduates from Analytic programs.BIG Data projects without a problem. We are certainly in a hype cycle around Big Data. This is natural with any technology that can be a game changer. More than likely your company does have business problems that can be assisted with BIG Data solutions. The alignment between Savvy business users and technology enabled IT departments is still in the works. Lack of technology alignment – By this I am referring to the fact that it is very easy to begin purchasing point Big Data Solutions for one specific problem. Watch out. This same problem has been happenning for years with out HYPE cycles. Lets get a bit smarter on this CYCLE.This flows directly into my next CAUTION – Develop a longer-term roadmap. If you are going to start a BIG DATA project that means you will be purchase software and may be hiring resources. Before you start, it may be time for a short Big Data strategy. Understand what happens after the first project. I am absolutely in favor of starting with a POC and starting small. However, before large investments think through the 2yr plan. PACTERA’s Big Data Strategy is a quick engagement to review each major business group in an organization and look for detailed problems that may be solved by BIG DATA solutions. It’s a great engagement that has the outcome of a 1-2year plan for implementing BIG DATA>5) Lack of Critical Evaluation – I feel like this has been missing in most IT projects. At the end of the project, did we achieve the expected business goals. If the answer is no then lets figure out why and make improvements.
I now want to present two business cases from real-life projects. The first project is for one of the largest on-line travel organizations in the world. Lets call them Acme OnLine Travel (AOL).Pactera has had a relationship with AOL for over 6years. We built the datawarehouse. We understand the business very well and frankly we understand the weaknesses of the BI solution. The volumes of data were so high and the cost to maintain was growing.The data sources for this client were everything from traditional ERP systems, Click Stream Data to Social Media such as Facebook. It’s not hard to see why the volumes were high. Petabytes is the norm.Part of the main driver for this project was to Reduce cost per TB from which was running at ten thousand USD. So a few years ago we suggested to AOL that we think a BIG Data solution is most likely necessary if we want to continue to be competitive in this industry. It started with some POC’s and then moved into BUILDING ONTO the current BI system at first. We are now beginning to see the natural death of some portions of the traditional BI system. I say natural death because our business users are simply not using some of the old methods. The most interesting and hard-hitting is the Preditive analytic functions that are being built on top of the base hadoop file system.One of the most recent changes is our moving to near REAL-TIME with a newer BIG Data product called Impala. Our team has been working with Impala for the past year or so even before it was officially released. This addresses one of the CRITICAL issues with Big Data and that is the lack of real-time capabilities.
Lets talk about IMPALA for a moment. This graph show our own testing at this client with Petabytes of data. As you can see the performance is quite stark between Hive and Impala. If you know anything about traditional What is also interesting that I wanted to draw out is that DESPITE our success with BIG DATA at this client is that large number of people use HADOOP only to get data so that they can process in a traditional RDBMS. A lot of this is simply because people are more comfortable and end-user tools are more user-friendly on relational/traditional databases. Please keep in mind that when it says FASTER on that 3rd line it is referring to much smaller sets of data that we are placing into RDBMS.In conclusion, the solution provided FASTER , more intelligence insights and the cost is down toless than 2 thousand per TB in Hadoop from 10kUSD.
The final case study that I want to present is around Retail. The picture that you are seeing is the goal of most major Retailers. The goal to drive a marketing and eventual sale down so personalized that it felt like they knew the customer on a one on one basis. Oh and by the way, not to cross the “Creep Factor” line. That is the line where the customer feels violated. This was the case with our client. Our client had a mix of the following types of data:Store POSWeb ClickStreamSocial MediaFinancialA BUNCH of spreadsheetsCustomer Satisfaction dataCall Center DataJust as in the last case study the volume of data was growing and the cost to manage it was growing even faster.The project started with a POC and has now reached into several departments. Examples of business problems / projects include:Customer buying behaviourPrice Optimization – as in changing prices on the web based on behaviourAnd Space planningAll of these projects were accomplished with a Theory, a Model, and A lot of testing. Eventually when good models were built and TESTED significantly then the models were embedded into the clients operational systems. What I have walked away from with these projects and research is how much phycology is required to be successful.This particular client is actually using BIG DATA solutions combined with SAS and several other traditional BI tools.
The key components in this architecture: Oracle Big Data Appliance (or other Hadoop Solutions): o Powered by the full distribution of Cloudera’s Distribution including Apache Hadoop (CDH) to store logs, reviews, and other related big data Oracle Big Data Connectors: o Create optimized data sets for efficient loading and analysis in Oracle Database 11g and Oracle Enterprise R Oracle Database 11g: o External Table: A feature in Oracle database to present data stored in a file system in a table format and can be used in SQL queries transparently. Traditional SQL Tools: o Oracle SQL Developer: Development tool with graphic user-interface that allows users to access data stored in a relational database using SQL.
The third use case is to continue our discussion of the insurance company mentioned in the earlier section of this paper. In a nutshell, the insurance giant has a need to capture the large amount of sensor data that track their customers’ driving habits, store them in a cost effective manner, process this data to determine trends and identify patterns, and to integrate end results with existing transactional, master, and reference data they are already capturing. The large amount of sensor data needs to be transferred to and stored at the centralized environment that provides flexible data structure, fast processing, as well as scalability and parallelism. MapReduce functions are needed to process the low-density data to identify patterns and trending insights. The end results need to be integrated into the database management system with structured data.
BIG data is not the solution. The solution is some type of use of technology that enables business answers. The four bullets on here represent the 4 focus areas of our BI&Analytic practice in 2013. I believe BIG DATA is the foundation that many of these other solutions.
I love this story because it is so hard hitting ….especially if you have daughters like I do.Most of you have heard the story so I won’t go into all of the details. The basic gist goes something like this.Target started a predictive analytics project that was so successful and accurate that it actually predicted that a Fathers daughter was pregnant before the father knew. Google the story to find the full story if you have not heard it.I wanted to end on this because we all have a corporate responsibility to use our technology without crossing the privacy line with our customers.