Life used to be simple and very transactional in natureEarly 90’s, ERP: transactions count your sales by customer by locationLate 90’s – the age of segmentation and targeted offers. Merge customer operations with marketingNow, life is more complex, connected, and interactional in nature!Digital marketing enables measurement of interactions across channelsSocial networks, mobile commerce, and user-generated content increases the TYPES and VOLUMES of data which is generated by system:system communication and data exhaust from customer behavior like click-streamAnd big data is just beginning – we don’t even list all the sensors, telematics, and other machine-generated data which is predicted to eclipse even that which is generated by social networks
Facts that bolster this vision include: 80% - 90% of the world’s data is unstructured or semi-structured (Forrester, IDC, Gartner all agree) Data volumes have increased exponentially over the past decade and are continuing to do so (IDC, McKinsey reports) Hadoop is uniquely designed to store and process this type of data…at scale…across commodity systems The major server and storage platform vendors are all creating Hadoop-focused strategies
Apache Hadoop LeadershipSanjay RadiaHDFS Core Lead Architect, 4+ years on Hadoop. Major projects include Append v2, Capacity scheduler, Federation and HA.Owen O’MalleyThe Leading committer of code to Hadoop. 5+ years on Hadoop. Original Hadoop architect at Yahoo!. Drove the implementation of Security throughout the project. Arun MurthyOriginal MapReduce Lead. 5+ years on Hadoop. Currently lead architect and Release manager of Apache Hadoop .23.Matt FoleyRelease manager of Apache Hadoop .20.205. Former Director of Engineering for Yahoo! Mail, now running Hortonworks’ Quality and Release efforts.Deveraj DasBuilt the original MapReduce development team at Yahoo!. 5+ years on Hadoop. Now leading up the Apache Ambari (Hadoop Management) project.Alan GatesLead of Pig and HCatalog. 3+ years on Hadoop.
Infrastructure Platform (Servers, Storage, Network, Operating System, Virtualization, Cloud)Systems Management (Installation, Configuration, Administration, Monitoring, Performance, Security Mgmt, Capacity Mgmt, Quality of Service)Data Management Systems (SQL, NoSQL, NewSQL, EDW, Datamarts, MPP DBs, Search, Indexing, MDM, etc.)Data Movement & Integration (ETL, Data Quality, Integration Middleware, Event Processing)Tools & Languages (IDEs, Programming Languages, other tools)Business Intelligence & Analytics (Analytics, Reporting, Visualization, and Dashboards)Applications & Solutions (SaaS offerings, bundled solutions, etc.)
Infrastructure Platform (Servers, Storage, Network, Operating System, Virtualization, Cloud)Systems Management (Installation, Configuration, Administration, Monitoring, Performance, Security Mgmt, Capacity Mgmt, Quality of Service)Data Management Systems (SQL, NoSQL, NewSQL, EDW, Datamarts, MPP DBs, Search, Indexing, MDM, etc.)Data Movement & Integration (ETL, Data Quality, Integration Middleware, Event Processing)Tools & Languages (IDEs, Programming Languages, other tools)Business Intelligence & Analytics (Analytics, Reporting, Visualization, and Dashboards)Applications & Solutions (SaaS offerings, bundled solutions, etc.)
In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.
At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.The three areas are:1.Business Transactions & Interactions2. Business Intelligence & Analytics3. Big Data RefineryThe graphic illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.
“Node" means a Server or Virtual Machine capable of running the Software. “Server” means a single hardware system capable of running the Software. A hardware partition or blade is considered a separate hardware system.“Virtual Machine" means a software container that can run its own operating system and execute applications like a physical machine.“Cluster” means two or more Nodes that are interconnected for the purposes of executing application programs and sharing data.“Storage” means the total available storage space, also known as raw capacity, within the cluster
I want to be careful with how we present services….they do want people to come onsite for extended engagements