Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important.
Existing approaches for data management have become both technically and commercially impractical.
Technically - these systems were never designed to store or process vast quantities of data
Commercially – the licensing structures with the traditonal approach are no longer feasible.
These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
Enter Hadoop.
Faced with this challenge the team at yahoo conceived and created apache hadoop to address the challenge. They then were convinced that contribution of this platform into an open community would speed innovation. They open sourced the technology and did so within the governance of the Apache Software Foundation. (ASF) This introduced two distinct significant advantages.
Not only could they manage new data types at scale but the now had a commercially feasible approach.
However, there will still significant challenges. The first generation of Hadoop was:
- designed and optimized for Batch only workloads,
- it required dedicated clusters for each application, and,
- it didn’t integrate easily with many of the existing technologies present in the data center.
Also, like any emerging technology, Hadoop was required to meet a certain level of readiness required by the enterprise.
After running Hadoop at scale at yahoo, the team spun out to form Hortonworks with the intent to address these challenges and make Hadoop enterprise ready.
The modern data architecture simply does not work unless it integrates with the systems and tools you already deploy. HDP enables your existing data platforms to expand the data you have under management through integration. The goal of HDO is to augment not replace these existing systems as we very clearly understand that you need to ruuse skills.
Further, through our work within the Hadoop community to deliver YARN, we have opened up Hadoop and unlocked innovation in the community of data center ISVs can extend their applications so that they can run natively IN Hadoop as just another workload operating on the single set of data lake. They can now function as a first class citizen alongside any other workload in Hadoop.
In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo!
This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type.
Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies.
This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
Lets first start with cost optimization… there are three primary drivers.
First, its about storage optimization: Archive your data off the EDW to drive down costs
Second is to optimize data processing: typically a large portion of EDW usage is for low value transformational workloads. Many of these can be transitioned away from the EDW and into Hadoop and in doing so, this frees up significant resources from the EDW
And finally hadoop can be used to capture new types of data that can then be refined and used within the context of the analysis of your EDW, introducing wholly new analysis and insight.
On the other hand, many start with a new analytic app based on data not previously captured.
These new types of data include clickstream, sentiment machine and sensor data, geolocation data, server logs and the tomes of unstructured data often found within the enterprise.
While your application will vary tremendously based on your vertical, we see commonality across application patterns intended to find value in these rich data sources.
We generally find there are three patterns of application
SINGLE VIEW OF ENTITY
The first of three common patterns in analytics applications, a single view of an entity (like a customer, product or a machine) is now possible because platforms like Hadoop can store and organize previously unmanageable columns and varieties of data.
PREDICTIVE ANALYTICSAs data scientists and analysts reveal patterns and correlations inside massive data sets, new models emerge to explain business performance. Most importantly, these models can reliably predict future events based on previously dissociated data.
DATA DISCOVERY
New, voluminous data types such as machine and sensor data, geolocation data, clickstream data and sentiment data are valuable when correlated with other data sets in a shared enterprise “data lake.” The patterns within the data lake can then fuel machine learning applications.
It is in these patterns we see them unlock value form the types of data, previously not capable before.
Ultimately, most organizations that adopt Hadoop, aspire to create a data lake where multiple applications use a shared set of resources, for both storage and processing all with a consistent level of service.
The value in the data lake ultimately results in delivery of “systems of insight” where advanced algorithms and applications that access multiple data sets allow organizations to derive brand new value from data that was once unable to be investigated or simply to complex to combine and analyze. Hadoop doesn’t just create a Data Lake—it opens the platform for analysts to view multiple data sources in multiple dimensions and reduce time to insight.
This journey from apps to lake is only possible with HDP and its YARN based architecture.
Let’s talk about TrueCar as they are a great example of an organization who started small but grew big… they also did this very quickly.
TRUECar focuses on making car buying fair and fun for everyone. They bring together a lot of messy automotive industry data from a wide range of sources with a wide range of formats.
Since they make money when cars get bought, their value is in how well they’re able to drive advanced correlations across the data to deliver an interactive customer experience that accelerates an informed buying decision.
At TRUECar, data is the product they sell, and they made rolling out a Hadoop-based data architecture a pre-cursor to going IPO earlier this year.
With the help of Hortonworks and in just over a year, they realized their vision of a data lake and have transformed their business. At the beginning of their journey they had very limited knowledge of Hadoop.
We partnered with them to train resources, then we worked with them to develop an architecture and help them through development and implementation until finally, we now provide mission critical support for their production environment.
What started with a single app on HDP grew to three in a few months and in a year they had 6 business apps running on a single 60node cluster that holds over 2PB of data.
Their 60-node production cluster was rolled out in Nov 2013 and grew 5X in the year after that.