All data is potentially valuable to your business, but its value density determines just how you manage and access that data. Discover how Hadoop complements your data warehouse by letting your choose the “right tool for the right job” and optimize your data economics and usage.
IN THIS SESSION, WE WILL EXPLORE USING HADOOP TO ADDRESS QUESTIONS AND ISSUES SURROUNDING * Cost of storage * Value of accessibility * Getting maximum return on your IT investments and all of your data
WHAT CONSTITUTES GOOD ROI FOR YOUR DATA?MIGHT HELP TO ASK YOURSELF SOME QUESTIONS: * Able to use all the data or the data you want in BI and analysis? (current & historical data) * Able to take on new projects without fear of impacting the performance of existing processes & reports? * Are you concerned about maintaining a satisfactory data retention strategy in the face of uncertain data value?* If any of these questions make you stop and think, then you might not be using your data to its maximum potential. SO WHAT IS THE COMMON SOLUTION TO PROVIDE SOME RELIEF TO THESE ISSUES? *Typically its to make more space within existing analytical systems * Means "archiving" data to mediums that are more cost-effective, but lack analytical capabilities (tape, filers)BUT WHAT DOES THIS MEAN FOR YOUR BUSINESS? * It means that the data is unavailable for analysis & reporting * Retrieval is troublesome & in the case of tape, the data has essentially been thrown away * But what is the opportunity lost with this data? Today, tomorrow, a year from now?RETURNS TO CORE ISSUE FACING BUSINESS TODAY:*Are you knowingly working with constrained or limited information when making business decisions?THS ISSUE IS PARTICULARLY ACUTE WHEN DEALING WITH OUTLIER & ANOMOLY DETECTION EXERCISES LIKE FRAUD ANALYSIS & ADVERSE DRUG EVENTS * Sampling is a powerful tool for data scientists * But if you have to sample just so you can run your analyses w/in your system requirements * What data points are going missing?THIS ISSUE ONLY STANDS TO BECOME MORE PREVALENT * As data points from expanding instrumentation become more valuable (in aggregate or as singular entities) * Or as "abundance of data" requirements feed this new era of Exploration * Both of these macro trends will exert tremendous influence on IT infrastructuresIN SHORT, IT ORGANIZATIONS MUST KEEP ACCESSIBLE GROWING AMOUNTS OF DATA *w/ potential, if not questionable, value * lest they close the door on a discovery of a significant windfall or failure
SO WHY NOT JUST KEEP THE DATA AROUND IN A MORE ACCESSIBLE FORM? * Typically means storing in a data warehouseDATA WAREHOUSES ARE GOOD AT STORING & RETRIEVING DATA AS LONG AS*the access patterns are understood & can be effectively modeled * the data has a high degree of intrinsic value * cost model makes poor economics for data that isn't particularly valuable in & of itselfWE LIKE TO CHARACTERIZE THIS AS “DATA DENSITY” * the value of each individual piece of data compared to the cost of storing it * This "return on byte" metric varies depending on organization, stage of the lifecycle and other factors - not a static equationAND, IMPORTANTLY, THE VALUE OF LOW DENSITY DATA CAN INCREASE TREMENDOUSLY WHEN VIEWED IN LARGE, AGGREGATE AMOUNTS * Distillation can yield extremely high value signals and metrics * Ex: ATM transactions * These types of examples are prolific now with the rise of instrumentation and explorationTHE PROBLEM IS THAT NOT ALL DATA IS CREATED EQUAL OR SHOULD “FLY FIRST CLASS” * Some is immediately and clearly valuable = candidate for the data warehouse * Other data is worth more than being banished to tape* But where do you put it so you can make sure it's usable when needed?
IN THIS CONTEXT* Where to put data that has value, but isn't valuable enough for the data warehouse * Hadoop offers a compelling alternative to offline archives and SAN/NAS systems for storing low density dataIN THIS FORM, HADOOP COMPLEMENTS THE TRADITIONAL DATA WAREHOUSE * Optimizes data alignment across the 2 systems * According to value density and the businesses' analysis needsAT THIS POINT, I’M GOING TO ASK TED MALASKA TO JOIN ME ON STAGE * talk about our experiences w/ customers taking this approach to data management.
A financial regulatory body saves tens of millions with a more efficient disaster recovery solution that also offers 20X faster data processing performance. The Challenge:(PRIMARY) Data reliability requirements mandate 7 years’ storage + replication between production & DR(PRIMARY) 40% annual data volume growth; 850TB collected each year from every Wall Street trade80% of firm’s costs are IT-relatedThe SolutionCloudera Enterprise replaces Greenplum + SAN for DR, starting with 2 years’ data4-5PB on CDH by end of 2013--++--++--Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000pvwoy?srPos=0&srKp=001A financial regulatory body saves tens of millions with a more efficient disaster recovery solution that also offers 20X faster data processing performance. Background: A large regulatory body has a data reliability requirement to store 7 years of historical data, and to replicate that data between their production and disaster recovery environments. Meanwhile, the firm’s data volumes are growing 40% every year -- they collect 850TB each year from every Wall Street trade. Challenge: Recognizing that 80% of the firm’s costs were IT-related, they realized the need to investigate other options for data storage, processing, and/or disaster recovery. Solution: The company decided to improve the operational efficiency of their data storage and disaster recovery environment with Cloudera. They’re initially migrating two years’ disaster recovery data from Greenplum onto CDH, and will eventually migrate all 7 years DR data onto the platform. They’ll have 4-5PB on CDH before the end of 2013. Upon successful completion of the DR migration, the company may consider moving their enterprise data warehouse onto Cloudera. Results: The company is saving tens of millions of dollars by replacing their Greenplum + SAN costs with Cloudera. Meanwhile, they’ve recognized a processing performance boost of 20X.
BlackBerry realized ROI on their Cloudera investment through storage savings alone, while reducing ETL code by 90%.The Challenge:(PRIMARY) BlackBerry Services generates .5PB (50-60TB compressed) data per day(PRIMARY) RDBMS is expensive – limited to 1% data sampling for analyticsThe Solution(PRIMARY)Cloudera Enterprise manages global data set of ~100PB(PRIMARY) Collecting device content, machine-generated log data, audit details90% ETL code base reduction(PRIMARY) No longer have to rely on a 1% data sample for analytics; they can query all of their data -- faster, on a much larger data set, and with greater flexibility before(PRIMARY) Predict the impact that the London Olympics would have on their network so they could take proactive measures and prevent a negative customer experience--++--++--Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000l7XjiBlackBerry realized ROI on their Cloudera investment through storage savings alone, while reducing ETL code by 90%.Background: BlackBerry transformed the mobile devices market in 1999 with their introduction of the BlackBerry smartphone. Since then, other industry innovators have introduced devices that compete against BlackBerry, and the company must leverage all of the data it can collect in order to understand its customers, what they need and want in mobile devices, and how to remain an industry leader. Challenge: BlackBerry Services generate ½ PB of data every single day -- or 50-60TB compressed. They couldn’t afford to store all of this data on their relational database, so their analytics were limited to a 1% data sample which reduced the accuracy of those analytic insights. And it took a long time to try to access data in the archive. Their incumbent system couldn’t cope with the multiplying growth of data volumes or constant access requests -- BlackBerry had to pipeline their data flows to prevent the data from hitting disk.Solution: BlackBerry deployed Cloudera Enterprise to provide a queryable data storage environment that would allow them to put all of their data to use. Today, BlackBerry has a global dataset of ~100 PB stored on Cloudera. The platform collects device content, machine-generated log data, audit details and more. BlackBerry has also converted ETL processes to run in Cloudera, and Cloudera feeds data into the data warehouse. Hadoop components in use include Flume, Hive, Hue, MapReduce, Pig and Zookeeper. Results: BlackBerry’s investment in Cloudera was justified through data storage cost savings alone. And by moving data processing over to Hadoop, their ETL code base has been reduced by 90%. They no longer have to rely on a 1% data sample for analytics; they can query all of their data -- faster, on a much larger data set, and with greater flexibility before. One ad hoc query that used to take 4 days to run now finishes in 53 minutes on Cloudera. BlackBerry’s new environment allowed them to do things like predict the impact that the London Olympics would have on their network so they could take proactive measures and prevent a negative customer experience.
ShortcomingsToo much dataForces big DWExpensive: $$ and FTEForced windows or samplingThe 10%Iceberg modelThe other 90%Below waterline existsInaccessible, high cost to retrieve/use“Break glass in case of emergency”Can satisfy some complianceOpportunity cost of storage vs. storage + computeNetwork storageLow cost storageEasier to retrieveNot computeData movement$$ at scaleOverall capacityExisting is growingHitting thresholdsNew workloadsFeasible?“Forces you as a business to make decisions based on inadequacies, not on opportunities”--++--++--All of these customers have faced similar issues. Let’s talk about where the bottlenecks and shortcomings are exposed in current data management infrastructures. The first shortcoming is something we have seen in lots of clients is defining a small (and getting smaller) window of data for analysis because they have to, they are forced to by their current infrastructures. Visa, for example, was constrained to 6 months of transactions to look for fraud until they turned to Hadoop, and they had a 100TB data warehouse designed specifically for this task. This stems from two things: first, the cost/TB calculations that we have discussed earlier – is it worth it to the business to make this investment in storing this data? – and the second, is it even physically possible to do so? A 100TB data warehouse is no small feat, in terms of hardware, systems, skill set, et al., and might be out of reach for the company. So, companies often turn to sampling and high data turnover to get around these two bottlenecks, and this is can be less than ideal for making better decisions.The second shortcoming refers back to the Cost/TB that we just mentioned, and a common solution is to put that data to a “side pocket.” This “iceberg” model for data – you only really see the top 10% of data within the organization, i.e. in the data warehouse and BI systems, because the rest is below the waterline. And as discussed, anything below the waterline is typically more difficult to access, which drives up the real cost of storage, and thus relegates that data to a “break glass in case of emergency” model – it’s there, but the cost to retrieve and use is expensive, so do so judiciously. To be fair, this can satisfy some compliance and retention strategies. However, you need to look at the opportunity cost of not being able to use this data – it could be valuable if you can get at it cheaply and easily. Why not have both compliance and accessibility?!The third shortcoming is that the options that do scale, like a SAN or NAS, don’t provide anything but storage. If we want to do something with the data, we need to move it. And that can be costly, both in terms of network and also in terms of where the data will land for processing. If you need to look at 10 years of ATM transactions, what if that volume doesn’t fit in your data warehouse or staging systems? You are back to some form of windowing, sampling, partitioning, etc. and that can be a lot of overhead and complexity, and you are also back to the topic of the Cost/TB for that processing system.And that gets to the fourth shortcoming, which is the overall capacity of the system. Many of our clients are hitting or approaching maximum capacity with their existing systems, and to take on new workloads, they are facing escalating costs that prohibit expansion or inadequate functionality if they do so. And this shortcoming really affects the previous points – it potentially threatens your compliance policies, constrains your reporting latitude, and in short, forces you as a business to make decisions based on inadequacies, not on opportunities.----How do these relate to the following business issues:Access to relevant and/or all the data in BI and analysis (current and historical)New projects affecting existing projectsCompliance, regulations, and data retention strategies
Compliance and Data RetentionScalabilityAll data, all typesCosts 10x lessMechanicsDASSchema-on-readCluster fault toleranceCompressionReplicationExpand your data capacity at will “Need more space, it can be as simple as adding another node to the cluster”Confidence in data assurance and protection due to distributed storage mechanicsAccessibilityQuery frameworksBI tool integrationSecurityWORMNew WorkloadsCompute, like storage“as simple as adding another node to the cluster”Built for new workloadsCrunching web pagesNew sites, internet expansionResource managementMultiple computesMultiple groupsPlay well togetherNeed to Analyze More DataCombination on one nodeStorageCompute“Cost-effectively store all the data that you want to analyze and at the same time”Orders of magnitude less“A typical data warehouse might run $2-$10M incremental spend to add 100TB to the system. With Cloudera, adding 100TB will cost roughly $200k – 1/10th the spend”--++--++--How then does Hadoop fit into the infrastructure and business processes to enable you to meet these challenges?Let’s start with compliance and data retention strategies, how does Hadoop work to support these policies?Hadoop provides linear scalability storage for all data, regardless of type, in its raw, native form, at a cost point far below that of traditional systems like the data warehouse or SAN. It can do this by relying on a couple of fundamental features of the framework: direct-access storage, schema-on-read, multiple layers of fault tolerance, pluggable compression, and block replication. (The first of the afternoon clinics will go into the details of these features.) Without going deeply into the details, these features provide you and your business the ability to expand your data capacity at will – need more space, it can be as simple as adding another node to the cluster – and arm you with the knowledge that the data is intelligently distributed throughout the cluster in order to afford a high degree of assurance against data loss. But unlike backups and offline archives, these features also allow Hadoop to provide this data immediately, on-demand, through many means of access, including query languages like SQL, various purpose-built connectors, and many industry-leading applications that your organizations already employ like your BI tools. Moreover, Hadoop provides industry-standard security mechanisms for controlling access and visibility, so when coupled with Hadoop’s write-once features, you can maintain your compliance policy while still allowing analysis and work. With Hadoop, you get a cost-effective way to store, access, query, and process your data, all of it, regardless of its “data density.” It’s compliance and then some!How about new projects and workloads? What role can Hadoop play in these kinds of situations?Hadoop also provides scalability for processing as well, so just like with storage, if you need more computing horsepower, it can be as simple as adding another node to the cluster. It does this by using a computing framework called MapReduce and also takes advantage of the block replication we just mentioned to speed things up even faster. (Again, MapReduce is a topic covered during the first afternoon clinic.) Hadoop was initially designed for this kind of problem, which was crunching web pages to build a search index, so adding new workloads, just like new sets of web sites, can be scaled in a straightforward and cost-effective manner. In addition, there are several other compute frameworks that can use the same underlying data, such as Cloudera Impala, and we fully expect to see more and more capabilities work in the same manner – single data set, multiple ways of computing. Hadoop also has features that help keep different units of work or projects from monopolizing all of a cluster. These resource management features are also configurable, so you have the opportunity to tune how new workloads operate with existing projects. Lastly, when you are feeling the effects of sampling and windowing, when you need to be able to look at more data than your current systems can handle or allow, how can Hadoop work to address these situations?This really is the combination of the previous two situations, because you can use Hadoop to cost-effectively store all the data that you want to analyze and at the same time, since Hadoop is both storage and compute on the same node, offer your business computing power across the entire range of data. This is very effective for low-value, low-density, low Return-on-Byte data like historical records and couples really well with many of the compliance and data retention needs we encounter. It’s also a natural fit for scenarios where you really must have a full access to data across a broad dimension, like our examples in fraud or anomaly detection. Given the characteristics of the infrastructure needed to build a Hadoop cluster, your structural cost/TB are typically an order of magnitude less than with a traditional data warehouse. For example, a typical data warehouse might run $2-$10M incremental spend to add 100TB to the system. With Cloudera, adding 100TB will cost roughly $200k – 1/10th the spend.----How do these relate to the following business issues:Access to relevant and/or all the data in BI and analysis (current and historical)New projects affecting existing projectsCompliance, regulations, and data retention strategiesToo feature-y, need to translate into needs/business driversGet slides from Impala launch – check Box/Gdrive/Justin
Features of HDFSFiles split into blocks; Blocks distributed across cluster: disk, node, rack; Blocks replicated for protection and accessibility; Transparent replicationTests for compromises; Self-healing via replicationHigh-bandwidth; DAS for IO; Bring compute to closest replicated block; Minimize network overhead; Read small parts from many places simultaneouslyClustered storage; Need more space? Add a node; That simpleData stored in native fidelity; Byte streams on disk; SerDe; No schema enforcement; Key to flexibility of compute and storageFeatures of MapReduceFault-tolerant; Distributed blocks == options for recompute on compute failure; Easy to persist intermediate results for replay;Distributed processing; Compute brought to blocks, not file; Options for best compute time due to block dispersion in cluster; Parallel programming details abstracted awaySchema-on-read; Read byte streams at query time; Determine schema at runtime; Key to flexibility; Key to multiple computes; Core to now and future workloads
Snappy, g
Interactive BIHad been lackingSome analysis impracticalBatch design decisions“you could run analysis, but it could take a while to get results”Limited audience for data in clusterLack of interaction via common BI tools/SQLResultsLower ROI due to limited audiencePush data to high cost, but approachable, systemsDW, not always ideal landing spot for this analysisImpalaFamiliar to larger audienceMakes Hadoop data accessibleImproved ROI“If your business analysts know SQL and BI tools, they can get immediate value from all your data, from the first-class data within your data warehouse to the data within your Hadoop cluster, too.”--++--++--While Hadoop does offer tremendous value and analytical capabilities for businesses, the lack of a true, Hadoop-native interactive BI and analytics engine has made some analysis impractical. Yes, collecting all of your data into a primary data hub – some call it a data refinery – alone might be value enough to your business, but the ways in which you could evaluate and explore this source might have discounted that value because they focused on resilience and fault tolerance or even access to complete expression and programming latitude at the expense of speed and rapid dialogue. In a nutshell, you could run analysis, but it could take a while to get results. This meant that the data in the cluster, while available and accessible, wasn’t quite accessible enough to larger groups within your business because that particular focus. The lack of interactivity made it frustrating to use many common BI tools to analyze and use the data, and in turn, this led to two outcomes. First, it limited the ROI of the data within Hadoop because only 10s of people within your organization – your ETL developers and your patient data scientists – rather than the 100s or more people that could get at it if the means were more approachable, more familiar. The other outcome is that it forced organizations to push data that was otherwise not ideal into the data warehouse, and all of the topics we have discussed about cost/TB, capabilities, and capacities remained at large.This is why the introduction of Cloudera Impala is so important to these scenarios, because with this real-time query engine, the users that are familiar with the speed and rapid dialogue common to BI tools and data warehouse now have that quickness and agility with your Hadoop data. More people accessing more data and getting more value from the cluster -- this all adds up to improved ROI for Hadoop. So now, if your business analysts know SQL and BI tools, they can get immediate value from all your data, from the first-class data within your data warehouse to the data within your Hadoop cluster, too.
Right Tool, Right JobHadoopLow-density storage and extractionWhat if’s and “how about this” questionsExploratory analysisJust one tool, thoughDon’t ditch the DWComplex transactionsKnown, highly planned reportingBest using schemaFastHigh-value dataReal powerRelationship between the twoFacebook example10TB DWThen 1PB Hadoop clusterResulting in 40TB DWFound much more important data in the cluster--++--++--So, why not just forgo the data warehouse all together if Hadoop can provide cost-effective, scalable storage and an array of methods for analyses, reporting, and getting value out of all your data, not just the selected, high-value, high-density data?The simple answer is that Hadoop gives you and your business the “right tools for the right job.” Hadoop offers you and your business the ability to query on big and growing data sets – it lets you get at the value within the low-density data. And it offers you the flexibility to keep that data in its raw form so that you can ask the “what if” and “how about this” questions on that data at any time, with no data duplication, using the tools that your business knows and uses on a daily basis. Hadoop with Impala excels at this kind of exploratory analysis. Hadoop, though, is only one tool.The data warehouse offers a number of capabilities that are either difficult or missing with Hadoop. If you need to ensure that the order is incremented at the same time the store’s inventory is decremented and that the sales person gets credited for the transaction, then you need the power of a data warehouse. That’s a contrived example, but the point is, data warehouses excel at complex transactions, and these transactions are commonplace throughout your business. If you need to provide the sales team with their quarterly pipeline reports, or if the floor manager needs to know how many boxes were shipped to Los Angeles on Thursday, then you should use a data warehouse. These kinds of questions – known, highly planned, and often repeated with drill-down variations – are best served using the speed and structure offered by the underlying technology powering the data warehouse, the relational schema. This is the high-value data, this is the data the needs to fly first-class.Where the real power lies in this relationship is that Hadoop can feed the data warehouse this high-value data and thus makes the data warehouse even more valuable to its users than previously thought. One example comes from one of our founders and current chief scientist at Cloudera, Jeff Hammerbacher. When Jeff was leading the data science team at Facebook, they had a 10TB data warehouse. He and his team started to use Hadoop to capture the multitude of interaction points that the web property offers, and they quickly established a formidable 10PB cluster. And what happened next? They found so much high-value information in the cluster that they wanted in their traditional BI environments that their data warehouse grew to 40TB.
[Just keeping for graphics possibilities -- added to the Solutions/Differentiators section]
More with relationshipHadoop as pre-processorEDW staging areaSignificant, high-cost data cleansingAs stagingNeed storageNeed processingNeed queryNeed low costsDisaster recovery alternateLow cost storageSatisfactory alterative in query during rebuild--++--++--So what else can be done with this complementary relationship of data warehouse and Hadoop?Let’s say you need to take some data and run it through some paces in order to make some decisions as to whether or not to include it in your BI reporting. Or perhaps you have some significant data cleansing efforts that might take considerable amount of time, like days or weeks, before it might be ready for your business consumers to use. Like we mentioned in Jeff’s story, Hadoop can be the ideal staging area for your data warehouse. Such a system requires scalable, flexible storage. It needs a high degree of processing capabilities, and it needs query abilities. All of which fit Hadoop very well and with a significant cost advantage to your data warehouse, too.Hadoop can also act as a disaster recovery option for your data warehouse. Using standard tools and connectors, the data flowing into your data warehouse can also be sent to the Hadoop cluster and stored, in its native format. So when the time comes, you have your data available and accessible, yet at the fraction of the cost for a duplicate data warehouse. In this scenario, when coupled with Impala, you and your business can still enjoy most of the speed and analytical features of your data warehouse while that system is concurrently rebuilt from the very data you are now serving.
SO, TO WRAP UP, HERE ARE SOME KEY TAKEAWAYSALL DATA HAS VALUE * Why compromise decision making by focusing on only high density or selected? *Use Hadoop to maximize the Return-on-Byte for all your dataEXPLOIT BOTH STORAGE AND COMPUTE * Hadoop gives you storage and computation at a similar cost to storage-only alternatives * The computation is flexible - you can bring multiple processing frameworks to bear on a single set of data * This multi-function is expected & natural for exploratory type workloadsUSE THE RIGHT TOOL FOR THE RIGHT JOB* This approach drives better data & workload alignment * Focus on what’s best of each system * DW for high-density, operational reporting * Hadoop for low-density, exploratory analytics
Consolidate your Information Lifecycle ManagementFind your valuable off-line archivesMake them available and accessibleConnect the archives with your existing reporting systemsCreate a staging area,Processing and push into your systems