Meet Mission Critical SLAs with Big Data Using Apache Flume

•Télécharger en tant que PPTX, PDF•

0 j'aime•625 vues

This document discusses using big data to meet mission critical SLAs. It outlines emerging challenges like new data sources, channels, and the need for data reconciliation and awareness. It presents options for processing big data like MapReduce, streaming, SQL and Impala. It also provides details on Apache Flume and encourages asking bigger questions to drive innovation. The key takeaways are collocation, flexibility and efficiency. It concludes by recommending selecting a challenging use case and creating a parallel environment to accelerate ETL using Cloudera.

Meet Mission Critical SLAs
with Big Data

Essence of Business

Expected

Timely

Unconstrained

3

Emerging Challenges

ETL Engineer BI Analyst
VB SQL
PL/SQL

FULL
DATAMARTS

PROCESSING
EXPORT
PARTIAL FTP ETL PROCESSING
PL/SQL
EXPORT
INCRE- ODS

ELT
MENTAL STAGE
EXPORT RAW DATA
LOAD
Sources ETL Systems Database Platform

5

Emerging Challenges

Channels

Design

Reconciliation

Awareness

6

Options for Processing

MapReduce

Streaming

SQL

Impala

9

Apache Flume Details

HTTP
Sensor
Source HDFS

Syslog
Web Server
Source

Custom
Retail POS
Source HBase
Data generators

11

Ask Bigger Questions:
How do we prevent fraud?
A global financial services firm can more
quickly & accurately find fraud while
saving $30 million in IT costs.

13

Ask Bigger Questions:
How can we improve our
support team’s productivity?
NetApp AutoSupport processes 600,000+
“phone home” transactions weekly to offer
proactive customer support.

14

Key Takeaways

Collocation

Flexibility

Efficiency

15

Starting Point – Accelerate your ETL
Select a particularly
challenging use case
Create a parallel
environment
Implement pipeline on
Cloudera

16

Thank You!
cloudera.com/clouderasessions

Contenu connexe

En vedette

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ET...Cloudera, Inc.

Cloudera Sessions - Optimize Your Data WarehouseCloudera, Inc.

Cloudera sessions centralized management systemsCloudera, Inc.

Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.

Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%Cloudera, Inc.

Optimizing Regulatory Compliance with Big DataCloudera, Inc.

Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Cloudera, Inc.

Cloudera Sessions - Cloudera Keynote: A Blueprint for Data ManagementCloudera, Inc.

Transforming Business for the Digital Age (Presented by Microsoft)Cloudera, Inc.

Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)Cloudera, Inc.

How Data Drives Business at Choice HotelsCloudera, Inc.

Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.

The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.

Architectural considerations for Hadoop Applicationshadooparchbook

Engaging with Cloudera & Morning Wrap UpCloudera, Inc.

Hadoop OperationsCloudera, Inc.

En vedette (17)

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ET...

Cloudera Sessions - Optimize Your Data Warehouse

Cloudera sessions centralized management systems

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%

Optimizing Regulatory Compliance with Big Data

Securing the Data Hub--Protecting your Customer IP (Technical Workshop)

Cloudera Sessions - Cloudera Keynote: A Blueprint for Data Management

Transforming Business for the Digital Age (Presented by Microsoft)

Turning Data into Business Value with a Modern Data Platform

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)

How Data Drives Business at Choice Hotels

Building a Data Hub that Empowers Customer Insight (Technical Workshop)

The Vortex of Change - Digital Transformation (Presented by Intel)

Architectural considerations for Hadoop Applications

Engaging with Cloudera & Morning Wrap Up

Hadoop Operations

Similaire à Meet Mission Critical SLAs with Big Data Using Apache Flume

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.

Data Ingestion, Extraction & Parsing on Hadoopskaluska

From the Big Data keynote at InCSIghts 2012Anand Deshpande

Modernise your EDW - Data LakeDataWorks Summit/Hadoop Summit

Data IntegrationDatio Big Data

Complex Er[jl]ang Processing with StreamBasedarach

SnapLogic corporate presentationpbridges

Big data analytics beyond beer and diapersKai Zhao

BI Self-Service Keys to Success and QlikView OverviewSenturus

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

The IBM Netezza datawarehouse applianceIBM Danmark

Informaticamukharji

oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a

CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.

Spark meets Springmark_fisher

Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSantosh Sahoo

Sparkflows Use CasesJayant Shekhar

SparkFlow Takede Madiga Albert

Salesforce & SAP IntegrationRaymond Gao

Similaire à Meet Mission Critical SLAs with Big Data Using Apache Flume (20)

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...

Data Ingestion, Extraction & Parsing on Hadoop

From the Big Data keynote at InCSIghts 2012

Modernise your EDW - Data Lake

Data Integration

Complex Er[jl]ang Processing with StreamBase

SnapLogic corporate presentation

Big data analytics beyond beer and diapers

BI Self-Service Keys to Success and QlikView Overview

Apache Spark: Lightning Fast Cluster Computing

The IBM Netezza datawarehouse appliance

Informatica

oracle_soultion_oracledataintegrator_goldengate_2021

CERN_DIS_ODI_OGG_final_oracle_golde.pptx

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...

Spark meets Spring

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Sparkflows Use Cases

SparkFlow

Salesforce & SAP Integration

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Meet Mission Critical SLAs with Big Data Using Apache Flume

1. Meet Mission Critical SLAs with Big Data

2. 2

3. Essence of Business Expected Timely Unconstrained 3

4. Changing Tides 4

5. Emerging Challenges ETL Engineer BI Analyst VB SQL PL/SQL FULL DATAMARTS PROCESSING EXPORT PARTIAL FTP ETL PROCESSING PL/SQL EXPORT INCRE- ODS ELT MENTAL STAGE EXPORT RAW DATA LOAD Sources ETL Systems Database Platform 5

6. Emerging Challenges Channels Design Reconciliation Awareness 6

7. Options for Processing MapReduce Streaming SQL Impala 9

8. Apache Flume Details HTTP Sensor Source HDFS Syslog Web Server Source Custom Retail POS Source HBase Data generators 11

9. Encouraging Innovation 12

10. Ask Bigger Questions: How do we prevent fraud? A global financial services firm can more quickly & accurately find fraud while saving $30 million in IT costs. 13

11. Ask Bigger Questions: How can we improve our support team’s productivity? NetApp AutoSupport processes 600,000+ “phone home” transactions weekly to offer proactive customer support. 14

12. Key Takeaways Collocation Flexibility Efficiency 15

13. Starting Point – Accelerate your ETL Select a particularly challenging use case Create a parallel environment Implement pipeline on Cloudera 16

14. Questions?

15. Thank You! cloudera.com/clouderasessions

Notes de l'éditeur

Yp.com (video)https://cloudera.box.com/files/0/f/550630840/1/f_5336182086
Yp.com (video)https://cloudera.box.com/files/0/f/550630840/1/f_5336182086
Information is the lifeblood of businessIt’s absence is terminal for business; Age of Information Worker; Zero tolerance“Our business depends upon information, yet the more of it we produce and the more of it we consume, the less we can guarantee its ready availability.”Data integrationConduit to business success; Conduit to successful application of information; Where do you want to be successful?--++--++--[Roll YP.com video]Information is the lifeblood of business. This we just heard from our customer at YP.com, and it is a truism since the days of bartering and market stalls, as important then as it is now. The absence of information is a very grave situation indeed for business, so it is no surprise that when information is expected, yet does not appear or is delayed, someone is going to hear about it. And in our modern era, celebrated as the age of the information worker, our tolerance for delay is practically non-existent. With the emergence of Big Data, our tolerance diminishes further and our delays more susceptible and magnified. Our business depends upon information, yet the more of it we produce and the more of it we consume, the less we can guarantee its ready availability.In this session, we will examine how Big Data affects your data processing and how you can employ Hadoop to accelerate, expand, and go further with your data integration objectives. Data integration is your conduit for business success – the successful, meaningful, timely application of information. During this session, you need to ask yourself, where do you want to be successful?
To envision success, focus on stresses to status quoInstrumentationMoving data; More data; More sources; Incorporate more into analysis; Still doing the same reports as 10 years ago; No change in reports; More data, however; Still only 24 hours in a day; Volume and breadth threatening successConsumptionMore awareness of data == More need for data; Democratization of BIExplorationAccelerated pace of innovation; Need to ask new questions more often; Traditional system too rigid to accommodate pace and changesValuePursuit and use of data makes data valuable; What happens if data != value? Dread of business repercussions as motivator; Opportunity if completed within time and under budget as motivator – what else can be done?This is the difficulty for businessDo more with less; Save time and resources; Satisfy agreements and clear thresholds“Given the tremendous changes in the business landscape, and the resulting pressures placed on your existing processes and infrastructure, how can you do more with your data integration systems when maintaining the status quo is so threatened?”--++--++--To envision success, we need to examine the changes in your data management ecosystem that can adversely affect integration.First and foremost, you and your business are moving more and more data than before, and more of it needs to be incorporated into your business. This is a result of the rise in Instrumentation, from the proliferation of system and application log files to your corporate Twitter stream. And while your reporting and analysis might not change – you are still producing the same reports as 5 years ago – the volume and breadth of data has increased, and you still have only 24 hours a day to process. This can be a source of difficulty for your infrastructure and processes.As you and your business become more aware of data and information, the more of it you and your business want, or need, to view and incorporate. This is the macro effect of Consumption, and this translates to more processing to accommodate the growth in reports and analysis. No longer is is BI the domain of your executives looking at weekly reports; market trends such as self-service, SaaS, and mobile BI mean that more of your users need access to critical business information, and their needs are rarely the same. These additional activities are also a source of difficulty for your infrastructure and processing capacity.And often, these new activities, along with the existing processes, need to shift and adjust to accommodate the accelerating pace of change and business innovation, to ask new questions, and this is related to the rise of Exploration within your business, as you and your business seeks to find the competitive edge or cost savings within your data. However, traditional approaches can be very rigid, and adjusting to these requests can also be a source of difficulty.All of these macro effects contribute to the pursuit and use of data as a Value within your organization, and as data becomes more and more valuable to your business, your data integration needs will face mounting challenges stemming from these pervasive changes in the business landscape.And within this setting, you and your business are motivated to improve your data performance because of the dread of missed SLAs and their business repercussions and the opportunities revealed if you can save time and resources. This is what is asked of you and your business – to do more with less -- to save time and resources, yet still satisfy your agreements and clear your thresholds. Given the tremendous changes in the business landscape, and the resulting pressures placed on your existing processes and infrastructure, how can you do more with your data integration systems when maintaining the status quo is so threatened?
Examination of ETL SLA issuesLook at building blocks; Move data; Transform dataMovement; Storage to compute; OLTP not optimized, so has to move; Also, the collection of sourcesTransformation; Normalizing, etc.; Put data into proper context, otherwise just GIGO; Merger of disparate data setsSource of issuesScale; 3 systems to scale: source, target, and ETL grid; 3places for scale issues: ingestion, processing, and query; How to handle: traffic spikes, limited storage, resource contention, network bottleneckFlexibility; Storage and raw data, i.e. unstructured, poor fit for conventional systems; Need to know questions first, then load; Typically only SQL for transform; Limited expressionReliability; Any place in the chain subject to compromise; Forfeit the entire chain/process (at least without significant cost, effort); Forced to replay from beginning if error or mistake--++--++--To better understand how we can approach this problem, we need to think of data integration not as your ETL SLA in and of itself, but to examine its building blocks -- look at the basics of integration – and the impacts and affects Big Data has on them. At the root of ETL are two fundamental goals, to move data and to transform data. Data movement is, in simple terms, moving storage – data – to compute. It’s about running weekly inventory or engagement analysis reports, but often your transactional systems, like your ERP or CRM system, your OLTP systems, are not optimized for analytics typically, and require you to move the information to systems designed for reporting and analysis. Data movement also includes the movement, rather the collection, of multiple sources to a single system to provide greater enrichment and insight of reporting.Data transform is about translating, normalizing, aggregating, and changing data in order to understand it within your business context and needs. It’s about turning all dates into Year-Month-Day, confirming a well-formed phone number, or looking up a state based on a zip code. This is the “garbage in, garbage out” effort and is the critical function in combining disparate data sets.As these are the two elements that build all data integration processes, changes to either of these two components can affect performance and capability and, by association, affect your SLAs.From this perspective, let’s examine the challenges facing existing systems.The first area of difficulty is with scale. Stresses to the system can occur in 3 places -- the source system, the target system, and the ETL grid itself – and these loosely translate to the 3 types of SLAs you and your business face – ingestion, processing, and query latency. Source system issues, like handling spikes in order fulfillment or other high volume transactions, are particularly acute if customer-facing: no one likes a slow website, let alone timed out applications, either of which greatly compromise the trust you have with your user – did my data get saved or not? Problems with target systems are common in that the system either faces storage constraints – you can only save 6 days worth of processed transactions -- or resource contention – reporting queries have to share resources with processing tasks, and thus your BI query takes too long and your spreadsheet connection fails. This latter issue can also manifest if the reporting system has an increased number of users or queries, like if a new dashboard in the finance portal hasn’t been properly tested and vetted, and the activity overwhelms the system. The last element, the ETL grid, can have issues with scale from many sources, ranging from processing language constraints and intermediate storage to network bottlenecks getting data out of the target system and into the source systems. In short, scale for ETL processing, with the many steps in the the flow, becomes a proverbial “weakest link” problem.The second area of difficulty with the current state of integration is with flexibility. This manifests in two forms. The first is with storage and raw data. This is particularly acute with unstructured data, like written reports or sensor information, and asks the question, if you could easily ask new and different questions of this data, could you speed up your processing or shorten your analysis? And why is is this a problem? Because unstructured data doesn’t fit well into traditional relational databases, the common source for most data, and during import, you need to make decisions regarding the data in order to persist it in the schema of the database. In effect, this means you need to know the questions before you load the raw data, and this also means that you are limited in what questions you can ask unless you reload the data, which most likely means changing schema and other costly activities, let alone the the load itself. So, with most source systems, you have limited expression with unstructured data. The second element of flexibility is with the transformation itself. Most systems, again relational databases, rely on SQL, and while this language is both powerful and accessible to programmers and non-programmers alike, it lacks the expressive power and range of an imperative language like Java or Ruby. For you and your business, it means that you might have difficulty performing the right transformation without resorting to other systems or custom extensions, both of which can be costly in terms of time, money, scale, and skills.The third area that poses difficulty is with reliability. With many traditional ETL processing jobs, if something fails mid-flight, some phase of the processing chain encounters something that ultimately compromises the results of the job, there is very little recourse than to rerun the job from the beginning. This can be extremely costly when processing can take hours, if not days. For these systems, they lack the ability to replay processing from the point of failure, or can do so, but with significant cost of storage and data movement.
[[COULD MERGE INTO PREVIOUS SLIDE]]Escalating costsSystem-related costs: Network movement; ELT contention (DW serving multiple purposes, roles)Resource costs; Multiple skill sets; Maintenance of schemas in source, target; Associated change management overhead/coordination; Needs visibility of entire process; Needs expert skills for distributed programming; Results: small changes == big costs“In short, the many moving parts of the data processing chain, both system and human, present a wide spectrum of operations that you and your business must defend against compromise if you are to maintain and forward your business objectives.”--++--++--The last area of that challenges current data integration systems is cost – costs in several forms. The cost of moving data from the source systems, into the ETL grid, and onto the target system is expensive due to the relative narrow channel of networking compared to its computing and storage counterparts. Moving an entire month of regional store transactions across your network is significantly slower than reading that same information from disk. Resource contention is also a contributor to movement costs, if either the source or target system has other responsibilities it must fulfill, especially where the transformations are executed within the target, i.e. the reporting, system itself – the shift of ETL to ELT. These systems need to scale to handle the load of processing as well as reporting, and this can incur significant costs.There is also the human costs associated with data integration. Scaling conventional databases as the source and target systems requires advanced knowledge of system architecture, and implementation often requires service interruptions and expert resources to execute. Moreover, there is the cost of maintaining separate source and target schemas and enforcing proper change management procedures for each – as your data varies, you and your teams need to adjust to handle these changes at both ends of the processing chain and at scale, this can be a formidable task that can take days, if not weeks for even minor changes to a process, like adding a single dimension to an existing report. The cost of scaling the ETL transformation itself can also be difficult, as writing and maintaining distributed computing programs requires sophisticated ETL and development skills. And these resources need to maintain a high degree of awareness of both the source and target schemas as well as the intermediate ETL step schemas in order to implement changes in a timely and effective manner. Change management issues aside, it is no small wonder that adding a single new column in a complex ETL workflow can take days and weeks.So, you can see that many factors can contribute to slow downs and failures that impact your SLAs and integration efforts, regardless of where in the chain: ingestion, processing, and query. In short, the many moving parts of the data processing chain, both system and human, present a wide spectrum of operations that you and your business must defend against compromise if you are to maintain and forward your business objectives.
HadoopNothing cost effective beforePlatformNew approachCompute to storageCompute is smallCollocationDistributionParallelism abstractedCode for 10TB == Code for 10PBLinear scaleLinear economicsFlexibilityHallmark for ETLAsk questions laterSchema-on-readMultiple interpretations, single loadReduce overheadUse existing BI toolsExtend beyond SQLAccessible to usersCornerstone of innovationETL enhancementsAny kind of dataIntermediate data ReplayReliability--++--++--It is here that Hadoop can offer new tools and approaches that can address these challenges in your data integration infrastructure and procedures. Until recently, there really hasn’t been a cost effective solutions to these problems, but today, many are using Hadoop to complement their existing data warehouses and ETL systems to improve performance, reduce costs, and enable new insights. Hadoop brings you and your business a flexible, scalable platform for storing, processing, and analyzing any data, from your ERP and structured data sources to your reams of lab test notes and other unstructured sources.Hadoop is a new approach to data management and analysis that reverses the traditional flow of bringing your storage – your data – to your compute and processing. Instead, with Hadoop, your compute is brought to your data. Your processing instructions are considerably smaller than the data upon which they act, so this shift greatly diminishes the penalty of data movement. Now, instead of moving that month of store transactions to your reports, you move your comparatively tiny report queries to the transactions. In fact, Hadoop’s architecture has coupled both storage and compute onto the same node within the cluster, so the scale enabled by distributed, parallel processing is available to both storage and compute simultaneously to all of your data and processing. So, once your data is loaded into Hadoop, an activity which benefits from the storage distribution and the high-throughput architecture underpinning the cluster, your processing can be brought directly to that data. Hadoop also transparently handles the data distribution, collection, and processing, so the implementation details of parallel programming are abstracted away for your batch and interactive query and analysis. The programming models of MapReduce and Impala are primary examples of how Hadoop can simplify your programming efforts, as writing code to for a 10-node, 50TB cluster is identical to the code for a 1000-node, 10PB cluster. What would you have to do if your 1TB warehouse grew by 10x? What skills and resources would you need? With Hadoop, you add 10x the nodes to the cluster, practically brute force, with little forethought or advanced skills. The simple act of collocating your data with your computing drives demonstrably 10x better scalability, flexibility, and costs, while enabling linear economics through simply adding a new node to the cluster. So, if your data integration needs more capacity or more horsepower, the answer with Hadoop is straightforward and clear.While the power of data locality is a strength of Hadoop, the hallmark for data integration is the flexibility offered by Hadoop. If you and your business didn’t have to know the questions you might ask before loading the data into your ETL systems, how might that change how you built your transformations and developed your reports? If changes to reports became relatively inconsequential and isolated to just the report itself? This is what Hadoop brings to data integration. The schemas involved with data processing – the source, target, and intermediate schemas – all become transient and subsumed by the query, not the storage, layer of the ETL process. This is what is called “schema-on-read” -- the schema and format of the bytes on disk are determined on-the-fly when read for processing -- and it is arguably the most powerful feature that Hadoop brings to ETL and all computing functions. Schema-on-read enables the same data to be read multiple times, each with a separate interpretation, without affecting or changing the source data itself. This means that you and your business can load your raw data once into Hadoop, and then process and reprocess repeatedly by simply reissuing queries – by simply asking new questions. You greatly reduce the overhead of maintenance and change management, as the queries, not the schemas, need to be adjusted and tracked. And since this benefit is powered by queries, not structures built into a system, all your existing BI and ETL tools can take full advantage of the schema-on-read advantage with minimal transition. This also extends beyond “just queries,” as that assumes a SQL-only mindset. Hadoop allows a full range of computing frameworks, like SQL, Java, Python to name a number of familiar tools, and this is the reason why you can continue to use your existing tools now and in the future. It also means that all your business users, no matter their programming sophistication, can use Hadoop for their processing needs, from the developer constructing MapReduce jobs by hand to the business analyst employing an industry-standard BI suite to build visual processing workflows using real-time SQL queries with Impala. This is a cornerstone for innovation on the platform, and your data integration processes can directly benefit from computing capabilities available now and those in the future without needing to change the underlying data and infrastructure.Moreover, without the constraints of relational schemas, you and your business can more easily incorporate any kind of data, regardless of structure, because you are storing the raw data to disk, not the structure. This flexibility empowers bulk, incremental, and streaming data loads, since Hadoop isolates the bytes from the interpretation. The relative ease of storage with Hadoop, powered by schema-on-read and data locality, make persisting intermediate results of the ETL process accessible and efficient. This can improve your ETL processes, since the intermediate data is available, and yet you still have flexibility to interpret this data in subsequent stages. Thus if an intermediate stage encounters an error, you can easily adjust the processing at that phase and replay from this point with the existing intermediate data. Reliability is also bolstered by Hadoop’s inherit fault-tolerance as a distributed storage and processing system, a characteristic that your long-running transformations can enjoy without additional programming or planning.
Cost economicsOpen sourceIndustry-standard to engineered systems1-2 orders magnitudeDW optimization for ELTResource economicsSource and target schemasParallel programmingAlignment of resourcesShared skillsToolsEfficient combinations“In short, Hadoop offers to you and your business the ability to do more with less: more computing and storage on less infrastructure costs, more flexibility and expression on less structure and constraints, more focused work with less resource specialization. Hadoop empowers you to find the best resource optimization and workload alignment for data integration.”--++--++--Another benefit that Hadoop offers you and your business is aggressive cost economics. As mentioned earlier, the costs of ETL come in several forms. From the hardware and infrastructure perspective, Hadoop can drastically reduce overall costs due to its open source licensing and ability to operate on a range of form factors, from engineered systems to heterogeneous industry-standard hardware. All told, these features combine to offer you and your business 1-2 orders of magnitude more cost-effective processing, a significant gain in an era of less. Moreover, by shifting ELT processing to a cost-effective Hadoop cluster, you can avoid the expensive transformations within your data warehouse, and simultaneously diminishing resource contention. In effect, ETL with Hadoop lets your optimize your other data integration systems to focus on their more critical jobs.You and your business can also benefit from more subtle cost savings from consolidated operations and its direct impact on your resource allocation and mix. With Hadoop as your data processing backbone, the potential to reduce the overhead associated with maintaining source and target schemas and the relative ease of developing efficient parallel ETL processes is significant and can give you and your business the opportunities to align your resources more directly with immediate business objectives. Likewise, the skills needed to establish, execute, and change many of the stages and steps throughout the data integration chain are shared, and thus your resources can more readily contributed to the entire ETL processing flow, from your business analysts to your ETL developers, using common knowledge, skills, programming models, and tools.In short, Hadoop offers to you and your business the ability to do more with less: more computing and storage on less infrastructure costs, more flexibility and expression on less structure and constraints, more focused work with less resource specialization. Hadoop empowers you to find the best resource optimization and workload alignment for data integration. All of this gives you and your business the tools to find the most efficient combination to meet your SLAs and take on what’s next for your business.
Hadoop offers compelling alternativeNothingas cost effective before; Also, broadly, as Hadoop == PlatformWhy Hadoop? New approach; Compute to storage; Data is big, yet compute is small – which do you want to move at scale? Idea of collocation of compute and dataDistribution mechanics; Parallelism abstracted, easier to code; Code for 10TB == Code for 10PB; Linear scale, economicsFlexibility; Hallmark for ETL; Ask questions later; Schema-on-read; Multiple interpretations, single load, reduce overhead; Use existing BI tools and extend beyond SQL – more accessible to more types of users; Cornerstone of innovationETL enhancements; Any kind of data; Handle intermediate data simply; Simple replay mechanics; Improved reliability[[INSERT OPTIONS FOR PROCESSING – MIGHT NEED AN ADDITIONAL SLIDE HERE]]Cost economics; Open source; Industry-standard to engineered systems; 1-2 orders magnitude; DW optimization for ELTResource economics; Reduced management of source and target schemas (just a query now); Parallel programming simplified; Offers better alignment of resources to tasks and goals more meaningful to the business, not operations; Use of shared skills, tools; Efficient combinations“In short, Hadoop offers to you and your business the ability to do more with less: more computing and storage on less infrastructure costs, more flexibility and expression on less structure and constraints, more focused work with less resource specialization. Hadoop empowers you to find the best resource optimization and workload alignment for data integration.”--++--++--It is here that Hadoop can offer new tools and approaches that can address these challenges in your data integration infrastructure and procedures. Until recently, there really hasn’t been a cost effective solutions to these problems, but today, many are using Hadoop to complement their existing data warehouses and ETL systems to improve performance, reduce costs, and enable new insights. Hadoop brings you and your business a flexible, scalable platform for storing, processing, and analyzing any data, from your ERP and structured data sources to your reams of lab test notes and other unstructured sources.Hadoop is a new approach to data management and analysis that reverses the traditional flow of bringing your storage – your data – to your compute and processing. Instead, with Hadoop, your compute is brought to your data. Your processing instructions are considerably smaller than the data upon which they act, so this shift greatly diminishes the penalty of data movement. Now, instead of moving that month of store transactions to your reports, you move your comparatively tiny report queries to the transactions. In fact, Hadoop’s architecture has coupled both storage and compute onto the same node within the cluster, so the scale enabled by distributed, parallel processing is available to both storage and compute simultaneously to all of your data and processing. So, once your data is loaded into Hadoop, an activity which benefits from the storage distribution and the high-throughput architecture underpinning the cluster, your processing can be brought directly to that data. Hadoop also transparently handles the data distribution, collection, and processing, so the implementation details of parallel programming are abstracted away for your batch and interactive query and analysis. The programming models of MapReduce and Impala are primary examples of how Hadoop can simplify your programming efforts, as writing code to for a 10-node, 50TB cluster is identical to the code for a 1000-node, 10PB cluster. What would you have to do if your 1TB warehouse grew by 10x? What skills and resources would you need? With Hadoop, you add 10x the nodes to the cluster, practically brute force, with little forethought or advanced skills. The simple act of collocating your data with your computing drives demonstrably 10x better scalability, flexibility, and costs, while enabling linear economics through simply adding a new node to the cluster. So, if your data integration needs more capacity or more horsepower, the answer with Hadoop is straightforward and clear.While the power of data locality is a strength of Hadoop, the hallmark for data integration is the flexibility offered by Hadoop. If you and your business didn’t have to know the questions you might ask before loading the data into your ETL systems, how might that change how you built your transformations and developed your reports? If changes to reports became relatively inconsequential and isolated to just the report itself? This is what Hadoop brings to data integration. The schemas involved with data processing – the source, target, and intermediate schemas – all become transient and subsumed by the query, not the storage, layer of the ETL process. This is what is called “schema-on-read” -- the schema and format of the bytes on disk are determined on-the-fly when read for processing -- and it is arguably the most powerful feature that Hadoop brings to ETL and all computing functions. Schema-on-read enables the same data to be read multiple times, each with a separate interpretation, without affecting or changing the source data itself. This means that you and your business can load your raw data once into Hadoop, and then process and reprocess repeatedly by simply reissuing queries – by simply asking new questions. You greatly reduce the overhead of maintenance and change management, as the queries, not the schemas, need to be adjusted and tracked. And since this benefit is powered by queries, not structures built into a system, all your existing BI and ETL tools can take full advantage of the schema-on-read advantage with minimal transition. This also extends beyond “just queries,” as that assumes a SQL-only mindset. Hadoop allows a full range of computing frameworks, like SQL, Java, Python to name a number of familiar tools, and this is the reason why you can continue to use your existing tools now and in the future. It also means that all your business users, no matter their programming sophistication, can use Hadoop for their processing needs, from the developer constructing MapReduce jobs by hand to the business analyst employing an industry-standard BI suite to build visual processing workflows using real-time SQL queries with Impala. This is a cornerstone for innovation on the platform, and your data integration processes can directly benefit from computing capabilities available now and those in the future without needing to change the underlying data and infrastructure.Moreover, without the constraints of relational schemas, you and your business can more easily incorporate any kind of data, regardless of structure, because you are storing the raw data to disk, not the structure. This flexibility empowers bulk, incremental, and streaming data loads, since Hadoop isolates the bytes from the interpretation. The relative ease of storage with Hadoop, powered by schema-on-read and data locality, make persisting intermediate results of the ETL process accessible and efficient. This can improve your ETL processes, since the intermediate data is available, and yet you still have flexibility to interpret this data in subsequent stages. Thus if an intermediate stage encounters an error, you can easily adjust the processing at that phase and replay from this point with the existing intermediate data. Reliability is also bolstered by Hadoop’s inherit fault-tolerance as a distributed storage and processing system, a characteristic that your long-running transformations can enjoy without additional programming or planning.Another benefit that Hadoop offers you and your business is aggressive cost economics. As mentioned earlier, the costs of ETL come in several forms. From the hardware and infrastructure perspective, Hadoop can drastically reduce overall costs due to its open source licensing and ability to operate on a range of form factors, from engineered systems to heterogeneous industry-standard hardware. All told, these features combine to offer you and your business 1-2 orders of magnitude more cost-effective processing, a significant gain in an era of less. Moreover, by shifting ELT processing to a cost-effective Hadoop cluster, you can avoid the expensive transformations within your data warehouse, and simultaneously diminishing resource contention. In effect, ETL with Hadoop lets your optimize your other data integration systems to focus on their more critical jobs.You and your business can also benefit from more subtle cost savings from consolidated operations and its direct impact on your resource allocation and mix. With Hadoop as your data processing backbone, the potential to reduce the overhead associated with maintaining source and target schemas and the relative ease of developing efficient parallel ETL processes is significant and can give you and your business the opportunities to align your resources more directly with immediate business objectives. Likewise, the skills needed to establish, execute, and change many of the stages and steps throughout the data integration chain are shared, and thus your resources can more readily contributed to the entire ETL processing flow, from your business analysts to your ETL developers, using common knowledge, skills, programming models, and tools.In short, Hadoop offers to you and your business the ability to do more with less: more computing and storage on less infrastructure costs, more flexibility and expression on less structure and constraints, more focused work with less resource specialization. Hadoop empowers you to find the best resource optimization and workload alignment for data integration. All of this gives you and your business the tools to find the most efficient combination to meet your SLAs and take on what’s next for your business.
[[NEED BULLET POINTS]]
What is next for business with Hadoop?Platform; Not a point solution; Rich opportunities in improving mechanics of movement and transformation; E.g. precompiled reports/static content for web site to mimic dynamic contentFoundation for future answers; New dimensions? New data sources? New SLAs? New sets of users, producers, consumers?Foundation for existing tools; Rich partner ecosystem; Built on platform and abstractions; Therefore can focus on transition; Business focus for you and your business; Focus on getting the job done--++--++--What is next for your business? As mentioned in the keynote, Hadoop is not a point solution, it is a platform, and the tools available to empower your ETL processes build a foundation for future applications. From IT log processing to mainframe offloading, Hadoop can offer you and your business rich opportunities to extend its processing benefits to any application rooted in the core elements of movement and transformation. With these foundational elements, you can cost-efficiently preprocess results in bulk so as to save your systems from the more expensive queries. For example, a customer-facing site can have its individual reports precompiled by a Hadoop-powered processing flow in order to serve the customer information more like static content rather than dynamic – less moving parts at the layer with the most to lose is a good content management practice.With these foundations, you can act upon and realize questions like:How can you add new fields and dimensions to your reports in hours, not days or weeks?What future processing or new data source might your business need next month, or next year?How can you maintain your SLAs in the face of accelerating data growth and more aggressive requests?Who might be using your data next and how that might be done?And you don’t need to start from scratch. Hadoop’s rich ecosystem includes the partners that help you right now with your ETL needs, and they can help you with your transition to using Hadoop for data integration. They all benefit from the foundation and abstractions that Hadoop exposes for storage and computing, but most importantly, they let you focus on getting your job done, on time and on target.
Storage and ComputeWORM; Proximity; ParallelismFlexibilitySchema interpretation; Existing BI and ETL tools; Compute frameworks; Largeruser baseCost EfficiencyLower hardware, software, skill costs; Workload allocation; IT resource and hardwareROI“In short, Hadoop gives you and your business the tools to not just meet your SLAs, but also to tackle your growing integration demands and be ready for what’s coming next for your business.”--++--++--What have we learned about how you can use Hadoop to accelerate and streamline your data integration?First, Hadoop brings your computing functions to your data, and not the other way around. This greatly accelerates the overall ETL pipeline, as data is effectively loaded once and then processed in place multiple times, and one derivative of this feature is that you and your business can replay stalled or failed ETL jobs without starting over from the beginning, which can have a profound effect on meeting your SLAs. Hadoop also brings storage and compute into the same physical node, thus allowing your programs to exploit the distributed system's inherit data proximity to reduce data movement, one of the core elements of the ETL process. By the same token, Hadoop’s computing operations, as well as it’s data load operations, take advantage of the easy parallelism that comes with its distributed operations and collocated services. In effect, the collocation of storage and compute on the same node provides you and your business linear scalability on two axes simultaneously.Second, you and your business gain the latitude and freedom to make changes in how you interpret your data without the overhead and effort of modifying rigid relational schemas. This allows you and your business to ask more questions and new questions of your data more frequently, with little fear of operational repercussions. In addition, all computing within Hadoop, regardless of type, benefits from this feature of “schema-on-read,” and this includes your existing BI and ETL tools that are familiar and approachable, so whether you execute your transformations in hand-coded Python scripts or use a visual data flow tool, you benefit from this control. Hadoop also provides all these computing frameworks and tools an abstraction layer that removes many of the taxing details of efficient parallel programming, and thus, processing in Hadoop becomes immediately usable and cost-effective to a larger population within your business and not just your expert developers.Lastly, Hadoop offers compelling infrastructure cost savings, at least 1-2 orders of magnitude less than conventional relational systems, and enjoys the economics of a proven open source model, too. You and your business can also realize cost savings through better workload alignment across your data integration infrastructure. As Hadoop takes on more of your integration tasks from your data warehouse, your data warehouse can better focus on the critical task best suited to that system. Furthermore, Hadoop, as a platform for both storage and processing, consolidates operational and execution skills and knowledge and helps optimize you and your business’ resources to focus on what is most important to forward your business goals.In short, Hadoop gives you and your business the tools to not just meet your SLAs, but also to tackle your growing integration demands and be ready for what’s coming next for your business.
Select A Particularly Challenging Use CaseMisses SLAs FrequentlyProcesses a Large Quantity of DataImplement Full Pipeline on ClouderaCreate parallel environmentImplement pipeline in Cloudera

Meet Mission Critical SLAs with Big Data Using Apache Flume

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (17)

Similaire à Meet Mission Critical SLAs with Big Data Using Apache Flume

Similaire à Meet Mission Critical SLAs with Big Data Using Apache Flume (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Meet Mission Critical SLAs with Big Data Using Apache Flume

Notes de l'éditeur