Over the past decade, big data implementations have been more sophisticated in particular for organizations operationalizing machine learning, analytics and data engineering. The pressures of data-driven cultures, multiple workload applications such as customer care, fraud management and cross-platform marketing are changing the game. Mixing machine learning with business processes and operationalizing analytics with data engineering practices places burdens on IT teams. Making these advanced data environments all work together is an ongoing challenge.
While you can still “swipe and go” to implement data management environments in the cloud for an easy solution, the easy path is often littered with additional costs, higher overhead in terms of maintenance and synchronization. Data savvy organizations are taking a more measured and coordinated approach to their machine learning, analytics and data engineering infrastructures. These proactive approaches speed adoption among business stakeholders and lower administration and governance issues for technologists.
Join John L Myers, managing research director at leading IT analyst firm Enterprise Management Associates (EMA), and Nik Rouda, director of product marketing at Cloudera, to discover how the world of cloud implementations have changed for the better and the future of an enterprise grade cloud environment for your organization using the right resources.
Attend this webinar to learn about:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
How proactive implementations improve the workloads of technologists
Examples of real world customer implementations
3 things to learn:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
Decarbonising Buildings: Making a net-zero built environment a reality
Strategies for Enterprise Grade Azure-based Analytics
1. IT & DATA MANAGEMENT RESEARCH, INDUSTRY
ANALYSIS & CONSULTING
Looking Before You Leap Into the Cloud:
Taking a proactive approach to machine learning, analytics, and data
engineering in the cloud
John L. Myers
Managing Research Director
EMA
Nik Rouda
Director of Product Marketing
Cloudera
6. IT & DATA MANAGEMENT RESEARCH, INDUSTRY
ANALYSIS & CONSULTING
Topic #1:
Drivers for implementing machine learning, analytics,
and data engineering with a proactive approach
65.5% of implemented next-generation data management implementations like Cloudera CDH are using a form of cloud implementation.
There are some pros and cons to cloud environments in the context of analytics workloads and data pipelines.
The benefits on the left are pretty well-known; cloud service providers have been pushing these for some time now.
The disadvantages may be lessons you learn the hard way. We’d like to save you some pain. Cloud is easy to get into for an individual, but very hard to optimize for an enterprise. These are very real problems that are actually exacerbated by the multitude of distinct services available in cloud. In a nutshell, most accidently end up recreating the data silos they had on-premises, and all the extra effort and risk that comes with silos.
[ASK: how important is it for you to solve the problems on the right?]
This is all made tougher to choose because traditional applications use just one kind of data and a single analytic approach. Delivering catalog, security, and governance for that single system is a challenge in bare-metal environments but becomes particularly tough in the cloud, where metadata and policies don’t persist when an elastic workload is dropped.
[ASK: do fragmented silos make it hard for you to manage and guarantee security/compliance/etc.? Do you end up often recreating the context, definitions, and permissions of the same data?]
There are some pros and cons to cloud environments in the context of analytics workloads and data pipelines.
The benefits on the left are pretty well-known; cloud service providers have been pushing these for some time now.
The disadvantages may be lessons you learn the hard way. We’d like to save you some pain. Cloud is easy to get into for an individual, but very hard to optimize for an enterprise. These are very real problems that are actually exacerbated by the multitude of distinct services available in cloud. In a nutshell, most accidently end up recreating the data silos they had on-premises, and all the extra effort and risk that comes with silos.
[ASK: how important is it for you to solve the problems on the right?]
Top 5 Advanced Analytics objectives
Graph analytics (e.g., influencer analysis)
Regression algorithms to predict information based on independent variables
Decision tree (recursive partitioning) algorithms
Feature selection algorithms (e.g., PCA, PLS)
Times Series Forecasting and Smoothing
Linked with change frequencies of daily or weekly. Data engineering departments quickly fall behind in their implementations.
Cloudera supports four major workloads, and each one addresses different analytics functions. Each stands alone as an industry-leading, open-source approach. Together, they handle your complete data pipeline. We’ve found again and again that the most high-value analytics applications combine these on the same platform with the same data, all managed logically in one place.
[ASK: What tools are you using for these today? Are they well integrated from the same vendor? Or do you handle each one separately? At what cost?]
So now the architecture changes in the cloud. We already talked about why there are separate clusters. Now, let’s talk about how they fit together and how they’re different.
Some clusters are going to be persistent, or running 24x7.
Others are going to be transient, so spin up for a few hours, run a job, and shut down.
Others are going to be clusters with both characteristics, so maybe a persistent cluster that is always up but bursts on occasion and then scales down.
They all have different characteristics.
Let’s say you have a use case where you are analyzing purchases in real time to help determine when you might be out of stock.
The clusters ingesting the data, running Kafka and Spark Streaming, are probably running 24x7 because you would be getting data at all times throughout the day. You probably want HA, DR, and the ability to upgrade the cluster.
After the data is ingested, you’re going to need to process it so that your analysts can use it. Spin up a cluster, run an ETL job, and then shut the cluster down. You don’t need HA because if you lose a NN, you can just spin up a new cluster. Security doesn’t matter as much since it’s a single user cluster.
Next, the data is probably going to be analyzed. This might be a BI tool and you’re probably going to keep that up 24x7 since people might connect to it at all hours and you want to maintain the metadata. But it’s going to get heavy usage during work hours, so you probably want to spin up additional nodes to support all those users.
Finally, maybe you have an application that is using a NoSQL backend to keep track and notify folks responsible for supply chain that they need to restock items. Again, that’s going to be a persistent cluster since that’s an application that will always be running.
Fundamentally, Cloudera leverages Azure Virtual Machines (from D, G, and L series) to provision nodes in a customer’s Azure environment to provide elastic scale. Azure Storage (Premium and Standard) is also used to independently scale out cluster storage capacity on demand. Azure ExpressRoute is used to accommodate customers who need a fast, private network from an on-premises or colocation facility to transfer data to Cloudera in Azure. Power BI integration provides visual analytics capability for end users.
Cloudera has also recently released the integration to Azure Data Lake Store (ADLS) to enable greater performance and scalability, leveraging the cloud object store technology built for big data in Azure.
Cloudera is also available in the Azure Marketplace (since 2015) to enable fast, one-click deployment of Cloudera Enterprise Data Hub to Azure customers. What used to take weeks or more on-premises can now be accomplished in under an hour.
Underlying everything is our SDX, which has the shared metadata catalog that facilitates consistent data management and operations everywhere and anywhere. SDX also includes comprehensive, granular security to protect against threats and unified governance for the audit and search capabilities that the modern world demands, especially with standards like PCI-DSS and GDPR.
For IT, that means you can set policies once and enforce them everyone. For analysts, data scientists, and others, SDX enables self-service and increases productivity. For the business, it means understanding customers better, connecting products and services, and protecting the business with confidence.
Cloudera Altus is our platform as a service offering, offering ETL, machine learning, and data processing on Amazon Web Services and Microsoft Azure. In the not too distant future, you’ll see us move beyond data engineering to analytic and data science workloads, delivered via any underlying cloud platform, including Amazon, Microsoft, and Google.
The first Altus experience we’re delivering is data engineering as a service.
Think about ETL for machine learning and analytics. Altus is available on AWS today, and we are planning to release on Azure in the future.
Altus runs on cloud-native infrastructure, so it’s easy to spin up transient clusters that have large-scale compute, process the data, and write your output back to a cloud object store like Amazon S3.
Altus supports our standard CDH distribution, which includes Hive, Spark, and Hive on Spark.
You can see the Altus portal here to the right of the text on the screen. You can access Altus with a simple login, and then work within the portal or through a CLI if you want to submit jobs programmatically.
Jobs are considered first-class objects on Altus. You can submit, clone, troubleshoot, and sort by jobs. Many of you are running upward of 100 workloads in a day. You may want to view a history of those jobs, so you can find and troubleshoot failed jobs and run them again.
Because Altus is a PaaS, you don’t need to deal with installing software, worrying about cluster configuration, resource management, or patching.
The usual issue to data movement. They need to have figured out a story for that. Otherwise, it becomes a painful conversation. What are some of the patterns that we have seen people use successfully? If they already backup data to S3, that works.
Here we see it all together: 4+ analytics workloads, 4 deployment models, and 1 shared data experience. Again, no one else offers this choice and common controls all together.
ADECCOAdecco uses Cloudera Enterprise on Azure to power its Search and Match solution, connecting qualified candidates to job vacancies with reported 30% reduction in time-to-fill and a 20% reduction in job board spend in its first 90 days.
JOY GLOBALCloudera on Azure makes it easy for Joy Global teams in the field to analyze equipment data form their own and third-party PLC-based equipment to get a systems view of machine operation.
WORLDWIDE FINANCIAL INSTITUTION (BLINDED)
Detects fraud (money laundering) and complies with federal regulations and authorities better
---- DETAIL/SPECIFICS ----
Adecco:
Search Technologies Helps Adecco Group Significantly Improve Recruiter Efficiency
http://www.prweb.com/releases/2015/11/prweb13100660.htm
(PRWeb: Search Technologies Press Release (12/2/2015). Add’l excerpts:
“Search and Match Application Based on Cloudera and Solr Improves Recruiter Response Times and Fill Rates”
“Adecco was recently short-listed for the prestigious Cloudera Business Impact Award at Hadoop World 2015”
Joy Global:
Joy Global is a world leader in making heavy-duty mining equipment for both surface and underground excavation. The company had a legacy IoT predictive maintenance system built in 2008 and had challenges meeting scale and performance demands from its business. As they grew and monitored more and more equipment and an increasing user base, they started to feel pressure points on the architecture that made it difficult for them to scale and support the global user base. Joy Global has a wide variety of data types that are collected from mining machines: machine pressure, temperature, currents, voltages, and a range of other sensor data, all of which are sampled at high frequencies and are increasing at an exponential rate. A single machine could have 800 data points generating about 30-50,000 unique time-stamped records in a one-minute file.
Cloudera on Azure makes it easy for Joy Global teams in the field to analyze data that they pull in from Joy Global equipment (such as longwall systems, shovels, wheel loaders, continuous miners, and others), and also from third party PLC-based equipment to get a systems view of machine operation.
This expanded capability allowed one of Joy Global’s longwall mining operator customers to acquire data not just from the Joy longwall system, but also from ancillary equipment. Using Impala on HDFS in Azure and an Hbase store for time-series data, the team is also able to provide access to this data through self-service visualization reports. The ability to create custom reports and ad-hoc analysis from a common set of data enabled regional engineers to answer customers' questions faster. An example of an outcome from this engagement was production optimization and the doubling of weekly cutting hours from their Joy Global longwall system.
Joy Global has realized some significant cost savings on their cloud infrastructure by moving to Azure. They are able to deliver all of the data for Joy Global customers with much less compute than they had in the previous system, with a lot more data and intelligence. As reputation for quality demands a 24x7 monitoring operation, Joy Global relies on Cloudera and Microsoft Azure to maintain that quality.
Worldwide Financial Institution:
Worldwide Financial Institution needed visibility and access to data in order to better understand what is happening with their products and business at all levels within the organization. In addition to providing insightful information to Executives, it will allow the business insight to critical information in order to make revisions for the way we do business today. The current data mart sits within the PCI zone, making access and self-service challenging. Information needs to be accessible and accurate, which requires a framework that needs to be integrated, repeatable, and scalable to add to future reporting needs. The new solution allowed the Institution to detect fraud (money laundering) to comply with federal authorities.
We allow you to run anywhere and deploy any way that you choose, giving you a simple, unified enterprise experience. We simplify your operations so you can work with familiar tools, and focus on your job without having to worry about cloud infrastructure management. “Unified” means that you can have a similar experience across any workload, whether in a hybrid or multi-cloud environment, and whether in a PaaS or infrastructure as a service deployment. Lastly, everything we do at Cloudera is built to be enterprise-grade, proven at great scale with a trusted security model, and have consistent governance and workload management.