Spark on YARN: The Road Ahead

•

9 j'aime•3,817 vues

Cloudera, Inc.

Presentation by Marcel Vanzin (Cloudera software engineer/Spark committer) at Spark Summit SF 2015

Logiciels

Spark on YARN:
The Road Ahead
Marcelo Vanzin

What to expect...
-  What is YARN?
-  Why YARN for Spark?
-  Current state and future direction for:
-  Resource Management
-  Security

What is YARN?
Resource management and negotiation
-  Manages vcores and memory
-  Multiple users
-  Multiple apps
-  Advanced features

Why YARN for Spark?
-  Advanced resource management features
-  Dynamic allocation for Spark executors
-  Security

Resource Management
Lots of knobs to turn, lots of trial and error.
yarn.nodemanager.resource.memory-‐mb

Executor
Container

spark.yarn.executor.memoryOverhead
(10%)

spark.executor.memory

spark.shuﬄe.memoryFracBon
(0.4)
spark.storage.memoryFracBon
(0.6)

RM (Cont’d)
Dynamic allocation
-  No need to specify number of executors!
-  Application grows and shrinks based on
outstanding task count
-  Still need to specify everything else…

SparkContext
Job 1
Task Task
Executor 1
Task
Task
Executor 2
Task
Dynamic Allocation

Dynamic Allocation (Cont’d)
How to make dynamic allocation better?
-  Reduce allocation latency
-  Data locality hints
-  Handle cached RDDs

Data Locality
-  Allocate executors
close to data
-  More predictable
performance
-  SPARK-4352

Cached RDDs
-  Short term: avoid discarding cached data
-  Keep executors around.
-  Long term:
-  Cache rebalance
-  Container resizing
-  Off-heap caching

Memory Sizing
How to make application sizing easier?
-  Soft container limits (YARN-3119)
-  Overhead memory
-  Simplified parameters

Overhead Memory
Everything that is not the Java heap. Very
opaque, hard to size correctly.
-  Better metrics to understand usage
-  Better heuristics for automatic sizing

Future: Simplified Allocation
Single parameter: “task size”. Spark figures
out the rest (vcores, heap, overhead).
Container sizes can change to match
available resources, data locality, etc.
task-size: s
Container request:
vcores = x
memory = y

RM Summary
“Dynamic allocation is not for everybody. It
can only meet the needs of 90% of
applications.”

Security: Kerberos
-  User authentication via Kerberos
-  Supported in all services (YARN, HDFS,
HBase, Hive, etc)
-  Uses per-service delegation tokens

Security: Delegation Tokens
Delegation tokens allow for authentication without running
into KDC limitations.

Tokens (Cont’d)
Tokens have a TTL. This is a problem for
long-lived applications
-  Spark Streaming
-  Thrift Server
Solution: re-generate tokens periodically.

Tokens (Cont’d)
Generating tokens requires a kerberos ticket
-  Spark driver handles KDC login
-  Need access to user credentials (keytab)
Secure if coupled with encryption.

Security: Encryption
-  HDFS supports wire encryption
-  YARN supports wire encryption
-  Spark not so much

Encryption (Cont’d)
Where does Spark need encryption?
-  Control plane
-  File distribution
-  Block Manager
-  User UI / REST API
-  Data-at-rest (shuffle files)

Encryption (Cont’d)
Goal: easy to configure encryption.
-  SSL for HTTP endpoints
-  SASL everywhere else
In 1.4: SASL available for shuffle data, SSL
available for Akka / file distribution.

Spark UI
Support SSL for HTTP UI / API
-  SPARK-2750
-  Outstanding issues:
-  Certificate distribution (cluster mode)
-  Shared certificates

Spark RPC
-  Control plane encryption
-  Current: Akka-over-SSL
-  Future: RPC-over-SASL
-  SPARK-6028, SPARK-6230
-  File distribution
-  Replace HTTP channel with RPC

Data-at-Rest
Local shuffle files, mostly.
-  Now: configure “local dirs” on encrypted
volumes
-  Later: Spark itself encrypts shuffle files
(SPARK-5682).

Summary
-  YARN provides lots of useful features
-  Not always easy to use
-  Focus on:
-  Usability
-  Security
-  Performance

Contenu connexe

Tendances

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Lucidworks

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Apache Spark & Hadoop

MapR Technologies

Workshop - How to Build Recommendation Engine using Spark 1.6 and HDP Hands-on - Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin. b) Follow along - Build a Recommendation Engine - This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.

Spark Advanced Analytics NJ Data Science Meetup - Princeton University

Alex Zeltov

In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking. Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.

Fast and Reliable Apache Spark SQL Releases

DataWorks Summit

Intro to Apache Spark

Cloudera, Inc.

Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.

Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...

Databricks

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing. Table of contents: 1. The Big Data triangle 2. Hadoop stack and its limitations 3. Spark: An Overview 3.a. Spark Streaming 3.b. GraphX: Graph processing 3.c. MLib: Machine Learning 4. Performance characteristics of Spark

Apache Spark: The Next Gen toolset for Big Data Processing

prajods

Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS) This workshop will provide an introduction to Big Data Analytics using Apache Spark using the HDInsights on Azure (SaaS) and/or HDP deployment on Azure(PaaS) . There will be a short lecture that includes an introduction to Spark, the Spark components. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes. The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Alex Zeltov

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

DataWorks Summit

Hadoop and Spark

Shravan (Sean) Pabba

Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed. Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

DataWorks Summit

Hadoop 2 - Going beyond MapReduce

Uwe Printz

The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack. The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...

Databricks

Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email. Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath

Sherlock: an anomaly detection service on top of Druid

DataWorks Summit

Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/ Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc. This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.

How Apache Spark fits into the Big Data landscape

Paco Nathan

In Memory Analytics with Apache Spark

Venkata Naga Ravi

This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Analytics 2) What is Apache Spark? 3) Why Apache Spark? 4) Using Spark with Hadoop 5) Apache Spark Features 6) Apache Spark Architecture 7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX 8) Demo: Analyze Flight Data Using Apache Spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Edureka!

Hadoop to spark-v2

Sujee Maniyam

Apache Spark in Scientific Applciations

Dr. Mirko Kämpf

Spark and Spark Streaming

宇傅

Tendances (20)

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Apache Spark & Hadoop

Spark Advanced Analytics NJ Data Science Meetup - Princeton University

Fast and Reliable Apache Spark SQL Releases

Intro to Apache Spark

Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...

Apache Spark: The Next Gen toolset for Big Data Processing

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Hadoop and Spark

Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning

Hadoop 2 - Going beyond MapReduce

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...

Sherlock: an anomaly detection service on top of Druid

How Apache Spark fits into the Big Data landscape

In Memory Analytics with Apache Spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Hadoop to spark-v2

Apache Spark in Scientific Applciations

Spark and Spark Streaming

En vedette

A beginners guide to Cloudera Hadoop

David Yahalom

Apache Spark Operations

Cloudera, Inc.

Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.

Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera

Cloudera, Inc.

It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark. With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end. Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss: An Introduction to Spark and its advantages over MR An introduction of Hive-on-Spark: Goals and Design Principles Migrating to HoS and a live demo Configuring and tuning for batch workloads What’s next for both tools

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Cloudera, Inc.

An introduction To Apache Spark

Amir Sedighi

Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...

DataWorks Summit

Effective Spark on Multi-Tenant Clusters

DataWorks Summit/Hadoop Summit

Dynamic Resource Allocation Spark on YARN

Tsuyoshi OZAWA

Map Reduce v2 and YARN - CHUG - 20120604

Chicago Hadoop Users Group

Extending Data Lake using the Lambda Architecture June 2015

DataWorks Summit

At the StampedeCon 2015 Big Data Conference: YARN enables Hadoop to move beyond just pure batch processing. With that multiple workloads and tenants now must be able to share a single infrastructure for data processing. Features of the Capacity Scheduler enable resource sharing among multiple tenants in a fair manner with elastic queues to maximize utilization. This talk will focus on the features of the Capacity Scheduler that enable Multi-Tenancy and how resource sharing can be rebalanced using features like Preemption.

Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...

StampedeCon

This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ ) Map Reduce: Beyond Word Count by Jeff Patti Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.

Map reduce: beyond word count

Jeff Patti

Apache Hadoop YARN

Adam Kawa

HDFS Internals

Apache Apex

Hadoop Internals (2.3.0 or later)

Emilio Coppa

Introduction to Apache Kafka

Jeff Holoman

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

Cloudera, Inc.

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

Hortonworks

As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. In this webinar, Srikanth Venkat, director of product management for security & governance will demonstrate two new data protection capabilities in Apache Ranger – dynamic column masking and row level filtering of data stored in Apache Hive. These features have been introduced as part of HDP 2.5 platform release.

Dynamic Column Masking and Row-Level Filtering in HDP

Hortonworks

En vedette (20)

A beginners guide to Cloudera Hadoop

Apache Spark Operations

Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

An introduction To Apache Spark

Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...

Effective Spark on Multi-Tenant Clusters

Dynamic Resource Allocation Spark on YARN

Map Reduce v2 and YARN - CHUG - 20120604

Extending Data Lake using the Lambda Architecture June 2015

Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...

Map reduce: beyond word count

Apache Hadoop YARN

HDFS Internals

Hadoop Internals (2.3.0 or later)

Introduction to Apache Kafka

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

Apache Spark in Depth: Core Concepts, Architecture & Internals

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

Dynamic Column Masking and Row-Level Filtering in HDP

Similaire à Spark on YARN: The Road Ahead

The need to handle increasingly large volumes of data, to quickly drive decisions (via streaming technologies and machine learning algorithms), to scale systems effectively, to guarantee the right level of continuity, to float data across systems efficiently and others are becoming critical and challenging requirements. During this talk we’ll demonstrate how to design reactive, resilient, message driven and elastic applications by combining technologies such as Akka, Kakfa, Cassandra and Spark along with architectural patterns like CQRS, ES, etc. in order to achieve the previously mentioned needs.

Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...

Scala Italy

Scala in increasingly demanding environments - DATABIZ

DATABIZit

Enterprises and non-profit organizations often work with sensitive business or personal information, that must be stored in an encrypted form due to corporate confidentiality requirements, the new GDPR regulations, and other reasons. Unfortunately, a straightforward encryption doesn’t work well for modern columnar data formats, such as Apache Parquet, that are leveraged by Spark for acceleration of data ingest and processing. When Parquet files are bulk-encrypted at the storage, their internal modules can’t be extracted, leading to a loss of column / row filtering capabilities and a significant slowdown of Spark workloads. Existing solutions suffer from either performance or security drawbacks. We work with the Apache Parquet community on a new modular encryption mechanism, that enables full columnar projection and predicate push down (filtering) functionality on encrypted data in any storage system. Besides confidentiality, the mechanism supports data authentication, where the reader can verify a file has not been tampered with or replaced with a wrong version. Different columns can be encrypted with different keys, allowing for a fine grained access control. In this talk, I will demonstrate Spark integration with the Parquet modular encryption mechanism, running efficient analytics directly on encrypted data. The demonstration scenarios are derived from use cases in our joint research project with a number of European companies, working with sensitive data such as connected car messages (location, speed, driver identity, etc). I will describe the encryption mechanism, and the observed performance implications of encrypting and decrypting data in Spark SQL workloads.

Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

Databricks

Sector Sphere 2009

lilyco

sector-sphere

xlight

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...

DataWorks Summit/Hadoop Summit

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Chris Fregly

Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Amazon Web Services

In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.

Fast Data Analytics with Spark and Python

Benjamin Bengfort

Bring the Spark To Your Eyes

Demi Ben-Ari

Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN

DataWorks Summit

Securing Millions of Devices

Kai Hudalla

Data Pipelines and Telephony Fraud Detection Using Machine Learning

Eugene

Spark & Yarn better together 1.2

Jianfeng Zhang

This session will examine the many options the data scientist has for running Spark clusters in public and private clouds. We will discuss various environments employing AWS, Mesos, containers, docker, and BlueData EPIC technologies and the benefits and challenges of each. Speakers: Tom Phelan, Co-founder and Chief Architect - BlueData Inc. Tom has spent the last 25 years as a senior architect, developer, and team lead in the computer software industry in Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability group. Most recently, Tom led one of the key projects – vFlash, focusing on integration of server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform. Tom received his Computer Science degree from the University of California, Berkeley.

February 2016 HUG: Running Spark Clusters in Containers with Docker

Yahoo Developer Network

Abstract:- With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing! Bio:- Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Data Con LA

Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!

Spark Streaming & Kafka-The Future of Stream Processing

Jack Gudenkauf

Seattle spark-meetup-032317

Nan Zhu

Spark 101 - First steps to distributed computing

Demi Ben-Ari

Spark on YARN

Adarsh Pannu

Similaire à Spark on YARN: The Road Ahead (20)

Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...

Scala in increasingly demanding environments - DATABIZ

Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

Sector Sphere 2009

sector-sphere

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Fast Data Analytics with Spark and Python

Bring the Spark To Your Eyes

Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN

Securing Millions of Devices

Data Pipelines and Telephony Fraud Detection Using Machine Learning

Spark & Yarn better together 1.2

February 2016 HUG: Running Spark Clusters in Containers with Docker

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Spark Streaming & Kafka-The Future of Stream Processing

Seattle spark-meetup-032317

Spark 101 - First steps to distributed computing

Spark on YARN

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx

Cloudera, Inc.

This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today. The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact. The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories: Data Lifecycle Connection Data for Enterprise AI Cloud Innovation Security & Governance Leadership People First Data for Good Industry Transformation

Cloudera Data Impact Awards 2021 - Finalists

Cloudera, Inc.

Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.

2020 Cloudera Data Impact Awards Finalists

Cloudera, Inc.

Edc event vienna presentation 1 oct 2019

Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19

Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19

Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19

Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1

Cloudera, Inc.

Extending Cloudera SDX beyond the Platform

Cloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18

Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360

Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18

Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18

Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Dernier

Model Call Girl Services in Delhi reach out to us at 🔝 9953056974 🔝✔️✔️ Our agency presents a selection of young, charming call girls available for bookings at Oyo Hotels. Experience high-class escort services at pocket-friendly rates, with our female escorts exuding both beauty and a delightful personality, ready to meet your desires. Whether it's Housewives, College girls, Russian girls, Muslim girls, or any other preference, we offer a diverse range of options to cater to your tastes. We provide both in-call and out-call services for your convenience. Our in-call location in Delhi ensures cleanliness, hygiene, and 100% safety, while our out-call services offer doorstep delivery for added ease. We value your time and money, hence we kindly request pic collectors, time-passers, and bargain hunters to refrain from contacting us. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Full night for more than 1 person: Contact us at 🔝 9953056974 🔝. for details Operating 24/7, we serve various locations in Delhi, including Green Park, Lajpat Nagar, Saket, and Hauz Khas near metro stations. For premium call girl services in Delhi 🔝 9953056974 🔝. Thank you for considering us!

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

9953056974 Low Rate Call Girls In Saket, Delhi NCR

8257 interfacing 2 in microprocessor for btech students

HimanshiGarg82

Looking for an efficient way to manage your finances? Look no further than our money management app. With easy-to-use features, you can track your expenses, create budgets, and monitor your savings goals all in one place. Our app provides real-time updates on your spending habits and helps you make smarter financial decisions. Take control of your finances today with our user-friendly money management app.

Right Money Management App For Your Financial Goals

Jhone kinadey

Test automation is a cornerstone of software development and quality assurance in today's rapidly evolving digital landscape. Its significance cannot be overstated. Businesses can enhance efficiency, productivity, and accelerate software delivery to market through automation, streamlining testing processes effectively. This comprehensive guide addresses the best practices for test automation in 2024. It offers a detailed checklist to empower you to optimize your automation efforts and maintain a competitive edge.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

kalichargn70th171

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

VTU technical seminar 8Th Sem on Scikit-learn

AmarnathKambale

(Vivek)Call Us, 8448380779,Call girls in Delhi NCr – We Offer best in class call girls. escort Service At Affordable Price At low Rate with Space Night 8000 We Are One Of The Oldest Escort and Call girls Agencies in Delhi. You Will Find That Our Female Escorts Are Full Of Fun, Sexy And They Would Love Enjoy Your Company. We Have A Fantastic Selection Of Escort Ladies Available For In-Calls As Well As Out-Calls. Our Escorts Are Not Only Beautiful But All Have Great Personalities Making Them The Perfect Companion For Any Occasion. In-Call:- You Can Come At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation. Out-Call:- You have To Come Pick The Girl From My Place We Are Also Provide Door Step Services (Delhi Ncr, Noida, Gurgaon, Faridabad, Ghaziabad Note:- Pic Collectors Time Passers Bargainers Stay Away As We Respect The Value For Your Money Time And Expect The Same From You Hygienic:- Full Ac room And Clean Rooms Available In Hotel 24 * 7 Hourly In Delhi NCR More Details, With WhatsApp Number, +91-8448380779

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Delhi Call girls

Azure Native Qumulo scales elastically for common High Performance Compute (HPC) workloads based on application requirements for: Financial Services, Automotive, Genomics / Life Sciences, Media and Entertainment, Energy, Oil and Gas, etc. Performance can be dialed UP (and back down) much higher than the examples shown here. These slides offer a glimpse into ANQ's HPC capabilities, although at a smaller scale. We invite YOU to do your own testing (with a free ANQ trial) and work with us to test your HPC workloads in Azure.

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

ryanfarris8

Craft an AI & Machine Learning Pitch with our Editable Professional PowerPoint Template. Ignite your AI & Machine Learning pitch with our cutting-edge PowerPoint template tailored for the industry. Perfect for AI conferences, investor presentations, sales pitches to tech-focused companies, training sessions, and educational programs. - 20+ editable slides: Get a variety of options to choose from for your presentation. - Time-saving solution: Download, replace text/images with a few clicks. - User-friendly customization: Easy to use and personalize. - Modern and attractive design: Captivating visuals, sleek layout. - Tailored to your requirements: Fully alterable for customization. - Well-organized slides: Complete control over content. - Thematic specificity: Reflects healthcare industry with relevant graphics. - Showcase your business idea: Communicate value proposition effectively.

AI & Machine Learning Presentation Template

Presentation.STUDIO

A great deal of attention in medical devices has shifted towards cybersecurity with the ratification of section 524B of the FD&C act. This new law enables the FDA to enforce cybersecurity controls in any medical device that is capable of networked communications or that has software. In this webinar we will recap the process for managing vulnerabilities, identify categories of vulnerabilities and solutions and more.

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

ICS

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

Philip Schwarz

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Delhi Call girls

Foundation models are machine learning models which are easily capable of performing variable tasks on large and huge datasets. FMs have managed to get a lot of attention due to this feature of handling large datasets. It can do text generation, video editing to protein folding and robotics. In case we believe that FMs can help the hospitals and patients in any way, we need to perform some important evaluations, tests to test these assumptions. In this review, we take a walk through Fms and their evaluation regimes assumed clinical value. To clarify on this topic, we reviewed no less than 80 clinical FMs built from the EMR data. We added all the models trained on structured and unstructured data. We are referring to this combination of structured and unstructured EMR data or clinical data.

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

harshavardhanraghave

In the past six months, the AI landscape has undergone a massive transformation, ushering in a new era of productivity with the latest in Large Language Models (LLMs) and AI technology. This deep dive unlocks how to: Create CustomGPT Models: No coding needed to tailor AI for your unique projects. Integrate your own data, including PDFs and Excel sheets, making information handling a breeze. Plus, discover how to call your own actions/integrations for even more personalized utility. Navigate Advanced Prompting: Overcome AI's memory limits and utilize Retrieval-Augmented Generation for accessing your personalized data, streamlining how you interact with AI. Stay Ahead with AI Trends: Peek into the evolving world of LLMs, featuring newcomers like Google Gemini, Anthropic Claude, Open Sora, and Twitter Grok, and understand what their advancements mean for your productivity. Witness Real-Life Transformations: Through examples and prompt demonstrations, see firsthand how these AI strategies revolutionize routine tasks, from data analysis to content creation. Learn to leverage image output and input for advanced practical use cases, adding a new dimension to your productivity toolkit. No previous coding or AI experience is needed for this talk. Stay ahead in the fast-evolving world of work. Embrace the AI revolution and transform your workflow with advanced LLM techniques. Join us to ensure you're not left behind in the productivity race.

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

VictorSzoltysek

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf

VishalKumarJha10

Diamond Application Development Crafting Solutions with Precision

SolGuruz

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

Looking to embark on a digital project in New York City? Choosing the ideal Laravel development partner is pivotal. Begin by defining your project requirements clearly. Assess potential partners' experience, expertise, and technical proficiency, checking portfolios and client testimonials. Effective communication and collaboration are paramount, so evaluate partners' communication styles and project management approaches. Consider long-term scalability and support options, and discuss pricing and contracts transparently. Lastly, trust your instincts when selecting a partner aligned with your vision and values.

How to Choose the Right Laravel Development Partner in New York City_compress...

software pro Development

+971565801893 Mtp-Kit (500MG) Prices » Dubai [(+971565801893**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Leen Whatsapp +971565801893 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971565801893''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971565801893' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Clinic in Abu Dhabi, United Arab Emirates.+971565801893

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Health

10 Trends Likely to Shape Enterprise Technology in 2024

Mind IT Systems

Dernier (20)