Enabling Data Management in a Big Data World

•Télécharger en tant que PPTX, PDF•

1 j'aime•1,415 vues

adoop has enabled a new scale of data processing that is paving the way for data driven businesses. However, business data is often riddled with compliance and regulatory requirements that can be easily lost as data is manipulated, transformed, and re-written within the Hadoop eco-system. Furthermore, enterprise data is often scattered across a wide array of systems, each with their own techniques for policy management. As data from these disparate systems is loaded into Hadoop, all of the carefully crafted policy is immediately lost, creating a potential risk for the business. Data provenance is widely recognized as a technique for applying policy in more traditional industries such as storage, databases and high-performance computing. By tracking data from its origin and across various transformations and computations, provenance tracking systems can answer questions such as: Who has seen a given piece of data? Where did this data come from? What policies existed on this data? In this talk, we will discuss traditional data management solutions, the challenges of bringing them to an eco-system like Hadoop, and approaches to enable data management in the growing Big Data world.

Technologie Business

Enabling data management in a
big data world
Craig Soules
Garth Goodson
Tanya Shastri

The problem with data management
• Hadoop is a collection of tools
– Not tightly integrated
– Everyone’s stack looks a little different
– Everything falls back to files

Agenda
• Traditional data management
• Hadoop’s eco-system
• Natero’s approach to data management

What is data management?
• What do you have?
– What data sets exist?
– Where are they stored?
– What properties do they have?
• Are you doing the right thing with it?
– Who can access data?
– Who has accessed data?
– What did they do with it?
– What rules apply to this data?

Traditional data management
External
Data
Sources
Extract
Transform
Load
DataWarehouse
Integrated storage
Data processing
Users
SQL

Key lessons of traditional systems
• Data requires the right abstraction
– Schemas have value
– Tables are easy to reason about
• Referenced by name, not location
• Narrow interface
– SQL defines the data sources and the processing
• But not where and how the data is kept!

Hadoop eco-system
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig HiveQL Mahout
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
More varied data
sources with many
more access / retention
requirements
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Data accessed through
multiple entry points
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
HiveQL Mahout

Key challenges
External
Data
Sources
HDFS storage layer
Users
Sqoop
+
Flume
Processing Framework
(Map-Reduce)
HBase
Pig
Hive
Metastore
(HCatalog)
Oozie
Cloudera
Navigator
Lots of new
consumers of the
data
HiveQL Mahout

Steps to data management
• Provide access at the right level
• Limit the processing interfaces
• Schemas and provenance provide control
• Enforce policy
1
3
2
4

Case study: Natero
• Cloud-based analytics service
– Enable business users to take advantage of big data
– UI-driven workflow creation and automation
• Single shared Hadoop eco-system
– Need customer-level isolation and user-level access controls
• Goals:
– Provide the appropriate level of abstraction for our users
– Finer granularity of access control
– Enable policy enforcement
– Users shouldn’t have to think about policy
• Source-driven policy management

Natero application stack
External
Data
Sources
HDFS storage layer
Processing Framework
(Map-Reduce)
Users
HBase
Sqoop
+
Flume
Pig
Access-aware workflow compiler
Schema
Extraction
Policy
and
Metadata
Manager
Provenance-aware scheduler
HiveQL Mahout
1
3
2
4

Natero execution example
Job
Sources
Job
Compiler
Metadata
Manager
Scheduler
• Fine-grain
access control
• Auditing
• Enforceable policy
• Easy for users
Natero
UI

The right level of abstraction
• Our abstraction comes with trade-offs
– More control, compliance
– No more raw Map-Reduce
• Possible to integrate with Pig/Hive
• What’s the right level of abstraction for you?
– Kinds of execution

Hadoop projects to watch
• HCatalog
– Data discovery / schema management / access
• Falcon
– Lifecycle management / workflow execution
• Knox
– Centralized access control
• Navigator
– Auditing / access management

Lessons learned
• If you want control over your data, you also
need control over data processing
• File-based access control is not enough
• Metadata is crucial
• Users aren’t motivated by policy
– Policy shouldn’t get in the way of use
– But you might get IT to reason about the sources

Contenu connexe

Plus de DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Security Framework for Multitenant Architecture

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Computer Vision: Coming to a Store Near You

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

DataWorks Summit

Knowledge graphs (KGs) have recently emerged as a powerful way to represent knowledge in multiple communities, including data mining, natural language processing and machine learning. Large-scale KGs like Wikidata and DBpedia are openly available, while in industry, the Google Knowledge Graph is a good example of proprietary knowledge that continues to fuel impressive advances in Google's semantic search capabilities. Yet, both crowdsourced and automatically constructed KGs suffer from noise, both during KG construction and during search and inference. In this talk, I will discuss how to build and use such knowledge graphs effectively, despite the noise and sparsity of labeled data, to solve real-world social problems such as providing insights in disaster situations, and helping law enforcement fight human trafficking. I will conclude by providing insight on the lessons learned, and the applicability of research techniques to industrial problems. The talk will be designed to appeal both to business and technical leaders.

Applying Noisy Knowledge Graphs to Real Problems

DataWorks Summit

A city is a system of systems with independent objectives and governance for transportation, energy, healthcare, safety, security, and infrastructure. A smart city relies on data to be the connectivity between independent functions, and open data to be the building blocks for citizen-centered design, inclusion, and sustainability. Big Data is not about size – it is about finding new life-changing and transformational opportunities using data. From Smart Mobility and Smart Energy to improved Public Health, Safety, & Governance – this session will discuss how cities are delivering better citizen services leveraging open source technology with a consistent governance and security framework that spans the data center and the public clouds. Data integration is the key to ensuring that a city’s attempts to become an intelligent system of systems doesn’t result in a system of silos. A single view requires the capability to integrate transactional data from traditional data stores with person generated data, unstructured data, and machine sensor (IoT) data. The key to managing such a range of data is a capability that allows for both scaling analytic workloads and the preservation of detailed data with unexplored value, as both are vital to future growth potential. Key Takeaways: Understand the common use cases that tier 1, tier 2, and emerging cities are undertaking to deliver tactical results and progress towards policy objectives. Understand the role of a shared catalog, unified security and consistent governance in building a secure, trusted, and connected capability.

Open Source, Open Data: Driving Innovation in Smart Cities

DataWorks Summit

In the current digital world, Enterprises are drowning under the weight of data that are required to store for customers, for corporate analysis, and for the business forecast. With the convergence of cloud, IoT, and big data technologies, data lakes are becoming the critical fuel for enterprise-wide digital transformations which are proven to be cost-effective, self-service with elastic in nature. This enterprise data is spread widely across numerous clusters and repositories residing in both the companies data centers and multiple cloud locations posing a new “data protection” problem in hybrid environments. Protecting data is very critical as part of every business continuity plan because data loss or corruption may have a huge impact on enterprise survival. Protecting data is more challenging than ever in a complex hybrid enterprise data lake environments since we need to answer questions such as - How do we move data seamlessly between enterprise data centers and cloud? - How to secure enterprise data that resides in different locations with multiple authorization policies? - How do we protect data from natural or accidental disasters to ensure operational continuity? Not having immediate answers to these questions makes it very difficult for business users and platform operators to do their jobs in protecting data in hybrid enterprise data lake environments. Therefore enterprises require a unified data protection orchestration platform which seamlessly protects the data across multiple environments. In this talk, we will address the above challenges faced by enterprises using Apache Hadoop, Apache Hive, Apache Ranger and Apache Atlas. We will outline using a unified open source orchestration platform how, - You can protect mission-critical data along with their security and governance policies across multiple data lakes and change data capture works using Apache Hadoop, Apache Hive, Apache Ranger and Apache Atlas. - You can monitor replication jobs and metric collections associated with the replicated data across hybrid enterprise data lake environments. We will also showcase, - How to seamlessly replicate HDFS data, Hive databases between Hortonworks clusters securely along with Apache Ranger policies and Apache Atlas metadata. - How to securely move the data between on-premise clusters and cloud storages.

Data Protection in Hybrid Enterprise Data Lake Environment

DataWorks Summit

Join the presenters from Meharry Medical College as they present an overview of the technologies that support the Data Science Institute for the advancement of medical training, Data Science and medical research. In 2016 the president of Meharry Medical College initiated a program to establish the Data Science Institute that would concentrate on programs to support the underserved population of Nashville, Tennessee. As a rule, Medical Colleges that serve this population are usually the last on the list to avail themselves of advanced technology. Meharry Medical College through an innovative approach was able to leapfrog their better funded peers by utilizing and applying the same opensource technologies used by Google, Facebook, Twitter, LinkedIn and Yahoo. The technical infrastructure, academic programs, research and the applied data science use cases will be discussed.

Big Data Technologies in Support of a Medical School Data Science Institute

DataWorks Summit

Hadoop was born much earlier than the Cloud Native era. But the question is still the same: what can it offer in the time of Kubernetes, containerization and hybrid clouds? Apache Hadoop Ozone is a new subproject of Hadoop. It has a generic low-level binary layer, the Hadoop Distributed Data Storage (HDDS) and a S3 compatible Object Store implementation on top of it. But the HDDS data storage layer is not just for the object store. It could be used for multiple purposes: to enhance the scalability the HDFS or provide block level access to the managed storage space. With this approach the same Hadoop Ozone cluster could provide hadoop file system based storage, object store space and block level storage. Storage is still a hot topic with Kubernetes and in Cloud Native environments. Container Storage Interface specification is a vendor neutral standard to provide storage plugin for multiple container orchestration system. Quadra provides block level access on top of the Hadoop Distributed Data Storage layer and it’s first class citizen of the containerized word. It implements the Container Storage Interface and can work as a Kubernetes dynamic volume provisioner. In this talk we will demonstrate how the Hadoop Ozone storage could be used from containers. We will explain the basic storage type of Kubernetes clusters and show how Hadoop Ozone and Quadra could help to solve the storage problem in an industry standard way.

Hadoop Storage in the Cloud Native Era

DataWorks Summit

Plus de DataWorks Summit (20)

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Applying Noisy Knowledge Graphs to Real Problems

Open Source, Open Data: Driving Innovation in Smart Cities

Data Protection in Hybrid Enterprise Data Lake Environment

Big Data Technologies in Support of a Medical School Data Science Institute

Hadoop Storage in the Cloud Native Era

Dernier

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Real Time Object Detection Using Open CV

Khem

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Advantages of Hiring UIUX Design Service Providers for Your Business

A Year of the Servo Reboot: Where Are We Now?

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Real Time Object Detection Using Open CV

AWS Community Day CPH - Three problems of Terraform

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Artificial Intelligence: Facts and Myths

Boost Fertility New Invention Ups Success Rates.pdf

🐬 The future of MySQL is Postgres 🐘

Automating Google Workspace (GWS) & more with Apps Script

Handwritten Text Recognition for manuscripts and early printed texts

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Enabling Data Management in a Big Data World

1. Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

2. The problem with data management • Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files

3. Agenda • Traditional data management • Hadoop’s eco-system • Natero’s approach to data management

4. What is data management? • What do you have? – What data sets exist? – Where are they stored? – What properties do they have? • Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?

5. Traditional data management External Data Sources Extract Transform Load DataWarehouse Integrated storage Data processing Users SQL

6. Key lessons of traditional systems • Data requires the right abstraction – Schemas have value – Tables are easy to reason about • Referenced by name, not location • Narrow interface – SQL defines the data sources and the processing • But not where and how the data is kept!

7. Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Oozie Cloudera Navigator

8. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout

9. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator HiveQL Mahout

10. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator Lots of new consumers of the data HiveQL Mahout

11. Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Oozie Cloudera Navigator One access control mechanism: files HiveQL Mahout

12. Steps to data management • Provide access at the right level • Limit the processing interfaces • Schemas and provenance provide control • Enforce policy 1 3 2 4

13. Case study: Natero • Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation • Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls • Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy • Source-driven policy management

14. Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4

15. Natero execution example Job Sources Job Compiler Metadata Manager Scheduler • Fine-grain access control • Auditing • Enforceable policy • Easy for users Natero UI

16. The right level of abstraction • Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce • Possible to integrate with Pig/Hive • What’s the right level of abstraction for you? – Kinds of execution

17. Hadoop projects to watch • HCatalog – Data discovery / schema management / access • Falcon – Lifecycle management / workflow execution • Knox – Centralized access control • Navigator – Auditing / access management

18. Lessons learned • If you want control over your data, you also need control over data processing • File-based access control is not enough • Metadata is crucial • Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources

Notes de l'éditeur

Data is abstracted to tablesSQL is a narrow interface that describes processingTaken a while for traditional systems to get hereStructured data world has evolved into the enterprise… molded to fit its needs
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema management
Don’t need to be so restrictive on the use, because it scales out… the DW might get over-provisioned.More people doing different kinds of things… more shift towards exploration.People are using Hadoop as an ETL into something else, but what policies should be in place for that data when it goes into the DW?
MAKE THIS SLIDE ABOUT THE FACT THAT THERE’S ONE ACCESS CONTROL MECHANISMS: FILES. IF YOU HAVE ACCESS TO IT, THEN YOU DO. CONTRAST THAT TO THE DW WORLD WHERE IF YOU HAVE ACCESS TO THE TABLE, YOU STILL DON’T HAVE ACCESS TO THE STORAGE. THERE’S AN ABSTRACTION THAT FORCES EVERYTHING TO PASS THROUGH THE PROCESSING LAYER.These tools live within an open eco-system, which prevents them from being complete solutions (at least today). And the shared access control mechanisms mean that control beyond file-read access is difficult / impossible.Workflows make the data very stepped.
Step 1: Integrate processing and storage Prevents direct access to storage… add access has to go through the data processing toolsWithout doing this you can’t: effectively track data as it moves through the system, enable fine-grained access control or source-based access controlExample: Oracle + NetApp w/ single oracle userStep 2: Limit the interfacesThe many entry-points to Hadoop increase the IT complexity of data management, and in some cases completely disable step 1. Reducing the entry points provides a way to control access and properly track behavior.Step 3: Metadata collection and trackingUnderstanding what is in your data enables the use of fine-grained access control, prevents the incorrect declassification of dataProvenance is crucial to effective policy enforcementExample: table schemas, logging: column-level access controlStep 4: Policy enforcementWe apply policy to the sources and allow those policies to cascade through the provenance to the result data. There’s an enormous body of work on this, but currently there’s nothing in Hadoop enables this. (Handful of companies) (e.g., Cloudera) is working on some stuff, but it’s still under wraps.
Pig, HiveQL, Mahout, MR: different processing interfacesOozie: Workflow dependency management and automationCloudera Navigator: centralized access control and auditingHive Metastore / Hcatalog: centralized data access and schema managementWatch what’s said to make it clear that the devs still have control over who sees what and what runs, its more about helping them enforce it.Put the numbers from the steps onto the diagram

Enabling Data Management in a Big Data World

Recommandé

Recommandé

Contenu connexe

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Enabling Data Management in a Big Data World

Notes de l'éditeur