SlideShare une entreprise Scribd logo
1  sur  14
Data Annotation
Platform for Machine
Learning
NLP,CV, Tabular,NERandLogdata
Julia Li, Steve Liang
VMware
Oct 19, 2021
Agenda
2
Data annotation is critical for ML
Options to annotate your data
Crowdsourcing vs. In-house data labeling
Data annotation challenges
DAML - Data Annotator for Machine Learning
Functionality
Demo
Q&A
3
Data Annotation is critical for Machine Learning
Garbage in garbage out
Source: https://i.pinimg.com/originals/b6/5f/04/b65f044e766a0f2fe8ad531eeee8e6a0.gif
4
 Hand-label the data
 Humans are given guidelines to perform a labeling task over large datasets
 Weak-supervision (Snorkel)
 Write functions based on heuristics to generate probabilistic labels
 Semi-supervised learning
 Lots of data, little labels
 Train a supervised model and predict for unlabeled data to generate pseudo-labels
 Self-supervised learning
 To make use of large amounts of unlabeled data, the model depends on the underlying structure of
data to predict outcomes
 Most common: language models in NLP (BERT)
Spoiler Alert: Hand-labeling is still the most common and effective method
Methods for Annotating Data
5
 Global network of Independent contractors
• Cheap & lower quality: Mturk, hive, etc.
 Managed data labeling services
• Expensive & higher quality: Scale.ai
 Support most common ML labelling tasks
 Pay as you go
 Lack domain specific expertise required
 Difficult to meet data privacy concerns
 Recurring expense
Why annotation services are popular Where annotation services fall short
Acquiring Hand-Labeled Data
Crowdsourcing vs. In-house data labeling
6
Data annotation is the most challenging limitations to ML adoption​
Data Annotation is Challenging for Enterprise
ML Optimization
ML Experimentation
*Acquiring, Processing,
Annotating Data
WHAT ML ENGINEERS ACTUALLY DO
*2019 Cognilytica report
• Lack good data & tools
• Using spreadsheets to collaborate &
label data
• Difficult to manage
• Not a continuous process
The Challenges
7
A collaborative data annotation platform for efficient data labeling at scale
DAML Makes Data Annotation Easy
The best ML solutions have humans in the loop
• Collaborate on data annotation
• Iterate quickly on data labelling and validation
• Reduce labeling effort using active learning
• Ensure data annotation is a continuous part of
improving ML
8
Support text, image, tabular, NER and log data
DAML Key Features
Annotator Network Data
Discovery
ML Assistance
End-to-end Data
Management
Connect upstream feed
and extract to downstream
tasks
Everyone instantly
becomes a potential
annotator
Share across organizations
and build up quality data
catalog
Active learning
identifies the most
uncertain data
9
DAML Supported
Use Cases
Text
NER
Classification
Question Answering
Image
Classification
Object Detection
Tabular Classification
and Regression
Logs
Event
Classification
10
Runs on any cloud or on-prem
DAML Architecture
DAML UI
(Angular + Clarity)
DAML REST
API
(Node.js + Express)
Data Storage
Calculate
accuracy
Measure
uncertainty
Query
annotators
Train
Model
Active Learning
Service
(Django + modAL)
11
Iterate on a simple prescriptive process
DAML Workflow
Data Upload
Text, Tabular, Images,
NER, Logs
Data
Ingestion
Create
Project
Assign
Annotators
Data
Labeling
Label
Preparation
Model
Training
Model
Track
Progress
Resolve
Disagreement
ML
Pipeline
Active Learning
12
Demo
13
 Apache 2.0 license
 Easy to install & runs anywhere
• $docker compose up
 Collaborative and ML powered
• Collaborate across organizations to label large datasets
• Active Learning to help significantly reduce the workload on the human labelers
 Try out DAML
• OSS Project: https://github.com/vmware/data-annotator-for-machine-learning.git
• Blogs: https://medium.com/vmware-data-ml-blog
DAML makes data annotation easy
Summary
Thank You

Contenu connexe

Tendances

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...DevSecCon
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA AutomationGiovanni Scerra ☃
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...People10 Technosoft Private Limited
 
Talent42 2017: Mei Lu - 1_Pager Software Developer Job Descriptions
Talent42 2017: Mei Lu - 1_Pager  Software Developer Job DescriptionsTalent42 2017: Mei Lu - 1_Pager  Software Developer Job Descriptions
Talent42 2017: Mei Lu - 1_Pager Software Developer Job DescriptionsTalent42
 
DevOps Friendly Doc Publishing for APIs & Microservices
DevOps Friendly Doc Publishing for APIs & MicroservicesDevOps Friendly Doc Publishing for APIs & Microservices
DevOps Friendly Doc Publishing for APIs & MicroservicesSonatype
 
Introduction to Agile Software Development & Python
Introduction to Agile Software Development & PythonIntroduction to Agile Software Development & Python
Introduction to Agile Software Development & PythonTharindu Weerasinghe
 
Agile archiecture iltam 2014
Agile archiecture   iltam 2014Agile archiecture   iltam 2014
Agile archiecture iltam 2014Dani Mannes
 
Matt carroll - "Security patching system packages is fun" said no-one ever
Matt carroll - "Security patching system packages is fun" said no-one everMatt carroll - "Security patching system packages is fun" said no-one ever
Matt carroll - "Security patching system packages is fun" said no-one everDevSecCon
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
 
Object Oriented Apologetics
Object Oriented ApologeticsObject Oriented Apologetics
Object Oriented ApologeticsVance Lucas
 
Henrique Dantas - API fuzzing using Swagger
Henrique Dantas - API fuzzing using SwaggerHenrique Dantas - API fuzzing using Swagger
Henrique Dantas - API fuzzing using SwaggerDevSecCon
 
Agile Languages for Rapid Prototyping
Agile Languages for Rapid PrototypingAgile Languages for Rapid Prototyping
Agile Languages for Rapid PrototypingTharindu Weerasinghe
 
Darpangupta resume
Darpangupta resumeDarpangupta resume
Darpangupta resumeDarpan Gupta
 
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscape
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscapeDevSecCon Singapore 2019: Embracing Security - A changing DevOps landscape
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscapeDevSecCon
 

Tendances (20)

System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...
DevSecCon Singapore 2018 - Maginot Line – 6 Common AppSec Anti-Patterns Preve...
 
Hexagonal symfony
Hexagonal symfonyHexagonal symfony
Hexagonal symfony
 
SureshSunkara
SureshSunkaraSureshSunkara
SureshSunkara
 
Getting Started With QA Automation
Getting Started With QA AutomationGetting Started With QA Automation
Getting Started With QA Automation
 
Priyanka Mehta
Priyanka MehtaPriyanka Mehta
Priyanka Mehta
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...
 
Thoughtful Software Design
Thoughtful Software DesignThoughtful Software Design
Thoughtful Software Design
 
Talent42 2017: Mei Lu - 1_Pager Software Developer Job Descriptions
Talent42 2017: Mei Lu - 1_Pager  Software Developer Job DescriptionsTalent42 2017: Mei Lu - 1_Pager  Software Developer Job Descriptions
Talent42 2017: Mei Lu - 1_Pager Software Developer Job Descriptions
 
DevOps Friendly Doc Publishing for APIs & Microservices
DevOps Friendly Doc Publishing for APIs & MicroservicesDevOps Friendly Doc Publishing for APIs & Microservices
DevOps Friendly Doc Publishing for APIs & Microservices
 
Introduction to Agile Software Development & Python
Introduction to Agile Software Development & PythonIntroduction to Agile Software Development & Python
Introduction to Agile Software Development & Python
 
Agile archiecture iltam 2014
Agile archiecture   iltam 2014Agile archiecture   iltam 2014
Agile archiecture iltam 2014
 
Matt carroll - "Security patching system packages is fun" said no-one ever
Matt carroll - "Security patching system packages is fun" said no-one everMatt carroll - "Security patching system packages is fun" said no-one ever
Matt carroll - "Security patching system packages is fun" said no-one ever
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
Introcution to EJB
Introcution to EJBIntrocution to EJB
Introcution to EJB
 
Object Oriented Apologetics
Object Oriented ApologeticsObject Oriented Apologetics
Object Oriented Apologetics
 
Henrique Dantas - API fuzzing using Swagger
Henrique Dantas - API fuzzing using SwaggerHenrique Dantas - API fuzzing using Swagger
Henrique Dantas - API fuzzing using Swagger
 
Agile Languages for Rapid Prototyping
Agile Languages for Rapid PrototypingAgile Languages for Rapid Prototyping
Agile Languages for Rapid Prototyping
 
Darpangupta resume
Darpangupta resumeDarpangupta resume
Darpangupta resume
 
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscape
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscapeDevSecCon Singapore 2019: Embracing Security - A changing DevOps landscape
DevSecCon Singapore 2019: Embracing Security - A changing DevOps landscape
 

Similaire à Data Annotation Platform for ML

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular dataJimmyLiang20
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Miningnabil_alsharafi
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami PresentationGreg Werner
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningNisha Talagala
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration StackPierre Brunelle
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Dell World
 

Similaire à Data Annotation Platform for ML (20)

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Mining
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Data Annotation in the world of ML- FiveS Digital.pdf
Data Annotation in the world of ML- FiveS Digital.pdfData Annotation in the world of ML- FiveS Digital.pdf
Data Annotation in the world of ML- FiveS Digital.pdf
 
Joe C
Joe CJoe C
Joe C
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Qiagram
QiagramQiagram
Qiagram
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami Presentation
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Module 1
Module  1Module  1
Module 1
 
The coding portion of Data Science
The coding portion of Data ScienceThe coding portion of Data Science
The coding portion of Data Science
 
Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration Stack
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?
 

Plus de All Things Open

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityAll Things Open
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best PracticesAll Things Open
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public PolicyAll Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...All Things Open
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashAll Things Open
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptAll Things Open
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractAll Things Open
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and SuccessAll Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with BackgroundAll Things Open
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblyAll Things Open
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksAll Things Open
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptAll Things Open
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramAll Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceAll Things Open
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamAll Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in controlAll Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsAll Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
 

Plus de All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Data Annotation Platform for ML

  • 1. Data Annotation Platform for Machine Learning NLP,CV, Tabular,NERandLogdata Julia Li, Steve Liang VMware Oct 19, 2021
  • 2. Agenda 2 Data annotation is critical for ML Options to annotate your data Crowdsourcing vs. In-house data labeling Data annotation challenges DAML - Data Annotator for Machine Learning Functionality Demo Q&A
  • 3. 3 Data Annotation is critical for Machine Learning Garbage in garbage out Source: https://i.pinimg.com/originals/b6/5f/04/b65f044e766a0f2fe8ad531eeee8e6a0.gif
  • 4. 4  Hand-label the data  Humans are given guidelines to perform a labeling task over large datasets  Weak-supervision (Snorkel)  Write functions based on heuristics to generate probabilistic labels  Semi-supervised learning  Lots of data, little labels  Train a supervised model and predict for unlabeled data to generate pseudo-labels  Self-supervised learning  To make use of large amounts of unlabeled data, the model depends on the underlying structure of data to predict outcomes  Most common: language models in NLP (BERT) Spoiler Alert: Hand-labeling is still the most common and effective method Methods for Annotating Data
  • 5. 5  Global network of Independent contractors • Cheap & lower quality: Mturk, hive, etc.  Managed data labeling services • Expensive & higher quality: Scale.ai  Support most common ML labelling tasks  Pay as you go  Lack domain specific expertise required  Difficult to meet data privacy concerns  Recurring expense Why annotation services are popular Where annotation services fall short Acquiring Hand-Labeled Data Crowdsourcing vs. In-house data labeling
  • 6. 6 Data annotation is the most challenging limitations to ML adoption​ Data Annotation is Challenging for Enterprise ML Optimization ML Experimentation *Acquiring, Processing, Annotating Data WHAT ML ENGINEERS ACTUALLY DO *2019 Cognilytica report • Lack good data & tools • Using spreadsheets to collaborate & label data • Difficult to manage • Not a continuous process The Challenges
  • 7. 7 A collaborative data annotation platform for efficient data labeling at scale DAML Makes Data Annotation Easy The best ML solutions have humans in the loop • Collaborate on data annotation • Iterate quickly on data labelling and validation • Reduce labeling effort using active learning • Ensure data annotation is a continuous part of improving ML
  • 8. 8 Support text, image, tabular, NER and log data DAML Key Features Annotator Network Data Discovery ML Assistance End-to-end Data Management Connect upstream feed and extract to downstream tasks Everyone instantly becomes a potential annotator Share across organizations and build up quality data catalog Active learning identifies the most uncertain data
  • 9. 9 DAML Supported Use Cases Text NER Classification Question Answering Image Classification Object Detection Tabular Classification and Regression Logs Event Classification
  • 10. 10 Runs on any cloud or on-prem DAML Architecture DAML UI (Angular + Clarity) DAML REST API (Node.js + Express) Data Storage Calculate accuracy Measure uncertainty Query annotators Train Model Active Learning Service (Django + modAL)
  • 11. 11 Iterate on a simple prescriptive process DAML Workflow Data Upload Text, Tabular, Images, NER, Logs Data Ingestion Create Project Assign Annotators Data Labeling Label Preparation Model Training Model Track Progress Resolve Disagreement ML Pipeline Active Learning
  • 13. 13  Apache 2.0 license  Easy to install & runs anywhere • $docker compose up  Collaborative and ML powered • Collaborate across organizations to label large datasets • Active Learning to help significantly reduce the workload on the human labelers  Try out DAML • OSS Project: https://github.com/vmware/data-annotator-for-machine-learning.git • Blogs: https://medium.com/vmware-data-ml-blog DAML makes data annotation easy Summary

Notes de l'éditeur

  1. https://www.cognilytica.com/document/report-data-engineering-preparation-and-labeling-for-ai-2019/