This document provides an overview of data annotation and introduces the Data Annotator for Machine Learning (DAML) platform. It discusses how data annotation is critical for machine learning but can be challenging. Common methods for annotating data include crowdsourcing, in-house labeling, weak supervision, and self-supervised learning, but hand-labeling remains most effective. The document then describes challenges with data annotation and how DAML aims to make the process easier by allowing collaborators to label data at scale, iterate quickly on labels, and reduce effort through active learning. Key features of DAML include supporting various data types, connecting data workflows, and running on any cloud or on-premises.
1. Data Annotation
Platform for Machine
Learning
NLP,CV, Tabular,NERandLogdata
Julia Li, Steve Liang
VMware
Oct 19, 2021
2. Agenda
2
Data annotation is critical for ML
Options to annotate your data
Crowdsourcing vs. In-house data labeling
Data annotation challenges
DAML - Data Annotator for Machine Learning
Functionality
Demo
Q&A
3. 3
Data Annotation is critical for Machine Learning
Garbage in garbage out
Source: https://i.pinimg.com/originals/b6/5f/04/b65f044e766a0f2fe8ad531eeee8e6a0.gif
4. 4
Hand-label the data
Humans are given guidelines to perform a labeling task over large datasets
Weak-supervision (Snorkel)
Write functions based on heuristics to generate probabilistic labels
Semi-supervised learning
Lots of data, little labels
Train a supervised model and predict for unlabeled data to generate pseudo-labels
Self-supervised learning
To make use of large amounts of unlabeled data, the model depends on the underlying structure of
data to predict outcomes
Most common: language models in NLP (BERT)
Spoiler Alert: Hand-labeling is still the most common and effective method
Methods for Annotating Data
5. 5
Global network of Independent contractors
• Cheap & lower quality: Mturk, hive, etc.
Managed data labeling services
• Expensive & higher quality: Scale.ai
Support most common ML labelling tasks
Pay as you go
Lack domain specific expertise required
Difficult to meet data privacy concerns
Recurring expense
Why annotation services are popular Where annotation services fall short
Acquiring Hand-Labeled Data
Crowdsourcing vs. In-house data labeling
6. 6
Data annotation is the most challenging limitations to ML adoption
Data Annotation is Challenging for Enterprise
ML Optimization
ML Experimentation
*Acquiring, Processing,
Annotating Data
WHAT ML ENGINEERS ACTUALLY DO
*2019 Cognilytica report
• Lack good data & tools
• Using spreadsheets to collaborate &
label data
• Difficult to manage
• Not a continuous process
The Challenges
7. 7
A collaborative data annotation platform for efficient data labeling at scale
DAML Makes Data Annotation Easy
The best ML solutions have humans in the loop
• Collaborate on data annotation
• Iterate quickly on data labelling and validation
• Reduce labeling effort using active learning
• Ensure data annotation is a continuous part of
improving ML
8. 8
Support text, image, tabular, NER and log data
DAML Key Features
Annotator Network Data
Discovery
ML Assistance
End-to-end Data
Management
Connect upstream feed
and extract to downstream
tasks
Everyone instantly
becomes a potential
annotator
Share across organizations
and build up quality data
catalog
Active learning
identifies the most
uncertain data
10. 10
Runs on any cloud or on-prem
DAML Architecture
DAML UI
(Angular + Clarity)
DAML REST
API
(Node.js + Express)
Data Storage
Calculate
accuracy
Measure
uncertainty
Query
annotators
Train
Model
Active Learning
Service
(Django + modAL)
11. 11
Iterate on a simple prescriptive process
DAML Workflow
Data Upload
Text, Tabular, Images,
NER, Logs
Data
Ingestion
Create
Project
Assign
Annotators
Data
Labeling
Label
Preparation
Model
Training
Model
Track
Progress
Resolve
Disagreement
ML
Pipeline
Active Learning
13. 13
Apache 2.0 license
Easy to install & runs anywhere
• $docker compose up
Collaborative and ML powered
• Collaborate across organizations to label large datasets
• Active Learning to help significantly reduce the workload on the human labelers
Try out DAML
• OSS Project: https://github.com/vmware/data-annotator-for-machine-learning.git
• Blogs: https://medium.com/vmware-data-ml-blog
DAML makes data annotation easy
Summary