Maximizing Correctness with Minimal User Effort to Learn Data Transformations

•Download as PPTX, PDF•

0 likes•708 views

Data transformation often requires users to write many trivial and task-dependent programs to transform thousands of records. Recently, programming-by-example (PBE) approaches enable users to transform data without coding. A key challenge of these PBE approaches is to deliver correctly transformed results on large datasets, since these transformation programs are likely to be generated by non-expert users. To address this challenge, existing approaches aim to identify a small set of potentially incorrect records and ask users to examine these records instead of the entire dataset. However, because the transformation scenarios are highly task-dependent, existing approaches cannot capture the incorrect records for various scenarios. \ We present a approach that learns from past transformation scenarios to generate a meta-classifier to identify the incorrect records. Our approach color-codes these transformed records and then presents them for users to examine. The method allows users to either enter an example for a record transformed incorrectly or confirm the correctness of a transformed record. And our approach can learn from the users' labels to refine the meta-classifier to accurately identify the incorrect records. Simulation results and a user study show that our method can identify the incorrectly transformed records and reduce the user efforts in examining the results.

Data & Analytics

Maximizing Correctness with Minimal User Effort
to Learn Data Transformations
Bo Wu and Craig Knoblock
University of Southern California
1
Department of Computer Science

4
Programming by Example
Video is from Excel YouTube official channel (https://www.youtube.com/watch?v=YPG8PAQQ894)

Overconfident Users
6
Users are often too confident to examine the results thoroughly

Problem
Enable the users of PBE systems to achieve maximal
correctness with minimal effort on large datasets
8
Help users to identify at least one of all incorrect
records in every iteration with minimal effort on
large datasets

Approach Overview
9
Raw Transformed
10“ H x 8” W 10
H: 58 x
W:25”
58
12”H x 9”W 12
11”H x 6” 11
… …
30 x 46” 30 x 46
Entire dataset
Random
Sampling
Raw Transformed
10“ H x 8” W 10
11”H x 6” 11
… …
30 x 46” 30 x 46
Sampled records
Verifying records
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Sorting and
color-codingRaw Transformed
30 x 46” 30 x 46
11”H x 6” 11
… …

Verifying Records
• First recommend records causing runtime
errors
– Records cause the program exit abnormally
• Second recommend potentially incorrect
records
– Learn a binary meta-classifier
11
Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Ex:

Learning the Meta-classifier
12
cs1
…
Meta-classifier
cs2
cs4 cs3
cp1
…
cp2
cp3 cp4
cf1
…
cf2
cf3 cf4
Program agreement
Format ambiguity
Similarity
cs3
cs4
cp2
cf1
w1
w2
w3
w4
…

Evaluation
• The recommendation contains incorrect
records
13

Evaluation
• The recommendation can place incorrect
records on top
14

User study
15
Experiment setup:
• 5 scenarios with 4000 records per scenario
• 10 graduate students divided into two groups

Summary and Future Work
• Summary
– Sample records
– Identify incorrect/questionable records
– Allow user to refine the recommendation
– Color-code the results
• Future work
– Show histograms of the data
– Translate the program to readable natural text
16

17
Questions ?
Data and system available at
https://github.com/areshand/Web-Karma

Type of Classifiers
• Classifier based on distance
• Classifier based on agreement of programs
• Classifier based on format ambiguity
18

Learning from various past results
19
…
Raw Transformed
26" H x 24" W x 12.5 26
Framed at 21.75" H x 24.25” W 21
12" H x 9" 12
…
Raw Transformed
Ravage 2099#24 (November, 1994) November, 1994
Gambit III#1 (September, 1997) September, 1997
(comic) Spidey Super Stories#12/2
(September, 1975)
comic
…
Examples
Incorrect
records
Correct
records

Sorting Records
20
Runtime errors
Rank records
using #failed_subprograms
Rank records
using meta-classifier output
Yes
No
Checking
transformed
records
Record #failed_subprograms
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 3
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations

Skillwise Big dataSkillwise Group

Efficient top-k queries processing in column-family distributed databasesRui Vieira

Big dataZeeshan Khan

IFAC MIM 2013Francisco Serdio

Introduction to Data streaming - 05/12/2014Raja Chiky

Giab ashg webinar 160224GenomeInABottle

XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...Mark Evans

Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier

AL slides.pptShehnazIslam1

IBANK - Big data www.ibank.uk.com 07474222079ibankuk

HOP-Rec_RecSys18Matt Yang

Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier

SMS Module-4(theory) ppt.pptxProddaturNagaVenkata

Reproducible research - to infinityPeterMorrell4

230208 MLOps Getting from Good to Great.pptxArthur240715

Data Automation at Light SourcesIan Foster

2015 illinois-talkc.titus.brown

Introduction to Generalised Low-Rank Model and Missing ValuesJo-fai Chow

Kx for wine tastingMark Lefevre, CQF

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations (20)

Skillwise Big data

Efficient top-k queries processing in column-family distributed databases

Big data

IFAC MIM 2013

Introduction to Data streaming - 05/12/2014

Giab ashg webinar 160224

XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...

Analytics of analytics pipelines:from optimising re-execution to general Dat...

AL slides.ppt

IBANK - Big data www.ibank.uk.com 07474222079

HOP-Rec_RecSys18

Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)

An experimental comparison of globally-optimal data de-identification algorithms

SMS Module-4(theory) ppt.pptx

Reproducible research - to infinity

230208 MLOps Getting from Good to Great.pptx

Data Automation at Light Sources

2015 illinois-talk

Introduction to Generalised Low-Rank Model and Missing Values

Kx for wine tasting

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Week-01-2.ppt BBB human Computer interactionfulawalesam

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Zuja dropshipping via API with DroFx.pptxolyaivanovalion

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

ALSO dropshipping via API with DroFx.pptx

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

100-Concepts-of-AI by Anupama Kate .pptx

Week-01-2.ppt BBB human Computer interaction

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online

BabyOno dropshipping via API with DroFx.pptx

Schema on read is obsolete. Welcome metaprogramming..pdf

Determinants of health, dimensions of health, positive health and spectrum of...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Zuja dropshipping via API with DroFx.pptx

VidaXL dropshipping via API with DroFx.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Ravak dropshipping via API with DroFx.pptx

Maximizing Correctness with Minimal User Effort to Learn Data Transformations

1. Maximizing Correctness with Minimal User Effort to Learn Data Transformations Bo Wu and Craig Knoblock University of Southern California 1 Department of Computer Science

2. 2 Art website Buyer

3. 3 Dimension of artworks

4. 4 Programming by Example Video is from Excel YouTube official channel (https://www.youtube.com/watch?v=YPG8PAQQ894)

5. Too Many Records 5

6. Overconfident Users 6 Users are often too confident to examine the results thoroughly

7. Variations 7

8. Problem Enable the users of PBE systems to achieve maximal correctness with minimal effort on large datasets 8 Help users to identify at least one of all incorrect records in every iteration with minimal effort on large datasets

9. Approach Overview 9 Raw Transformed 10“ H x 8” W 10 H: 58 x W:25” 58 12”H x 9”W 12 11”H x 6” 11 … … 30 x 46” 30 x 46 Entire dataset Random Sampling Raw Transformed 10“ H x 8” W 10 11”H x 6” 11 … … 30 x 46” 30 x 46 Sampled records Verifying records Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Sorting and color-codingRaw Transformed 30 x 46” 30 x 46 11”H x 6” 11 … …

10. Learning from users’ feedback 10

11. Verifying Records • First recommend records causing runtime errors – Records cause the program exit abnormally • Second recommend potentially incorrect records – Learn a binary meta-classifier 11 Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Ex:

12. Learning the Meta-classifier 12 cs1 … Meta-classifier cs2 cs4 cs3 cp1 … cp2 cp3 cp4 cf1 … cf2 cf3 cf4 Program agreement Format ambiguity Similarity cs3 cs4 cp2 cf1 w1 w2 w3 w4 …

13. Evaluation • The recommendation contains incorrect records 13

14. Evaluation • The recommendation can place incorrect records on top 14

15. User study 15 Experiment setup: • 5 scenarios with 4000 records per scenario • 10 graduate students divided into two groups

16. Summary and Future Work • Summary – Sample records – Identify incorrect/questionable records – Allow user to refine the recommendation – Color-code the results • Future work – Show histograms of the data – Translate the program to readable natural text 16

17. 17 Questions ? Data and system available at https://github.com/areshand/Web-Karma

18. Type of Classifiers • Classifier based on distance • Classifier based on agreement of programs • Classifier based on format ambiguity 18

19. Learning from various past results 19 … Raw Transformed 26" H x 24" W x 12.5 26 Framed at 21.75" H x 24.25” W 21 12" H x 9" 12 … Raw Transformed Ravage 2099#24 (November, 1994) November, 1994 Gambit III#1 (September, 1997) September, 1997 (comic) Spidey Super Stories#12/2 (September, 1975) comic … Examples Incorrect records Correct records

20. Sorting Records 20 Runtime errors Rank records using #failed_subprograms Rank records using meta-classifier output Yes No Checking transformed records Record #failed_subprograms 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 3 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2

Editor's Notes

Ashley wants to buy a painting for the space over her sofa She has strict space limits. Ex: the painting should be about 60’’ wide and 40’’ high
Ashley got a spreadsheet of artworks on sale. The sizes information that she got is a long list of entries with the height, width and even depth in one entry. She has to split them into three columns and remove some extra text such as “H:”, “in.”, etc. Thus, she can then filter the artworks based on each degree’s size. Dataset has so many records that it requires her to write programs to solve problem. Problem: a long learning curve to learn this skill. The time should be used to decorate her house instead.
Programming by example doesn’t require users to write code anymore.
The list can have thousands of records. It is really hard to notice some records in the middle that are transformed incorrectly.
According to previous research, User often believe that they have carefully examined all the records. They stop checking the results when there is still a large percentage of incorrect records in the dataset.
To identify the Cannot rely on single rule or
Random sampling is to address the too many records problem Verifying records can capture incorrect records in various scenarios Sorting and color-coding is to address over confident user problem Can also learn from the users interaction in current iteration to refine the recommendation
Learn from the users feedback to refine the recommendation
First, describe correctness Second, iteration time Third, total time. explain why certain scenarios have longer total time. Why in s5 and s3 beta has twice the iteration time as our approach? Why does the iteration time in beta varies much more than the times in our approach?
Summary vs Conclusion

Maximizing Correctness with Minimal User Effort to Learn Data Transformations

Recommended

Recommended

More Related Content

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations (20)

Recently uploaded

Recently uploaded (20)

Maximizing Correctness with Minimal User Effort to Learn Data Transformations

Editor's Notes