Data transformation often requires users to write many trivial and task-dependent programs to transform thousands of records. Recently, programming-by-example (PBE) approaches enable users to transform data without coding. A key challenge of these PBE approaches is to deliver correctly transformed results on large datasets, since these transformation programs are likely to be generated by non-expert users. To address this challenge, existing approaches aim to identify a small set of potentially incorrect records and ask users to examine these records instead of the entire dataset. However, because the transformation scenarios are highly task-dependent, existing approaches cannot capture the incorrect records for various scenarios. \ We present a approach that learns from past transformation scenarios to generate a meta-classifier to identify the incorrect records. Our approach color-codes these transformed records and then presents them for users to examine. The method allows users to either enter an example for a record transformed incorrectly or confirm the correctness of a transformed record. And our approach can learn from the users' labels to refine the meta-classifier to accurately identify the incorrect records. Simulation results and a user study show that our method can identify the incorrectly transformed records and reduce the user efforts in examining the results.
Maximizing Correctness with Minimal User Effort to Learn Data Transformations
1. Maximizing Correctness with Minimal User Effort
to Learn Data Transformations
Bo Wu and Craig Knoblock
University of Southern California
1
Department of Computer Science
8. Problem
Enable the users of PBE systems to achieve maximal
correctness with minimal effort on large datasets
8
Help users to identify at least one of all incorrect
records in every iteration with minimal effort on
large datasets
9. Approach Overview
9
Raw Transformed
10“ H x 8” W 10
H: 58 x
W:25”
58
12”H x 9”W 12
11”H x 6” 11
… …
30 x 46” 30 x 46
Entire dataset
Random
Sampling
Raw Transformed
10“ H x 8” W 10
11”H x 6” 11
… …
30 x 46” 30 x 46
Sampled records
Verifying records
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Sorting and
color-codingRaw Transformed
30 x 46” 30 x 46
11”H x 6” 11
… …
11. Verifying Records
• First recommend records causing runtime
errors
– Records cause the program exit abnormally
• Second recommend potentially incorrect
records
– Learn a binary meta-classifier
11
Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Ex:
16. Summary and Future Work
• Summary
– Sample records
– Identify incorrect/questionable records
– Allow user to refine the recommendation
– Color-code the results
• Future work
– Show histograms of the data
– Translate the program to readable natural text
16
18. Type of Classifiers
• Classifier based on distance
• Classifier based on agreement of programs
• Classifier based on format ambiguity
18
19. Learning from various past results
19
…
Raw Transformed
26" H x 24" W x 12.5 26
Framed at 21.75" H x 24.25” W 21
12" H x 9" 12
…
Raw Transformed
Ravage 2099#24 (November, 1994) November, 1994
Gambit III#1 (September, 1997) September, 1997
(comic) Spidey Super Stories#12/2
(September, 1975)
comic
…
Examples
Incorrect
records
Correct
records
20. Sorting Records
20
Runtime errors
Rank records
using #failed_subprograms
Rank records
using meta-classifier output
Yes
No
Checking
transformed
records
Record #failed_subprograms
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 3
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2
Editor's Notes
Ashley wants to buy a painting for the space over her sofa
She has strict space limits. Ex: the painting should be about 60’’ wide and 40’’ high
Ashley got a spreadsheet of artworks on sale.
The sizes information that she got is a long list of entries with the height, width and even depth in one entry.
She has to split them into three columns and remove some extra text such as “H:”, “in.”, etc. Thus, she can then filter the artworks based on each degree’s size.
Dataset has so many records that it requires her to write programs to solve problem.
Problem: a long learning curve to learn this skill. The time should be used to decorate her house instead.
Programming by example doesn’t require users to write code anymore.
The list can have thousands of records.
It is really hard to notice some records in the middle that are transformed incorrectly.
According to previous research, User often believe that they have carefully examined all the records.
They stop checking the results when there is still a large percentage of incorrect records in the dataset.
To identify the Cannot rely on single rule or
Random sampling is to address the too many records problem
Verifying records can capture incorrect records in various scenarios
Sorting and color-coding is to address over confident user problem
Can also learn from the users interaction in current iteration to refine the recommendation
Learn from the users feedback to refine the recommendation
First, describe correctness
Second, iteration time
Third, total time. explain why certain scenarios have longer total time.
Why in s5 and s3 beta has twice the iteration time as our approach?
Why does the iteration time in beta varies much more than the times in our approach?