This is the deck for Science Advisory Board review of our recent progress in setting up a basic infrastructure -- hybrid system architecture to facilitate automatic question answering in Project Halo -- Vulcan's long-range strong AI effort to attack a key problem in the field of AI research.
3. Hybrid System Near Term Goals
CYC
• Setup the infrastructure to communicate
with existing reasoners AURA
AURA TEQA
• Reliably dispatch questions and collect
answers CYC
• Create related tools and resources
Question generation/selection, answer
TEQA
o
evaluation, report analysis, etc.
• Experiment ways to choose the answers from Dispatcher
available reasoners – as hybrid solver
3
4. Focus Areas of Hybrid Framework (until mid
2013)
Modularity
• Loose coupling, high cohesion, data exchange
protocols
Dispatching
• Send requests and handle the responses
Evaluation
• Ability to get ratings on answers, and report results
4
5. Hybrid System Core Components
CYC TEQA
Find-A-
Value
Chapt 7
In
AURA IR?
Campbell
DirectQA
Filtered Yellow Outline: New
or Updated
Set of
Questions SQs: suggested
questions
SQA: QA with
suggested questions
TEQA: Textual
Entailment QA
IR: Information
Retrieval
5
7. Dispatcher Features
• Asynchronous batch mode and single/experiment mode
• Parallel dispatching to reasoners
o Very functional UI: Live progress indicator, view question file, logs
o Exception and error handling
• Retry question when server is busy
• Batch service can continue to finish even if the client dies
o Cancel/stop the batch process also available
• Input and output support both XML and CSV/TSV formats
o Pipeline support: accept Question-Selector input
• Configurable dispatchers, select reasoners
o Collect answers and compute basic statistics
7
8. Question-Answering via Suggested Questions
• Similar features as Live/Direct QA
• Aggregate suggested questions’ answers as a solver
• Unique features:
o Interactively browse suggested questions database
o Filter on certain facets
o Using Q/A concepts, question types, etc. to improve relevance
o Automatic comparison of filtered and non-filtered results by chapters
8
9. Question and Answer Handling
• Handling and parsing reasoner’s returned results
o Customized programming
• Information on execution: details and summary
• Report generation
o Automatic evaluation
• Question Selector
o Support multiple facets/filters
o Question banks
o Dynamic UI to pick questions
o Hidden tags support
9
10. Automatic Evaluation: Status as of 2013.3
User overall AutoEval Overall
120
• Automatic result evaluation features
• Web UI/service to use
100
• Algorithms to score exact and variable answers
– brevity/clarity
80 – relevance: correctness + completeness
– overall score
• Generate reports
60
– Summary & details
– Graph plot
• 40Improving evaluation result accuracy
• Using: basic text processing tricks (stop words, stemming, trigram
20
similarity, etc.), location of answer, length of answer, bio concepts, counts
of concepts, chapters referred, question types, answer type
• Experiments and analysis (several rounds, W.I.P.)
0
10
12. Caveats: Question Generation and Selection
• Generated by a small group of SMEs (senior biology students)
• In natural language, without textbook (only syllabus)
12
13. Question Set Facets
Question Types
Chapter Distribution
12 0
PROPERTY 4
HOW
5% WHY
7%
5% 11 5
HAVE- HOW-MANY WHAT-IS-A
RELATIONSHIP 4% 3%
7% WHERE
5% 10 6
WHAT-DOES-X-DO
IS-IT-TRUE-THAT HAVE- 7
3%
9% SIMILARITIES 9
2% 8
Other
9%
X-OR-Y
2%
FUNCTION-OF-X
1%
HAVE- E
DIFFERENCES V
FIND-A-VALUE 1%
46%
13
14. Caveat: Evaluation Criteria
• We provided a clear guideline, but still subjective
o A(4.0) = correct, complete answer, no major weakness
o B(3.0) = correct, complete answer with small cosmetic issues
o C(2.0) = partially correct or complete answers, with some big issues
o D(1.0) = somewhat relevant answer or information, or poor presentation
o F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers
• Only 3 users to rate the answers, under tight timeline
User Preferences
3
2.5
2
Aura
1.5
Cyc
1
Text QA
0.5
0
7 15 23
14
18. Performance Number
Reasoner Performance on Reasoner Performance
All Ratings (0..4) on "Good" (>= 3.0) Answers
0.600 0.400
0.350
0.500
0.300
0.400
0.250
Aura Aura
0.300 0.200
Cyc Cyc
Text QA Text QA
0.150
0.200
0.100
0.100
0.050
0.000
0.000
Precision Recall F1
Precision Recall F1
18
19. Answers Over Question Types
Count of Answered Questions Answer Overall Rating
HAVE-RELATIONSHIP
HAVE-RELATIONSHIP
HAVE-SIMILARITIES
HAVE-SIMILARITIES
HAVE-DIFFERENCES HAVE-DIFFERENCES
Text QA
IS-IT-TRUE-THAT Cyc Text QA
IS-IT-TRUE-THAT
Aura Cyc
X-OR-Y
X-OR-Y Aura
WHAT-IS-A
WHAT-IS-A
WHAT-DOES-X-DO
WHAT-DOES-X-DO
PROPERTY
PROPERTY
HOW-MANY
HOW-MANY
HOW
HOW
36
FIND-A-VALUE
FIND-A-VALUE
0 5 10 15 20
0.00 1.00 2.00 3.00 4.00
19
20. Answer Distribution Over Chapters
4.00
Answer Quality Over Chapters
Text QA
Text QA
3.50 Text QA 0Aura 4
Cyc
Cyc
Aura Cyc
3.00 Aura Text QA
Cyc 5 6
Text QA
2.50 Text QA Aura
Cyc Aura
7 8
2.00
Aura
Cyc
1.50 Text QA
9 10
Aura
1.00 Aura
Cyc Cyc
Text QA
Aura Aura 11 12
0.50 Cyc Text QA
Cyc
Cyc
Text QA
0.00 Text QA
0 4 5 6 7 8 9 10 11 12
Aura 3.13 3.67 1.83 2.33 0.58 1.83 1.00 0.50
Cyc 1.75 2.17 1.00 1.67 3.17 1.11 1.83 2.67
Text QA 2.21 2.27 1.23 2.67 2.89 1.20 1.28 1.97 2.06 2.50
20
21. Answers on Questions with E/V Answer Type
Exact/Various Answer Count
50
40 45
40
30
E
20 25
V
10
5 5 13
0
Aura Cyc Text QA
Exact/Variou Answer Quality
3.00
2.50
2.00
1.50 E
1.00 V
0.50
0.00
Aura Cyc Text QA
21
22. Improve Performance: Hybrid Solver – Combine!
• Random selector (dumbest, baseline)
o Total question answered correctly should beat the best solver
• Priority selector (less dumb)
o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) *
o Expected performance: better than best individual
• Trained selector: Feature and rule-based selector (smarter)
o Decision-Tree (CTree…) learning over Q-Type, Chapter, …
o Expected performance: slightly better than above
• Theoretical best selector: MAX – the upper limit (smartest)
o Suppose we can always pick the best performing reasoner
22
23. Performance (F1) with Hybrid Solvers
Performance of Solvers
on Good Answers (Good: Rating >= 3.0)
0.300
0.250
0.200
Aura
Cyc
Text QA
0.150
Random
Priority
0.100 D-Tree
Max
0.050
0.000
Aura Cyc Text QA Random Priority D-Tree Max
24
24. Conclusion
• Each reasoner has its own strength and weakness
o Some aspects not handled well by AURA & CYC
o Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …
• Aggregated performance easily beats the best individual (Text QA)
o Random solver does a good job (F1: mean=0.609): F1MAX – F1random ~ 2.5%
• Little room for better performance via answer selection
o F1MAX – F1D-Tree ~ 0.5%
o Better focus on MORE and/or BETTER solvers
25
26. Near Future Plans
• Include SQDB-based answers as a “Solver”
o Help alleviate question interpretation problems by reasoners
• Include Information Retrieval-based answers as a “Solver”
o Help understand the extra power reasoners can have over search
• Improvement evaluation mechanism
• Extract more features from questions and answers to enable a better
solver, and see how close we can get to the upper limit (MAX)
• Improve question selector to support multiple sources and automatic
update/merge of question metadata
• Find ways to handle question bank evolution
27
27. Further Technical Directions (2013.6+)
Get More, Better Reasoners
Machine learning, Evidence
combination
• Extract and use more features to select best answers
• Evidence collection and weighing
Analytics & tuning
• Easier to explore individual results and diagnose failures
• Support to tune and optimize performance over target
question-answer datasets
Inter-solver communication
• Support shared data, shared answers
• Subgoaling
• Allow reasoners to call each other for subgoals
28
28. Open *Data*
Requirements
• Clear Semantics, Common Format (standard), Easy to
Access, Persistent (available)
Data Sources
• Questions bank, training sets, knowledge base, protocol for
intermediate and final data exchange
Open Data Access Layer
• Design and implement protocols and services for data I/O
29
29. Open *Services*
Two Categories
• Pure machine/algorithms based
• Human-computation (social, crowd sourcing)
Requirements
• Communicate with open data, generate meta data,
• More reliable, scalable, reusable
Goal: Process and refine data
• Convert raw, noisy, inaccurate data
refined, structured, useful
30
30. Open *Environment*
Definition
• AI development environment to facilitate
collaboration, efficiency and scalability
Operation
• like MMPOG, each “player” gets credits: contribution, resource
consumption; interests, loans; ratings…
Opportunities
• self-organized projects, growth potential, encourage
collaboration, grand prize
31
39. Work Performed
• Created web-based dispatcher infrastructure
o For both Live Direct QA and Live Suggested Questions
o Batch mode to handle larger amount
• Built a web UI for UW student to rate answers of questions (HEF)
o Coherent UI, duplicate removal, queued tasks
• Established automatic ways for result evaluation and comparison
• Applied first versions of file exchange format and protocols
• Employed initial file and data exchange formats and protocols
• Setup faceted browsing and search (retrieval) UI
o And web services for 3rd party consumption
• Carried out many rounds of relevance studies and analysis
40
40. First Evaluation via Halo Evaluation Framework
• We sent individual QA result set to UW students for evaluation
• First round hybrid system evaluation:
o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answered
o Aura QA: 1 best, 9 good, 14/60 answered;
o Aura SQA: 4 best (3 ties), 7 good, 8/60 answered
o Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answered
o Best scenario: 41/60 answered
o Note: Cyc Live was not included
o * SQA (Answering via suggested questions)
41
41. Live Direct QA Dispatcher Service
What does ribosome make?
Ask a question
Waiting for answers
Answers returned?
42
53. Tuning the Suggested Question Recommendation
Accomplished Not Yet Implemented
• Indexed suggested questions • Parsing the questions
database • More experiment (heuristics)
– Concept, question, answers
on retrieval/ranking criteria
• Created a web service for – manual
upload new set of suggested
questions • Get SME generate training
• Extracted chapter information data to evaluate
from answer text (TEXT) – Automatic
• Analyzed question types • More feature extraction
– Pattern-based
• Experimented with some basic
retrieval criteria
54
54. Parsing, Indexing and Ranking
In-place NYI
• New local concept extraction • More sentence features
service – Content type:
Questions, figures, header, reg
• Concept extracted and in index ular, review…
• Both sentences and paragraphs – Previous and next concepts
are in index – Count of concepts
• Basic sentence type identified – Clauses
• Chapter and section – Universal truth
information in – Relevance or not
• Question parsing
• Several ways of ranking
evaluated • More refining on ranking
• Learning to Rank ??
55
56. WIP: Ranking Experiments (Ablation Study)
Features Only Without Only W/O
(Easy) (Easy) (Hard) (Hard)
Sentence Text 139/201 31/146
Sentence Concept 79/201 13/146
Prev/Next Sentence - -
Concept
Locality info - -
(Chapter, etc.)
Stopword list - -
Stemming comparison - -
Other features (type…) - -
Weighting (variations)
57
57. Automatic Evaluation of IR Results
• Inexpensive, consistent results for tuning
o Always using human judgments would be expensive and somehow
inconsistent
• Quick turnover
• With both “easy” and “difficult” question-answer sets
• Validated by UW students to be trustworthy
o 95% accuracy on average with threshold
58
58. First UW Students’ Evaluation on AutoEval
• Notations:
o 0 = right on. 100% is right, 0% is wrong.
o -1 = false positive. It means we gave it a high score (>50%), but the
retrieved text does NOT contain or imply answer
o +1 = false negative. It means we gave it a low score (<50%), but the
retrieved text actually DOES contain or imply the answer
• We gave each of 4 students
o 15 questions, 15*5=75 sentences and scores to rank
o 5 of the questions are the same, 10 are unique to each student
o 23/45 questions from “hard” set, 22/45 from “easy” set
59
59. Results: Auto-Evaluation Validity Verification
1
0.9
Threshold at 50%
0.8
Threshold at 80%
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 Threshold at 80%
1
2 Threshold at 50%
3
4
60
60. The “Easy” QA set *
• Task: automatic evaluate if retrieved sentences contain the answer
• Scoring: Max score, Mean Average Precision (MAP)
• Result using Max (with threshold at 80%):
o 193 regular questions and 8 yes/no questions (via concepts overlap)
• Only with sentence text: 139 (69.2%)
• Peter’s test set: 149 (74.1%)
• Peter’s more refined: 158 (78.6%)
• (Lower) Upper bound for IR: 170 (84.2%)
• Jesse’s best: ??
* The evaluation is for IR portion ONLY, no answer pinpointing
61
61. “Easy” QA Set Auto-Evaluation
Result
0.9
0.8
0.7
0.6
0.5
0.4 Result
0.3
0.2
0.1
0
Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper
Bound
62
62. Best Upper Bound for Hard Set as of Today
With weighting on Answer Text, Answer Concepts, Question
Text, Question Concepts, matching over Sentence Text, Concepts, and
Concepts from Previous and Next Sentences, and sentence type…
Comparison with keyword overlap, concept overlap, stopwords removal
and smart stemming techniques…
64
63. Sharing the Data and Knowledge
• Information We Want, and each solver may also want
• Everyone’s result
• Everyone’s confidence on results
• Everyone’s supporting evidence
o From textbook sentences, reviews, homework section, figures…
o From related web material, e.g. biology WikiPedia
o From common world knowledge, ParaPara, WordNet, …
• Training data – for offline use
66
64. More Timeline Details for First Integration
We are in control Partners
• AURA • Cyc
– Now – ? Hopefully before EOY 2012
• Text • JHU
– before 12/7 – ?? Hopefully before EOY 2012
• Vulcan IR Baseline • ReVerb
– before 12/15 – ??? EOM January 2013
• Initial Hybrid System Output
– Before 12/21
– Without unified data format
– With limited (possibly
outdated) suggested questions
67
65. Rounds of Improvements
Infrastructure (module &
service)
• Integrate solver
• Data I/O
Tricks (algorithms & data)
• Refine Hybrid Strategy
• Heuristic + Machine Learning
Analysis (evaluation)
• Evaluation with humans
• With each solver + hybrid system
68
66. OpenHalo
AURA
SILK
QA
CYC
QA
Vulcan Hybrid
System
Other
TEQA QA
Data Service Collaboration
69
Editor's Notes
We’ve been debating to see if it is necessary to evaluate a separate Information Retrieval module for comparison purpose – see how well an Information Retrieval-based module can do as a baseline and how much better we can add on to it – our value added.