Hybrid system architecture overview

Overview of Hybrid Architecture in Project Halo
Jesse Wang, Peter Clark
March 18, 2013

Status of Hybrid
Architecture
Goals, Modularity, Dispatcher, Evaluation

2

Hybrid System Near Term Goals
CYC
• Setup the infrastructure to communicate
with existing reasoners AURA
AURA TEQA
• Reliably dispatch questions and collect
answers CYC
• Create related tools and resources
Question generation/selection, answer
TEQA
o

evaluation, report analysis, etc.
• Experiment ways to choose the answers from Dispatcher
available reasoners – as hybrid solver

3

Focus Areas of Hybrid Framework (until mid
2013)
Modularity

• Loose coupling, high cohesion, data exchange
protocols

Dispatching

• Send requests and handle the responses

Evaluation

• Ability to get ratings on answers, and report results

4

Hybrid System Core Components

CYC TEQA

Find-A-
Value
Chapt 7

In
AURA IR?
Campbell
DirectQA

Filtered Yellow Outline: New
or Updated
Set of
Questions SQs: suggested
questions
SQA: QA with
suggested questions
TEQA: Textual
Entailment QA
IR: Information
Retrieval

5

Infrastructure: Dispatchers

CYC TEQA

AURA
IR
Dispatcher

Live Single QA Suggested QA Batch QA

6

Dispatcher Features

• Asynchronous batch mode and single/experiment mode
• Parallel dispatching to reasoners
o Very functional UI: Live progress indicator, view question file, logs
o Exception and error handling
• Retry question when server is busy
• Batch service can continue to finish even if the client dies
o Cancel/stop the batch process also available
• Input and output support both XML and CSV/TSV formats
o Pipeline support: accept Question-Selector input
• Configurable dispatchers, select reasoners
o Collect answers and compute basic statistics

7

Question-Answering via Suggested Questions

• Similar features as Live/Direct QA
• Aggregate suggested questions’ answers as a solver
• Unique features:
o Interactively browse suggested questions database
o Filter on certain facets
o Using Q/A concepts, question types, etc. to improve relevance
o Automatic comparison of filtered and non-filtered results by chapters

8

Question and Answer Handling

• Handling and parsing reasoner’s returned results
o Customized programming
• Information on execution: details and summary
• Report generation
o Automatic evaluation
• Question Selector
o Support multiple facets/filters
o Question banks
o Dynamic UI to pick questions
o Hidden tags support

9

Automatic Evaluation: Status as of 2013.3
User overall AutoEval Overall
120
• Automatic result evaluation features
• Web UI/service to use
100
• Algorithms to score exact and variable answers
– brevity/clarity
80 – relevance: correctness + completeness
– overall score
• Generate reports
60
– Summary & details
– Graph plot
• 40Improving evaluation result accuracy
• Using: basic text processing tricks (stop words, stemming, trigram
20
similarity, etc.), location of answer, length of answer, bio concepts, counts
of concepts, chapters referred, question types, answer type
• Experiments and analysis (several rounds, W.I.P.)
0

10

Hybrid Performance
How we evaluate and how can improve
overall system performance

11

Caveats: Question Generation and Selection

• Generated by a small group of SMEs (senior biology students)
• In natural language, without textbook (only syllabus)

12

Question Set Facets
Question Types
Chapter Distribution

12 0
PROPERTY 4
HOW
5% WHY
7%
5% 11 5
HAVE- HOW-MANY WHAT-IS-A
RELATIONSHIP 4% 3%
7% WHERE
5% 10 6

WHAT-DOES-X-DO
IS-IT-TRUE-THAT HAVE- 7
3%
9% SIMILARITIES 9
2% 8
Other
9%
X-OR-Y
2%

FUNCTION-OF-X
1%
HAVE- E
DIFFERENCES V
FIND-A-VALUE 1%
46%

13

Caveat: Evaluation Criteria

• We provided a clear guideline, but still subjective
o A(4.0) = correct, complete answer, no major weakness
o B(3.0) = correct, complete answer with small cosmetic issues
o C(2.0) = partially correct or complete answers, with some big issues
o D(1.0) = somewhat relevant answer or information, or poor presentation
o F(0.0) = wrong or irrelevant, conflicting or hard-to-locate answers
• Only 3 users to rate the answers, under tight timeline
User Preferences
3
2.5
2
Aura
1.5
Cyc
1
Text QA
0.5
0
7 15 23

14

Evaluation Example
Q: What is the maximum number of different atoms a carbon atom can bind at once?

15

More Evaluation Samples (Snapshot)

16

Reasoner Quality Overview
160
Answer Counts Over Rating
140

120 Aura

Cyc
100
Text QA
80

60

40

20

0
0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00

17

Performance Number

Reasoner Performance on Reasoner Performance
All Ratings (0..4) on "Good" (>= 3.0) Answers
0.600 0.400

0.350
0.500

0.300

0.400
0.250

Aura Aura
0.300 0.200
Cyc Cyc
Text QA Text QA
0.150
0.200

0.100
0.100
0.050

0.000
0.000
Precision Recall F1
Precision Recall F1

18

Answers Over Question Types
Count of Answered Questions Answer Overall Rating
HAVE-RELATIONSHIP
HAVE-RELATIONSHIP

HAVE-SIMILARITIES
HAVE-SIMILARITIES

HAVE-DIFFERENCES HAVE-DIFFERENCES
Text QA
IS-IT-TRUE-THAT Cyc Text QA
IS-IT-TRUE-THAT
Aura Cyc
X-OR-Y
X-OR-Y Aura
WHAT-IS-A
WHAT-IS-A

WHAT-DOES-X-DO
WHAT-DOES-X-DO

PROPERTY
PROPERTY

HOW-MANY
HOW-MANY

HOW
HOW
36
FIND-A-VALUE
FIND-A-VALUE

0 5 10 15 20
0.00 1.00 2.00 3.00 4.00

19

Answer Distribution Over Chapters
4.00
Answer Quality Over Chapters
Text QA
Text QA
3.50 Text QA 0Aura 4
Cyc
Cyc
Aura Cyc
3.00 Aura Text QA
Cyc 5 6
Text QA
2.50 Text QA Aura

Cyc Aura
7 8
2.00
Aura
Cyc
1.50 Text QA
9 10
Aura

1.00 Aura
Cyc Cyc
Text QA
Aura Aura 11 12
0.50 Cyc Text QA
Cyc
Cyc
Text QA
0.00 Text QA
0 4 5 6 7 8 9 10 11 12
Aura 3.13 3.67 1.83 2.33 0.58 1.83 1.00 0.50
Cyc 1.75 2.17 1.00 1.67 3.17 1.11 1.83 2.67
Text QA 2.21 2.27 1.23 2.67 2.89 1.20 1.28 1.97 2.06 2.50
20

Answers on Questions with E/V Answer Type

Exact/Various Answer Count
50
40 45
40
30
E
20 25
V
10
5 5 13
0
Aura Cyc Text QA

Exact/Variou Answer Quality
3.00
2.50
2.00
1.50 E
1.00 V
0.50
0.00
Aura Cyc Text QA

21

Improve Performance: Hybrid Solver – Combine!

• Random selector (dumbest, baseline)
o Total question answered correctly should beat the best solver

• Priority selector (less dumb)
o Pick reasoner following a good order (e.g. Aura > Cyc > Text QA) *
o Expected performance: better than best individual

• Trained selector: Feature and rule-based selector (smarter)
o Decision-Tree (CTree…) learning over Q-Type, Chapter, …
o Expected performance: slightly better than above

• Theoretical best selector: MAX – the upper limit (smartest)
o Suppose we can always pick the best performing reasoner

22

Performance (F1) with Hybrid Solvers
Performance of Solvers
on Good Answers (Good: Rating >= 3.0)
0.300

0.250

0.200
Aura
Cyc
Text QA
0.150
Random
Priority

0.100 D-Tree
Max

0.050

0.000
Aura Cyc Text QA Random Priority D-Tree Max

24

Conclusion

• Each reasoner has its own strength and weakness
o Some aspects not handled well by AURA & CYC
o Low hanging: IS-IT-TRUE-THAT for all, WHAT-IS-A for CYC, …
• Aggregated performance easily beats the best individual (Text QA)
o Random solver does a good job (F1: mean=0.609): F1MAX – F1random ~ 2.5%
• Little room for better performance via answer selection
o F1MAX – F1D-Tree ~ 0.5%
o Better focus on MORE and/or BETTER solvers

25

Future and Discussions

26

Near Future Plans

• Include SQDB-based answers as a “Solver”
o Help alleviate question interpretation problems by reasoners
• Include Information Retrieval-based answers as a “Solver”
o Help understand the extra power reasoners can have over search
• Improvement evaluation mechanism
• Extract more features from questions and answers to enable a better
solver, and see how close we can get to the upper limit (MAX)
• Improve question selector to support multiple sources and automatic
update/merge of question metadata
• Find ways to handle question bank evolution

27

Further Technical Directions (2013.6+)

Get More, Better Reasoners

Machine learning, Evidence
combination
• Extract and use more features to select best answers
• Evidence collection and weighing

Analytics & tuning

• Easier to explore individual results and diagnose failures
• Support to tune and optimize performance over target
question-answer datasets

Inter-solver communication

• Support shared data, shared answers
• Subgoaling
• Allow reasoners to call each other for subgoals

28

Open *Data*

Requirements

• Clear Semantics, Common Format (standard), Easy to
Access, Persistent (available)

Data Sources

• Questions bank, training sets, knowledge base, protocol for
intermediate and final data exchange

Open Data Access Layer

• Design and implement protocols and services for data I/O

29

Open *Services*

Two Categories

• Pure machine/algorithms based
• Human-computation (social, crowd sourcing)

Requirements

• Communicate with open data, generate meta data,
• More reliable, scalable, reusable

Goal: Process and refine data

• Convert raw, noisy, inaccurate data 
refined, structured, useful

30

Open *Environment*

Definition

• AI development environment to facilitate
collaboration, efficiency and scalability

Operation

• like MMPOG, each “player” gets credits: contribution, resource
consumption; interests, loans; ratings…

Opportunities

• self-organized projects, growth potential, encourage
collaboration, grand prize

31

Thank You!
For having the opportunity for Q&A 

Backup slides next

32

IBM Watson’s “DeepQA” Hybrid Architecture

33

DeepQA Answer Merging And Ranking Module

34

Wolfram Alpha Hybrid Architecture

• Data Curation
• Computation
• Linguistic components
• Presentation

35

Answer Distribution (Density)

Answer Distribution
16

14

12
Count of Answers

10

8 Text QA
Cyc
6
Aura

4

2

0
0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 3.00 3.33 3.67 4.00
Average User Rating

38

Data Table for Answer Quality Distribution

39

Work Performed

• Created web-based dispatcher infrastructure
o For both Live Direct QA and Live Suggested Questions
o Batch mode to handle larger amount
• Built a web UI for UW student to rate answers of questions (HEF)
o Coherent UI, duplicate removal, queued tasks
• Established automatic ways for result evaluation and comparison
• Applied first versions of file exchange format and protocols
• Employed initial file and data exchange formats and protocols
• Setup faceted browsing and search (retrieval) UI
o And web services for 3rd party consumption
• Carried out many rounds of relevance studies and analysis

40

First Evaluation via Halo Evaluation Framework

• We sent individual QA result set to UW students for evaluation
• First round hybrid system evaluation:
o Cyc SQA: 9 best (3 ties), 14 good, 15 / 60 answered
o Aura QA: 1 best, 9 good, 14/60 answered;
o Aura SQA: 4 best (3 ties), 7 good, 8/60 answered
o Text QA: 27 best, 29 good; SQA: 3 best, 5 good, 7/60 answered
o Best scenario: 41/60 answered
o Note: Cyc Live was not included

o * SQA (Answering via suggested questions)

41

Live Direct QA Dispatcher Service
What does ribosome make?

Ask a question

Waiting for answers

Answers returned?
42

Live Suggested QA Dispatcher Service

43

Batch QA Dispatcher Service

44

Live solver Service Dispatchers

45

Direct Live QA: What does ribosome make?

46

Direct Live QA: What does ribosome make?

47

Suggested Questions Dispatcher

48

Results for Suggested Question Dispatcher

49

50

Batch Mode QA Dispatcher

Batch QA Progress Bar

51

Suggested questions database browser

52

Faceted Search on Suggested Questions

53

Tuning the Suggested Question Recommendation

Accomplished Not Yet Implemented
• Indexed suggested questions • Parsing the questions
database • More experiment (heuristics)
– Concept, question, answers
on retrieval/ranking criteria
• Created a web service for – manual
upload new set of suggested
questions • Get SME generate training
• Extracted chapter information data to evaluate
from answer text (TEXT) – Automatic
• Analyzed question types • More feature extraction
– Pattern-based
• Experimented with some basic
retrieval criteria

54

Parsing, Indexing and Ranking

In-place NYI
• New local concept extraction • More sentence features
service – Content type:
Questions, figures, header, reg
• Concept extracted and in index ular, review…
• Both sentences and paragraphs – Previous and next concepts
are in index – Count of concepts
• Basic sentence type identified – Clauses
• Chapter and section – Universal truth
information in – Relevance or not
• Question parsing
• Several ways of ranking
evaluated • More refining on ranking
• Learning to Rank ??

55

Browse Hybrid system

56

WIP: Ranking Experiments (Ablation Study)
Features Only Without Only W/O
(Easy) (Easy) (Hard) (Hard)
Sentence Text 139/201 31/146

Sentence Concept 79/201 13/146

Prev/Next Sentence - -
Concept
Locality info - -
(Chapter, etc.)
Stopword list - -

Stemming comparison - -

Other features (type…) - -

Weighting (variations)
57

Automatic Evaluation of IR Results

• Inexpensive, consistent results for tuning
o Always using human judgments would be expensive and somehow
inconsistent
• Quick turnover
• With both “easy” and “difficult” question-answer sets
• Validated by UW students to be trustworthy
o 95% accuracy on average with threshold

58

First UW Students’ Evaluation on AutoEval

• Notations:
o 0 = right on. 100% is right, 0% is wrong.
o -1 = false positive. It means we gave it a high score (>50%), but the
retrieved text does NOT contain or imply answer
o +1 = false negative. It means we gave it a low score (<50%), but the
retrieved text actually DOES contain or imply the answer
• We gave each of 4 students
o 15 questions, 15*5=75 sentences and scores to rank
o 5 of the questions are the same, 10 are unique to each student
o 23/45 questions from “hard” set, 22/45 from “easy” set

59

Results: Auto-Evaluation Validity Verification

1
0.9
Threshold at 50%
0.8
Threshold at 80%
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 Threshold at 80%

1
2 Threshold at 50%
3
4

60

The “Easy” QA set *

• Task: automatic evaluate if retrieved sentences contain the answer
• Scoring: Max score, Mean Average Precision (MAP)
• Result using Max (with threshold at 80%):
o 193 regular questions and 8 yes/no questions (via concepts overlap)
• Only with sentence text: 139 (69.2%)
• Peter’s test set: 149 (74.1%)
• Peter’s more refined: 158 (78.6%)
• (Lower) Upper bound for IR: 170 (84.2%)
• Jesse’s best: ??

* The evaluation is for IR portion ONLY, no answer pinpointing
61

“Easy” QA Set Auto-Evaluation
Result
0.9

0.8

0.7

0.6

0.5

0.4 Result

0.3

0.2

0.1

0
Q text Only Vulcan Basic Vulcan Refined BaseIR Current Upper
Bound

62

Best Upper Bound for Hard Set as of Today

With weighting on Answer Text, Answer Concepts, Question
Text, Question Concepts, matching over Sentence Text, Concepts, and
Concepts from Previous and Next Sentences, and sentence type…
Comparison with keyword overlap, concept overlap, stopwords removal
and smart stemming techniques…

64

Sharing the Data and Knowledge

• Information We Want, and each solver may also want
• Everyone’s result
• Everyone’s confidence on results
• Everyone’s supporting evidence
o From textbook sentences, reviews, homework section, figures…
o From related web material, e.g. biology WikiPedia
o From common world knowledge, ParaPara, WordNet, …
• Training data – for offline use

66

More Timeline Details for First Integration

We are in control Partners
• AURA • Cyc
– Now – ? Hopefully before EOY 2012
• Text • JHU
– before 12/7 – ?? Hopefully before EOY 2012
• Vulcan IR Baseline • ReVerb
– before 12/15 – ??? EOM January 2013
• Initial Hybrid System Output
– Before 12/21
– Without unified data format
– With limited (possibly
outdated) suggested questions

67

Rounds of Improvements
Infrastructure (module &
service)
• Integrate solver
• Data I/O

Tricks (algorithms & data)
• Refine Hybrid Strategy
• Heuristic + Machine Learning

Analysis (evaluation)
• Evaluation with humans
• With each solver + hybrid system

68

OpenHalo
AURA

SILK
QA
CYC
QA
Vulcan Hybrid
System

Other
TEQA QA

Data Service Collaboration

69

Hybrid system architecture overview

Recommended

Recommended

More Related Content

Similar to Hybrid system architecture overview

Similar to Hybrid system architecture overview (20)

More from Jesse Wang

More from Jesse Wang (20)

Recently uploaded

Recently uploaded (20)

Hybrid system architecture overview

Editor's Notes