A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Georgia Koutrika, ATHENA Research Center
A Spark-based Intelligent
Assistant
Making Data Exploration in Natural Language Real
#UnifiedDataAnalytics #SparkAISummit

Data, Data, Data
3
Data growth
More data than humans can process and comprehend
Data democratization
From scientists to the public, increasingly more users are consumers of data

4
HOW CAN WE EXPLORE AND
LEVERAGE OUR DATA?

5#UnifiedDataAnalytics #SparkAISummit
Select p.status
from conference_attendees p
where p.conference=‘SPARK+AI2019’ Data Store
SQL
queries
Results

How are you today?

The phases of data
exploration
7

The “SQL” Age
8
Programmer
SELECT * FROM CITIES
WHERE 50 <
(SELECT AVG(TEMP_F)
FROM STATS WHERE
CITIES.ID = STATS.ID);
Users
which cities have
year-round average
temperature above
50 degrees?
DBMS
Limited access (for most but the privileged)
Communication bottleneck (the guru)
Data starvation (for those that really need it)
Limited interaction (query answering)
- Sophisticated user
- Precise knowledge of data and schema
- Precise knowledge of their need
- User “speaks” fluently SQL
- DBMS “responds” with tables
Interaction:
Knowledge:
User type :
Info need:
Period characteristics

The “Baby Talk” Age
9
Business user
cities with year-round
average temperature
above 50 degrees?
DBMS
User pretty much knows what to ask
User queries are relatively simple
Query answering paradigm (still)
- Domain Expert
- User understands the data domain
- Precise knowledge of their need
- User not familiar with SQL
- DBMS “responds” with tables and graphs
Interaction:
Knowledge:
User type :
Info need:
Period characteristics

Chatbots
10
https://medium.com/swlh/chatbots-of-the-future-86b5bf762bb4
A chatbot:
• mimics conversations with people
• uses artificial intelligence techniques
• lives on consumer messaging platforms, as a means
for consumers to interact with brands.
Drawbacks
• Primarily text interfaces based on rules
• Encourage canned, linear-driven interactions
• Deal with simple, unambiguous questions (“what is the weather forecast today”)
• Cannot answer random or complex queries over data repositories

Conversational AI
For example, Google Duplex: demo released in May 2018
• The technology is directed towards completing specific tasks, such as scheduling certain types of
appointments.
• For such tasks, the system makes the conversational experience as natural as possible.
• One of the key research insights was to constrain Duplex to closed domains.
• Duplex can only carry out natural conversations after being deeply trained in such domains

Human-like Data Exploration
12
- From expert to data consumer
- Intuition about the data
- Not necessarily sure what to ask
- Intuitive, natural, interactionInteraction:
Knowledge:
User type :
Info need:
Requirements

Human-like Data Exploration
13
• converses with the user in a more natural bilateral interaction;
• actively guides the user offering explanations and suggestions;
• keeps track of the context and can respond and adapt accordingly;
• constantly improves its behavior by learning and adapting.
An Intelligent Data Assistant

Intelligent Data Assistant
Syntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations

Step 1: Let the user ask
using natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations

Facts
Unlike search engines, users tend to express sophisticated
query logics to a data assistant and expect perfect results
Translating a natural language query to a structured query is hard!

Challenges: from the NL Side
Synonymy: multiple words with the same
meaning e.g.,“movies" and “films"
Polysemy: a single word has multiple meanings
e.g. Paris as in the city and Paris Hilton
Syntactic: Multiple readings based on syntax
e.g., “Find all German movie directors“ means:
“directors that have directed German movies" ??
“directors from Germany that have directed a movie“??
Semantic: Multiple meanings for a sentence.
e.g. “Are Brad and Angelina married?".
Are they married to each other or separately.
Paraphrasing Multiple way to say the same thing
E.g. ‘how many people live in ..." could be a mention of the
“Population“ column.
Context dependent terms:
E.g., “Return the top product" the term “top" is a modifier for
“product". Does it mean Based on popularity?? Based on
number of sales??
Elliptical queries: sentences from which one or
more words are omitted.
E.g. , “Return the movies of Clooney".
Non-exact matching: mentions do not map
exactly to values or tables/attributes)
E.g. Who is the best actress in 2011 à ‘actress’ should map to
the “actor” column

Challenges: from the Data Side
Complex Syntax: SQL is a structured language with a strict grammar
and limited expressivity when compared to natural language.
e.g., “Return the movie with the best rating".
Should look like “SELECT name , MAX( rating ) FROM Movie ;” but it is WAY more
complicated
Database Structure:
E.g., for the term “date" a system may need to retrieve three attributes: year, month, day
Multiple relationships: mentions may connect in multiple ways/join
disambiguation
E.g., “Woody Allen movies” may need several tables to be joined.
Ranking: how to rank multiple answers
e.g., “Return Woody Allen movies".

Ask a query
What movies have the same director as
the movie “Revolutionary Road”

Understanding Syntax
20
Step 1. Understand the natural language query linguistically.
Generate a dependency parse tree:
• part-of-speech (POS) tags
that describe each word's syntactic function
+
• syntactic relationships between words in the sentence.
NLQ
Syntactic Parser

21
Step 2.
1. Map query elements to data elements:
• tables, attributes, values – using indexes
• commands (e.g., order by) – using a dictionary
2. Keep best mappings
NLQ
Syntactic Parser
Node Mapper

22
Step 3. Map the parse tree to the database structure and build a query tree
NLQ
Syntactic Parser
Node Mapper
Tree Mapper

23
NLQ
Syntactic Parser
Node Mapper
Step 4. Generate the SQL query to execute
Tree Mapper
SQL Generator
SQL

24
#UnifiedAnalytics #SparkAISummit
Keyword Schema Element
movie Movie
“Revolutionary
Road”
Movie.Tittle
movies Movie
director Director
What movies have the same director as the movie “Revolutionary Road”
ROOT
Return
director
movie
“Revolutionary Road”
same
movies
Syntactic Parser Node Mapper
ROOT
Return
movies
Same
director
movies
director
movie
“Revolutionary Road”
Tree Mapper
SQL Generator
Main Query
SELECT DISTINCT movie.tittle
FROM movie, block0, block1
WHERE movie.mid = block0.mid AND
block0.pk_director =
block1.pk_director
Block0
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.mid = directed_by.msid AND
directed_by.did = director.did
Block1
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.tittle = “Revolutionary Road” AND
movie.mid = directed_by.msid AND

25
Why is Parsing So Hard For Computers to Get Right?
• Human languages show remarkable levels of ambiguity.
• It is not uncommon for moderate length sentences to have hundreds, thousands, or even tens of
thousands of possible syntactic structures.
• A natural language parser must somehow search through all of these alternatives, and find the most
plausible structure given the context.

Ask a query
Show me Italian restaurants
Not much value a parser can add

Ambiguity
Several possible mappings

1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Ambiguity
Several possible query interpretationsLikely
Unlikely

Ambiguity
Too many ways to interpret a query
• which one(s) represent user intent?
• how do we rank them?

Analyzing Data
Expert input +
Query logs
NLQ
Entity Mapper
Interpretation
Generator
ML-based
Disambiguation
SQL Generator
SQL
Training
data

Analyzing Data
Probability: Probability captures the commonality of a
keyword in an attribute
Attribute_WordCount is the number of all words in an attribute
Exclusivity is an adopted version of gini-index to capture the power of each mapping
Example Features for attribute mappings
Several possible mappings

Analyzing Data
33
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
Min_Prob: We can take the minimum of probabilities inside an attribute
combination as a way to represent the mappings with.
Example Features for attribute combinations

Analyzing Data
34
Example Features for attribute combinations
IR_Score: We compute a relevance score:
- For each initial attribute, we compute the single-attribute
relevance score
- The single attribute scores for an attribute combination
are combined into a final score
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)

Step 2: Let the system
respond in natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations

NL Explanations
comedies by
Woody Allen director or
producer
Woody Allen?
director
select d.name, m.title
from MOVIE m, DIRECTED r, DIRECTOR d, GENRE g
where m.id=r.mid and r.did=d.id
and m.id = g.mid and d.name = `Woody Allen'
and g.genre = `comedy'
select d.name, m.title
from MOVIE m, CAST c, ACTOR a, GENRE g
where m.id=c.mid and c.did=a.id
and m.id = g.mid and a.name = `Woody Allen'
and g.genre = `comedy'

Generating Explanations
Domain-independent graph traversal for efficiently exploring query graphs and composing query
descriptions as phrases in natural language
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Template-based Synthesis
Annotations +
Templates
Templates
Annotations
• Relation ACTOR à “actors”
• Attribute “fname” à “firstname”
• Function MAX à “the greatest”
NL explanation

Generating Explanations
Actors Cast Movies
name
Year =
2010
l(actors) + ‘that play in’ + l(movies) l(movies) + in + l(year)
select a.name
from actors a, movies m, cast c
where a.id=c.aid and c.mid=m.mid
and year=2010
Return the name of the actors
for actors that play in movies in 2010

NL->SQL logs
What about user NL queries?
We can use our knowledge of
translating past NL queries
to synthesize NL explanations of new queries
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Annotations +
Templates
NL explanation

Step 3: Help the user ask the
right questionSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations

Guiding the user
The user needs help:
• discovering the data in the first place
• knowing what questions may be asked
• finding what to do next
Hello??

Query Recommendations
Two settings
Cold-start:
Starting with no previous interaction.
Show a set of starter queries that the users could use to get some initial
answers from the dataset and start understanding the data better
Warm-start:
Exploring and looking for answers in a new data set takes time and effort.
At each step, the user may not know what she should do next.
The system can leverage the user’s interactions (queries) to show possible next queries

Starter Query
Generation
Data
statistics
User logs
Example
queries
Starter Queries
Cold-start
Metrics

Generative
Approach
Data
statistics
User Query
Warm-start
Log-based
ApproachStructural
modifications
Query Log
Transition Probabilities
Query Similarities
Next Queries

Step 4: Putting everything
together

46
Syntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Knowledge + Expert Bases
Annotation + Template Bases
Query + Translation Logs
Statistics
Query Similarity Graph
Query Transition Graph
SPARK SQL
SPARK CoreNLP
SPARK MLlib
GraphX
TensorFlow
S
P
A
R
K
C
o
r
e
S
t
o
r
a
g
e
HDFS
Parquet
LuceneDataProcessing

47
What are you looking
for today?

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

Similaire à A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real