Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.
In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.
Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.
Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.
2. Georgia Koutrika, ATHENA Research Center
A Spark-based Intelligent
Assistant
Making Data Exploration in Natural Language Real
#UnifiedDataAnalytics #SparkAISummit
3. Data, Data, Data
3
Data growth
More data than humans can process and comprehend
Data democratization
From scientists to the public, increasingly more users are consumers of data
#UnifiedDataAnalytics #SparkAISummit
8. The “SQL” Age
8
Programmer
SELECT * FROM CITIES
WHERE 50 <
(SELECT AVG(TEMP_F)
FROM STATS WHERE
CITIES.ID = STATS.ID);
Users
which cities have
year-round average
temperature above
50 degrees?
DBMS
Limited access (for most but the privileged)
Communication bottleneck (the guru)
Data starvation (for those that really need it)
Limited interaction (query answering)
- Sophisticated user
- Precise knowledge of data and schema
- Precise knowledge of their need
- User “speaks” fluently SQL
- DBMS “responds” with tables
Interaction:
Knowledge:
User type :
Info need:
Period characteristics
#UnifiedDataAnalytics #SparkAISummit
9. The “Baby Talk” Age
9
Business user
cities with year-round
average temperature
above 50 degrees?
DBMS
User pretty much knows what to ask
User queries are relatively simple
Query answering paradigm (still)
- Domain Expert
- User understands the data domain
- Precise knowledge of their need
- User not familiar with SQL
- DBMS “responds” with tables and graphs
Interaction:
Knowledge:
User type :
Info need:
Period characteristics
#UnifiedDataAnalytics #SparkAISummit
10. Chatbots
10
https://medium.com/swlh/chatbots-of-the-future-86b5bf762bb4
A chatbot:
• mimics conversations with people
• uses artificial intelligence techniques
• lives on consumer messaging platforms, as a means
for consumers to interact with brands.
#UnifiedDataAnalytics #SparkAISummit
Drawbacks
• Primarily text interfaces based on rules
• Encourage canned, linear-driven interactions
• Deal with simple, unambiguous questions (“what is the weather forecast today”)
• Cannot answer random or complex queries over data repositories
11. Conversational AI
11#UnifiedDataAnalytics #SparkAISummit
For example, Google Duplex: demo released in May 2018
• The technology is directed towards completing specific tasks, such as scheduling certain types of
appointments.
• For such tasks, the system makes the conversational experience as natural as possible.
• One of the key research insights was to constrain Duplex to closed domains.
• Duplex can only carry out natural conversations after being deeply trained in such domains
12. Human-like Data Exploration
12
- From expert to data consumer
- Intuition about the data
- Not necessarily sure what to ask
- Intuitive, natural, interactionInteraction:
Knowledge:
User type :
Info need:
Requirements
#UnifiedDataAnalytics #SparkAISummit
13. Human-like Data Exploration
13
• converses with the user in a more natural bilateral interaction;
• actively guides the user offering explanations and suggestions;
• keeps track of the context and can respond and adapt accordingly;
• constantly improves its behavior by learning and adapting.
#UnifiedDataAnalytics #SparkAISummit
An Intelligent Data Assistant
17. Challenges: from the NL Side
17#UnifiedDataAnalytics #SparkAISummit
Synonymy: multiple words with the same
meaning e.g.,“movies" and “films"
Polysemy: a single word has multiple meanings
e.g. Paris as in the city and Paris Hilton
Syntactic: Multiple readings based on syntax
e.g., “Find all German movie directors“ means:
“directors that have directed German movies" ??
“directors from Germany that have directed a movie“??
Semantic: Multiple meanings for a sentence.
e.g. “Are Brad and Angelina married?".
Are they married to each other or separately.
Paraphrasing Multiple way to say the same thing
E.g. ‘how many people live in ..." could be a mention of the
“Population“ column.
Context dependent terms:
E.g., “Return the top product" the term “top" is a modifier for
“product". Does it mean Based on popularity?? Based on
number of sales??
Elliptical queries: sentences from which one or
more words are omitted.
E.g. , “Return the movies of Clooney".
Non-exact matching: mentions do not map
exactly to values or tables/attributes)
E.g. Who is the best actress in 2011 à ‘actress’ should map to
the “actor” column
18. Challenges: from the Data Side
18#UnifiedDataAnalytics #SparkAISummit
Complex Syntax: SQL is a structured language with a strict grammar
and limited expressivity when compared to natural language.
e.g., “Return the movie with the best rating".
Should look like “SELECT name , MAX( rating ) FROM Movie ;” but it is WAY more
complicated
Database Structure:
E.g., for the term “date" a system may need to retrieve three attributes: year, month, day
Multiple relationships: mentions may connect in multiple ways/join
disambiguation
E.g., “Woody Allen movies” may need several tables to be joined.
Ranking: how to rank multiple answers
e.g., “Return Woody Allen movies".
20. Understanding Syntax
20
Step 1. Understand the natural language query linguistically.
Generate a dependency parse tree:
• part-of-speech (POS) tags
that describe each word's syntactic function
+
• syntactic relationships between words in the sentence.
NLQ
Syntactic Parser
#UnifiedDataAnalytics #SparkAISummit
21. Understanding Syntax
21
Step 2.
1. Map query elements to data elements:
• tables, attributes, values – using indexes
• commands (e.g., order by) – using a dictionary
2. Keep best mappings
NLQ
Syntactic Parser
Node Mapper
#UnifiedDataAnalytics #SparkAISummit
22. Understanding Syntax
22
Step 3. Map the parse tree to the database structure and build a query tree
NLQ
Syntactic Parser
Node Mapper
Tree Mapper
#UnifiedDataAnalytics #SparkAISummit
24. 24
#UnifiedAnalytics #SparkAISummit
Keyword Schema Element
movie Movie
“Revolutionary
Road”
Movie.Tittle
movies Movie
director Director
What movies have the same director as the movie “Revolutionary Road”
ROOT
Return
director
movie
“Revolutionary Road”
same
movies
Syntactic Parser Node Mapper
ROOT
Return
movies
Same
director
movies
director
movie
“Revolutionary Road”
Tree Mapper
SQL Generator
Main Query
SELECT DISTINCT movie.tittle
FROM movie, block0, block1
WHERE movie.mid = block0.mid AND
block0.pk_director =
block1.pk_director
Block0
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.mid = directed_by.msid AND
directed_by.did = director.did
Block1
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.tittle = “Revolutionary Road” AND
movie.mid = directed_by.msid AND
25. Understanding Syntax
25
Why is Parsing So Hard For Computers to Get Right?
• Human languages show remarkable levels of ambiguity.
• It is not uncommon for moderate length sentences to have hundreds, thousands, or even tens of
thousands of possible syntactic structures.
• A natural language parser must somehow search through all of these alternatives, and find the most
plausible structure given the context.
#UnifiedDataAnalytics #SparkAISummit
26. Step 1: Let the user ask
using natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
29. 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Ambiguity
29#UnifiedDataAnalytics #SparkAISummit
Several possible query interpretationsLikely
Unlikely
30. Ambiguity
Too many ways to interpret a query
• which one(s) represent user intent?
• how do we rank them?
30#UnifiedDataAnalytics #SparkAISummit
32. Analyzing Data
32#UnifiedDataAnalytics #SparkAISummit
Probability: Probability captures the commonality of a
keyword in an attribute
Attribute_WordCount is the number of all words in an attribute
Exclusivity is an adopted version of gini-index to capture the power of each mapping
Example Features for attribute mappings
Several possible mappings
33. Analyzing Data
33
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Min_Prob: We can take the minimum of probabilities inside an attribute
combination as a way to represent the mappings with.
Example Features for attribute combinations
34. Analyzing Data
34
Example Features for attribute combinations
IR_Score: We compute a relevance score:
- For each initial attribute, we compute the single-attribute
relevance score
- The single attribute scores for an attribute combination
are combined into a final score
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
35. Step 2: Let the system
respond in natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
36. NL Explanations
36#UnifiedDataAnalytics #SparkAISummit
comedies by
Woody Allen director or
producer
Woody Allen?
director
select d.name, m.title
from MOVIE m, DIRECTED r, DIRECTOR d, GENRE g
where m.id=r.mid and r.did=d.id
and m.id = g.mid and d.name = `Woody Allen'
and g.genre = `comedy'
select d.name, m.title
from MOVIE m, CAST c, ACTOR a, GENRE g
where m.id=c.mid and c.did=a.id
and m.id = g.mid and a.name = `Woody Allen'
and g.genre = `comedy'
37. Generating Explanations
37#UnifiedDataAnalytics #SparkAISummit
Domain-independent graph traversal for efficiently exploring query graphs and composing query
descriptions as phrases in natural language
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Template-based Synthesis
Annotations +
Templates
Templates
Annotations
• Relation ACTOR à “actors”
• Attribute “fname” à “firstname”
• Function MAX à “the greatest”
NL explanation
38. Generating Explanations
38#UnifiedDataAnalytics #SparkAISummit
Actors Cast Movies
name
Year =
2010
l(actors) + ‘that play in’ + l(movies) l(movies) + in + l(year)
select a.name
from actors a, movies m, cast c
where a.id=c.aid and c.mid=m.mid
and year=2010
Return the name of the actors
for actors that play in movies in 2010
39. NL->SQL logs
What about user NL queries?
39#UnifiedDataAnalytics #SparkAISummit
We can use our knowledge of
translating past NL queries
to synthesize NL explanations of new queries
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Annotations +
Templates
NL explanation
40. Step 3: Help the user ask the
right questionSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
41. Guiding the user
41#UnifiedDataAnalytics #SparkAISummit
The user needs help:
• discovering the data in the first place
• knowing what questions may be asked
• finding what to do next
Hello??
42. Query Recommendations
42#UnifiedDataAnalytics #SparkAISummit
Two settings
Cold-start:
Starting with no previous interaction.
Show a set of starter queries that the users could use to get some initial
answers from the dataset and start understanding the data better
Warm-start:
Exploring and looking for answers in a new data set takes time and effort.
At each step, the user may not know what she should do next.
The system can leverage the user’s interactions (queries) to show possible next queries
46. Intelligent Data Assistant
46
Syntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Knowledge + Expert Bases
Annotation + Template Bases
Query + Translation Logs
Statistics
Query Similarity Graph
Query Transition Graph
SPARK SQL
SPARK CoreNLP
SPARK MLlib
GraphX
TensorFlow
S
P
A
R
K
C
o
r
e
S
t
o
r
a
g
e
HDFS
Parquet
LuceneDataProcessing