SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Georgia Koutrika, ATHENA Research Center
A Spark-based Intelligent
Assistant
Making Data Exploration in Natural Language Real
#UnifiedDataAnalytics #SparkAISummit
Data, Data, Data
3
Data growth
More data than humans can process and comprehend
Data democratization
From scientists to the public, increasingly more users are consumers of data
#UnifiedDataAnalytics #SparkAISummit
4
HOW CAN WE EXPLORE AND
LEVERAGE OUR DATA?
5#UnifiedDataAnalytics #SparkAISummit
Select p.status
from conference_attendees p
where p.conference=‘SPARK+AI2019’ Data Store
SQL
queries
Results
6#UnifiedDataAnalytics #SparkAISummit
How are you today?
The phases of data
exploration
7
The “SQL” Age
8
Programmer
SELECT * FROM CITIES
WHERE 50 <
(SELECT AVG(TEMP_F)
FROM STATS WHERE
CITIES.ID = STATS.ID);
Users
which cities have
year-round average
temperature above
50 degrees?
DBMS
Limited access (for most but the privileged)
Communication bottleneck (the guru)
Data starvation (for those that really need it)
Limited interaction (query answering)
- Sophisticated user
- Precise knowledge of data and schema
- Precise knowledge of their need
- User “speaks” fluently SQL
- DBMS “responds” with tables
Interaction:
Knowledge:
User type :
Info need:
Period characteristics
#UnifiedDataAnalytics #SparkAISummit
The “Baby Talk” Age
9
Business user
cities with year-round
average temperature
above 50 degrees?
DBMS
User pretty much knows what to ask
User queries are relatively simple
Query answering paradigm (still)
- Domain Expert
- User understands the data domain
- Precise knowledge of their need
- User not familiar with SQL
- DBMS “responds” with tables and graphs
Interaction:
Knowledge:
User type :
Info need:
Period characteristics
#UnifiedDataAnalytics #SparkAISummit
Chatbots
10
https://medium.com/swlh/chatbots-of-the-future-86b5bf762bb4
A chatbot:
• mimics conversations with people
• uses artificial intelligence techniques
• lives on consumer messaging platforms, as a means
for consumers to interact with brands.
#UnifiedDataAnalytics #SparkAISummit
Drawbacks
• Primarily text interfaces based on rules
• Encourage canned, linear-driven interactions
• Deal with simple, unambiguous questions (“what is the weather forecast today”)
• Cannot answer random or complex queries over data repositories
Conversational AI
11#UnifiedDataAnalytics #SparkAISummit
For example, Google Duplex: demo released in May 2018
• The technology is directed towards completing specific tasks, such as scheduling certain types of
appointments.
• For such tasks, the system makes the conversational experience as natural as possible.
• One of the key research insights was to constrain Duplex to closed domains.
• Duplex can only carry out natural conversations after being deeply trained in such domains
Human-like Data Exploration
12
- From expert to data consumer
- Intuition about the data
- Not necessarily sure what to ask
- Intuitive, natural, interactionInteraction:
Knowledge:
User type :
Info need:
Requirements
#UnifiedDataAnalytics #SparkAISummit
Human-like Data Exploration
13
• converses with the user in a more natural bilateral interaction;
• actively guides the user offering explanations and suggestions;
• keeps track of the context and can respond and adapt accordingly;
• constantly improves its behavior by learning and adapting.
#UnifiedDataAnalytics #SparkAISummit
An Intelligent Data Assistant
Intelligent Data Assistant
14#UnifiedDataAnalytics #SparkAISummit
Syntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Step 1: Let the user ask
using natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Facts
16#UnifiedDataAnalytics #SparkAISummit
Unlike search engines, users tend to express sophisticated
query logics to a data assistant and expect perfect results
Translating a natural language query to a structured query is hard!
Challenges: from the NL Side
17#UnifiedDataAnalytics #SparkAISummit
Synonymy: multiple words with the same
meaning e.g.,“movies" and “films"
Polysemy: a single word has multiple meanings
e.g. Paris as in the city and Paris Hilton
Syntactic: Multiple readings based on syntax
e.g., “Find all German movie directors“ means:
“directors that have directed German movies" ??
“directors from Germany that have directed a movie“??
Semantic: Multiple meanings for a sentence.
e.g. “Are Brad and Angelina married?".
Are they married to each other or separately.
Paraphrasing Multiple way to say the same thing
E.g. ‘how many people live in ..." could be a mention of the
“Population“ column.
Context dependent terms:
E.g., “Return the top product" the term “top" is a modifier for
“product". Does it mean Based on popularity?? Based on
number of sales??
Elliptical queries: sentences from which one or
more words are omitted.
E.g. , “Return the movies of Clooney".
Non-exact matching: mentions do not map
exactly to values or tables/attributes)
E.g. Who is the best actress in 2011 à ‘actress’ should map to
the “actor” column
Challenges: from the Data Side
18#UnifiedDataAnalytics #SparkAISummit
Complex Syntax: SQL is a structured language with a strict grammar
and limited expressivity when compared to natural language.
e.g., “Return the movie with the best rating".
Should look like “SELECT name , MAX( rating ) FROM Movie ;” but it is WAY more
complicated
Database Structure:
E.g., for the term “date" a system may need to retrieve three attributes: year, month, day
Multiple relationships: mentions may connect in multiple ways/join
disambiguation
E.g., “Woody Allen movies” may need several tables to be joined.
Ranking: how to rank multiple answers
e.g., “Return Woody Allen movies".
Ask a query
19#UnifiedDataAnalytics #SparkAISummit
What movies have the same director as
the movie “Revolutionary Road”
Understanding Syntax
20
Step 1. Understand the natural language query linguistically.
Generate a dependency parse tree:
• part-of-speech (POS) tags
that describe each word's syntactic function
+
• syntactic relationships between words in the sentence.
NLQ
Syntactic Parser
#UnifiedDataAnalytics #SparkAISummit
Understanding Syntax
21
Step 2.
1. Map query elements to data elements:
• tables, attributes, values – using indexes
• commands (e.g., order by) – using a dictionary
2. Keep best mappings
NLQ
Syntactic Parser
Node Mapper
#UnifiedDataAnalytics #SparkAISummit
Understanding Syntax
22
Step 3. Map the parse tree to the database structure and build a query tree
NLQ
Syntactic Parser
Node Mapper
Tree Mapper
#UnifiedDataAnalytics #SparkAISummit
Understanding Syntax
23
NLQ
Syntactic Parser
Node Mapper
Step 4. Generate the SQL query to execute
Tree Mapper
SQL Generator
SQL
#UnifiedDataAnalytics #SparkAISummit
24
#UnifiedAnalytics #SparkAISummit
Keyword Schema Element
movie Movie
“Revolutionary
Road”
Movie.Tittle
movies Movie
director Director
What movies have the same director as the movie “Revolutionary Road”
ROOT
Return
director
movie
“Revolutionary Road”
same
movies
Syntactic Parser Node Mapper
ROOT
Return
movies
Same
director
movies
director
movie
“Revolutionary Road”
Tree Mapper
SQL Generator
Main Query
SELECT DISTINCT movie.tittle
FROM movie, block0, block1
WHERE movie.mid = block0.mid AND
block0.pk_director =
block1.pk_director
Block0
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.mid = directed_by.msid AND
directed_by.did = director.did
Block1
SELECT director.did, movie.mid
FROM movie, director, directed_by
WHERE movie.tittle = “Revolutionary Road” AND
movie.mid = directed_by.msid AND
Understanding Syntax
25
Why is Parsing So Hard For Computers to Get Right?
• Human languages show remarkable levels of ambiguity.
• It is not uncommon for moderate length sentences to have hundreds, thousands, or even tens of
thousands of possible syntactic structures.
• A natural language parser must somehow search through all of these alternatives, and find the most
plausible structure given the context.
#UnifiedDataAnalytics #SparkAISummit
Step 1: Let the user ask
using natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Ask a query
27#UnifiedDataAnalytics #SparkAISummit
Show me Italian restaurants
Not much value a parser can add
Ambiguity
28#UnifiedDataAnalytics #SparkAISummit
Several possible mappings
1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Ambiguity
29#UnifiedDataAnalytics #SparkAISummit
Several possible query interpretationsLikely
Unlikely
Ambiguity
Too many ways to interpret a query
• which one(s) represent user intent?
• how do we rank them?
30#UnifiedDataAnalytics #SparkAISummit
Analyzing Data
31#UnifiedDataAnalytics #SparkAISummit
Expert input +
Query logs
NLQ
Entity Mapper
Interpretation
Generator
ML-based
Disambiguation
SQL Generator
SQL
Training
data
Analyzing Data
32#UnifiedDataAnalytics #SparkAISummit
Probability: Probability captures the commonality of a
keyword in an attribute
Attribute_WordCount is the number of all words in an attribute
Exclusivity is an adopted version of gini-index to capture the power of each mapping
Example Features for attribute mappings
Several possible mappings
Analyzing Data
33
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Min_Prob: We can take the minimum of probabilities inside an attribute
combination as a way to represent the mappings with.
Example Features for attribute combinations
Analyzing Data
34
Example Features for attribute combinations
IR_Score: We compute a relevance score:
- For each initial attribute, we compute the single-attribute
relevance score
- The single attribute scores for an attribute combination
are combined into a final score
Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category)
2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category)
3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category)
4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value)
5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
Step 2: Let the system
respond in natural languageSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
NL Explanations
36#UnifiedDataAnalytics #SparkAISummit
comedies by
Woody Allen director or
producer
Woody Allen?
director
select d.name, m.title
from MOVIE m, DIRECTED r, DIRECTOR d, GENRE g
where m.id=r.mid and r.did=d.id
and m.id = g.mid and d.name = `Woody Allen'
and g.genre = `comedy'
select d.name, m.title
from MOVIE m, CAST c, ACTOR a, GENRE g
where m.id=c.mid and c.did=a.id
and m.id = g.mid and a.name = `Woody Allen'
and g.genre = `comedy'
Generating Explanations
37#UnifiedDataAnalytics #SparkAISummit
Domain-independent graph traversal for efficiently exploring query graphs and composing query
descriptions as phrases in natural language
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Template-based Synthesis
Annotations +
Templates
Templates
Annotations
• Relation ACTOR à “actors”
• Attribute “fname” à “firstname”
• Function MAX à “the greatest”
NL explanation
Generating Explanations
38#UnifiedDataAnalytics #SparkAISummit
Actors Cast Movies
name
Year =
2010
l(actors) + ‘that play in’ + l(movies) l(movies) + in + l(year)
select a.name
from actors a, movies m, cast c
where a.id=c.aid and c.mid=m.mid
and year=2010
Return the name of the actors
for actors that play in movies in 2010
NL->SQL logs
What about user NL queries?
39#UnifiedDataAnalytics #SparkAISummit
We can use our knowledge of
translating past NL queries
to synthesize NL explanations of new queries
Structured
Query
Annotated Query
Graph
Template-based
Synthesis
Annotations +
Templates
NL explanation
Step 3: Help the user ask the
right questionSyntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Guiding the user
41#UnifiedDataAnalytics #SparkAISummit
The user needs help:
• discovering the data in the first place
• knowing what questions may be asked
• finding what to do next
Hello??
Query Recommendations
42#UnifiedDataAnalytics #SparkAISummit
Two settings
Cold-start:
Starting with no previous interaction.
Show a set of starter queries that the users could use to get some initial
answers from the dataset and start understanding the data better
Warm-start:
Exploring and looking for answers in a new data set takes time and effort.
At each step, the user may not know what she should do next.
The system can leverage the user’s interactions (queries) to show possible next queries
Query Recommendations
43#UnifiedDataAnalytics #SparkAISummit
Starter Query
Generation
Data
statistics
User logs
Example
queries
Starter Queries
Cold-start
Metrics
Query Recommendations
44#UnifiedDataAnalytics #SparkAISummit
Generative
Approach
Data
statistics
User Query
Warm-start
Log-based
ApproachStructural
modifications
Query Log
Transition Probabilities
Query Similarities
Next Queries
Step 4: Putting everything
together
Intelligent Data Assistant
46
Syntax-based (NLP)
Query Understanding
Data-based Query
Understanding
NL Query
Explanations
Query
Recommendations
Knowledge + Expert Bases
Annotation + Template Bases
Query + Translation Logs
Statistics
Query Similarity Graph
Query Transition Graph
SPARK SQL
SPARK CoreNLP
SPARK MLlib
GraphX
TensorFlow
S
P
A
R
K
C
o
r
e
S
t
o
r
a
g
e
HDFS
Parquet
LuceneDataProcessing
Intelligent Data Assistant
47
What are you looking
for today?
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Contenu connexe

Tendances

An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 

Tendances (20)

Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache Spark
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 

Similaire à A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Marcel Kurovski
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
inovex GmbH
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 

Similaire à A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real (20)

CarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming AnalyticsCarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming Analytics
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Understanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingUnderstanding the New World of Cognitive Computing
Understanding the New World of Cognitive Computing
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data LakeITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
ITCamp 2019 - Andy Cross - Machine Learning with ML.NET and Azure Data Lake
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Democratizing AI with Apache Spark
Democratizing AI with Apache SparkDemocratizing AI with Apache Spark
Democratizing AI with Apache Spark
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Dernier (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Georgia Koutrika, ATHENA Research Center A Spark-based Intelligent Assistant Making Data Exploration in Natural Language Real #UnifiedDataAnalytics #SparkAISummit
  • 3. Data, Data, Data 3 Data growth More data than humans can process and comprehend Data democratization From scientists to the public, increasingly more users are consumers of data #UnifiedDataAnalytics #SparkAISummit
  • 4. 4 HOW CAN WE EXPLORE AND LEVERAGE OUR DATA?
  • 5. 5#UnifiedDataAnalytics #SparkAISummit Select p.status from conference_attendees p where p.conference=‘SPARK+AI2019’ Data Store SQL queries Results
  • 7. The phases of data exploration 7
  • 8. The “SQL” Age 8 Programmer SELECT * FROM CITIES WHERE 50 < (SELECT AVG(TEMP_F) FROM STATS WHERE CITIES.ID = STATS.ID); Users which cities have year-round average temperature above 50 degrees? DBMS Limited access (for most but the privileged) Communication bottleneck (the guru) Data starvation (for those that really need it) Limited interaction (query answering) - Sophisticated user - Precise knowledge of data and schema - Precise knowledge of their need - User “speaks” fluently SQL - DBMS “responds” with tables Interaction: Knowledge: User type : Info need: Period characteristics #UnifiedDataAnalytics #SparkAISummit
  • 9. The “Baby Talk” Age 9 Business user cities with year-round average temperature above 50 degrees? DBMS User pretty much knows what to ask User queries are relatively simple Query answering paradigm (still) - Domain Expert - User understands the data domain - Precise knowledge of their need - User not familiar with SQL - DBMS “responds” with tables and graphs Interaction: Knowledge: User type : Info need: Period characteristics #UnifiedDataAnalytics #SparkAISummit
  • 10. Chatbots 10 https://medium.com/swlh/chatbots-of-the-future-86b5bf762bb4 A chatbot: • mimics conversations with people • uses artificial intelligence techniques • lives on consumer messaging platforms, as a means for consumers to interact with brands. #UnifiedDataAnalytics #SparkAISummit Drawbacks • Primarily text interfaces based on rules • Encourage canned, linear-driven interactions • Deal with simple, unambiguous questions (“what is the weather forecast today”) • Cannot answer random or complex queries over data repositories
  • 11. Conversational AI 11#UnifiedDataAnalytics #SparkAISummit For example, Google Duplex: demo released in May 2018 • The technology is directed towards completing specific tasks, such as scheduling certain types of appointments. • For such tasks, the system makes the conversational experience as natural as possible. • One of the key research insights was to constrain Duplex to closed domains. • Duplex can only carry out natural conversations after being deeply trained in such domains
  • 12. Human-like Data Exploration 12 - From expert to data consumer - Intuition about the data - Not necessarily sure what to ask - Intuitive, natural, interactionInteraction: Knowledge: User type : Info need: Requirements #UnifiedDataAnalytics #SparkAISummit
  • 13. Human-like Data Exploration 13 • converses with the user in a more natural bilateral interaction; • actively guides the user offering explanations and suggestions; • keeps track of the context and can respond and adapt accordingly; • constantly improves its behavior by learning and adapting. #UnifiedDataAnalytics #SparkAISummit An Intelligent Data Assistant
  • 14. Intelligent Data Assistant 14#UnifiedDataAnalytics #SparkAISummit Syntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  • 15. Step 1: Let the user ask using natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  • 16. Facts 16#UnifiedDataAnalytics #SparkAISummit Unlike search engines, users tend to express sophisticated query logics to a data assistant and expect perfect results Translating a natural language query to a structured query is hard!
  • 17. Challenges: from the NL Side 17#UnifiedDataAnalytics #SparkAISummit Synonymy: multiple words with the same meaning e.g.,“movies" and “films" Polysemy: a single word has multiple meanings e.g. Paris as in the city and Paris Hilton Syntactic: Multiple readings based on syntax e.g., “Find all German movie directors“ means: “directors that have directed German movies" ?? “directors from Germany that have directed a movie“?? Semantic: Multiple meanings for a sentence. e.g. “Are Brad and Angelina married?". Are they married to each other or separately. Paraphrasing Multiple way to say the same thing E.g. ‘how many people live in ..." could be a mention of the “Population“ column. Context dependent terms: E.g., “Return the top product" the term “top" is a modifier for “product". Does it mean Based on popularity?? Based on number of sales?? Elliptical queries: sentences from which one or more words are omitted. E.g. , “Return the movies of Clooney". Non-exact matching: mentions do not map exactly to values or tables/attributes) E.g. Who is the best actress in 2011 à ‘actress’ should map to the “actor” column
  • 18. Challenges: from the Data Side 18#UnifiedDataAnalytics #SparkAISummit Complex Syntax: SQL is a structured language with a strict grammar and limited expressivity when compared to natural language. e.g., “Return the movie with the best rating". Should look like “SELECT name , MAX( rating ) FROM Movie ;” but it is WAY more complicated Database Structure: E.g., for the term “date" a system may need to retrieve three attributes: year, month, day Multiple relationships: mentions may connect in multiple ways/join disambiguation E.g., “Woody Allen movies” may need several tables to be joined. Ranking: how to rank multiple answers e.g., “Return Woody Allen movies".
  • 19. Ask a query 19#UnifiedDataAnalytics #SparkAISummit What movies have the same director as the movie “Revolutionary Road”
  • 20. Understanding Syntax 20 Step 1. Understand the natural language query linguistically. Generate a dependency parse tree: • part-of-speech (POS) tags that describe each word's syntactic function + • syntactic relationships between words in the sentence. NLQ Syntactic Parser #UnifiedDataAnalytics #SparkAISummit
  • 21. Understanding Syntax 21 Step 2. 1. Map query elements to data elements: • tables, attributes, values – using indexes • commands (e.g., order by) – using a dictionary 2. Keep best mappings NLQ Syntactic Parser Node Mapper #UnifiedDataAnalytics #SparkAISummit
  • 22. Understanding Syntax 22 Step 3. Map the parse tree to the database structure and build a query tree NLQ Syntactic Parser Node Mapper Tree Mapper #UnifiedDataAnalytics #SparkAISummit
  • 23. Understanding Syntax 23 NLQ Syntactic Parser Node Mapper Step 4. Generate the SQL query to execute Tree Mapper SQL Generator SQL #UnifiedDataAnalytics #SparkAISummit
  • 24. 24 #UnifiedAnalytics #SparkAISummit Keyword Schema Element movie Movie “Revolutionary Road” Movie.Tittle movies Movie director Director What movies have the same director as the movie “Revolutionary Road” ROOT Return director movie “Revolutionary Road” same movies Syntactic Parser Node Mapper ROOT Return movies Same director movies director movie “Revolutionary Road” Tree Mapper SQL Generator Main Query SELECT DISTINCT movie.tittle FROM movie, block0, block1 WHERE movie.mid = block0.mid AND block0.pk_director = block1.pk_director Block0 SELECT director.did, movie.mid FROM movie, director, directed_by WHERE movie.mid = directed_by.msid AND directed_by.did = director.did Block1 SELECT director.did, movie.mid FROM movie, director, directed_by WHERE movie.tittle = “Revolutionary Road” AND movie.mid = directed_by.msid AND
  • 25. Understanding Syntax 25 Why is Parsing So Hard For Computers to Get Right? • Human languages show remarkable levels of ambiguity. • It is not uncommon for moderate length sentences to have hundreds, thousands, or even tens of thousands of possible syntactic structures. • A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. #UnifiedDataAnalytics #SparkAISummit
  • 26. Step 1: Let the user ask using natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  • 27. Ask a query 27#UnifiedDataAnalytics #SparkAISummit Show me Italian restaurants Not much value a parser can add
  • 29. 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address) Ambiguity 29#UnifiedDataAnalytics #SparkAISummit Several possible query interpretationsLikely Unlikely
  • 30. Ambiguity Too many ways to interpret a query • which one(s) represent user intent? • how do we rank them? 30#UnifiedDataAnalytics #SparkAISummit
  • 31. Analyzing Data 31#UnifiedDataAnalytics #SparkAISummit Expert input + Query logs NLQ Entity Mapper Interpretation Generator ML-based Disambiguation SQL Generator SQL Training data
  • 32. Analyzing Data 32#UnifiedDataAnalytics #SparkAISummit Probability: Probability captures the commonality of a keyword in an attribute Attribute_WordCount is the number of all words in an attribute Exclusivity is an adopted version of gini-index to capture the power of each mapping Example Features for attribute mappings Several possible mappings
  • 33. Analyzing Data 33 Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address) Min_Prob: We can take the minimum of probabilities inside an attribute combination as a way to represent the mappings with. Example Features for attribute combinations
  • 34. Analyzing Data 34 Example Features for attribute combinations IR_Score: We compute a relevance score: - For each initial attribute, we compute the single-attribute relevance score - The single attribute scores for an attribute combination are combined into a final score Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
  • 35. Step 2: Let the system respond in natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  • 36. NL Explanations 36#UnifiedDataAnalytics #SparkAISummit comedies by Woody Allen director or producer Woody Allen? director select d.name, m.title from MOVIE m, DIRECTED r, DIRECTOR d, GENRE g where m.id=r.mid and r.did=d.id and m.id = g.mid and d.name = `Woody Allen' and g.genre = `comedy' select d.name, m.title from MOVIE m, CAST c, ACTOR a, GENRE g where m.id=c.mid and c.did=a.id and m.id = g.mid and a.name = `Woody Allen' and g.genre = `comedy'
  • 37. Generating Explanations 37#UnifiedDataAnalytics #SparkAISummit Domain-independent graph traversal for efficiently exploring query graphs and composing query descriptions as phrases in natural language Structured Query Annotated Query Graph Template-based Synthesis Template-based Synthesis Annotations + Templates Templates Annotations • Relation ACTOR à “actors” • Attribute “fname” à “firstname” • Function MAX à “the greatest” NL explanation
  • 38. Generating Explanations 38#UnifiedDataAnalytics #SparkAISummit Actors Cast Movies name Year = 2010 l(actors) + ‘that play in’ + l(movies) l(movies) + in + l(year) select a.name from actors a, movies m, cast c where a.id=c.aid and c.mid=m.mid and year=2010 Return the name of the actors for actors that play in movies in 2010
  • 39. NL->SQL logs What about user NL queries? 39#UnifiedDataAnalytics #SparkAISummit We can use our knowledge of translating past NL queries to synthesize NL explanations of new queries Structured Query Annotated Query Graph Template-based Synthesis Annotations + Templates NL explanation
  • 40. Step 3: Help the user ask the right questionSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  • 41. Guiding the user 41#UnifiedDataAnalytics #SparkAISummit The user needs help: • discovering the data in the first place • knowing what questions may be asked • finding what to do next Hello??
  • 42. Query Recommendations 42#UnifiedDataAnalytics #SparkAISummit Two settings Cold-start: Starting with no previous interaction. Show a set of starter queries that the users could use to get some initial answers from the dataset and start understanding the data better Warm-start: Exploring and looking for answers in a new data set takes time and effort. At each step, the user may not know what she should do next. The system can leverage the user’s interactions (queries) to show possible next queries
  • 43. Query Recommendations 43#UnifiedDataAnalytics #SparkAISummit Starter Query Generation Data statistics User logs Example queries Starter Queries Cold-start Metrics
  • 44. Query Recommendations 44#UnifiedDataAnalytics #SparkAISummit Generative Approach Data statistics User Query Warm-start Log-based ApproachStructural modifications Query Log Transition Probabilities Query Similarities Next Queries
  • 45. Step 4: Putting everything together
  • 46. Intelligent Data Assistant 46 Syntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations Knowledge + Expert Bases Annotation + Template Bases Query + Translation Logs Statistics Query Similarity Graph Query Transition Graph SPARK SQL SPARK CoreNLP SPARK MLlib GraphX TensorFlow S P A R K C o r e S t o r a g e HDFS Parquet LuceneDataProcessing
  • 47. Intelligent Data Assistant 47 What are you looking for today?
  • 48. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT