2. Our interests
S Scalability
S Machine Learning
S Natural Language Understanding
S Java, Python
S VLSI and Scripting Languages
3.
4. What is Information Retrieval ?
S In the era of Big Data with data in multiple forms
(structured and unstructured text, images, videos)
and increasing usage of computing across different
devices and media and peaking consumerism, IR is
nothing but study of algorithms, tools and
techniques by leveraging multiple disciplines of
computer science (Data Mining, Machine
Learning, Computer Vision, Visualization and
Natural Language Processing) to bring most
relevant information with minimal cognitive
effort .
5. What did we do and Learn?
S Different Commercial Vertical Engines
S Elastic Search
S Java plugin for Elastic Search
S Search Re-Ranking : A NLP Approach
S Expedia Personalized Search Ranking – 2013
S Computer Vision and Visualization Examples
6. An example from
Computational Advertisement
S Night-stand has different Meanings
S If Search Engines , don’t understand meaning
properly ,customer’s lose money
S How do they understand the context ?
S Different signals
S User History and Query Understanding
S NLP is Crucial
9. S Narrow Down the Search by department
S Entity Matching using Lewenstein's distance /Soundex
Algorithm
S Smyth Vs Smith
S Bare String Matching
10. Commercial Search Engines
S Yelp or FourSquare or Ebay
Multiple - Signals
1. Is user looking for a hotel or a salon ?
2. What are diff options available ? If multiple then do
sentiment analysis ? Click rate Analysis
3 . Location and Social Network Analysis
4. We need VERY good query understanding
11. What we DON’T care about ?
S Search (Grep) algorithm , Page Rank Vs
S Search Ranking/Relevance
13. Bag of Words Model
S Don’t preserve semantics
S Rama went to Lanka in Search of Seetha
S Seetha went to Lanka in Search of Rama
S [1 0 1 1 1 0 1]
S [1 0 1 1 1 1 1]
S Dict = {//sort these words , chaitanya}
14. Can you do sentiment analysis
Positive, Negative, Neutral
The shutter lag of this digital camera is annoying sometimes, especially when capturing cute
baby.
S I received the camera as a Christmas present from relatives and enjoyed it a lot.
S Presence or Absence of words don’t help- Sentiment Analysis
15. We need Better Representations
S C Vs Java . Object Oriented Modeling
S Properties + Methods : class Student {
S Float getcGPA ; boolean isHeEligibleToTakeGradCourses()
{
S { } }
Good structures to represent and play with and get
meaningful results
18. Query : When Lady Gaga sings
S R1 : lady gaga sings and kati perry dances
S R2: lady gaga dances and keri parry sings
S N-GRAM or TF-IDF approach works here..
S Why ?
19. Query :When Lady Gaga sings
S R1 : lady gaga dances and keri parry sings
S R2 : lady gaga dances and sings and katy perry dances
S Does TF-IDF /Bag of Words /Vector
Space Model work here ?
S Yes / No ?
23. We need a plug and play
solution
S . Create parse tree representations T1, T2,
T3…. T10 for R1, R2, R3… R10 respectively.
S Create parse tree representation for the query Q.
S Find the similarity score of each results tree T
with that of Q.
S Sort all of them and present to the user.
24. Elastic Search
S Distributed Search Server based on Lucene
S Based on Lucene
S Is it a Data Base ?
S Is it SQL/No SQL ?
S When we have lot of data bases, why should we care about it ?
Lets look at in action .
25. How does Elastic Search work
today ?
S It uses TF-IDF for Search Ranking
S It assigns scores to each and every document
26. Data Mining approach
S Not every thing is the natural Language Text
S We may have lot of features , the interdependency among
them may not be known to us.
S Big Data Not Always means Huge Data, It could also be
small data with huge number of features that might require
statistics and Data Mining