This document discusses emerging changes in search technologies, including the growing use of semantic web techniques, natural language processing, machine learning, and big data. Key points include:
1) Structure is being derived from unstructured data like text through entity extraction, knowledge graphs, and other semantic technologies.
2) Advances in natural language processing are enabling capabilities like natural language search statements and image recognition.
3) Multiple companies are developing these technologies, not just Google, bringing changes across news and other search applications.
2. 2
Brave New Search World
• The nature of “search” is changing
radically.
• Structure is being created from
(relatively) unstructured data.
• The “Semantic Web” is becoming an
actuality.
• Natural Language Processing (NLP)
and other technologies are being
extensively applied to search and
search-related activities.
3. 3
Brave New Search World
• These technologies are making the following
kinds of things happen:
– “Knowledge graphs”
– “Entity” identification in numerous
applications
– Natural language search statements
– Actual searching of images (not just of
image metadata)
• These advances are coming not just from
Google but from numerous services,
especially for “news” search.
4. 4
Some Themes/Perspectives
• What is happening is more evolutionary than
revolutionary. Many, but not all, of the "pieces" of
the technology have been around for a while.
• Structure is being derived out of (not totally) chaos.
We are going from words to meaning.
• Google isn’t the only player here.
• We can take real advantage of the developments.
• Using what you already know about “search” is
important.
5. 5
Unstructuredness of Data
• Part of the “organization of knowledge” problem
• Particularly acute for textual material
• To a computer, a “word” is a string of characters
bounded by spaces or punctuation and has no
“meaning”.
• When we are searching for something, we are
searching for meaningful things, not character
strings.
• Meaning can be derived from context by the use
of NLP.
6. 6
Where We Were Recently
• Boolean Logic
– Actually a precursor/example of Artificial
Intelligence (AI) applied to “search”.
– Still a part of search AI
• Boolean is (from our infancy) a central
aspect of how we think, a part of our
“consciousness”
• Old approach: Searching by concepts
7. 7
Where We Were Recently
“Old” (circa 1975 – 2???)
search strategy
(searching by “concepts”)
OR
8. 8
Where We Were Recently
(cont.)
• Ranking of web search results was/is
based on a wide range (ca 200) factors,
“signals”
• User-controlled field searching (intitle:
etc.)
• Etc.
9. 9
The “Newer” Technologies
• Semantic Web Technologies
• Artificial Intelligence (AI) used at a broad
level and utilizing various AI subfields
• AI - Expert Systems approaches
• AI - Natural Language Processing (NLP)
• AI - NLP - Entity identification (extraction,
disambiguation, classification, etc.)
• AI - Machine Learning
• Big Data processing
10. 10
Technologies:
The Semantic Web
• W3C “informal” definition – "The Semantic
Web is an extension of the current web in
which information is given well-defined
meaning, better enabling computers and
people to work in cooperation.”
(from Tim Berners-Lee et al, The Semantic Web. Scientific
American, May 2001.)
11. 11
Technologies:
The Semantic Web
• Essence:
• “strings to things”
• “words to meaning”
• Technologically accomplished on webpages by
means of a specialized xml markup language, etc.
12. 12
Technologies:
The Semantic Web
• Idea born pre-1999
• In practice, also requires other technologies
such as Natural Language Processing, etc.
• 2006 - Berners-Lee and colleagues stated
that: "This simple idea…remains largely
unrealized".
• 2013 - more than four million Web domains
contained Semantic Web markup.
13. 13
Technologies:
AI - Expert Systems
• Search results ranking has long used an
“expert systems” approach, mimicking what an
experienced researcher looks for:
– Words appearing in the title
– Number of times cited (linked-to)
– Proximity of words
– Words in the abstract
– Words in headings
– Etc.
• This will continue, more and more
automatically.
14. 14
Technologies:
Natural Language Processing
• A part of artificial intelligence and computational
linguistics
• Deals with helping computers “understand”
written and spoken languages
• Plays a key role in voice input for search,
natural language search statements,
translations, and more.
15. 15
Technologies:
Natural Language Processing
Google's syntactic systems
• predict part-of-speech tags for each word in
a given sentence,
• identify morphological features such as
gender and number.
• label relationships between words, such as
subject, object, modification, etc.
• leverage large amounts of unlabeled data
• incorporate neural net technology.
research.google.com/pubs/NaturalLanguageProcessing.html
16. 16
Technologies:
Natural Language Processing
Google’s semantic systems
• identify entities in free text,
• label them with types (such as person,
location, or organization),
• cluster mentions of those entities within and
across documents (co-reference resolution),
• incorporates multiple sources of knowledge
and information to aid with analysis of text
research.google.com/pubs/NaturalLanguageProcessing.html
17. 17
Technologies:
Entity Extraction
• A.k.a. named-entity recognition, entity identification
• Complementary to other natural language processing
• Identifies things, people, places, etc. within text (and
speech).
• Relates to the idea of concepts referred to earlier.
• Because “text” is based on language, “structure” is there
but the structure is not readily evident to a computer.
18. 18
Technologies:
Entity Extraction
• Context-based connections allow
discernment of different meanings of a word.
• Entity extraction draws inferences based on
the logical content of the data.
• Entity extraction may be the single most
important tool for bringing structure to
unstructured data, specifically text.
• Also used for search query “suggestions”.
• An excellent example is found in Silobreaker.
22. 22
Technologies:
Machine Learning
Computers teaching themselves
Google RankBrain
• Used in processing search results, part of Google’s
Hummingbird search algorithm
• A way of interpreting a search statement in order to
find web pages that may not have the specific words
in the search statement.
• Uses patterns from seemingly unconnected other
“complex” searches to find similarities in the current
search, then applying that information to most likely
useful content.
• Google regards this as the third most important
signal.
23. 23
Technologies:
Big Data
• The existence of “big data” collections provides
unprecedented opportunities for computational
approaches for computers to “understand” text.
• In neural networking image entity identification
experiments, the accuracy of machine learning
algorithms improves vastly when used with large
pools of data.
• "...Google’s search engine queries a 100 petabyte
index that incorporates over 200 indicators and
whose algorithms change more than 500 times per
year."
24. 24
Specific Applications of These
(and Other) Technologies
• Continued gradual incorporation of “expert”
techniques
• Natural language search statements
• Search by voice
• Image recognition and search: search of images,
search by image, and facial recognition
• Knowledge Graphs
• Entities in news search
25. 25
Gradual Incorporation of
“Expert” Techniques
• An “ordinary” search isn’t what it used to be.
• Google has now quietly taken over more of the
“old” “professional searcher” techniques and now
automatically adds not just word variants, but
synonyms.
26. 26
Gradual Incorporation of
“Expert” Techniques
• Suggested searches (based on known connections and
not just based on your character string)
A "data-driven" approach - trillions of words, vs "rules“. Not
just word variants.
• The old “synonyms” (~diet) option didn’t just go away. It is
now applied automatically. (Few people use the OR.)
28. 28
Gradual Incorporation of
“Expert” Techniques
• “Fuzzy Logic” – As well as searching for
words that are “close”, Google may
drop some of your “concepts” for some
records
29. 29
Gradual Incorporation of “Expert”
Techniques
– If Google “thinks” you want specific facts and
“sees” a matching answer, you may get that
immediately.
30. 30
Specific Applications:
Natural Language
Search Statements
• Don’t hesitate to use them!
• The above two searches give different (and
relevant) answers
• This is especially important for Google Now
and Siri!
31. 31
Specific Applications:
Voice Search
• Apple (iOS) - Siri
• Google – Google Now
• Bing – Cortana (recently deceased?)
• These “expect” natural language, so
natural language will yield the best
results.
32. 32
Specific Applications:
Image Recognition and Search:
Search of Images
Not much recent obvious change in Bing’s or
Google’s regular image search, but:
• “Categorization” (aspect of entity extraction) is
now shown on image search results pages
• Google, Microsoft (Bing) and Apple are heavy
into research on image identification and
classification.
• What’s happening/coming can be anticipated by
looking at Google Photos.
35. 35
Specific Applications:
Image Recognition and Search:
Search of Images
• In December 2015, Microsoft beat out 5 competitors
(including Google) in the ImageNet contest for
machine recognition of images
• Machines were trained to recognize images using a
“deep neural networking” method.
• Competitors must locate and identify objects from
100,000 photographs found in Flickr and search
engines and then place them in 1,000 object
categories.
• Microsoft, the winner, had an error rate of 3.5 percent
for classification and 9 percent for localization.
• Machine learning using neural networking is also very
successfully used for translations, such as in Skype’s
new translation offering
38. 38
Specific Applications:
Knowledge Graphs
• Knowledge graphs do not originate with
Google (but Google has made the term
widely known.)
• “Knowledge graph theory was initiated by
C. Hoede, a discrete mathematician at the
University of Twente and F.N. Stokman,
a mathematical sociologist at the
University of Groningen, both in the
Netherlands.” (ca 1982)
http://doc.utwente.nl/64931/1/memo1876.pdf
39. 39
Specific Applications:
Google Knowledge Graph
• The Google Knowledge Graph, overall,
is a database about “things” and the
connections between those things.
• Delivers and summarizes key facts
about people, places, things.
• The selection of those facts is based on
connections regarding that entity and
related entities and on what other users
have asked about that entity.
40. 40
Specific Applications:
Google Knowledge Graph
• Launched May 2012
• At its heart, Google Knowledge Graph is a
database of facts.
• At that time it contained 18 billion facts
between 570 million objects.
• The kinds of things included vary with the
kind of entity.
• Content comes primarily from Wikipedia,
World Factbook, Freebase/Wikidata, plus
other sources.
43. 43
Specific Applications:
Google Knowledge Graph
• The key power of Google Knowledge
Graph lies in its utilization of
connections between entities as
searched for by other users.
• At present, its present main weakness
is its heavy un-vetted reliance on
Wikipedia, which is not always right,
e.g., the Wikipedia article on Knowledge
Graph.
46. 46
Bing’s Knowledge Graph
• Named “Snapshot”, it uses Bing’s Satori
technology
• Launched in June 2012
• Utilizes Wikipedia, Freebase, Qwiki,
LinkedIn, Britannica, etc.
• Builds into results interactive features
such as audio and video
50. 50
Specific Applications:
News Applications
EMM – European Media Monitor
• From the European Commission
• Computerized analysis of news trends
and story content
• Makes extensive use of NLP techniques
for entity extraction and clustering
• “Organizes” a vast quantity of
knowledge very efficiently.
54. 54
So, How do we as researchers take
advantage of this?
• Get in the habit of using what's new (Siri,
Google Now, natural language).
Join the Evolution!
• Actually pay attention to Google Instant
(suggestions).
• Don't forsake the old. There are times when
you need to turn the auto-pilot off and take
charge.
• Ask questions you didn't bother asking
before [because you didn't think the search
engine would do it.]
55. 55
So, how do we as researchers take
best advantage of this?
• Increase awareness of information quality
criteria
• Worry a bit -
– Worrisome - the general public's further reliance
on quick, single, local, twitter-length answers
– Worrisome - Localization,
– Worrisome -"echo chambers“
– " Machines making decisions on our behalf”
• Enjoy the new.
In the description of this talk which you have read, I say that
The nature of search is changing "radically"
The changes are 'radical" largely in terms of what we can do with "search" both in terms of how and when we (both information professionals and the masses) perform searches and the kinds of results we get. One quick example is the instananeous with which, in ordinary language, a question can be asked orally and an answer (not a list of resources" can be received.
In the description of this talk which you have read, I say that
The nature of search is changing "radically"
The changes are 'radical" largely in terms of what we can do with "search" both in terms of how and when we (both information professionals and the masses) perform searches and the kinds of results we get. One quick example is the instananeous with which, in ordinary language, a question can be asked orally and an answer (not a list of resources" can be received.
A we go along over the next 40 or so minutes, you'll notice several recurring themes, or perhaps, perspectives.
pieces not new - the idea of augmenting results pages with collections of facts aout nthe topic dates back to AltaVista and yahoo in the mid-1990s
People can see structure (linguistic structure) that machines don't easily see
Bing, news sites, and many others are involved in improving search technologies
Especially if we more fully understand the basic ideas of some of teh tech nolgies, we can make fuller use of what they are providing us with.
Human searches can still accomplish things the technologies cant
In a sense, the organization of data is at the core of wht the information profession is all about
except for a few cues such as heading, full text is just a collection of word strings to a typical computer
NLP is the magic potion that changes strings to meaning
Boolean has been the primary "technology" since the beginning of computerized information retrieval. Since the essence of Boolean is an intellectual means of identifying from a group of items, those that have a specific combination of characteristics, boolen is likely toe a big thing for a long time.
For decades and to varying degree up to the present, this is the general approach that I and others who teach Internet search have used. - searching by concepts ---- identifying the essential concepts and then the alternate terms that might indicate the presence of each concept
Whether or not you search using a chart like this, this is one way a professional searcher thinks -- concepts and related
Though there are other technologies involved, I think he main ones regarding things currently happening are the one's listed here.
And, perhaps obviously, because of the nature of these technologies, there's considerable overlap between some of these categories
Semantic .... particularly the things that can be done at the webpage level
Entity .. And if I had to pick out one from the list that makes the biggest difference it is this
T B-L - creator of the Web
W3C - World Wide Web Consortium - the main international organization for Web standards.
If you want further details on this, go to schema.org.
In practice ….Webpage markup can’t, by itself, create a semantic web.
To really accomplish that, “all of the words need to have meaning and for this, themarkup has to be complemented by other techniques such as natural language processing
One of the reasons Google became so successful is because of the way it mimics what a researcher looks for when looking at a collection of articles
Again, with NLP, the thrust is to turn words in to meaning,
As I said before, the “structure” is actually present in text, but it is a challenge to program in, for a computer,the cues that we as humans can rather easily identify.
Withn NLP as ir is used by Google, two major systems do a lot of the “heavy work” in understanding language, a syntactic system and a symantic system.
Nouns, verbs, pronouns, adjectives, adverbs, prepositions, etc.
TheSemantic systems go eyoud the more “grammatical: strucdture and examines the broader contextual situation
What I’m referring to on this slide goes by several different names - entity extraction,entity identification, named-entity recognition and perhaps other names
And it involves various subsets of activities such and entity classification and entity disambiguation.
Silobreaker is a Swedish company that has been around since 2005 and was one of the first news services to extensively use entity extraction.
This slide, showing a search being entered points out that entity extraction isn’t just used for indexing for the retrieval part of he search, but also for providing organized search terminology, basically a “somewhat-controlled” vocabulary.
This slide shows how silobreaker uses named entities to visualize connections between entities.
By the way, visualizations similar to this, but showing connections between retrieved domains have been around for a longtime. AltaVista in the mid 1990’s was showing visualizations of connections between the first 200 retrieved records in a search.
This screenshot from a Silobreaker search shows named entties by classification, people, companies, groups, places, activities, and so on.
One rather different technology eing used is machine learning, programming that allows a computer to teach itself.
There now existant volumes of data and particularly where statistical analysis is a key part of processes, the more data the merrier
Having taken a look at the technologies involved let’s take a more specific look at where in the search process they are eing applied.
There isn’t time tocover all search-realted situations but the oneslisted here are theones most central to “search”