Procuring digital preservation CAN be quick and painless with our new dynamic...
Searching the internet - better with Google / Google not always best
1. UB Utrecht HvA-MIC GO Opleidingen
searching the internet
better with Google / Google not always best
Eric Sieverts
@sieverts
CODARTS, 04-03-2013
2. agenda
• searching the web
• smart searching
• google options
• beyond google
• beyond general web search
for all links see: http://sieverts.pbworks.com/codarts
2
3. the general
agenda importance
web of specific
?=? material
everything types?
general specific
web material
search search how to …
how to …
when
& why
4. an ever changing google landscape
• unreliable numbers
• irreproducible results
• disappearing functions
• changing interfaces
4
6. building block approach
systematic searching in structured information systems (like JStor etc.)
start analytically with so-called building block approach
e.g.: subject "modern american composers"
– it breaks up in 3 facets
– collect keywords for each facet
– combine keywords with OR and AND operators
modern american composers
modern american composer
contemporary america composers
20th century OR usa OR songwriters OR
twentieth century united states …
… …
6
AND AND
7. building block approach
modern american composers
modern american composer
contemporary america composers
20th century OR usa OR songwriters OR
twentieth century united states …
… …
AND AND
it makes a query:
(modern OR contemporary OR "twentieth century" OR "20th
century")
AND (america OR american OR usa OR "united states")
AND (composer OR composers OR songwriter OR songwriters)
7
8. building block approach
also with Google ?
web search engines are not specifically designed for such structured
queries, but it is possible to do
Google and Yahoo make it even easier, since you may omit parentheses
and the AND-operator (since it is default) :
implied
AND
modern OR contemporary OR "twentieth century" OR "20th century" america
OR american OR usa OR "united states" composer OR composers OR
songwriter OR songwriters
implied
AND
8
9. relevance ranking (1)
Google (and other web search engines) are primarily
focused on presenting search results in order of relevance
how do they know what is relevant?
– they interpret the importance of words for the subject matter of
the retrieved documents
(your search terms present in title, url, headings, ... ?)
• you can enhance importance of a certain term for your
query by repeating that word a couple of times
– they estimate the importance of the relation between words in
the retrieved documents: whether ..
• your search words occur close together
• your search words occur in same order as you entered them
9
>> formulate your query like you expect it formulated
11. relevance ranking (2)
Google (and other web search engines) are primarily
focused on presenting search results in order of relevance
how do they know what is relevant?
– importance or quality of retrieved web pages is deduced from
the number and the importance of links from other sites
(for each site a pagerank is calculated)
– importance of retrieved web pages for your personal interest is
deduced on basis of your previous search and browse behaviour,
which is monitored whenever you're logged in
since every search engine uses somewhat different algorithms for its
relevance calculations (and their coverage is different as well) there
tends to be little overlap between top 10 results form different engines
11
12.
13. search terms
use of proper search terms is crucial for search success
think of :
– singular / plural , verbs / nouns / adjectives , conjugations , ...
– spelling variations (behavior / behaviour)
– compound terms (writer / songwriter)
– synonyms, acronyms (compact disc / compact disk / cd / digital disc)
how would the answer to my question be formulated in a
relevant document? "think as if being a document"
– the right terms
– as an "exact phrase" or in most probable word order
– use wildcard for variable words ("modern * * composers")
– use known examples from a list to be found
– use of popular <> scientific terms etc.
13
14. refining searches
if results are too broad, too diverse
– add another essential term or set of terms to your query
– see what your search engine suggests
while you enter your query
– exclude unwanted term with NOT (francis bacon NOT philosopher)
NB: Google does not understand NOT ; use minus-sign instead:
14 francis bacon -philosopher
16. is Google outsmarting us ?
Google tries to improve and to broaden your queries
• automatic spelling corrections (veilgheid >> veiligheid)
• automatic search for words with same word stem (singular/plural,
verb, conjugation, inflection, …)
• expands acronyms (jfk >> john f kennedy | wwii >> world war II)
• adds some synonyms (vaccination >> immunization)
• transforms separate words to compound term & vice versa
(veiligheid maatregel >> veiligheidsmaatregel | catfood >> cat food)
• may leave out term as optional if not differentiating enough
more often what/when or notEnglish than in Dutch
never sure and elaborate in
• personalisation based on previous search behaviour
but what, if you don't like all of this ........
16
>> "verbatim"
17.
18.
19. d
searche
only literally
t
f or t he exac
u
w ords yo
entered
on google.nl:
"woord voor woord"
20. some more "how to"
• domain search: site:edu OR site:edu.* [for all edu (sub)domains]
site:shell.com OR site:philips.com
• url search: inurl:novelty
• title search: intitle:catalytic
just
• filetype search: filetype:pdf
filetype:xls OR filetype:xlsx
filetype:doc OR filetype:docx
more than shown in
advanced search
drop-down menu
filetype:rss
• exact search: "greenhouses“ [or VERBATIM for all words]
20
21. advanced search
Google is hiding its advanced search screen :
you must perform a simple search
first, to get the "cog wheel"
21
22. some more "how to"
some of this can be done from the advanced search screen
but regular search box offers greater flexibility,
once you know the syntax
• domain search: [in combination with real search terms]
site:codarts.nl
site:edu OR site:edu.* [for all edu (sub)domains]
site:last.fm OR site:spotify.com
• url search: inurl:course
• title search: intitle:guitar
22
23. some more "how to" (2)
• filetype search: filetype:pdf
filetype:xls OR filetype:xlsx more types than shown
in advanced search
filetype:doc OR filetype:docx
drop-down menu
filetype:rss
• numeric search: 10..20 [includes all values in between]
$10..$20 [not for other currencies]
• punctuation: &, %, dot, ... [can be searched]
€, /, ", comma, ... [is ignored]
• exact search: "greenhouses“ [or VERBATIM for all words]
• synonym search: ~guitar
• time limitations: [after search, hidden in top menu]
23
27. who searches for “Bach” is probably more interested
in data about him, than in websites about him; and
most probably in "J.S." instead of one of his relatives
Google's "Knowledge Graph"
knows 500 million objects
with 3,5 billion properties and
even more mutual relations
(but only in English)
30. general
search engines besides google
• Bing microsoft, large
• Yahoo! content=Bing, large
• Blekko uses hashtags to search more [domain-] selective
also many predefined hashtags; e.g. /likes for Facebook
• DuckDuckGo assures privacy, no personalisation, no filter-bubble,
rather small, !Bang-function offers many extras
• Gigablast green search engine, rather small, some unique functions
• Exalead french, many advanced functions, primarily demo system
• Millionshort leaves out results from most popular sites → the long tail
• WolframAlpha knowledge engine, facts, calculations
together, these others have 30% market share in US; in NL only 3%
• Yandex in Russia more popular than Google
• Baidu in China more popular than Google
• Naver, Daum in South Korea more popular than Google
• Seznam in Czechia more popular than Google
30
31. material type specific search
science google scholar, microsoft academic, scirus,
oaister, scientific commons, science.gov
reference wikipedia, quora, wolfram|alpha, answers.com
news google news, yahoo news, bing news, cnn, bbc
old news way-back-machine, historische kranten KB
images google image, yahoo image, bing image, flickr,
tineye (ip-check), panoramio (geo-search)
video google video, youtube, youtube edu channel,
bing video, blinkx, voxalead-news
tweets twitter search, topsy, postpost, snapbird
social socialsearcher, socialmention, whostalkin, kurrently
forums google groups, omgili, boardtracker
blogs google blogs, icerocket, [rss] CTRLQ, RSS SearchHub
31
32. scientific search
books
– Google Books (full text search)
– Hathitrust Digital Library (open book scan project / part of G-books)
– Librarything (catalog of 58.000.000 books from 1.000.000 owners)
– GoodReads (reviews, recommandation, friends, ...)
– Open Textbook Catalog (open access leerboeken)
journal articles
– licensed databases (like JStor, ...)
– Google Scholar (articles, dissertations, reports, ...)
– sEURch / UvA-library ("discovery" systems of EUR / UvA)
– Scirus / SciVerse (journal articles -Elsevier- , database content, webpages)
– Magportal (also -English- popular magazines)
– DeepDyve (scientific articles "for rent" - for 24 hours)
32
33. Google Books
• all pages scanned and full-text searchable
• important to discover specific subjects/terms - not primary book topic
• often limitations on display and browsability
(no preview / snippet view / limited preview / full preview)
• content from publishers and large libraries
• problems with viewing copyrighted material also from libraries
• build your personal ‘My Library’
• NL-books not only from Gent University (and soon KB), also from
US/UK
• also some ‘magazines’
• metadata on about-this-book-page
33
34.
35.
36.
37. Google Scholar
• > 100 million scientific publications (most articles)
• differences between availability (and hence searchability) of
full-text (majority), bibliographic-only, and citation data
• competitor of Web of Science, Scopus, Scirus, ...
• indexing many selected -even licensed- sources (publishers,
abstract-databases, university sites, institutional repositories, ...)
• includes numbers of citations! [and links to them]
• number of citations important factor for relevance ranking
(!! reason why recent publications get low rankings)
• advanced search limited, many mistakes in metadata (authors etc.)
• accessibility of full-text often a problem because of licences
• often many versions of same article (including sometimes free ones)
• coupling with library subscriptions to allow smoother linking
• no info about sources, updates etc.
37
38. open access
if this article is interesting,
these 23 more recent ones probably also
## of
citations
subscription
univ. utrecht
39.
40.
41. facts and reference
encyclopedias
– wikipedia
– internet movie database
– ...
Q&A (human powered)
– Quora
– Yahoo-answers
direct answers, facts and calculations
– Wolfram|Alpha
dictionaries, translations
– answers.com (metasearch)
– Roget thesaurus
– Bartleby
– Google Translate
– Google Translated search >
– Synoniemen.net (dutch)
41
42. wikipedia
• >250 languages
• “wisdom of the crowds” ?=? “wisdom” for all topics?
• quite good for “factual” topics
• many detailed specific topics (>20 million lemmas, >1 million NL)
• there are policies & guidelines
& management: stewards, administrators
• for searching the wikipedia use Google rather than internal search
limit to: site:wikipedia.org
gives more complete results
and searches directly in all language versions together
42
47. ... and pages selected
from the result list are
translated in English too
48.
49.
50. old stuff : web & news
• web archive
– "way-back machine": old versions of websites, back to 1996
access thru the -original- url, NO search
internal site links will mostly work
– also other archived materials (a.o. music)
• historical Dutch newspapers
– historische kranten KB (1618-1995 ; full-text search)
• historical international newspapers
– British newspapers 1800-1900
– historic American newspapers
– international overview
50
51.
52.
53. … and the very oldest one from february 1998:
53
54. twitter & social search
twitter search (often limited to messages from past 1 - 2 weeks only)
– twitter (also advanced search)
– topsy (best one at the moment, also older messages)
– postpost (search your own timeline - everything you're following)
– snapbird (search thru all tweets of particular person -
you have to know twittername)
real time / social search
– socialsearcher (facebook | twitter | g+ : side by side)
– socialmention (also weblogs)
– samepoint, whostalkin, kurrently, … (also weblogs)
forum discussions
– omgili, boardtracker, ...
– Google groups
54
64. image search
Content based image retrieval (CBIR)
• search on colors
– examples: Tineye, Chromatik, Picitup, Google, ...
64
65. image search
Content based image retrieval
• search by example
– draw it yourself
Retrievr, ...
– existing image
Google (visually similar)
Tineye (almost exact copies)
Retrievr, ...
example found on the web or
uploaded from your own computer
65
68. google looks for most probable
keywords to describe this image
and in the search box combines
them already with the image
... and how about these
"visually similar images" ?
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie
Opdracht zoekactie verfijnen tot er bij de eerste 50 geen niet-relevante meer zitten, lettend op deze punten; gebruiken thesaurus of Word-synoniemen; truncatie