SlideShare a Scribd company logo
1 of 13
Download to read offline
February 22, 2012

Essential Elements of
Excellent Multilingual Search
Boosting Search Quality with the Rosette Linguistics Platform




                      We put the World in the World Wide Web®
ABOUT BASIS TECHNOLOGY
Basis Technology provides software solutions for text analytics, information retrieval,
digital forensics, and identity resolution in over forty languages. Our Rosette® linguistics
platform is a widely used suite of interoperable components that power search, business
intelligence, e-discovery, social media monitoring, financial compliance, and other
enterprise applications. Our linguistics team is at the forefront of applied natural language
processing using a combination of statistical modeling, expert rules, and
corpus-derived data. Our forensics team pioneers better, faster, and cheaper techniques
to extract forensic evidence, keeping government and law enforcement ahead of
exponential growth of data storage volumes.

Software vendors, content providers, financial institutions, and government agencies
worldwide rely on Basis Technology’s solutions for Unicode compliance, language
identification, multilingual search, entity extraction, name indexing, and name
translation. Our products and services are used by over 250 major firms, including Cisco,
EMC, Exalead/Dassault Systems, Hewlett-Packard, Microsoft, Oracle, and Symantec.
Our text analysis products are widely used in the U.S. defense and intelligence industry
by such firms as CACI, Lockheed Martin, Northrop Grumman, SAIC, and SRI. We are
the top provider of multilingual technology to web and e-commerce search engines,
including Amazon.com, Bing, Google, and Yahoo!.

Company headquarters are in Cambridge, Massachusetts, with branch offices
in San Francisco, Washington, London, and Tokyo. For more information, visit
www.basistech.com.




© 2012 Basis Technology Corporation. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosette”, and “We put the World in the
World Wide Web” are registered trademarks of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the
property of their respective owners. (2012-08-30)
ABSTRACT
If search is vital to the competitive advantage or value proposition of your business, then this
whitepaper is a must read. In the survival of the fittest, companies like Netflix and many
businesses today thrive or die depending on whether they can:
   a) Provide “excellent search” – comprehensive, accurate, and relevant to the user
   b) Serve each user equally well in his or her preferred language

The three essential elements of a business-dependable search solution on which to build
applications or workflows to generate revenue are: (1) search customizability and the speed-
scalability-cost trio, (2) high search quality—with emphasis on “recall” or maximizing the number
of good results found—and (3) reliability and support. This paper will examine what these three
elements mean, and examine Basis Technology’s Rosette linguistic platform as one possible
solution.


THE THREE ELEMENTS OF “EXCELLENT SEARCH”
More and more, revenues and business longevity rely on cost-effective, accurate, and scalable
search in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/
e-disclosure solutions, financial compliance solutions, online information providers, media and
entertainment portal sites all begin with users finding what they are looking for. Choose the right
search and see revenues go up, satisfied customers increase, and a business well-positioned for
the future. Choose the wrong search and costs constrict growth and poor relevancy chases away
customers. Savvy businesses recognize that their customers, partners, suppliers and employees
increasingly create and consume content in their native language.

For most enterprises, improving search quality means improving (1) recall—i.e., maximizing the
number of relevant results found—and to a slightly lesser degree, (2) precision—i.e., maximizing
the ratio of good to bad results returned.

We’ll now examine each of these three elements of search in detail:
   • Customizability of search is critical to fit it to your particular use case or create the value add
     of your business. Speed, Scale, and Cost can quickly strangle the success of your business
     if search is struggling to maintain speed and scale, and choking profits with escalating per-
     query or per-document costs.
   • Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important,
     but if your users aren’t finding the results they want to see, it doesn’t matter how fast it
     gets delivered. For enterprise applications and search engines, maximizing recall (without
     diminishing precision) is usually the greater concern. Good language support is an intrinsic
     ingredient for improved recall
   • Reliability and Technical Support ensures that expert resources are available for solving any
     serious issues in the service.




 Essential Elements of Excellent Multilingual Search                                                3
ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COST
The basic question a company has to ask is: Do we want/like/need/know how to bend technology
to our business processes and business model in order to achieve competitive advantage? And
most importantly: Is search part of that competitive advantage? If the answer is “yes” then search
is not a plug-in-a-solution problem, but a software development problem.

Because search is such a broad-scope problem, there are all kinds of edge cases that are difficult to
support using a proprietary, commercial search solution. The technology is not easily changed
without expensive consultation and modifications to the core software that are dependent on the
vendor’s product release cycle.

For companies where better search means more money earned or saved, the frustration can
increase geometrically because their use case may be unique, but the software isn't extensible
enough to deal with it, and they have to spend more and more for less and less.

Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that the
load on the search engine will only increase over time, search performance must be scalable to
keep the query response times low and to ensure index updating times do not affect searches.
At the same time, pricing models of commercial search engines based on number of queries or
number of documents indexed will eventually constrict growth by sheer cost.


ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORT
Search is really all about finding, whether it’s a shopper on an e-commerce site, a banker checking a
wire transfer request against a list of money launderers, or home buyers looking through real estate
listings.

The concepts of precision and recall measure how “good” search is. More precise results with greater
recall mean better quality search, but frequently increasing one means a decrease in the other, so the
trick is how to maximize each metric while minimizing the impact on the other. (See sidebar.)

Let’s now look at the language-specific processing required to increase recall and/or precision for three
categories of notoriously difficult languages to process: Chinese/Japanese/Korean, Germanic and
Scandinavian languages, and Arabic. We’ll first look at one aspect of linguistic processing that improves
recall in most languages, including English and European languages: lemmas and stems.


Improving Recall via Lemmas and Stems
Lemmatization and stemming are two methods of increasing the recall of search results. Stemming
improves recall, but can diminish precision. Lemmatization is the preferred method since it increases
recall while maintaining precision.

Lemmatization, finding the dictionary form of a word, broadens search to add relevant search results,
in ways that stemming—more commonly available—cannot. A user searching for “President Obama
speaking on healthcare” would likely also want results containing “President Obama spoke on
healthcare” and “President Obama was speaking on healthcare.” By lemmatizing“ spoke” and “was
speaking” and storing the lemma “speak” in the search index, the latter two results will (correctly!) get
picked up by a search for “President Obama speaking on healthcare.




             Essential Elements of Excellent Multilingual Search                                            4
Stemming uses some basic rules to look for common stems in words. For example, “decod” is the
stem of “decoding,” “decoder,” and “decodes.” However, no matter how many letters a stemmer
chops off, “spoke” will never become “speak.” Lemmas are found using dictionary data and
recognizing the context of a word. The lemma is very different when “spoke” is used as a noun or a
verb. Stemming, on the other hand, applies a set of rules regardless of a word’s part-of-speech or
context in a sentence.

Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum.
On the other hand, stemming may sometimes produce unpredictable results. Stemming turns
“several” into “sever,” and “arsenic” and “arsenal” share the same stem “arsen.” And, whereas
“apples” to “apple” is reasonable, “Los Angeles” to “Los Angele” is not.

Although stemming is more readily accessible as a set of language-specific rules than the
dictionary-requiring lemmatization, the greater recall and precision from Lemmatization is the
difference between good and excellent search.

Stemming vs. Lemmatization in a Nutshell
Search Query                     Traditional Stemming           Lemmatization using Rosette      Comparison

animals                          anim                           animal                           Two unrelated words may
                                                                                                 share a stem, in this case
animated                         anim                           animate                          “anim”
arsenal                          arsen                          arsenal

arsenic                          arsen                          arsenic

several                          sever                          several                          Stemming may have
                                                                                                 unintended consequences
organization                     organ                          organization

children                         children                       child                            Irregular verbs and nouns
                                                                                                 stump the stemmer
spoke                            spoke                          speak (when used as past tense
                                                                verb)
                                                                spoke (when used as noun)


Improving Precision in Chinese, Japanese, and Korean
Asian languages like Chinese, Japanese, and Korean are fundamentally more difficult to process
than European languages because their words are not consistently separated by spaces. Chinese
and Japanese use no spaces between words, and Korean uses some spaces, but not between
every word, and inconsistently depending on the writer.

For a search engine to build an index, it has to have words. N-gram and Tokenization are methods
used to produce indexable units in these languages, each resulting in different degrees of search
precision.

N-gram vs. Tokenization
Some engines may use the n-gram technique to break up streams of text into overlapping 2-4
character units, but though recall will be high, precision will be low and indexes will be very large,
impacting speed.

Here is a Japanese example: 東京都の観光地 (translation: sightseeing spots in Tokyo)




          Essential Elements of Excellent Multilingual Search                                                                 5
Bigram          English translation

東京              Tokyo

京都              Kyoto

都の              capital’s

の観              not a word

観光              sightseeing

光地              not a word


A seven character phrase produces six items to index via the n-gram technique, but also introduces
a false word “Kyoto.” Since many of the bigrams are not actual words in the original text, many
other unrelated strings might accidentally match them, too, and a search becomes a statistical
process: Do enough of these two character segments appear in a given document to flag it as a
“match,” while not accidentally matching the wrong strings?

This approach also produces many more entries in the index than words, unnecessarily increasing
its size. Processing a 100-character buffer (about 13-17 words), adds 999 entries to the index.

A better choice is true tokenization based on morphological analysis1, which breaks up text into
real words. With tokenization of the example above, there are at most three items to index, and
possibly only two if the possessive marker is treated as a “stop word”.2

Words                 English Translation

東京都                   Tokyo

の                     (possessive marker)

観光地                   Sightseeing spots


Character Normalization to Improve Recall
Chinese, Japanese, and Korean have additional character normalization needs beyond making sure
ASCII characters are all uppercase or lowercase. In digital files, these languages also use a full- width
variant of ASCII letters and punctuation which need to be normalized to the half-width ASCII form.

                                          ASCIIwords%$& → ASCIIwords%$&

In the case of Japanese, a half-width version of Japanese katakana characters also needs to be
normalized to the usual full-width katakana. The half-width version was invented in the early days
of computer processing to reduce data size, but it is still found in documents and webpages.

                                                   カタカナ → カタカナ




1     Morphological analysis in linguistics examines the smallest units of meaning within a word—
      called “morphemes—such as “un-” in “unremarkable.”
2     A “stop word” is one which occurs so frequently within a language as to not help in finding
      search results. Some search engines may choose to ignore them.




    Essential Elements of Excellent Multilingual Search                                             6
Improving Recall with Pan-Chinese Search
Chinese search gains an enormous boost in recall if both the simplified and traditional Chinese
scripts are searched. Mainland China and Singapore use simplified Chinese, whereas Taiwan and
Hong Kong use traditional Chinese. Chinese speakers expect one query in one script to find results
from both scripts.

Pan-Chinese search—searching documents in both Chinese scripts—is accomplished by converting
all the text of one script into the other at query and index

The differences between the two scripts fall into three categories:
Category                                               Simplified Chinese   Traditional Chinese       English Translation

                                                                                                     big
Same character used in both scripts—this is true
where the character was simple to start with           大                   大
Two characters with same pronunciation (but
different base meaning)—collapsed to one                头发                  頭髮                        emit
character in simplified Chinese                                                                       hair
                                                       出发                  出發
Different vocabulary (analogous to American                                                           computer
and British English differences such as “truck” vs.     计算机                 電腦
“lorry”)—occurs in mostly modern words


In the first case, the change is trivial, and amounts to making sure all the text is in the same
encoding; however, the second and third categories require dictionary data to ensure that the
conversion is context-sensitive, so that the correct traditional character is chosen.

Improving Recall in Germanic and Scandinavian Languages
Linguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian,
Swedish, and Finnish), and Korean is very similar in that these languages freely use compound
words which, frequently need to be broken up to increase recall.

Take these two German compound words:
German compound                               German decompounded                  English translation

Jugendarbeitslosigkeit                        Jugend + arbeitslosigkeit            Youth unemployment
                                                                                   (Youth + unemployment)

Samstagmorgen                                 Samstag + morgen                     Saturday morning
                                                                                   (Saturday + morning)


It’s reasonable to imagine that a person looking for information about youth unemployment in
German would also welcome search results which included “youth” and “unemployment” as
separate words. Similarly, the concepts of “Saturday” and “morning” are very likely to appear as a
compound word or as separate words in related documents.

Decompounding increases search recall with minimal impact on precision, but again, requires
dictionary data to do properly.

Additionally, to maximize recall, character normalization for plurals “Garten” (garden, singular
form) and its plural form “Gärten” is needed.




                 Essential Elements of Excellent Multilingual Search                                                       7
Improving Recall in Arabic
Arabic is one language which especially suffers from low recall if a search engine does not
perform Arabic-specific linguistic processing, starting from the basic character normalization up to
stemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many affixes
—to the beginning, end and middle of words—that without lemmatizing before searching, many
relevant results are never found. One root could form the basis for up to fifteen different verbs!

Character Normalization to Improve Precision and Recall
In English, search engines perform some basic normalization such as lowercasing all the words. In
Arabic, normalization is much more complex and will improve both precision and recall. Types of
character normalization required include:
   • Words with additional vocalization marks such as: ‫ﺳﻲـ‬          vs. ‫ﻲـﺳﺎﯾـﺳ‬

   ‫ﺳـ ﯾﺎ‬
   • Words containing certain letters with dots added or removed such as: ‫ رـ ـﻗﯾﻪ ـ‬vs. ‫رـ ـﻗﯾﺔ ـ‬
   • Words–including ambiguous cases–containing certain letters with symbols added or
     removed such as:
                     ‫ٱدﺎـﻣﺗﻋ‬    vs.   ‫ادﺎـﻣﺗﻋ‬
                      ‫ادؤود‬     vs.   ‫ادوود‬
                        ‫اﻣـﺛ‬    vs.   (‫ آﻣـﺛ‬or ‫إﻣـﺛ‬    or   ‫)أﻣـﺛ‬

Improving Recall with Lemmatization and Stemming
In Arabic as other languages, Lemmatization increases the recall of search results (cf. “Improving
Recall via Lemmas vs. Stems”), but with a twist. In Arabic, Lemmatization is only applicable to verbs
and common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names of
people such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah).

Proper nouns require stemming to remove prepositions and conjunctions which are often
attached to them. Searches for names of people, places, and organizations are seriously hampered
without stemming. For example, these phrases in English appear as one word in Arabic: “for
Othman,” “with Othman,” “as Othman,” and “and Othman.” Thus a search for just “Othman”
would not find the above variations without stemming.

Additionally, since Arabic names are frequently the same words as common nouns, part-of-speech
tagging becomes critical to differentiating between nouns and proper nouns, particularly when a
word’s part-of-speech varies depending on where it appears in a sentence.

Search Recall and Precision Enhancers for Many Languages
For English and most European languages, better recall and precision means:
   • Adding lemmas (the dictionary form of a word) to the search index to increase recall, while
     minimizing the negative impact on precision (cf. previous section “Improving Recall via
     Lemmas vs. Stems”)
   • Boosting the ranking of relevant documents on the results page through the use of
     document metadata such as entities.
   • Increasing recall through comprehensive name search that goes beyond the “Did you
     mean?” functionality of most search engines.




 Essential Elements of Excellent Multilingual Search                                                8
Boosting Relevant Results
Once the search engine is finding more good results (better recall) and returning fewer bad results
(better precision), the good results need to be at the top of the search results.

Frequently the most critical words in a search have to do with entities—proper nouns such as
names of people, places, and organizations. But when is “Christian” a religious affiliation or the
name of a person “Christian Dior”? A good entity extractor will know. At indexing time, entities can
be added to a document record’s metadata to boost the rank of documents which have entities
matching those in the query.

Comprehensive Name Search to Improve Recall
The “Did you mean?” function of most modern search engines will find actress “Cate Blanche ”
even if the user types “Kate          .” However, for less famous people, and to cover a gamut of
name variations, a true name matching function is needed to handle “Chuck Berry” vs. “Charles
Berry”; “John Kearns” vs. “Jon Cairns” or even “Baqah Al-Sharqiyyah” and “‫.”ﺔﻗﺎﺑ رﺷﻟ اﻗﺔﯾ‬

The Rosette Option for Language Support
Rosette is a software development kit (SDK) designed for a wide range of large-scale applications
that need to identify, classify, analyze, index, and search unstructured text from various sources,
and its linguistic analysis capabilities are widely used by search engines to both increase
precision and recall It uses a combination of statistical models, dictionary data, and sophisticated
computational linguistics to parse digital text in English and over 25 major European, Asian,
and Middle Eastern languages. Years of development and linguistic analysis have gone into the
development of Rosette to satisfy the most demanding customers, both for quality, robustness,
and speed.

Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo
(Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search-
based applications for e-discovery, financial compliance, and other fields also use Rosette.

Rosette is a cross-platform SDK available for Windows and Unix, and offers all its capabilities via a
single API, in C, C++, Java, or .NET.

Rosette Capabilities
   • Language Identification—identification of the primary language of a document—or language
     regions within a multilingual document—and the file’s encoding in 55 languages and 45
     encodings.
   • Unicode Conversion—converts documents in legacy encodings to Unicode
   • Character Normalization—normalizes characters to a single representation (e.g. character+
     diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian
     languages, and several language-specific normalizations.)
   • Linguistic Analysis—Lemmatization, tokenization, part-of-speech tagging, decompounding,
     and more in over 25 languages.
   • Entity Extraction—extracts entities such as people, places, and organizations in 15 languages
     via statistical models with customizable user-defined entities via regular expressions and
     entity databases
   • Name Matching—returns matching names despite spelling variations, initials, nicknames,
     missing name components, missing spaces, out-of-order name components, the same name
     written in different languages, and more—supported in multiple languages including Arabic,
     Chinese, English, Korean, and Persian.




             Essential Elements of Excellent Multilingual Search                                           9
• Name Translation—translates names from non-Latin script languages to English and standardizes
  already translated names—supported in multiple languages including Arabic, Chinese, English,
  Korean, Russian, and Persian.



     Lemmatization and Stemming
     Needing to support a variety languages can stretch the capabilities of out of the box search
     solutions, requiring additional effort to custom-configure the handling of each language and
     tune its accuracy one by one.


     Chinese, Japanese, and Korean Search
     Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for
     Unicode (RCLU):
        • For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ASDF&%$) to
          the usual half-width characters (ASDF&%$).
        • For Japanese: Converting half-width Japanese katakana characters (アイウエオ; one-byte
          characters invented in the early days of Japanese computing to save on storage) to the usual
          full-width characters (アイウエオ).


     Pan-Chinese Search
       Rosette’s Chinese script converter can convert text into either simplified or traditional
       Chinese.

     Germanic and Scandinavian Languages

       Character normalization of plurals is part of the Lemmatization of German within Rosette.

      Arabic
       Rosette provides speech tagging and Lemmatization for Arabic.




       Essential Elements of Excellent Multilingual Search                                          10
Boosting Relevant Results with Entities
Rosette’s entity extractor can push results higher in the rankings.




Comprehensive Name Search
Rosette’s name matching functionality including name variations due to nicknames, initials, missing
name components, missing spaces between names, out-of-order name components, the same
name represented in different languages, and more. Thus, besides handling “Robert” vs. “Bob,”
and “Abdul Rasheed” vs. “Abd-al-Rasheed” vs. “Abd Ar-Rashid,” Rose e also matches “Mao Tse-
Tung”, “Mao Zedong”, and “毛泽东” (simplified Chinese).




PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELL
Precision and recall are metrics used to evaluate the quality of search results, whether searching
for articles on a topic, or finding name matches.

Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls.
Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green.

Precision is the number of correct items over the total number of items found. That is, you found
4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your results
were correct) or 50%.

Recall asks the question: “Of all the correct items, how many did I find?” In this case, there were 7
correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%.




F-score is a measure that attempts to balance precision and recall and is often called the
“relevancy” of a system.

F-score = 2*(Precision*Recall) ÷ (Precision+Recall)

Thus our F-score above = 8/15 = .53, or 53%




             Essential Elements of Excellent Multilingual Search                                        11
ELEMENT 3: RELIABILITY AND SUPPORT
Even in a technologically sophisticated company, the core business logic will be the main focus of
all business operations, so technical support as a safety net is essential. Just as firms rely on
calling the vendor when the copier is acting up, a few judicious pieces of advice from the search
or linguistics experts to settle a nettlesome problem lends assurance to both engineers and upper
management.

On the language support side, the wide language coverage of Basis Technology’s Rosette—over 25
languages covering Asia, Europe, and the Middle East—ensure there will be one point of contact
to address questions about any languages, instead of a multitude of single-language vendors with
varying support contracts. Rosette developers are on the front lines, providing high quality support
to customer inquiries.

Additionally as a single platform, Rosette’s benchmarks for speed and accuracy are readily
available, and implementing one or over 25 languages is the same amount of work.




 Essential Elements of Excellent Multilingual Search                                             12
SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM
    For companies that derive competitive advantage from a better quality or better customized
    search, search is a development problem, and not a “plug-in-a-solution” problem, and the four
    search engine essentials are:
       1. Customizability—the ability to innovate on top of the chosen search engine to build value-
          added features
       2. Speed—the search engine’s query response rates and indexing and updating times
       3. Scalability—the ability to easily handle increased number of queries or volumes of content
          requiring indexing
       4. Cost—total cost of ownership; return on investment; scaling of cost to search needs

The Rosette platform fills that gap, through solid linguistic analysis technology and comprehensive
dictionary data to perform:
   • Language Identification—The language identifier detects 55 languages with speed and
     accuracy, having been trained on gigabytes of hand-verified data. It covers a broad range of
     Asian, Indo-European, and Middle Eastern languages.
   • Linguistic Analysis—Rosette performs a complete linguistic analysis to improve search recall
     and precision for some of the most linguistically difficult languages.
     ○ Lemmatization—for boosting recall while maintaining precision in many languages
     ○ Tokenization—for Chinese, Japanese, and Korean which do not have spaces between
       words
     ○ Chinese script conversion—for pan-Chinese search

     ○ Character normalization—specialized transforms for Arabic and Asian languages
     ○ Decompounding—for Germanic and Scandinavian languages which form words from
       multiple words
     ○ Part-of-speech tagging—for context-sensitive Lemmatization, and particularly for Arabic
       search to distinguish between common nouns requiring Lemmatization, and proper nouns,
       requiring stemming

   • Entity Extraction—To locate names, places, organizations, and other entities, both to boost
     ranking of results whose entities match the search query or as a base for building faceted
     search. Rosette plugs into the UpdateRequestProcessor at indexing        to populate each
     document record with entities.
   • Name Matching—Going beyond “Did you mean?”, Rosette finds matching names that differ
     due to nicknames, initials, missing name components, missing spaces between names, out-
     of-order name components, the same name represented in different languages, and more.


NEXT STEPS

   • Request a free product evaluation of Rosette at:
     http://www.basistech.com/products/requests/evalua         -request.html




     Essential Elements of Excellent Multilingual Search                                             13

More Related Content

Viewers also liked

Vegetarianism[1]
Vegetarianism[1]Vegetarianism[1]
Vegetarianism[1]
taylorjhen
 
Question bank accounting for management
Question bank   accounting for managementQuestion bank   accounting for management
Question bank accounting for management
Sb Raja
 
14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances
Sb Raja
 

Viewers also liked (12)

Power Point
Power PointPower Point
Power Point
 
Memo final
Memo finalMemo final
Memo final
 
Big Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspectiveBig Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspective
 
Enabling High Quality Search in European Languages
Enabling High Quality Search in European LanguagesEnabling High Quality Search in European Languages
Enabling High Quality Search in European Languages
 
09342
0934209342
09342
 
Rpp smt 2
Rpp smt 2Rpp smt 2
Rpp smt 2
 
Vegetarianism[1]
Vegetarianism[1]Vegetarianism[1]
Vegetarianism[1]
 
Mimecast Presentation
Mimecast PresentationMimecast Presentation
Mimecast Presentation
 
Question bank accounting for management
Question bank   accounting for managementQuestion bank   accounting for management
Question bank accounting for management
 
14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances
 
Barcamp AQUOPS (BarAQUOPS) 2015
Barcamp AQUOPS (BarAQUOPS) 2015Barcamp AQUOPS (BarAQUOPS) 2015
Barcamp AQUOPS (BarAQUOPS) 2015
 
Outdoor Connect
Outdoor  ConnectOutdoor  Connect
Outdoor Connect
 

Similar to Essential Elements of Excellent Multilingual Search

Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
Steven Toole
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
John Felahi
 
NICE Speech Analytics - White Paper
NICE Speech Analytics - White PaperNICE Speech Analytics - White Paper
NICE Speech Analytics - White Paper
Clark Hill
 
Bryan Bell Presentation
Bryan Bell PresentationBryan Bell Presentation
Bryan Bell Presentation
Mediabistro
 

Similar to Essential Elements of Excellent Multilingual Search (20)

Semantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfSemantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdf
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint Administrators
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSi
 
Key Phrases for Better Search
Key Phrases for Better SearchKey Phrases for Better Search
Key Phrases for Better Search
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
 
The Bottom Line: Where’s the Money?
The Bottom Line: Where’s the Money?The Bottom Line: Where’s the Money?
The Bottom Line: Where’s the Money?
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NICE Speech Analytics - White Paper
NICE Speech Analytics - White PaperNICE Speech Analytics - White Paper
NICE Speech Analytics - White Paper
 
The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
Bryan Bell Presentation
Bryan Bell PresentationBryan Bell Presentation
Bryan Bell Presentation
 
5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and Understand5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and Understand
 
5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chen5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chen
 
SEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdfSEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdf
 
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPCroud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
 
Introduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerIntroduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBroker
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Essential Elements of Excellent Multilingual Search

  • 1. February 22, 2012 Essential Elements of Excellent Multilingual Search Boosting Search Quality with the Rosette Linguistics Platform We put the World in the World Wide Web®
  • 2. ABOUT BASIS TECHNOLOGY Basis Technology provides software solutions for text analytics, information retrieval, digital forensics, and identity resolution in over forty languages. Our Rosette® linguistics platform is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, financial compliance, and other enterprise applications. Our linguistics team is at the forefront of applied natural language processing using a combination of statistical modeling, expert rules, and corpus-derived data. Our forensics team pioneers better, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponential growth of data storage volumes. Software vendors, content providers, financial institutions, and government agencies worldwide rely on Basis Technology’s solutions for Unicode compliance, language identification, multilingual search, entity extraction, name indexing, and name translation. Our products and services are used by over 250 major firms, including Cisco, EMC, Exalead/Dassault Systems, Hewlett-Packard, Microsoft, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such firms as CACI, Lockheed Martin, Northrop Grumman, SAIC, and SRI. We are the top provider of multilingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!. Company headquarters are in Cambridge, Massachusetts, with branch offices in San Francisco, Washington, London, and Tokyo. For more information, visit www.basistech.com. © 2012 Basis Technology Corporation. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosette”, and “We put the World in the World Wide Web” are registered trademarks of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2012-08-30)
  • 3. ABSTRACT If search is vital to the competitive advantage or value proposition of your business, then this whitepaper is a must read. In the survival of the fittest, companies like Netflix and many businesses today thrive or die depending on whether they can: a) Provide “excellent search” – comprehensive, accurate, and relevant to the user b) Serve each user equally well in his or her preferred language The three essential elements of a business-dependable search solution on which to build applications or workflows to generate revenue are: (1) search customizability and the speed- scalability-cost trio, (2) high search quality—with emphasis on “recall” or maximizing the number of good results found—and (3) reliability and support. This paper will examine what these three elements mean, and examine Basis Technology’s Rosette linguistic platform as one possible solution. THE THREE ELEMENTS OF “EXCELLENT SEARCH” More and more, revenues and business longevity rely on cost-effective, accurate, and scalable search in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/ e-disclosure solutions, financial compliance solutions, online information providers, media and entertainment portal sites all begin with users finding what they are looking for. Choose the right search and see revenues go up, satisfied customers increase, and a business well-positioned for the future. Choose the wrong search and costs constrict growth and poor relevancy chases away customers. Savvy businesses recognize that their customers, partners, suppliers and employees increasingly create and consume content in their native language. For most enterprises, improving search quality means improving (1) recall—i.e., maximizing the number of relevant results found—and to a slightly lesser degree, (2) precision—i.e., maximizing the ratio of good to bad results returned. We’ll now examine each of these three elements of search in detail: • Customizability of search is critical to fit it to your particular use case or create the value add of your business. Speed, Scale, and Cost can quickly strangle the success of your business if search is struggling to maintain speed and scale, and choking profits with escalating per- query or per-document costs. • Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important, but if your users aren’t finding the results they want to see, it doesn’t matter how fast it gets delivered. For enterprise applications and search engines, maximizing recall (without diminishing precision) is usually the greater concern. Good language support is an intrinsic ingredient for improved recall • Reliability and Technical Support ensures that expert resources are available for solving any serious issues in the service. Essential Elements of Excellent Multilingual Search 3
  • 4. ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COST The basic question a company has to ask is: Do we want/like/need/know how to bend technology to our business processes and business model in order to achieve competitive advantage? And most importantly: Is search part of that competitive advantage? If the answer is “yes” then search is not a plug-in-a-solution problem, but a software development problem. Because search is such a broad-scope problem, there are all kinds of edge cases that are difficult to support using a proprietary, commercial search solution. The technology is not easily changed without expensive consultation and modifications to the core software that are dependent on the vendor’s product release cycle. For companies where better search means more money earned or saved, the frustration can increase geometrically because their use case may be unique, but the software isn't extensible enough to deal with it, and they have to spend more and more for less and less. Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that the load on the search engine will only increase over time, search performance must be scalable to keep the query response times low and to ensure index updating times do not affect searches. At the same time, pricing models of commercial search engines based on number of queries or number of documents indexed will eventually constrict growth by sheer cost. ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORT Search is really all about finding, whether it’s a shopper on an e-commerce site, a banker checking a wire transfer request against a list of money launderers, or home buyers looking through real estate listings. The concepts of precision and recall measure how “good” search is. More precise results with greater recall mean better quality search, but frequently increasing one means a decrease in the other, so the trick is how to maximize each metric while minimizing the impact on the other. (See sidebar.) Let’s now look at the language-specific processing required to increase recall and/or precision for three categories of notoriously difficult languages to process: Chinese/Japanese/Korean, Germanic and Scandinavian languages, and Arabic. We’ll first look at one aspect of linguistic processing that improves recall in most languages, including English and European languages: lemmas and stems. Improving Recall via Lemmas and Stems Lemmatization and stemming are two methods of increasing the recall of search results. Stemming improves recall, but can diminish precision. Lemmatization is the preferred method since it increases recall while maintaining precision. Lemmatization, finding the dictionary form of a word, broadens search to add relevant search results, in ways that stemming—more commonly available—cannot. A user searching for “President Obama speaking on healthcare” would likely also want results containing “President Obama spoke on healthcare” and “President Obama was speaking on healthcare.” By lemmatizing“ spoke” and “was speaking” and storing the lemma “speak” in the search index, the latter two results will (correctly!) get picked up by a search for “President Obama speaking on healthcare. Essential Elements of Excellent Multilingual Search 4
  • 5. Stemming uses some basic rules to look for common stems in words. For example, “decod” is the stem of “decoding,” “decoder,” and “decodes.” However, no matter how many letters a stemmer chops off, “spoke” will never become “speak.” Lemmas are found using dictionary data and recognizing the context of a word. The lemma is very different when “spoke” is used as a noun or a verb. Stemming, on the other hand, applies a set of rules regardless of a word’s part-of-speech or context in a sentence. Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum. On the other hand, stemming may sometimes produce unpredictable results. Stemming turns “several” into “sever,” and “arsenic” and “arsenal” share the same stem “arsen.” And, whereas “apples” to “apple” is reasonable, “Los Angeles” to “Los Angele” is not. Although stemming is more readily accessible as a set of language-specific rules than the dictionary-requiring lemmatization, the greater recall and precision from Lemmatization is the difference between good and excellent search. Stemming vs. Lemmatization in a Nutshell Search Query Traditional Stemming Lemmatization using Rosette Comparison animals anim animal Two unrelated words may share a stem, in this case animated anim animate “anim” arsenal arsen arsenal arsenic arsen arsenic several sever several Stemming may have unintended consequences organization organ organization children children child Irregular verbs and nouns stump the stemmer spoke spoke speak (when used as past tense verb) spoke (when used as noun) Improving Precision in Chinese, Japanese, and Korean Asian languages like Chinese, Japanese, and Korean are fundamentally more difficult to process than European languages because their words are not consistently separated by spaces. Chinese and Japanese use no spaces between words, and Korean uses some spaces, but not between every word, and inconsistently depending on the writer. For a search engine to build an index, it has to have words. N-gram and Tokenization are methods used to produce indexable units in these languages, each resulting in different degrees of search precision. N-gram vs. Tokenization Some engines may use the n-gram technique to break up streams of text into overlapping 2-4 character units, but though recall will be high, precision will be low and indexes will be very large, impacting speed. Here is a Japanese example: 東京都の観光地 (translation: sightseeing spots in Tokyo) Essential Elements of Excellent Multilingual Search 5
  • 6. Bigram English translation 東京 Tokyo 京都 Kyoto 都の capital’s の観 not a word 観光 sightseeing 光地 not a word A seven character phrase produces six items to index via the n-gram technique, but also introduces a false word “Kyoto.” Since many of the bigrams are not actual words in the original text, many other unrelated strings might accidentally match them, too, and a search becomes a statistical process: Do enough of these two character segments appear in a given document to flag it as a “match,” while not accidentally matching the wrong strings? This approach also produces many more entries in the index than words, unnecessarily increasing its size. Processing a 100-character buffer (about 13-17 words), adds 999 entries to the index. A better choice is true tokenization based on morphological analysis1, which breaks up text into real words. With tokenization of the example above, there are at most three items to index, and possibly only two if the possessive marker is treated as a “stop word”.2 Words English Translation 東京都 Tokyo の (possessive marker) 観光地 Sightseeing spots Character Normalization to Improve Recall Chinese, Japanese, and Korean have additional character normalization needs beyond making sure ASCII characters are all uppercase or lowercase. In digital files, these languages also use a full- width variant of ASCII letters and punctuation which need to be normalized to the half-width ASCII form. ASCIIwords%$& → ASCIIwords%$& In the case of Japanese, a half-width version of Japanese katakana characters also needs to be normalized to the usual full-width katakana. The half-width version was invented in the early days of computer processing to reduce data size, but it is still found in documents and webpages. カタカナ → カタカナ 1 Morphological analysis in linguistics examines the smallest units of meaning within a word— called “morphemes—such as “un-” in “unremarkable.” 2 A “stop word” is one which occurs so frequently within a language as to not help in finding search results. Some search engines may choose to ignore them. Essential Elements of Excellent Multilingual Search 6
  • 7. Improving Recall with Pan-Chinese Search Chinese search gains an enormous boost in recall if both the simplified and traditional Chinese scripts are searched. Mainland China and Singapore use simplified Chinese, whereas Taiwan and Hong Kong use traditional Chinese. Chinese speakers expect one query in one script to find results from both scripts. Pan-Chinese search—searching documents in both Chinese scripts—is accomplished by converting all the text of one script into the other at query and index The differences between the two scripts fall into three categories: Category Simplified Chinese Traditional Chinese English Translation big Same character used in both scripts—this is true where the character was simple to start with 大 大 Two characters with same pronunciation (but different base meaning)—collapsed to one 头发 頭髮 emit character in simplified Chinese hair 出发 出發 Different vocabulary (analogous to American computer and British English differences such as “truck” vs. 计算机 電腦 “lorry”)—occurs in mostly modern words In the first case, the change is trivial, and amounts to making sure all the text is in the same encoding; however, the second and third categories require dictionary data to ensure that the conversion is context-sensitive, so that the correct traditional character is chosen. Improving Recall in Germanic and Scandinavian Languages Linguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian, Swedish, and Finnish), and Korean is very similar in that these languages freely use compound words which, frequently need to be broken up to increase recall. Take these two German compound words: German compound German decompounded English translation Jugendarbeitslosigkeit Jugend + arbeitslosigkeit Youth unemployment (Youth + unemployment) Samstagmorgen Samstag + morgen Saturday morning (Saturday + morning) It’s reasonable to imagine that a person looking for information about youth unemployment in German would also welcome search results which included “youth” and “unemployment” as separate words. Similarly, the concepts of “Saturday” and “morning” are very likely to appear as a compound word or as separate words in related documents. Decompounding increases search recall with minimal impact on precision, but again, requires dictionary data to do properly. Additionally, to maximize recall, character normalization for plurals “Garten” (garden, singular form) and its plural form “Gärten” is needed. Essential Elements of Excellent Multilingual Search 7
  • 8. Improving Recall in Arabic Arabic is one language which especially suffers from low recall if a search engine does not perform Arabic-specific linguistic processing, starting from the basic character normalization up to stemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many affixes —to the beginning, end and middle of words—that without lemmatizing before searching, many relevant results are never found. One root could form the basis for up to fifteen different verbs! Character Normalization to Improve Precision and Recall In English, search engines perform some basic normalization such as lowercasing all the words. In Arabic, normalization is much more complex and will improve both precision and recall. Types of character normalization required include: • Words with additional vocalization marks such as: ‫ﺳﻲـ‬ vs. ‫ﻲـﺳﺎﯾـﺳ‬ ‫ﺳـ ﯾﺎ‬ • Words containing certain letters with dots added or removed such as: ‫ رـ ـﻗﯾﻪ ـ‬vs. ‫رـ ـﻗﯾﺔ ـ‬ • Words–including ambiguous cases–containing certain letters with symbols added or removed such as: ‫ٱدﺎـﻣﺗﻋ‬ vs. ‫ادﺎـﻣﺗﻋ‬ ‫ادؤود‬ vs. ‫ادوود‬ ‫اﻣـﺛ‬ vs. (‫ آﻣـﺛ‬or ‫إﻣـﺛ‬ or ‫)أﻣـﺛ‬ Improving Recall with Lemmatization and Stemming In Arabic as other languages, Lemmatization increases the recall of search results (cf. “Improving Recall via Lemmas vs. Stems”), but with a twist. In Arabic, Lemmatization is only applicable to verbs and common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names of people such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah). Proper nouns require stemming to remove prepositions and conjunctions which are often attached to them. Searches for names of people, places, and organizations are seriously hampered without stemming. For example, these phrases in English appear as one word in Arabic: “for Othman,” “with Othman,” “as Othman,” and “and Othman.” Thus a search for just “Othman” would not find the above variations without stemming. Additionally, since Arabic names are frequently the same words as common nouns, part-of-speech tagging becomes critical to differentiating between nouns and proper nouns, particularly when a word’s part-of-speech varies depending on where it appears in a sentence. Search Recall and Precision Enhancers for Many Languages For English and most European languages, better recall and precision means: • Adding lemmas (the dictionary form of a word) to the search index to increase recall, while minimizing the negative impact on precision (cf. previous section “Improving Recall via Lemmas vs. Stems”) • Boosting the ranking of relevant documents on the results page through the use of document metadata such as entities. • Increasing recall through comprehensive name search that goes beyond the “Did you mean?” functionality of most search engines. Essential Elements of Excellent Multilingual Search 8
  • 9. Boosting Relevant Results Once the search engine is finding more good results (better recall) and returning fewer bad results (better precision), the good results need to be at the top of the search results. Frequently the most critical words in a search have to do with entities—proper nouns such as names of people, places, and organizations. But when is “Christian” a religious affiliation or the name of a person “Christian Dior”? A good entity extractor will know. At indexing time, entities can be added to a document record’s metadata to boost the rank of documents which have entities matching those in the query. Comprehensive Name Search to Improve Recall The “Did you mean?” function of most modern search engines will find actress “Cate Blanche ” even if the user types “Kate .” However, for less famous people, and to cover a gamut of name variations, a true name matching function is needed to handle “Chuck Berry” vs. “Charles Berry”; “John Kearns” vs. “Jon Cairns” or even “Baqah Al-Sharqiyyah” and “‫.”ﺔﻗﺎﺑ رﺷﻟ اﻗﺔﯾ‬ The Rosette Option for Language Support Rosette is a software development kit (SDK) designed for a wide range of large-scale applications that need to identify, classify, analyze, index, and search unstructured text from various sources, and its linguistic analysis capabilities are widely used by search engines to both increase precision and recall It uses a combination of statistical models, dictionary data, and sophisticated computational linguistics to parse digital text in English and over 25 major European, Asian, and Middle Eastern languages. Years of development and linguistic analysis have gone into the development of Rosette to satisfy the most demanding customers, both for quality, robustness, and speed. Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo (Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search- based applications for e-discovery, financial compliance, and other fields also use Rosette. Rosette is a cross-platform SDK available for Windows and Unix, and offers all its capabilities via a single API, in C, C++, Java, or .NET. Rosette Capabilities • Language Identification—identification of the primary language of a document—or language regions within a multilingual document—and the file’s encoding in 55 languages and 45 encodings. • Unicode Conversion—converts documents in legacy encodings to Unicode • Character Normalization—normalizes characters to a single representation (e.g. character+ diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian languages, and several language-specific normalizations.) • Linguistic Analysis—Lemmatization, tokenization, part-of-speech tagging, decompounding, and more in over 25 languages. • Entity Extraction—extracts entities such as people, places, and organizations in 15 languages via statistical models with customizable user-defined entities via regular expressions and entity databases • Name Matching—returns matching names despite spelling variations, initials, nicknames, missing name components, missing spaces, out-of-order name components, the same name written in different languages, and more—supported in multiple languages including Arabic, Chinese, English, Korean, and Persian. Essential Elements of Excellent Multilingual Search 9
  • 10. • Name Translation—translates names from non-Latin script languages to English and standardizes already translated names—supported in multiple languages including Arabic, Chinese, English, Korean, Russian, and Persian. Lemmatization and Stemming Needing to support a variety languages can stretch the capabilities of out of the box search solutions, requiring additional effort to custom-configure the handling of each language and tune its accuracy one by one. Chinese, Japanese, and Korean Search Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for Unicode (RCLU): • For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ASDF&%$) to the usual half-width characters (ASDF&%$). • For Japanese: Converting half-width Japanese katakana characters (アイウエオ; one-byte characters invented in the early days of Japanese computing to save on storage) to the usual full-width characters (アイウエオ). Pan-Chinese Search Rosette’s Chinese script converter can convert text into either simplified or traditional Chinese. Germanic and Scandinavian Languages Character normalization of plurals is part of the Lemmatization of German within Rosette. Arabic Rosette provides speech tagging and Lemmatization for Arabic. Essential Elements of Excellent Multilingual Search 10
  • 11. Boosting Relevant Results with Entities Rosette’s entity extractor can push results higher in the rankings. Comprehensive Name Search Rosette’s name matching functionality including name variations due to nicknames, initials, missing name components, missing spaces between names, out-of-order name components, the same name represented in different languages, and more. Thus, besides handling “Robert” vs. “Bob,” and “Abdul Rasheed” vs. “Abd-al-Rasheed” vs. “Abd Ar-Rashid,” Rose e also matches “Mao Tse- Tung”, “Mao Zedong”, and “毛泽东” (simplified Chinese). PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELL Precision and recall are metrics used to evaluate the quality of search results, whether searching for articles on a topic, or finding name matches. Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls. Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green. Precision is the number of correct items over the total number of items found. That is, you found 4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your results were correct) or 50%. Recall asks the question: “Of all the correct items, how many did I find?” In this case, there were 7 correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%. F-score is a measure that attempts to balance precision and recall and is often called the “relevancy” of a system. F-score = 2*(Precision*Recall) ÷ (Precision+Recall) Thus our F-score above = 8/15 = .53, or 53% Essential Elements of Excellent Multilingual Search 11
  • 12. ELEMENT 3: RELIABILITY AND SUPPORT Even in a technologically sophisticated company, the core business logic will be the main focus of all business operations, so technical support as a safety net is essential. Just as firms rely on calling the vendor when the copier is acting up, a few judicious pieces of advice from the search or linguistics experts to settle a nettlesome problem lends assurance to both engineers and upper management. On the language support side, the wide language coverage of Basis Technology’s Rosette—over 25 languages covering Asia, Europe, and the Middle East—ensure there will be one point of contact to address questions about any languages, instead of a multitude of single-language vendors with varying support contracts. Rosette developers are on the front lines, providing high quality support to customer inquiries. Additionally as a single platform, Rosette’s benchmarks for speed and accuracy are readily available, and implementing one or over 25 languages is the same amount of work. Essential Elements of Excellent Multilingual Search 12
  • 13. SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM For companies that derive competitive advantage from a better quality or better customized search, search is a development problem, and not a “plug-in-a-solution” problem, and the four search engine essentials are: 1. Customizability—the ability to innovate on top of the chosen search engine to build value- added features 2. Speed—the search engine’s query response rates and indexing and updating times 3. Scalability—the ability to easily handle increased number of queries or volumes of content requiring indexing 4. Cost—total cost of ownership; return on investment; scaling of cost to search needs The Rosette platform fills that gap, through solid linguistic analysis technology and comprehensive dictionary data to perform: • Language Identification—The language identifier detects 55 languages with speed and accuracy, having been trained on gigabytes of hand-verified data. It covers a broad range of Asian, Indo-European, and Middle Eastern languages. • Linguistic Analysis—Rosette performs a complete linguistic analysis to improve search recall and precision for some of the most linguistically difficult languages. ○ Lemmatization—for boosting recall while maintaining precision in many languages ○ Tokenization—for Chinese, Japanese, and Korean which do not have spaces between words ○ Chinese script conversion—for pan-Chinese search ○ Character normalization—specialized transforms for Arabic and Asian languages ○ Decompounding—for Germanic and Scandinavian languages which form words from multiple words ○ Part-of-speech tagging—for context-sensitive Lemmatization, and particularly for Arabic search to distinguish between common nouns requiring Lemmatization, and proper nouns, requiring stemming • Entity Extraction—To locate names, places, organizations, and other entities, both to boost ranking of results whose entities match the search query or as a base for building faceted search. Rosette plugs into the UpdateRequestProcessor at indexing to populate each document record with entities. • Name Matching—Going beyond “Did you mean?”, Rosette finds matching names that differ due to nicknames, initials, missing name components, missing spaces between names, out- of-order name components, the same name represented in different languages, and more. NEXT STEPS • Request a free product evaluation of Rosette at: http://www.basistech.com/products/requests/evalua -request.html Essential Elements of Excellent Multilingual Search 13