2. Text Analysis with SAP HANA
2Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich
Motivation1 3
Text Analysis with SAP HANA2 7
Enhancement Options - Dictionaries and Rules3 21
3. Text Analysis with SAP HANA
3Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich
Motivation1 3
Text Analysis with SAP HANA2 7
Enhancement Options - Dictionaries and Rules3 21
4. Text Analysis with SAP HANA
Why do we need Text Analysis?
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 4
• According to Merril Lynch 80-90% of all potentially usable business information may originate in
unstructured form
(Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.)
• The data might origin from:
Social Networks
“Letters” from Customer
...
• What is the problem with unstructured data?
• It is unstructured!
Not organized
No pre-defined data model
No metadata or mix of data and metadata
We have a lot of information that is relevant for the business but we cannot access it
5. Text Analysis with SAP HANA
How can we solve that issue?
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 5
• Text Analysis: Extracting high quality information from texts
• Typical process of a text analysis:
Parsing of the text
Adding features like linguistic information
Entity recognition: Is it an organization or a person or a place including domain facts like
requests?
Sentiment analysis: What attitudinal information is “hidden” in the text?
Insertion of information to database in structured manner
6. Text Analysis with SAP HANA
6Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich
Motivation1 3
Text Analysis with SAP HANA2 7
Enhancement Options - Dictionaries and Rules3 21
8. Text Analysis with SAP HANA
Fulltext Index - Basics
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 8
• Starting point: database table containing the text (types like TEXT, NVARCHAR, BLOB …)
• Create a Fulltext index incl. options (see system view SYS.FULLTEXT_INDEXES)
9. Text Analysis with SAP HANA
Entity Extraction
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 9
• In order to get valuable information out of the data SAP delivers several configurations
• These configurations focus on entity and fact extraction under specific aspects
• Types of Extraction:
EXTRACTION_CORE
EXTRACTION_CORE_ENTERPRISE
EXTRACTION_CORE_PUBLIC_SECTOR
EXTRACTION_CORE_VOICEOFCUSTOMER
10. Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 10
11. Text Analysis with SAP HANA
11Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich
Motivation1 3
Text Analysis with SAP HANA2 7
Enhancement Options - Dictionaries and Rules3 21
13. Text Analysis with HANA – Workflow of Enhancement
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 13
1. Find an extraction configuration that is most fitting for you
2. Copy the configuration into the target folder
3. Create a new custom dictionary
4. Reference the dictionary in your configuration copy
5. Recreate the fulltext index using your custom configuration
14. Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 14
15. Text Analysis with HANA – What’s next?
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 15
• Assume that we are in an “industry”-specific context or mining for “slang”-like facts and entities
• Good example for this are sports!
• We use the example of CrossFit® … as there are some funny facts to extract
• Question: How can we extract complex entities from a text?
• Examples:
Did somebody attend a CrossFit training?
Does somebody want to join a CrossFit box?
16. Text Analysis with HANA – Text Analysis Extraction Rules
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 16
• Extraction rules (CGUL rules): pattern-based language for pattern matching using character or
token-based regular expressions combined with linguistic attributes to define custom entity types.
• Goal of the rule sets:
Extract complex facts based on relations between entities and predicates.
Identify entities in domain-specific language and capture facts expressed in new, popular
“slang”
17. Text Analysis with HANA – Text Analysis Extraction Rules
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 17
Extraction Rule
Regular ExpressionsTokens
Luck Dictionaries
18. Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 20
19. Text Analysis with HANA – “Lessons Learned”
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 21
• Text Analysis on SAP HANA is extremely powerful
• Besides the delivered content you have a lot of options to adopt the text analysis to extract the
entities and facts that you need
• This also means you have a lot of options that you can set the wrong way
• Since SP09 rules get compiled upon activation (no separate compilation necessary)
• The documentation is mostly ok but has room for improvement in case of extraction rules
• Creating custom dictionaries and text rules is cumbersome, finding an error (e. g. a typo) is hell
No support in IDE
You can usually activate all objects, create the index … but the index remains empty
20. Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 22
Q&A
21. .consulting .solutions .partnership
Dr. Christian Lechner
Principal IT Consultant
+49 (0) 171 7617190
christian.lechner@msg-systems.com
http://scn.sap.com/people/christian.lechner
@lechnerc77
22. Text Analysis with HANA – Ressources
Oktober 2015 | Text Analysis with SAP HANA - SAP Inside Track Munich 24
• SAP HANA Search Developer Guide (Fulltext Index Options)
help.sap.com -> Search Developer Guide
• SAP HANA Text Analysis Developer Guide:
help.sap.com -> TA Developer Guide
• SAP HANA Text Analysis Language Reference Guide:
help.sap.com -> TA Language Refrence Guide
• SAP HANA Text Analysis Extraction Customization Guide:
help.sap.com -> TA Extraction Customization Guide
• YouTube Playlist of SAP HANA Academy:
Text Analysis and Search
Notes de l'éditeur
Text analysis in SAP HANA is a suite of natural-language processing capabilities based on linguistic, statistical and machine-learning algorithms that model and structure the information content of textual sources in multiple languages.
This technology forms the foundation for advanced text processing for a range of applications including search, business intelligence or exploratory data analysis.
LANGUAGE COLUMN <column_name> - Defines the column where the language of a document is specified.
LANGUAGE DETECTION ( <string_literal_list> ) - The set of languages to be considered during language detection.
MIME TYPE COLUMN <column_name> - Defines the column where the mime-type of a document is specified.
FUZZY SEARCH INDEX <on_off> - Specifies whether a fuzzy search index should be used.
PHRASE INDEX RATIO <index_ratio> <index_ratio> ::= <exact_numeric_literal> - Specifies the percentage of the phrase index. Value must be between 0.0 and 1.0Stores information about the occurrence of words and the proximity of words to one another. If a phrase index is present, phrase searches are sped up (e.g. SELECT * FROM T WHERE CONTAINS(COLUMN1, '"cats and dogs"')) . The float value is between 0.0 and 1.0. 1.0 means that the internal phrase index can use 100% of the memory size of the fulltext index.
CONFIGURATION <string_literal> - The path to a custom configuration file for text analysis.
SEARCH ONLY <on_off> - Defines if the original document should be stored or only the search results. When set to ON the original document content is not stored.
FAST PREPROCESS <on_off> - If set to ON, fast preprocessing is used, i.e. linguistic searches are not possible.
TEXT ANALYSIS <on_off> - Enables text analysis capabilities on the indexed column. Text analysis can extract entities such as persons, products, or places from documents, which are stored in a new table.
MIME TYPE <string_literal> - The default mime type used for preprocessing. The value must be a valid mime type.
TOKEN SEPARATORS <string_literal> - A set of characters used for token separation. Only ASCII characters are considered.
<change_tracking_elem> ::= SYNC[HRONOUS] | ASYNC[HRONOUS] [FLUSH [QUEUE] <flush_queue_elem>] - The type of index to be created.
SYNC[HRONOUS] - Creates a synchronous fulltext index.
ASYNC[HRONOUS] - Creates an asynchronous fulltext index.
FLUSH [QUEUE] <flush_queue_elem> <flush_queue_elem> ::= EVERY <integer_literal> MINUTES | AFTER <integer_literal> DOCUMENTS | EVERY <integer_literal> MINUTES OR AFTER <integer_literal> DOCUMENTS - Specifies when to update the fulltext index if an asynchronous index is used. When DOCUMENTS is specified, the fulltext index will be updated after the specified number of changes to the table, including updates and deletes.
TEXT MINING <on_off> - Enables text mining capabilities on the indexed column. Text mining provides functionality that can compare documents by examining the terms used within them.
TEXT MINING CONFIGURATION <string_literal> - The path to a custom configuration file for text mining. If not specified, DEFAULT.textminingconfig is use
Entity Extraction is the identification of named entities (persons, organizations etc.), which eliminates the 'noise' in textual data by highlighting salient information. This process transforms unstructured text into structured information.
Fact Extraction is a higher-level semantic processing that links entities as "facts" in domain-specific applications. For example, "Voice of the Customer" classifies sentiments with their corresponding topics.