SlideShare une entreprise Scribd logo
1  sur  21
Scalable Natural Language Processing (SNaP)
IBM Research | Almaden
SystemT:
Declarative Information Extraction
Yunyao Li
Text analytics
is the key for
Unlocking big data value
Customer 360º View
Social Media Analytics, CRM Analytics
Security & Privacy
Email Analytics, Data Redaction, Digital Piracy
Operations Analysis
Machine Data Analytics
Financial Analytics
Systemic risk analysis, Event monitoring
Healthcare Analytics
Patient care, drug discovery
and many more …
Text Analytics
Text Analytics vs Information Extraction
3
Information
Extraction
Information
Extraction
ClusteringClustering
ClassificationClassification
RegressionRegression
Core operations such
as pos tagging, deep
parsing, sentence
detection etc.
Core operations such
as pos tagging, deep
parsing, sentence
detection etc.
Information extraction: A key enabling technology for
text analytics
•Input: Raw text and features from low-level natural
language processing operations
•Output: Structured data suitable for general purpose
analysis software
Information Extraction All Over
… hotel close to Seaworld in Orlando?
Planning to go this coming weekend…
Social Media
Customer 360º
Security & Privacy
Operations Analysis
Financial Analytics
my social security # is 400-05-4356
Email
Machine
Data
Financial
Statements
Medical
Records
News
CRM
Patents
Product:Hotel Location:Orlando
Type:SSN Value:400054356
Storage Module 2 Drive 1 fault
Event:DriveFail Loc:mod2.Slot1
5
Extraction Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
6
Extraction Example: Sentiment Analysis
Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets are fake as shit but they so delicious.
We should do something cool like go to Z (kidding).We should do something cool like go to Z (kidding).
Makin chicken fries at home bc everyone sucks!Makin chicken fries at home bc everyone sucks!
You are never too old for Disney movies.You are never too old for Disney movies.
Social Media
Different domain  Different ball game!
Bank X got me ****ed up today!Bank X got me ****ed up today!
Customer Reviews
Not a pleasant client experience. Please fix ASAP.Not a pleasant client experience. Please fix ASAP.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
X... fixing something that wasn't brokenX... fixing something that wasn't broken
Analyst Research Reports
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
IBM announced 4Q2012 earnings of $5.13 per share, compared with
4Q2011 earnings of $4.62 per share, an increase of 11 percent
IBM announced 4Q2012 earnings of $5.13 per share, compared with
4Q2011 earnings of $4.62 per share, an increase of 11 percent
We continue to rate shares of MSFT neutral.We continue to rate shares of MSFT neutral.
Sell EUR/CHF at market for a decline to 1.31000…Sell EUR/CHF at market for a decline to 1.31000…
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
Equity
Fixed Income
Currency
Enterprise Requirements
• Accuracy
– Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of
extraction
• Scalability
– Large data volumes, often orders of magnitude larger than classical NLP corpora
• Social Media:
– Twitter alone has 400M+ messages / day; 1TB+ per day
• Financial Data:
– SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few
MBs
• Machine Data:
– One application server under moderate load at medium logging level  1GB of logs per day
• Transparency
– Every customer’s data and problems are unique in some way
– Need to quickly implement new business logic
• Usability
– Building an accurate IE system is labor-intensive
– Professional programmers are expensive
8
Core OperationsCore OperationsExecution EngineExecution Engine
SystemT’s 5 Layered Architecture
Compiler + OptimizerCompiler + Optimizer
Extraction LanguageExtraction Language
Generic LibrariesGeneric Libraries
Tooling + UITooling + UI
20+ languages
currently
supported
NE libraries,
Sentiment,
informal-text
normalizer, …
Scalable
Embeddable
Runtime
AQL
Cost based
optimizer
Eclipse tooling:
editor, workflow
analysis, pattern
discovery, regex
learner, profiler,…
9
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Easy to Develop & Maintain
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
10
State-of-the-art
industrial system
State-of-the-art
open-source system
SystemT
Scalability via Optimization
SystemT needs less than 10% of the
cores to keep up with Twitter’s daily
feed without drop in quality
10x50x
Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasks
Products fully exposingProducts fully exposing
the AQL language and enginethe AQL language and engine
Products fully exposingProducts fully exposing
the AQL language and enginethe AQL language and engine
SystemT in IBM Products
InfoSphere
BigInsights
InfoSphere
Streams
Social
Media
Analytics
eDiscovery
Analyzer
Optim Data
Lifecycle
Management
Solutions
SmartCloud
Analytics:
Log Analysis
Watson
Discovery
Advisor
Installed on 400+ client sites
Case Study: Drug Discovery
0-I II III IV
Time to develop a drug; 12 -15 years
Avg cost to develop a drug: 1.2 billion
Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile
Laboratory
50,000 +
Compounds
Pre-Clinical
250 Compounds
Clinical
5 Compounds
FDA
approval
1 compound
“..Toxicity and Serious Adverse Events in Late Stage Drug Development
are the Major Causes of Drug Failure”
Intervention should happen at critical transition bottlenecks between
stages (most likely to impact outcome)
Drug Development Pipeline
LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside
Objective: enable data-driven decision making
more efficient clinical trial design, data analytics and drug success / failure
predictions
Case Study: Drug Discovery
Structure
Indication
Indication
Mode of action
/ target
Effect level
Side Effect
360° view on drugs allows for comparison between drugs, and therefore allow to
predict drug behavior
…Structured and
unstructured data
sources
Extract known facts about a drug
Patents also have (Manually Created)
Chemical Complex Work Units (CWU’s)
As text
Chemical names
found in the text of
documents
As bitmap images
Pictures of chemicals
found in the document
Images
Unstructured data contain chemical compound information in many different forms
Chemical
nomenclature can
be daunting
Information Extraction Example
Structure
Renders content searchable and avoid redundant and expensive tests
Building block for automation of the discovery and decision process
Internal reports
Public Medline
articles with limited
structured info
POPULATION
gender FEMALE,MALE
species RAT
strain Sprague Dawley
precondition pregnant
From unstructured to structured content
 Handle variants in entity reference
 Disambiguate based on context, etc…
 Extract relationships
 Assemble complex concepts
Combine rule-based extraction with Machine Learning
 Handle variants in entity reference
 Disambiguate based on context, etc…
 Extract relationships
 Assemble complex concepts
Combine rule-based extraction with Machine Learning
EfficacyEfficacyEfficacyEfficacyResults
StudyStudy
Authorship InterventionInterventionAuthorship InterventionAuthorship Intervention
PopulationPopulationPopulationPopulationPopulation
Study
name
Size Population
description
Time
interval
Intervention Phase Type Authorship
Author Author
affiliation
Funding
org
Compound Target Admin
route
dosage Type (preventive
/therapeutic)
Species Subspecies Gender Age Pre-existing
conditions
Physical
attributes
Geographic
Location
1.Extractmentions2.Assemblementions
intoinformation
In-life
Observations
Hematology Clinical chem Comparative
Macroscopic Results
Comparative
Microscopic Resutls
Preliminary Results
SUGGEST NEW P53 KINASES
2. Cluster proteins by their literature distance: known p53
kinases cluster (green) and suggest others that may also
phosphorylate p53.
GREEN – p53 Kinases
RED/ORANGE/YELLOW – Predicted Targets
1. Quantify the “literature
distances” between
proteins
BMPR2CHEK2
Bag of Words Model
Example Use Case: Drug Development
Summary
• SystemT
– Declarative IE system based on an algebraic framework
– Expressivity and performance advantages over grammar-
based IE systems
– Text-specific optimizations
– Key enabling technology for Big Data analytics applications
• Ongoing Work
– Tooling support for non-programmers
– Improved optimization strategies
– New operators for advanced features (e.g. co-reference
resolution, text normalization)
Thank you!
• For more information…
– Visit our website: https://ibm.biz/BdF4GQ
– Get a free copy:
• IBM BigInsights Quick start Edition http://www-
01.ibm.com/software/data/infosphere/biginsights/quick-start/
• IBM Academic Initiative: http://www-
03.ibm.com/ibm/university/academic/pub/page/academic_initiative.
InfoSphere BigInsights 2.1.1 (and Streams 3.0) are freely available for academic
purposes.
– Take free online classes: BigDataUniversity.com
– Watch YouTube videos:
• IBM InfoSphere BigInsights Text Analytics Channel:
http://www.youtube.com/playlist?list=PL2C66C6E455AD0216
• IBM Big Data Channel: http://www.youtube.com/user/ibmbigdata
• IBM InfoSphere Streams Channel: http://www.youtube.com/playlist?
list=PLCF04A48C22F34B19
– Contact me
• yunyaoli@us.ibm.com
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

Contenu connexe

Tendances

Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitternep_test_account
 
Text semantics with azure text analytics cognitive services
Text semantics with azure text analytics cognitive servicesText semantics with azure text analytics cognitive services
Text semantics with azure text analytics cognitive servicesCodeOps Technologies LLP
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at ScaleAndrei Lopatenko
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine LearningDatabricks
 
Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish Jain
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...Rakuten Group, Inc.
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...Dr. Haxel Consult
 
Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014PR Cell, IIM Rohtak
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsConnected Data World
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...Dr. Haxel Consult
 
3 where are my keys - sql explore
3   where are my keys - sql explore3   where are my keys - sql explore
3 where are my keys - sql exploresqlserver.co.il
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translationbdonaldson
 
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Connected Data World
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchCareerBuilder.com
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 

Tendances (20)

Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospects
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
Text semantics with azure text analytics cognitive services
Text semantics with azure text analytics cognitive servicesText semantics with azure text analytics cognitive services
Text semantics with azure text analytics cognitive services
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
 
Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 
Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_Profile
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
 
Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014Webinar on IT Basics by IIM Rohtak for Admissions-2014
Webinar on IT Basics by IIM Rohtak for Admissions-2014
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
 
3 where are my keys - sql explore
3   where are my keys - sql explore3   where are my keys - sql explore
3 where are my keys - sql explore
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 

Similaire à SystemT: Declarative Information Extraction

2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animationdiannepatricia
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTdiannepatricia
 
Transparent Machine Learning for Information Extraction
Transparent Machine Learning for Information ExtractionTransparent Machine Learning for Information Extraction
Transparent Machine Learning for Information ExtractionLaura Chiticariu
 
Elasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch
 
2017-01-25-SystemT-Overview-Stanford
2017-01-25-SystemT-Overview-Stanford2017-01-25-SystemT-Overview-Stanford
2017-01-25-SystemT-Overview-StanfordLaura Chiticariu
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issuesElasticsearch
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)Mike Felch
 
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Databricks
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
A case for teaching SQL to scientists
A case for teaching SQL to scientistsA case for teaching SQL to scientists
A case for teaching SQL to scientistsdhalperi
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptRahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .pptGanesh E
 

Similaire à SystemT: Declarative Information Extraction (20)

2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Transparent Machine Learning for Information Extraction
Transparent Machine Learning for Information ExtractionTransparent Machine Learning for Information Extraction
Transparent Machine Learning for Information Extraction
 
Elasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard field
 
2017-01-25-SystemT-Overview-Stanford
2017-01-25-SystemT-Overview-Stanford2017-01-25-SystemT-Overview-Stanford
2017-01-25-SystemT-Overview-Stanford
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
A case for teaching SQL to scientists
A case for teaching SQL to scientistsA case for teaching SQL to scientists
A case for teaching SQL to scientists
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Fuzzing - Part 1
Fuzzing - Part 1Fuzzing - Part 1
Fuzzing - Part 1
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
Shilpa ppt
Shilpa pptShilpa ppt
Shilpa ppt
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 

Plus de Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summaryYunyao Li
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationYunyao Li
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesYunyao Li
 

Plus de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text Normalization
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

SystemT: Declarative Information Extraction

  • 1. Scalable Natural Language Processing (SNaP) IBM Research | Almaden SystemT: Declarative Information Extraction Yunyao Li
  • 2. Text analytics is the key for Unlocking big data value Customer 360º View Social Media Analytics, CRM Analytics Security & Privacy Email Analytics, Data Redaction, Digital Piracy Operations Analysis Machine Data Analytics Financial Analytics Systemic risk analysis, Event monitoring Healthcare Analytics Patient care, drug discovery and many more …
  • 3. Text Analytics Text Analytics vs Information Extraction 3 Information Extraction Information Extraction ClusteringClustering ClassificationClassification RegressionRegression Core operations such as pos tagging, deep parsing, sentence detection etc. Core operations such as pos tagging, deep parsing, sentence detection etc. Information extraction: A key enabling technology for text analytics •Input: Raw text and features from low-level natural language processing operations •Output: Structured data suitable for general purpose analysis software
  • 4. Information Extraction All Over … hotel close to Seaworld in Orlando? Planning to go this coming weekend… Social Media Customer 360º Security & Privacy Operations Analysis Financial Analytics my social security # is 400-05-4356 Email Machine Data Financial Statements Medical Records News CRM Patents Product:Hotel Location:Orlando Type:SSN Value:400054356 Storage Module 2 Drive 1 fault Event:DriveFail Loc:mod2.Slot1
  • 5. 5 Extraction Example: Fact Extraction (Tables) Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document) Identify line item for Operating expenses from Income statement (financial table in pdf document)
  • 6. 6 Extraction Example: Sentiment Analysis Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets are fake as shit but they so delicious. We should do something cool like go to Z (kidding).We should do something cool like go to Z (kidding). Makin chicken fries at home bc everyone sucks!Makin chicken fries at home bc everyone sucks! You are never too old for Disney movies.You are never too old for Disney movies. Social Media Different domain  Different ball game! Bank X got me ****ed up today!Bank X got me ****ed up today! Customer Reviews Not a pleasant client experience. Please fix ASAP.Not a pleasant client experience. Please fix ASAP. I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better. X... fixing something that wasn't brokenX... fixing something that wasn't broken Analyst Research Reports Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16% IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent We continue to rate shares of MSFT neutral.We continue to rate shares of MSFT neutral. Sell EUR/CHF at market for a decline to 1.31000…Sell EUR/CHF at market for a decline to 1.31000… FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. Equity Fixed Income Currency
  • 7. Enterprise Requirements • Accuracy – Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of extraction • Scalability – Large data volumes, often orders of magnitude larger than classical NLP corpora • Social Media: – Twitter alone has 400M+ messages / day; 1TB+ per day • Financial Data: – SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: – One application server under moderate load at medium logging level  1GB of logs per day • Transparency – Every customer’s data and problems are unique in some way – Need to quickly implement new business logic • Usability – Building an accurate IE system is labor-intensive – Professional programmers are expensive
  • 8. 8 Core OperationsCore OperationsExecution EngineExecution Engine SystemT’s 5 Layered Architecture Compiler + OptimizerCompiler + Optimizer Extraction LanguageExtraction Language Generic LibrariesGeneric Libraries Tooling + UITooling + UI 20+ languages currently supported NE libraries, Sentiment, informal-text normalizer, … Scalable Embeddable Runtime AQL Cost based optimizer Eclipse tooling: editor, workflow analysis, pattern discovery, regex learner, profiler,…
  • 9. 9 package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Easy to Develop & Maintain create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation
  • 10. 10 State-of-the-art industrial system State-of-the-art open-source system SystemT Scalability via Optimization SystemT needs less than 10% of the cores to keep up with Twitter’s daily feed without drop in quality 10x50x
  • 11. Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasks Products fully exposingProducts fully exposing the AQL language and enginethe AQL language and engine Products fully exposingProducts fully exposing the AQL language and enginethe AQL language and engine SystemT in IBM Products InfoSphere BigInsights InfoSphere Streams Social Media Analytics eDiscovery Analyzer Optim Data Lifecycle Management Solutions SmartCloud Analytics: Log Analysis Watson Discovery Advisor Installed on 400+ client sites
  • 12. Case Study: Drug Discovery
  • 13. 0-I II III IV Time to develop a drug; 12 -15 years Avg cost to develop a drug: 1.2 billion Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile Laboratory 50,000 + Compounds Pre-Clinical 250 Compounds Clinical 5 Compounds FDA approval 1 compound “..Toxicity and Serious Adverse Events in Late Stage Drug Development are the Major Causes of Drug Failure” Intervention should happen at critical transition bottlenecks between stages (most likely to impact outcome) Drug Development Pipeline
  • 14. LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside Objective: enable data-driven decision making more efficient clinical trial design, data analytics and drug success / failure predictions Case Study: Drug Discovery
  • 15. Structure Indication Indication Mode of action / target Effect level Side Effect 360° view on drugs allows for comparison between drugs, and therefore allow to predict drug behavior …Structured and unstructured data sources Extract known facts about a drug
  • 16. Patents also have (Manually Created) Chemical Complex Work Units (CWU’s) As text Chemical names found in the text of documents As bitmap images Pictures of chemicals found in the document Images Unstructured data contain chemical compound information in many different forms Chemical nomenclature can be daunting Information Extraction Example
  • 17. Structure Renders content searchable and avoid redundant and expensive tests Building block for automation of the discovery and decision process Internal reports Public Medline articles with limited structured info POPULATION gender FEMALE,MALE species RAT strain Sprague Dawley precondition pregnant From unstructured to structured content  Handle variants in entity reference  Disambiguate based on context, etc…  Extract relationships  Assemble complex concepts Combine rule-based extraction with Machine Learning  Handle variants in entity reference  Disambiguate based on context, etc…  Extract relationships  Assemble complex concepts Combine rule-based extraction with Machine Learning
  • 18. EfficacyEfficacyEfficacyEfficacyResults StudyStudy Authorship InterventionInterventionAuthorship InterventionAuthorship Intervention PopulationPopulationPopulationPopulationPopulation Study name Size Population description Time interval Intervention Phase Type Authorship Author Author affiliation Funding org Compound Target Admin route dosage Type (preventive /therapeutic) Species Subspecies Gender Age Pre-existing conditions Physical attributes Geographic Location 1.Extractmentions2.Assemblementions intoinformation In-life Observations Hematology Clinical chem Comparative Macroscopic Results Comparative Microscopic Resutls
  • 19. Preliminary Results SUGGEST NEW P53 KINASES 2. Cluster proteins by their literature distance: known p53 kinases cluster (green) and suggest others that may also phosphorylate p53. GREEN – p53 Kinases RED/ORANGE/YELLOW – Predicted Targets 1. Quantify the “literature distances” between proteins BMPR2CHEK2 Bag of Words Model Example Use Case: Drug Development
  • 20. Summary • SystemT – Declarative IE system based on an algebraic framework – Expressivity and performance advantages over grammar- based IE systems – Text-specific optimizations – Key enabling technology for Big Data analytics applications • Ongoing Work – Tooling support for non-programmers – Improved optimization strategies – New operators for advanced features (e.g. co-reference resolution, text normalization)
  • 21. Thank you! • For more information… – Visit our website: https://ibm.biz/BdF4GQ – Get a free copy: • IBM BigInsights Quick start Edition http://www- 01.ibm.com/software/data/infosphere/biginsights/quick-start/ • IBM Academic Initiative: http://www- 03.ibm.com/ibm/university/academic/pub/page/academic_initiative. InfoSphere BigInsights 2.1.1 (and Streams 3.0) are freely available for academic purposes. – Take free online classes: BigDataUniversity.com – Watch YouTube videos: • IBM InfoSphere BigInsights Text Analytics Channel: http://www.youtube.com/playlist?list=PL2C66C6E455AD0216 • IBM Big Data Channel: http://www.youtube.com/user/ibmbigdata • IBM InfoSphere Streams Channel: http://www.youtube.com/playlist? list=PLCF04A48C22F34B19 – Contact me • yunyaoli@us.ibm.com SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

Notes de l'éditeur

  1. Much of the big data is the form of text. Text analytics enables the building of new applications from different domain. It is the key for unlocking the value of big data.
  2. Text analytics is broad concept. In this talk, we will focus on information extraction.
  3. This slide conveys our ability to extract chemical names from not just text but from figures of chemical structures and map to their trade, generic name and ids. And as you can see, the list of chemical names can be daunting.
  4. The methodology here is to create a model of each protein kinase based on the medline abstracts that contain only that kinase and no others. The feature space of this model is the words and phrases contained in those abstracts. The distance matrix is then the cosine similarity between each kinases centroid. This distance matrix can then form a similarity graph which can be visualized and reasoned over to identify suspect p53 kinases. These can then be confirmed through experimentation. This exercise proves our ability to go all the way from text extraction to representation and reasoning to predict likely experimental outcomes. The physical diffusion model represent proteins as nodes, their interactions as edges and their functions as labels. These labels are then propagated throughout a network in a process akin to heatflow from a thermal source. Due to constructive interference dictated by the network structure (topology) some nodes that were not previously labeled become significantly so after diffusion, suggested a new function. Both of these methods predicted by kinases not previously known to target p53 might indeed do so. Experimental validation is under way. Thus far two of four predictions are unambiguously correct, and two more show some p53 kinase activity, but with some ambiguity. Significance / Impact – In 34 years of p53 biology, kinases that target p53 have been discovered at a rate of less than one per year. Doing so in a matter of days is a remarkable gain in efficiency. Using our computational screens we should be able to identify strong candidates and validate them as true p53 kinases experimentally in a matter of months. This represents a dramatic acceleration of discovery potential. This process may be repeated for phosphatases, acetylases, nedddylases, methylases, ubiquitinases, and other such proteins that modify p53 and switch its functions on and off.
  5. -selectivity estimation for dictionaries and regex – we don’t have them in db world