Introducing Priberam Labs: Machine Learning and Natural Language Processing

Introducing Priberam Labs:
Machine Learning and Natural Language Processing

Andr´ Martins
e

IST, Lisbon, November 22nd, 2012

Andr´ Martins (Priberam/IT)
e Introducing Priberam Labs IST 22/11/2012 1 / 56

Collaborators

M´rio Figueiredo, Noah Smith, Pedro Aguiar, Eric Xing, Miguel Almeida.
a


Outline

1 Introduction
What is Priberam?
What are the Priberam Labs?

2 Research at Priberam Labs

3 Master’s Projects

4 Academia Partnerships


Outline

1 Introduction
What is Priberam?





What is Priberam?

A spin-oﬀ from IST funded in 1989
R&D in the area of language technologies
Microsoft gold certiﬁed partner, PME L´
ıder, PME Inovadora COTEC


What is Priberam?

A spin-oﬀ from IST funded in 1989
R&D in the area of language technologies
Microsoft gold certiﬁed partner, PME L´
ıder, PME Inovadora COTEC
Some of our clients:


Online Dictionary

(http://www.priberam.pt/dlpo — 1M page-views per day)


Grammar Checker

(http://www.ﬂip.pt)


Legal Search

(http://www.legix.pt)


Newswire Search

(http://www.dn.pt, http://www.jn.pt, http://www.tsf.pt)


Newswire Search
question



Newswire Search
question

answer



Outline

1 Introduction
What is Priberam?






Every day we deal with challenging and stimulating problems, some of
them unanswered by current scientiﬁc knowledge



Our key areas: Natural Language Processing and Machine Learning



Our key areas: Natural Language Processing and Machine Learning
Our goals:
advance the state of the art in NLP and ML
incorporate the resulting innovations in new products
promote collaborations with other researchers in academia

Outline

1 Introduction
What is Priberam?





Our Research Interests

Natural Language Processing
Machine Learning
Structured Prediction
Graphical Models



Goal: make machines capable of “understanding” human language.



Goal: make machines capable of “understanding” human language.

Information Retrieval
Machine Translation
Syntactic Parsing
Semantic Parsing
Speech Recognition
...


The Empirical “Revolution” in NLP

Until the 1980s: rule-based methods were prevalent in AI



Since the mid 1990s: statistical methods, corpus linguistics



Since the mid 1990s: statistical methods, corpus linguistics
Today: emphasis in machine learning and large-scale data processing
“The unreasonable eﬀectiveness of data”, Halevy et al. 2009



Machine Learning
Graphical Models


Example: Spam Detector


Machine Learning

Goal: build systems that learn from the data.

Mitchell (1997); Manning and Sch¨tze (1999); Sch¨lkopf and Smola (2002); Bishop (2006)
u o


Machine Learning


Input set X and output set Y

u o


Machine Learning


Learn a classiﬁer h : X → Y from a set of labeled examples
{(xi , yi )}N ⊆ X × Y
i=1

u o


Machine Learning


{(xi , yi )}N ⊆ X × Y
i=1
Given an unseen example x ∈ X, predict y = h(x)

u o


Machine Learning


{(xi , yi )}N ⊆ X × Y
i=1
Given an unseen example x ∈ X, predict y = h(x)
Many approaches: decision trees, neural networks, nearest neighbors,
naive Bayes, logistic regression, support vector machines, ...
Many learning formalisms: supervised, unsupervised, semi-supervised,
weakly-supervised, active, online, reinforcement, ...

u o



Machine Learning
Graphical Models



Language is structured, complex, and ambiguous.

Laﬀerty et al. (2001); Taskar et al. (2003); Altun et al. (2003); Tsochantaridis et al. (2004)



The input set X is typically structured (a string, an acoustic signal, etc.)
Often: the output set Y is also structured (a string, a parse tree, etc.)




The input set X is typically structured (a string, an acoustic signal, etc.)
Often: the output set Y is also structured (a string, a parse tree, etc.)
Some problems:
How to decode structured outputs?
How to learn models for structured prediction?
How to learn the structure itself?



Example: Part-of-Speech Tagging

Goal: given a sentence, determine the part-of-speech tag of each word.

Time flies like an arrow




Noun Det Noun





Noun?
Noun Verb? Det Noun





Noun? Prep?
Noun Verb? Verb? Det Noun





Rule-based systems (Brill, 1993)

Noun? Prep?
Noun Verb? Verb? Det Noun





Hidden Markov models (Brants, 2000)

Noun Verb Prep Det Noun





Hidden Markov models (Brants, 2000)
Conditional random ﬁelds (Laﬀerty et al., 2001)

Noun Verb Prep Det Noun




Machine Learning
Graphical Models


Graphical Models

Inspired in Statistical Mechanics (Ising, 1925; Potts, 1952)
Applications in coding theory, vision, computational biology, ...
(Tanner, 1981; Pearl, 1988; Kschischang et al., 2001; Koller and Friedman, 2009)


Graphical Models


MAP Inference: obtain the most likely conﬁguration.


Graphical Models



Graphs without cycles: dynamic programming (Viterbi, 1967)


Graphical Models



Graphs without cycles: dynamic programming (Viterbi, 1967)
In general NP-hard!


AD3 Algorithm (Martins et al., 2010a, 2011a)

“Alternating Directions Dual Decomposition.”




An approximate MAP inference algorithm based on an LP relaxation




Fundamental idea: decompose the graph in parts, at each iteration
t solve local subproblems and promote a consensus on the overlaps




Convergence rate O(1/t)




Can tackle combinatorial parts and ﬁrst-order logic constraints




Can tackle combinatorial parts and ﬁrst-order logic constraints
Code available at: http://www.ark.cs.cmu.edu/AD3


Graphs are Everywhere

Facebook graph
WWW graph

Protein folding Image Segmentation


Syntactic Parsing
(Chomsky, 1965; Magerman, 1995; Charniak, 1996; Collins, 1999; Klein and Manning, 2003)


Syntactic Parsing

She solved the problem with the statistical method.


Syntactic Parsing

She solved the problem with the statistical method.
S
S --> NP VP
NP --> Pro
NP --> Det N NP VP
NP --> Det Nbar
Nbar --> Adj N Pro
VP --> V NP PP
PP --> P NP She
Det --> the V NP PP
Pro --> She solved Det N
N --> problem P NP
N --> method the problem
V --> solved with
Det Nbar
P --> with
Adj --> the Adj N
statistical
statistical method


Syntactic Ambiguity
1 She employed the statistical method:
S

NP VP

She

V NP PP
solved the problem
with the statistical method

2 The statistical method was broken:
S

NP VP

She

V NP

solved
NP PP

the problem
with the statistical method


Dependency Syntax
(P¯nini, 4th century BCE, Tesni`re 1959; Hudson 1984; Mel’ˇuk 1988; Eisner 1996; McDonald
a. e c
et al. 2005; Nivre et al. 2006; Koo et al. 2007)

* She solved the problem with the statistical method

Tree obtained “lexicalizing” the previous phrase-structure tree.
A lightweight syntactic formalism, without phrases
Grammar functions represented as lexical relationships


Turbo Parser (Martins et al., 2009, 2010b, 2011b)

A multi-lingual statistical dependency parser,
which formulates parsing as inference in a
graphical model.

Ignores global eﬀects caused by the cycles of the graph
Same idea that underlies turbo decoders (Berrou et al., 1993)
Uses AD3 for solving the relaxation
State-of-the-art accuracies, extremely fast (1, 200 words per second)
Code available at: http://www.ark.cs.cmu.edu/TurboParser


Ongoing Project: Summarization
Given a set of documents about an event, generate a brief summary.


Extractive Summarization
Just extract the most salient sentences.


Extractive Summarization
Just extract the most salient sentences.
Reward relevance and coverage, penalize redundancy


Compressive Summarization
Jointly extract and compress sentences.


Compressive Summarization
Jointly extract and compress sentences.
Trade-oﬀ between informativeness, length, and grammaticality


Released Software

A multilingual part-of-speech tagger (TurboTagger)
A multilingual dependency parser (TurboParser)
A algorithm for approximate inference in graphical models (AD3 )

http://www.ark.cs.cmu.edu/TurboParser
http://www.ark.cs.cmu.edu/AD3

lti


Outline

1 Introduction
What is Priberam?





Master’s Projects

Opinion Mining in Newspapers and Blogs
Text-Driven Forecasting
Recommendation Systems
Weakly Supervised Sentiment Analysis



Build a system that extracts “opinions” from text in natural language.



Examples: opinions of politicians about controversial topics, user
reviews about products, opinions expressed in blogs and Twitter, etc.



Examples: opinions of politicians about controversial topics, user
reviews about products, opinions expressed in blogs and Twitter, etc.
Goal: a computer program that extracts opinions, identiﬁes the
opinion holder, the aspect that is being opinionated about, and the
opinion polarity (positive or negative sentiment)


Example: Google Products

opinion snippets

aspects


Master’s Projects




Example: a movie by a famous director has
premiered. Can we predict its gross revenue
given opinionated text?

“[...] a masterpiece in sheer
awfulness.” — Rotten Tomatoes





Goal: develop ML algorithms for predicting numeric quantities about
an event given a body of text.





Goal: develop ML algorithms for predicting numeric quantities about
an event given a body of text.
Possible applications: predicting the revenue of movies, opinion
polls from blogs, stock volatility from ﬁnancial reports, the number of
external links given a news article, etc.


Master’s Projects



In many applications (e.g. movie rental systems) users assign ratings to
products according to their taste (from to )


These ratings can be seen as entries in a matrix (of N users by M movies)
 
? ? ...
 ? ? ... 
 

 ? ? ... 

 .
. . . ..
. . .
. 
 . . . . . 
? ? ...


 
? ? ...
 ? ? ... 
 

 ? ? ... 

 .
. . . ..
. . .
. 
 . . . . . 
? ? ...

Goal: ﬁll the blanks (matrix completion).


 
? ? ...
 ? ? ... 
 

 ? ? ... 

 .
. . . ..
. . .
. 
 . . . . . 
? ? ...

Predict the rating that the ith user will assign to the jth movie based
on similar user/movie proﬁles: collaborative ﬁltering


 
? ? ...
 ? ? ... 
 

 ? ? ... 

 .
. . . ..
. . .
. 
 . . . . . 
? ? ...

Predict the rating that the ith user will assign to the jth movie based
on similar user/movie proﬁles: collaborative ﬁltering
Recommend new movies to unseen users



Netﬂix Prize: $1M for whoever improves Netﬂix’s Cinematch R in > 10%



Winner: BellKor’s Pragmatic Chaos, 21/9/2009



Data: some entries of the user/movie matrix (training and test splits)



Evaluation metric: root mean squared error (RMSE)



Evaluation metric: root mean squared error (RMSE)
Some possible approaches:
k-nearest neighbors (for some similarity metric)
probabilistic models with latent variables
low-rank matrix factorization


Master’s Projects




Classify a product review as positive or negative.



“This camera takes poor quality photos. Yes, it’s slim and
lightweight. Yes, the shutter speed is snappy. But the photos are
of such poor quality that it’s a pretty useless camera.”

— Amazon.com




— Amazon.com

Data: a set of reviews along with product ratings.




— Amazon.com

Data: a set of reviews along with product ratings.
Goal: an algorithm which, given as input a new product review, predicts
its polarity (positive or negative)



Consider a scenario with weak supervision: domain adaptation,
semi-supervised learning, language transfer, etc.



Possible tasks:
Classify movie reviews with a system trained on cellphone reviews
Train a system in English data and use it for reviews in Portuguese



Possible tasks:
Classify movie reviews with a system trained on cellphone reviews
Train a system in English data and use it for reviews in Portuguese
What are the relevant features?
Adjectives? (not always helpful...)
Connective words: but, however, although,...


Outline

1 Introduction
What is Priberam?





Academia Partnerships

CMU/Portugal
Seminars
Summer School (LxMLS)
Opportunity: Research Internships


CMU/Portugal

Dual PhD Program in Language Technologies
Priberam is an industrial partner
See how to apply in: http://www.cmuportugal.org


CMU/Portugal

Dual PhD Program in Language Technologies
Priberam is an industrial partner
See how to apply in: http://www.cmuportugal.org
Note: deadline soon (December 15th)


Priberam Machine Learning Lunch Seminars

A series of informal meetings every two weeks at IST (Tuesdays 1PM)
Discussion forum involving diﬀerent research groups interested in
machine learning
Everyone can attend, no registration needed


Priberam Machine Learning Lunch Seminars

A series of informal meetings every two weeks at IST (Tuesdays 1PM)
Discussion forum involving diﬀerent research groups interested in
machine learning
Everyone can attend, no registration needed
Delicious free food!


Lisbon Machine Learning School

An annual summer school held since 2011 devoted to ML and NLP



> 100 participants worldwide (mostly MSc and PhD students)



Priberam Labs co-organizes and is one of the sponsors
Google is the main sponsor



Priberam Labs co-organizes and is one of the sponsors
Google is the main sponsor
Next year’s topic is Big Data
More information and videos of past lectures: http://lxmls.it.pt


Opportunity: Research Internships

We’re oﬀering short term research internships at Priberam Labs!

Who? MSc/PhD students wanting a short experience in the industry
What? A stimulating research environment, connections to the
international ML and NLP research scene
How? Interns will work with us in a research project of their choice

Interested?
labs@priberam.com


Thank You!

More information about the Labs: http://labs.priberam.com
(You could be here.)


References I
Altun, Y., Tsochantaridis, I., and Hofmann, T. (2003). Hidden Markov support vector
machines. In Proc. of International Conference of Machine Learning.
Berrou, C., Glavieux, A., and Thitimajshima, P. (1993). Near Shannon limit error-correcting
coding and decoding. In Proc. of International Conference on Communications, volume 93,
pages 1064–1070.
Bishop, C. (2006). Pattern recognition and machine learning. Springer New York.
Brants, T. (2000). Tnt: a statistical part-of-speech tagger. In Proc. of the Sixth Conference on
Applied Natural Language Processing.
Brill, E. (1993). A Corpus-Based Approach to Language Learning. PhD thesis, University of
Pennsylvania.
Charniak, E. (1996). Tree-bank grammars. In Proc. of the National Conference on Artiﬁcial
Intelligence, pages 1031–1036.
Chomsky, N. (1965). Aspects of the Theory of Syntax, volume 119. The MIT press.
Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis,
University of Pennsylvania.
Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. In
Proc. of International Conference on Computational Linguistics, pages 340–345.
Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable eﬀectiveness of data.
Intelligent Systems, IEEE, 24(2):8–12.
Hudson, R. (1984). Word grammar. Blackwell Oxford.


References II
Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift f¨r Physik A Hadrons
u
and Nuclei, 31(1):253–258.
Klein, D. and Manning, C. (2003). Accurate unlexicalized parsing. In Proc. of Annual Meeting
on Association for Computational Linguistics, pages 423–430.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.
The MIT Press.
Koo, T., Globerson, A., Carreras, X., and Collins, M. (2007). Structured prediction models via
the matrix-tree theorem. In Empirical Methods for Natural Language Processing.
Kschischang, F. R., Frey, B. J., and Loeliger, H. A. (2001). Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory, 47.
Laﬀerty, J., McCallum, A., and Pereira, F. (2001). Conditional random ﬁelds: Probabilistic
models for segmenting and labeling sequence data. In Proc. of International Conference of
Machine Learning.
Magerman, D. (1995). Statistical decision-tree models for parsing. In Proc. of Annual Meeting
on Association for Computational Linguistics, pages 276–283.
Manning, C. and Sch¨tze, H. (1999). Foundations of Statistical Natural Language Processing.
u
MIT Press, Cambridge, MA.
Martins, A. F. T., Figueiredo, M. A. T., Aguiar, P. M. Q., Smith, N. A., and Xing, E. P.
(2011a). An Augmented Lagrangian Approach to Constrained MAP Inference. In Proc. of
International Conference of Machine Learning.


References III
Martins, A. F. T., Smith, N. A., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2011b). Dual
Decomposition with Many Overlapping Components. In Proc. of Empirical Methods for
Natural Language Processing.
Martins, A. F. T., Smith, N. A., and Xing, E. P. (2009). Concise Integer Linear Programming
Formulations for Dependency Parsing. In Proc. of Annual Meeting of the Association for
Computational Linguistics.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T.
(2010a). Augmented Dual Decomposition for MAP Inference. In Neural Information
Processing Systems: Workshop in Optimization for Machine Learning.
Martins, A. F. T., Smith, N. A., Xing, E. P., Figueiredo, M. A. T., and Aguiar, P. M. Q.
(2010b). Turbo Parsers: Dependency Parsing by Approximate Variational Inference. In Proc.
of Empirical Methods for Natural Language Processing.
McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency
parsing using spanning tree algorithms. In Proc. of Empirical Methods for Natural Language
Processing.
Mel’ˇuk, I. (1988). Dependency syntax: theory and practice. State University of New York Press.
c
Mitchell, T. (1997). Machine learning. McGraw Hill.
Nivre, J., Hall, J., Nilsson, J., Eryiˇit, G., and Marinov, S. (2006). Labeled pseudo-projective
g
dependency parsing with support vector machines. In Procs. of International Conference on
Natural Language Learning.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann.


References IV

Potts, R. (1952). Some generalized order-disorder transformations. In Proceedings of the
Cambridge Philosophical Society, volume 48, pages 106–109. Cambridge Univ Press.
Sch¨lkopf, B. and Smola, A. J. (2002). Learning with Kernels. The MIT Press, Cambridge, MA.
o
Tanner, R. (1981). A recursive approach to low complexity codes. IEEE Transactions on
Information Theory, 27(5):533–547.
Taskar, B., Guestrin, C., and Koller, D. (2003). Max-margin Markov networks. In Proc. of
Neural Information Processing Systems.
Tesni`re, L. (1959). El´ments de syntaxe structurale. Libraire C. Klincksieck.
e e
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support vector machine
learning for interdependent and structured output spaces. In Proc. of International
Conference of Machine Learning.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.


Introducing Priberam Labs: Machine Learning and Natural Language Processing

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Introducing Priberam Labs: Machine Learning and Natural Language Processing