Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Extraction of domain-speciﬁc bilingual lexicon
from comparable corpora
compositional translation and ranking

Estelle Delpech1 , B´atrice Daille1 , Emmanuel Morin1 , Claire
e
Lemaire2,3
1 LINA,

2 GREMUTS, Universit´ de Grenoble
Universit´ de Nantes
e
e
3 Lingua et Machina

COLING’12

10/12/12

Mumbai, India

Outline

1

Context

2

Translation method

3

Ranking method

4

Results of experiments

5

Future work

Context
Translation method
Ranking method
Future work

Context : comparable corpora for Computer-Aided
Translation

1 / 31

Context
Translation method
Ranking method
Future work

Translation

Aim : provide domain-speciﬁc bilingual lexicons to translators
when no parallel data is available

1 / 31

Context
Translation method
Ranking method
Future work

Translation

Aim : provide domain-speciﬁc bilingual lexicons to translators
when no parallel data is available
⇒ Comparable corpora :
Set of texts in languages L1 and L2, which are not
translations, but which deal with the same subject matter, so
that there is still a possibility to extract translation pairs

1 / 31

Context
Translation method
Ranking method
Future work

Motivations for compositional translation

2 / 31

Context
Translation method
Ranking method
Future work


Usual context-based methods [Fung, 1997]:

2 / 31

Context
Translation method
Ranking method
Future work


51% to 88% precision on top 20 candidates with specialized
corpora [Daille and Morin, 2005]

2 / 31

Context
Translation method
Ranking method
Future work


⇒ lexicons diﬃcult to use for translators [Delpech, 2011]

2 / 31

Context
Translation method
Ranking method
Future work



Compositional translation :

2 / 31

Context
Translation method
Ranking method
Future work



81% to 94% precision on Top1
[Robitaille et al., 2006, Cartoni, 2009, Morin and Daille, 2009]

2 / 31

Context
Translation method
Ranking method
Future work



More than 60% of terms in technical and scientiﬁc domains are
morphologically complex [Namer and Baud, 2007]

2 / 31

Context
Translation method
Ranking method
Future work



More than 60% of terms in technical and scientiﬁc domains are
morphologically complex [Namer and Baud, 2007]
Outperforms context-based approaches for the translation of
terms with compositional meaning [Morin and Daille, 2009]

2 / 31

Context
Translation method
Ranking method
Future work

Compositional translation
Compositionality
“the meaning of the whole is a function of the meaning of the
parts” [Keenan and Faltz, 1985, 24-25]

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”
Decompose {a, b}

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”
Decompose {a, b}
Translate {α, β}

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”
Decompose {a, b}
Translate {α, β}
Reorder {αβ, βα}

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”
Decompose
Translate
Reorder
Select

{a, b}
{α, β}
{αβ, βα}
αβ

3 / 31

Context
Translation method
Ranking method
Future work

Compositionality
Input : ”ab”
Decompose
Translate
Reorder
Select

{a, b}
{α, β}
{αβ, βα}
αβ

Output : ”αβ”

3 / 31

Context
Translation method
Ranking method
Future work

Related work

4 / 31

Context
Translation method
Ranking method
Future work

Related work

Applied to phrases, decomposed into words
[Robitaille et al., 2006, Morin and Daille, 2009]
rate of evaporation → taux d’´vaporation
e

4 / 31

Context
Translation method
Ranking method
Future work

Related work

e

Applied to words, decomposed into morphemes
[Cartoni, 2009, Harastani et al., 2012]
cardiology → cardiologie
ricostruire → rebuild

4 / 31

Context
Translation method
Ranking method
Future work

Related work

e

Applied to words, decomposed into morphemes
[Cartoni, 2009, Harastani et al., 2012]
cardiology → cardiologie
ricostruire → rebuild

⇒ No approach links bound morphemes to words :
-cyto- → cellule ’cell’
cytotoxic → toxique pour les cellules ’toxic to the cells’

4 / 31

Context
Translation method
Ranking method
Future work

Selection and ranking methods

5 / 31

Context
Translation method
Ranking method
Future work


Select translations that occur in target texts / Web
[Morin and Daille, 2009]

5 / 31

Context
Translation method
Ranking method
Future work


Select most frequent translation [Grefenstette, 1999]

5 / 31

Context
Translation method
Ranking method
Future work


Compare contexts [Garera and Yarowsky, 2008]

5 / 31

Context
Translation method
Ranking method
Future work


ML : Binary classiﬁer [Baldwin and Tanaka, 2004]

5 / 31

Context
Translation method
Ranking method
Future work


⇒ Combination of criterion

5 / 31

Context
Translation method
Ranking method
Future work


⇒ Combination of criterion
⇒ ML : Learning-to-rank algorithms (IR)

5 / 31

Context
Translation method
Ranking method
Future work

Translation process overview

7 / 31

Context
Translation method
Ranking method
Future work

Input : ”non-cytotoxic”

7 / 31

Context
Translation method
Ranking method
Future work

Decompose {non, cyto, toxic}

7 / 31

Context
Translation method
Ranking method
Future work

Concatenate {non, cyto, toxic} , {noncyto, toxic}, {non,
cytotoxic} , {noncytotoxic}

7 / 31

Context
Translation method
Ranking method
Future work

Translate {non, cellule, toxique}, {non, cyto, toxique},
{non, cellule, toxicit´}, {non, cyto, toxicit´}
e
e

7 / 31

Context
Translation method
Ranking method
Future work

e
e
Reorder {non, toxique, cellule}, {non, cellule, toxique},
{cellule, toxique, non}

7 / 31

Context
Translation method
Ranking method
Future work

e
e
Concatenate {non, toxique, cellule}, {nontoxique, cellule},
{non, toxiquecellule}, {nontoxiquecellule}

7 / 31

Context
Translation method
Ranking method
Future work

e
e
Match {non, toxique, cellule}

7 / 31

Context
Translation method
Ranking method
Future work

e
e
Match {non, toxique, cellule}
Output : ”non toxique pour les cellules” ’non toxic to the
cells’
7 / 31

Context
Translation method
Ranking method
Future work

Decomposition

non-cytotoxic → {non, cyto, toxic}

8 / 31

Context
Translation method
Ranking method
Future work

Decomposition

Split source term into minimal components with heuristic
rules:

8 / 31

Context
Translation method
Ranking method
Future work

Decomposition

rules:
split on hyphens

8 / 31

Context
Translation method
Ranking method
Future work

Decomposition

rules:
split on hyphens
match substrings of the source term with:
a list of morphemes
a list of lexical items

8 / 31

Context
Translation method
Ranking method
Future work

Decomposition

rules:
split on hyphens
match substrings of the source term with:
a list of morphemes
a list of lexical items

respect some length constraints on the substrings

8 / 31

Context
Translation method
Ranking method
Future work

Concatenation

9 / 31

Context
Translation method
Ranking method
Future work

Concatenation

Generate all possible concatenations of the minimal
components

9 / 31

Context
Translation method
Ranking method
Future work

Concatenation

Generate all possible concatenations of the minimal
components
Increases the chances of matching the components with
entries of the dictionaries
{ non, cyto, toxic} → {non, cyto, ∅ }
{non, cytotoxic} → {non, cytotoxique }

9 / 31

Context
Translation method
Ranking method
Future work

Translation with direct dictionary look-up

10 / 31

Context
Translation method
Ranking method
Future work


Bilingual dictionary for lexical items:
toxic → toxique

10 / 31

Context
Translation method
Ranking method
Future work


toxic → toxique

Morpheme translation table for bound morphemes:
allow bound to free morpheme translation equivalence
-cyto- → -cyto-, cellule

10 / 31

Context
Translation method
Ranking method
Future work


toxic → toxique

Morpheme translation table for bound morphemes:
allow bound to free morpheme translation equivalence
-cyto- → -cyto-, cellule
{-cyto-, toxic} → {-cyto-, toxique},
{cellule, toxique}

10 / 31

Context
Translation method
Ranking method
Future work

Translation with variation

11 / 31

Context
Translation method
Ranking method
Future work


Morphological lexicon
toxic → toxique → toxicit´ ’toxicity’
e

11 / 31

Context
Translation method
Ranking method
Future work


e

Synonyms
toxic → toxique → v´n´neux ’poisonous’
e e

11 / 31

Context
Translation method
Ranking method
Future work


e

Synonyms
toxic → toxique → vńńeux ’poisonous’
e e

{-cyto-, toxic} → {-cyto-, toxicit´},
e
{-cyto-, vńńeux}, {cellule, toxicit´},
e e
e
{cellule, vńńeux}
e e

11 / 31

Context
Translation method
Ranking method
Future work

Reordering

12 / 31

Context
Translation method
Ranking method
Future work

Reordering

No translation patterns or reordering rules

12 / 31

Context
Translation method
Ranking method
Future work

Reordering

No translation patterns or reordering rules
Permutate the translated components :
{cellule, toxique} → {cellule, toxique},
{toxique, cellule}

12 / 31

Context
Translation method
Ranking method
Future work

Concatenation

13 / 31

Context
Translation method
Ranking method
Future work

Concatenation

Recreate target words by generating all possible
concatenations of the components :
{toxique, cellule} →
{toxique cellule},
{toxiquecellule}

13 / 31

Context
Translation method
Ranking method
Future work

Selection

14 / 31

Context
Translation method
Ranking method
Future work

Selection

Match target words with the words of the target corpus

14 / 31

Context
Translation method
Ranking method
Future work

Selection

Allow at maximum 3 stop words between two words

14 / 31

Context
Translation method
Ranking method
Future work

Selection

Allow at maximum 3 stop words between two words
{toxique cellule} → ‘‘toxique pour les
cellules’’ ’toxic to the cells’

14 / 31

Context
Translation method
Ranking method
Future work

Target term frequency

16 / 31

Context
Translation method
Ranking method
Future work

Target term frequency

Number of occurrences of target term divided by the total
number of occurrences in the target texts
Freq(t) =

occ(t)
N

16 / 31

Context
Translation method
Ranking method
Future work

Context similarity measure

17 / 31

Context
Translation method
Ranking method
Future work


Corresponds to context-based approaches

17 / 31

Context
Translation method
Ranking method
Future work


Collect words coocurring with source and target term in a
window of 5 words

17 / 31

Context
Translation method
Ranking method
Future work


window of 5 words
Normalize cooccurrences with log-likelihood ratio

17 / 31

Context
Translation method
Ranking method
Future work


window of 5 words
Normalize cooccurrences with log-likelihood ratio
Compare contexts with weighted jaccard
Cont(s, t) =

min(c(s, w ), c(t, w ))
max(c(s, w ), c(t, w ))
w ∈s∪t
w ∈s∩t

17 / 31

Context
Translation method
Ranking method
Future work

Part-of-speech translation probability

18 / 31

Context
Translation method
Ranking method
Future work


Probability that source term with part-of-speech A translates
to target term with part of speech B
Pos(s, t)

= P(pos(t)|pos(s))
= P(B|A)

18 / 31

Context
Translation method
Ranking method
Future work


Probability that source term with part-of-speech A translates
to target term with part of speech B
Pos(s, t)

= P(pos(t)|pos(s))
= P(B|A)

Acquired from pos-tagged parallel corpora [Tiedemann, 2009]
with word alignment software AnyMalign [Lardrilleux, 2008]

18 / 31

Context
Translation method
Ranking method
Future work

Resources reliability score

19 / 31

Context
Translation method
Ranking method
Future work


Some translation resources might give more reliable
translations than others
ex : bilingual dictionary > synonyms

19 / 31

Context
Translation method
Ranking method
Future work


score = mean of the reliability of the resources used for
translating the components
Reso(t = {c1 , ...cn }) =

n
i=1

resource reliability (ci )
n

19 / 31

Context
Translation method
Ranking method
Future work


score = mean of the reliability of the resources used for
translating the components
Reso(t = {c1 , ...cn }) =

n
i=1

resource reliability (ci )
n

Tuned on training data

19 / 31

Context
Translation method
Ranking method
Future work

Combination

20 / 31

Context
Translation method
Ranking method
Future work

Combination

Linear combination of the 4 criterion Frequency, Context,
Part-of-speech translation probability and Resources reliabilily
Combi(t, s) = Freq(s) + Cont(s, t) + Pos(s, t) + Reso(t)

20 / 31

Context
Translation method
Ranking method
Future work

Machine learning

1

http://people.cs.umass.edu/ vdang/ranklib.html
21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

Learning-to-rank algorithms used in IR for ranking documents

1

21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

Tried 3 algorithms implemented in the RankLib software1

1

21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

AdaRank [Li and Xu, 2007]

1

21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

Coordinate Ascend [Metzler and Croft, 2000]

1

21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

LambdaMart [Wu et al., 2010]

1

21 / 31

Context
Translation method
Ranking method
Future work

Machine learning

LambdaMart [Wu et al., 2010]

Features: Freq, Cont, Pos, Reso

1

21 / 31

Context
Translation method
Ranking method
Future work

Corpora

23 / 31

Context
Translation method
Ranking method
Future work

Corpora

English → French, German

23 / 31

Context
Translation method
Ranking method
Future work

Corpora

breast cancer

23 / 31

Context
Translation method
Ranking method
Future work

Corpora

breast cancer
≈ 400k words per language

23 / 31

Context
Translation method
Ranking method
Future work

Lexicons

24 / 31

Context
Translation method
Ranking method
Future work

Lexicons

Morpheme translation table (hand-crafted)

24 / 31

Context
Translation method
Ranking method
Future work

Lexicons

General language dictionary (Xelda)

24 / 31

Context
Translation method
Ranking method
Future work

Lexicons

Synonyms (Xelda)

24 / 31

Context
Translation method
Ranking method
Future work

Lexicons

Synonyms (Xelda)
Domain-speciﬁc dictionary : cognates extracted from corpus
[Hauer and Kondrak, 2011]

24 / 31

Context
Translation method
Ranking method
Future work

Lexicons

Synonyms (Xelda)
Domain-speciﬁc dictionary : cognates extracted from corpus
[Hauer and Kondrak, 2011]
Morphological families [Porter, 1980]

24 / 31

Context
Translation method
Ranking method
Future work

Training and evaluation datasets

25 / 31

Context
Translation method
Ranking method
Future work

EVALUATION ≈ 100 source terms

25 / 31

Context
Translation method
Ranking method
Future work

source terms in UMLS meta-thesaurus with
translation(s) in target texts

25 / 31

Context
Translation method
Ranking method
Future work

TRAINING ≈ 600 source terms

25 / 31

Context
Translation method
Ranking method
Future work

source terms for which a translation could be
generated and whose translation(s) is in the
target texts

25 / 31

Context
Translation method
Ranking method
Future work

target texts
generated translations were scored manually

25 / 31

Context
Translation method
Ranking method
Future work

target texts
⇒ evaluation and training datasets are disjoint

25 / 31

Context
Translation method
Ranking method
Future work

target texts
⇒ evaluation and training datasets are disjoint
⇒ source terms are morphologically complex words with no
translation in dictionary
25 / 31

Context
Translation method
Ranking method
Future work

Results for translation generation

# source terms
# at least 1 translation

EN → FR
126
86 (68%)

EN → DE
90
56 (62%)

# at least 1 translation
1 trans. in UMLS
1 trans. in UMLS or judged correct

86
68 (79%)
81 (94%)

56
40 (71%)
51 (91%)

26 / 31

Context
Translation method
Ranking method
Future work

Results for translation ranking

Random
Freq
Cont
Pos
Reso
Combination
ML AdaRank
ML CoordAsc
ML LambdaMart

EN → FR
.83
.92
.90
.88
.92
.93
.90
.93
.86

EN → DE
.80
.84
.82
.91
.82
.89
.84
.89
.88

Average
.815
.88
.86
.895
.87
.91
.87
.91
.87

Table: Top1 translation in UMLS or judged correct
27 / 31

Context
Translation method
Ranking method
Future work

Silence analysis

28 / 31

Context
Translation method
Ranking method
Future work

Silence analysis

Missing translation in resources (≈30%)

28 / 31

Context
Translation method
Ranking method
Future work

Silence analysis

Target term is not compositional (≈30%)
breastfeeding → allaitement (FR), stillen (DE)

28 / 31

Context
Translation method
Ranking method
Future work

Silence analysis


Lexical divergence (≈20%)
radiosensitivity → Strahlentoleranz, sensitivity = toleranz

28 / 31

Context
Translation method
Ranking method
Future work

Silence analysis


Lexical divergence (≈20%)
radiosensitivity → Strahlentoleranz, sensitivity = toleranz

Additional elements (≈13%)
postpartum→ postpartalperiod

28 / 31

Context
Translation method
Ranking method
Future work

Error analysis

29 / 31

Context
Translation method
Ranking method
Future work

Error analysis

Problems in word reordering
self-examination → untersuchung selbst ’examination self’

29 / 31

Context
Translation method
Ranking method
Future work

Error analysis

Problems in word reordering
self-examination → untersuchung selbst ’examination self’

Wrong or innapropriate translations
in-patient → pas malade ’not ill’
in → “inside” → inside patient
in → “inverse” → not a patient

29 / 31

Context
Translation method
Ranking method
Future work

Impact of fertile translations

exact translations
wrong translations

EN → FR
21%
50%

EN → DE
10%
80%

Table: % of fertile translations

30 / 31

Context
Translation method
Ranking method
Future work

Impact of fertile translations

exact translations
wrong translations

EN → FR
21%
50%

EN → DE
10%
80%

Table: % of fertile translations

German germanic language: tendency to agglutination
oestrogen-independant → Ostrogen-unabh¨ngige
a
French romance language: creates phrases more easily
oestrogen-independant → ind´pendant des œstrog`nes
e
e

30 / 31

Context
Translation method
Ranking method
Future work

Future work

Improve quality of linguistic resources
morphological derivation rules instead of stemming
use of a thesaurus

Try translations patterns on top of permutations
Try learning morpheme translation equivalences from
cognates
bilingual dictionaries
out-of-domain parallel data

31 / 31

Thank you for your attention.

B
estelle.delpech@univ-nantes.fr
beatrice.daille@univ-nantes.fr
emmanuel.morin@univ-nantes.fr
cl@lingua-et-machina.com

Exact translations

Non fertiles:
pathophysiological → physiopathologique
overactive → uberaktiv
¨

Fertiles:
cardiotoxicity → toxicit´ cardiaque ’cardiac toxicity’
e
mastectomy → ablation der brust ’ablation of the breast’

Morphological variants

Non fertiles:
dosimetry → dosim´trique ’dosimetric’
e
radiosensitivity → strahlenempﬁndlich ’radiosensitive’

Fertiles:
milk-producing → production de lait ’production of milk’
selfexamination → selbst untersuchen ’self examine’

Inexact but semantically related

Non fertiles:
oncogene → oncog´n`se ’oncogenesis’
e e
breakthrough → durchbrechen ’break’

Fertiles:
chemoradiotherapy → chemotherapie oder strahlen
’chemotherapy or radiation’
treatable → pouvoir le traiter ’can treat it’

Wrong translations

Non fertiles:
immunoscore → immunomarquer ’immunostain’
check-in → unkontrollieren ’uncontrolled’

Fertiles:
bloodstream → ﬂiessen mehr blut ’more blood ﬂow’
risk-reducing → risque de r´duire ’risk of reducing’
e

References I
Baldwin, T. and Tanaka, T. (2004).
Translation by machine of complex nominals.
In Proceedings of the ACL 2004 Workshop on Multiword expressions: Integrating Processing, pages 24–31,
Barcelona, Spain.
Bo, L. and Gaussier, E. (2010).
Improving corpus comparability for bilingual lexicon extraction from comparable corpora.
In 23`me International Conference on Computational Linguistics, pages 23–27, Beijing, Chine.
e
Cartoni, B. (2009).
Lexical morphology in machine translation: A feasibility study.
In Proceedings of the 12th Conference of the European Chapter of the ACL, pages 130–138, Athens, Greece.
Daille, B. and Morin, E. (2005).
French-English terminology extraction from comparable corpora.
In Proceedings, 2nd International Joint Conference on Natural Language Processing, volume 3651 of
Lecture Notes in Computer Sciences, page 707–718, Jeju Island, Korea. Springer.
Delpech, E. (2011).
Evaluation of terminologies acquired from comparable corpora : an application perspective.
In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), volume 11
of NEALT Proceedings Series,, pages 66–73, Riga, Latvia. Pedersen B.S., Neˇpore G., Skadi¸ a I.
s
n
Fung, P. (1997).
Finding terminology translations from non-parallel corpora.
pages 192–202, Hong Kong.
Garera, N. and Yarowsky, D. (2008).
Translating compounds by learning component gloss translation via multiple languages.
In Proceedings of the 3rd International Joint Conference on Natural Language Processing, volume 1, pages
403–410, Hyderabad, India.

References II
Grefenstette, G. (1999).
The world wide web as a resource for example-based machine translation tasks.
ASLIB’99 Translating and the computer, 21.
Harastani, R., Daille, B., and Morin, E. (2012).
Neoclassical compound alignments from comparable corpora.
In Proceedings of the 13th International Conference on Computational Linguistics and Intelligent Text
Processing, volume 2, pages 72–82, New Delhi, India.
Hauer, B. and Kondrak, G. (2011).
Clustering semantically equivalent words into cognate sets in multilingual lists.
In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 865–873,
Chiang Mai, Thailand.
Keenan, E. L. and Faltz, L. M. (1985).
Boolean semantics for natural language.
D. Reidel, Dordrecht, Holland.
Lardrilleux, A. (2008).
A truly multilingual, high coverage, accurate, yet simple, sub-sentential alignment method.
Li, H. and Xu, J. (2007).
Adarank: A boosing algorithm for information retrieval.
In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in
information retrieval, pages 391–398, Amsterdam, The Netherlands.
Metzler, D. and Croft, W. B. (2000).
Linear feature-based models for information retrieval.
Information Retrieval, 10(3):257–274.

References III
Morin, E. and Daille, B. (2009).
Compositionality and lexical alignment of multi-word terms.
In Language Resources and Evaluation (LRE), volume 44 of Multiword expression: hard going or plain
sailing, pages 79–95. P. Rayson, S. Piao, S. Sharoff, S. Evert, B. Villada Moirń, springer netherlands
o
edition.
Morin, E. and Daille, B. (2010).
Compositionality and lexical alignment of multi-word terms.
In Rayson, P., Piao, S., Sharoff, S., Evert, S., and B., V. M., editors, Language Resources and Evaluation
(LRE), volume 44 of Multiword expression: hard going or plain sailing, pages 79–95. Springer Netherlands.
Namer, F. and Baud, R. (2007).
Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system.
International Journal of Medical Informatics, 76(2-3):226–33.
Porter, M. F. (1980).
An algorithm for suffix stripping.
Program, 14(3):130–137.
Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., and Utsuro, S. (2006).
Compiling French-Japanese terminologies from the web.
In Proceedings of the 11th Conference of the European Chapter of the Association for Computational
Linguistics, pages 225–232, Trento, Italy.
Tiedemann, J. (2009).
News from opus - a collection of multilingual parallel corpora with tools and interfaces.
Wu, Q., Burges, J. C., Svore, K., and Gao, J. (2010).
Adapting boosting for information retrieval measures.
Journal of Information Retrieval, 13(3):254–270.

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking

Similar to Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking (20)

More from Estelle Delpech

More from Estelle Delpech (15)

Recently uploaded

Recently uploaded (20)

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking