Statistical Alignment and Machine Translation

January 9, 2018 | Author: Anonymous | Category: Math, Statistics And Probability

Short Description

Download Statistical Alignment and Machine Translation...

Description

Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12

(Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004SNaturalLanguageProcessing/NLP_main_2004S.htm)

References: 1. 2. 3.

Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993. 1

Machine Translation (MT) • Definition – Automatic translation of text or speech from one language to another

• Goal – Produce close to error-free output that reads fluently in the target language – Far from it ?

• Current Status – Existing systems seem not good in quality • Babel Fish Translation (before 2004) • Google Translation – A mix of probabilistic and non-probabilistic components 2

Issues • Build high-quality semantic-based MT systems in circumscribed domains • Abandon automatic MT, build software to assist human translators instead – Post-edit the output of a buggy translation

• Develop automatic knowledge acquisition techniques for improving general-purpose MT – Supervised or unsupervised learning

3

Different Strategies for MT Interlingua (knowledge representation) knowledge-based translation English (semantic representation)

English (syntactic parse)

English Text (word string)

semantic transfer

syntactic transfer

word-for-word

French (semantic representation)

French (syntactic parse)

French Text (word string) 4

Word for Word MT • Translate words one-by-one from one language to another – Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) – Need to look at the context larger than individual word (→ phrase or clause) 2. Languages have different word orders English suit

French lawsuit (訴訟), set of garments (服裝) meanings 5

Syntactic Transfer MT • Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree – Solve the problems of word ordering – Problems • Syntactic ambiguity • The target syntax will likely mirror that of the source text N

V

Adv

German: Ich esse gern ( I like to eat ) English: I eat readily/gladly

6

Text Alignment: Definition • Definition – Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages • Thus can learn which words tend to be translated by which other words in another language

bilingual dictionaries, MT , parallel grammars … – Is not part of MT process per se • But the obligatory first step for making use of multilingual text corpora

• Applications – – – –

Bilingual lexicography Machine translation Multilingual information retrieval … 7

Text Alignment: Sources and Granularities • Sources of parallel texts or bitexts – Parliamentary proceedings (Hansards) – Newspapers and magazines – Religious and literary works

with less literal translation

• Two levels of alignment – Gross large scale alignment • Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language – Word alignment • Learn which words tend to be translated by which words in another language • The necessary step for acquiring a bilingual dictionary Orders of word or sentence might not be preserved.

8

Text Alignment: Example 1

2:2 alignment

9

Text Alignment: Example 2

2:2 alignment

1:1 alignment 1:1 alignment

2:1 alignment

a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment. 10

Sentence Alignment • Crossing dependencies are not allowed here – Word ordering is preserved !

• Related work

11

Sentence Alignment • Length-based

• Lexical-guided • Offset-based

12

Sentence Alignment Length-based method

• Rationale: the short sentences will be translated as short sentences and long sentences as long sentences – Length is defined as the number of words or the number of characters

• Approach 1 (Gale & Church 1993)

Union Bank of Switzerland (UBS) corpus : English, French, and German

– Assumptions • The paragraph structure was clearly marked in the corpus, confusions are checked by hand s1 s2 s3 s4 . . . sI

t1 t2 t3 t4 . . . . tJ

• Lengths of sentences measured in characters •

• Crossing dependences are not handled here – The order of sentences are not changed in the translation Ignore the rich information available in the text. 13

Sentence Alignment Length-based method source

Source

B1

S  s1s2  sI

T  t1t2 t J Target

B2 B3

Bk

s1 s2 s3 s4 . . . sI

target

t1 t2 t3 t4 . . . . tJ

possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,…}

a bead

probability independence between beads

arg max PA S , T   arg max P A, S , T    PBk  K

A

A

where A  B1 , B2 ,...,Bk 

k 1

14

Sentence Alignment Length-based method

– Dynamic Programming • The cost function (Distance Measure)





cost align l1 , l2    log P  align  l1 , l2 ,  , s 2







  log P align P  l1 , l2 ,  , s 2  align

 log PBk 

 l1 , l2 ,  , s 2   l2  l1  Ratio of texts in two languages





l1s 2   is a normal distribution square difference of two paragraphs

L2  L1

• Sentence is the unit of alignment • Statistically modeling of character lengths





Bayes’ Law



P  l1, l2 , , s 2  align  21  prob



The prob. distribution of standard normal distribution

15

Sentence Alignment Length-based method

• The priori probability Or P(α align)

Source



si si-1 si-2

tj-2 tj-1

tj



Di, j  1  cost 0 : 1 align , t j   Di  1, j   cost1 : 0 align si ,     Di  1, j  1  cost 1 : 1 alignsi , t j  Di, j   min   Di  1, j  2  cost 1 : 2 alignsi , t j 1 , t j  Di  2, j  1  cost 2 : 1 alignsi 1 , si , t j    Di  2, j  2   cost 2 : 2 alignsi 1 , si , t j 1 , t j Target



 





 



16

Sentence Alignment Length-based method

– A simple example

L1 alignment 1

L1 alignment 2

s1

t1

cost(align(s1, s2, t1)) t1 +

s2

t2

t2

s3

t3

s4

cost(align(s3, t2)) + cost(align(s4, t3))

t3

cost(align(s1, t1)) + cost(align(s2, t2)) + cost(align(s3,Ø)) + cost(align(s4, t3))

17

Sentence Alignment Length-based method

– 4% error rate was achieved – Problems: • Can not handle noisy and imperfect input – E.g., OCR output or file containing unknown markup conventions – Finding paragraph or sentence boundaries is difficult – Solution: just align text (position) offsets in two parallel texts (Church 1993) • Questionable for languages with few cognates (同源) or different writing systems – E.g., English ←→ Chinese

eastern European languages ←→ Asian languages

18

Sentence Alignment Lexical method

• Approach 1 (Kay and Röscheisen 1993) – First assume the first and last sentences of the text were align as the initial anchors – Form an envelope of possible alignments • Alignments excluded when sentences across anchors or their respective distance from an anchor differ greatly – Choose word pairs their distributions are similar in most of the sentences – Find pairs of source and target sentences which contain many possible lexical correspondences • The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations 19

Word Alignment • The sentence/offset alignment can be extended to a word alignment • Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary – Frequency of word correspondences – Association measures – ….

20

Statistical Machine Translation • The noisy channel model Language Model Pe 

e

f

Translation Model P f e

Decoder eˆ  arg maxe Pe f 

eˆ

e: English f: French

eak fk ea j

|e|=l

fj

|f|=m

– Assumptions: • An English word can be aligned with multiple French words while each French word is aligned with at most one English word • Independence of the individual word-to-word translations 21

Statistical Machine Translation • Three important components involved – Decoder eˆ  arg max Pe f   arg max e

pe  p  f e 

e

 arg max pe  p f e 

p f 

e

– Language model • Give the probability p(e) – Translation model



m l 1 l P f e   ...   P f Z a1  0 am  0 j 1

all possible normalization alignments constant (the English word that a French word fj is aligned with)

j

ea j



translation probability

22

An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12

23

Introduction • Translations involve many cultural respects – We only consider the translation of individual sentence, just acceptable sentences.

• Every sentence in one language is a possible translation of any sentence in the other – Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

24

Statistical Machine Translation (SMT)

• Noise channel problem

25

Fundamental of SMT • Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem)

• Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

26

Practical Challenges • Computation of translation model Pr(f|e) • Computation of language model Pr(e) • Decoding (i.e., search for e that maximize Pr(f|e)  Pr(e))

27

Alignment of case 1 (one-to-many)

28

Alignment of case 2 (many-to-one)

29

Alignment of case 3 (many-to-many)

30

Formulation of Alignment(1) • Let e = e1le1e2…el and f = f1m  f1f2…fm • An alignment between a pair of strings e and f use a mapping of every word ei to some word fj e = e1e2…ei…el

aj =i

f = f1 f2… fj… fm

fj = eaj = ei

• In other words, an alignment a between e and f tells that the word ei, 1 i  l is generated by the word fj, aj{1,…,l} • There are (l+1)m different alignments between e and f. (Including Null – no mapping ) 31

Translation Model

• The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null).

32

IBM Model I (1)



33

IBM Model I (2) • The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.

34

Constrained Maximization • We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

35

Lagrange Multipliers (1) • Method of Lagrange multipliers（拉格朗乘數法）: Lagrange multipliers with one constraint – If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the

F ( x, y,  )  f ( x, y)  g ( x, y) – – – –

f(x,y) is called the objective function（目標函數）. g(x,y) is called the constrained equation（條件限制方程式）. F(x, y, ) is called the Lagrange function（拉格朗函數）.  is called the Lagrange multiplier（拉格朗乘數）.

36

Lagrange Multipliers (2) • Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

37

Derivation (1) • The partial derivative of h with respect to t(f|e) is

• where  is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

38

Derivation (2)

• We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,

39

Derivation (3) • replacing e by ePr(f|e), then Equation (11) can be written very compactly as

• In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes

40

Derivation (4) • For an expression that can be evaluated efficiently.

41

Derivation (5) • Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)

42

Other References about MT

44

Chinese-English Sentence Alignment • 林語君、高照明,結合統計與語言訊息的混合式中英雙語句對應演算法,ROCLING XVI,2004 • 問題:中文句子的界線並不清楚，句點與逗點都有可能是句子的界線。 • 方法 – 群組句對應模組 • 2 Stages Iterative DP

– 句對應參數 • 採用廣泛的字典翻譯詞以及Stop list • 中文翻譯詞的部分比對（Partial Match） • 句中重要標點符號序列相似度 • 共同數字詞、時間詞、原文詞之對應錨

– 群組句對應參數 • 句長相異度 • 句末標點符號要求相同

45

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

46

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

• Preprocessing steps to calculate the following information: – – – –

Lists of preferred POS patterns of collocation in both languages. Collocation candidates matching the preferred POS patterns. N-gram statistics for both languages, N = 1, 2. Log likelihood Ratio statistics for two consecutive words in both languages. – Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. – Content word alignment based on Competitive Linking Algorithm (Melamed 1997).

47

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

48

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)

49

A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta

Outline • Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features

Introduction • Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. – E.g., translation lexicons, transfer rules

• Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)

– Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve

B(links(u, v) | cooc(u, v), ) scoreB(u, v)  log B(links(u, v) | cooc(u, v), )

CISCO System Inc.

思科系統

• To propose a probabilistic word-alignment model which allows easy integration of context-specific features.

noise

Probabilistic Word-Alignment Model • Given E = e1, e2, …, em , F = f1, f2, …, fn – If ei and fj are translation pair, then link l(ei, fj) exists – If ei has no corresponding translation, then null link l(ei, f0) exists – If fj has no corresponding translation, then null link l(e0, fj) exists – An alignment A is a set of links such that every word in E and F participates in at least one link

• Alignment problem is to find alignment A to maximize P(A|E, F) • IBM’s translation model: maximize P(A, F|E)

Probabilistic Word-Alignment Model (Cont.) • Given A = {l1, l2, …, lt}, where lk = l(eik, fjk), then j consecutive subsets of A, li = {li, li+1, …, lj} P( A | E, F ) 

P(l1t

t

| E, F )   P(lk | E, F , l1k 1 ) k 1

k-1

• Let Ck= {E, F, l1 } represent the context of lk P(lk | Ck ) 

P(lk , Ck ) P(Ck | lk ) P(lk )  P(Ck ) P(Ck , eik , f jk )

P(lk , eik , f jk ) P(Ck | lk )   P(Ck | eik , f jk ) P(eik , f jk ) P(Ck | lk )  P(lk | eik , f jk )  P(Ck | eik , f jk )

 P(Ck , eik , f jk )  P(Ck )  P(eik , f jk | Ck )   P(eik , f jk | Ck )  1

 P(lk , eik , f jk )  P(lk )  P(eik , f jk | lk )   P(eik , f jk | lk )  1

Probabilistic Word-Alignment Model (Cont.) k-1

• Ck = {E, F, l1 } is too complex to estimate • FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik, fjk, FTk) • Let Ck’ = {eik, fjk} ∪ FTk P(lk | Ck' ) 

P(lk | eik , f jk ) 

 P(lk | eik , f jk ) 

P(Ck' | lk ) P(Ck' | eik , f jk ) P( FTk | lk ) P( FTk | eik , f jk )

 P( ft | lk )   P( A | E , F )   P(lk | eik , f jk )     k 1 ftFTk P ( ft | eik , f jk )  t

An Illustrative Example

Word-Alignment Algorithm • Input: E, F, TE – TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions

• Constraints – One-to-one constraint: every word participates in exactly one link – Cohesion constraint: use TE to induce TF with no crossing dependencies

Word-Alignment Algorithm (Cont.) • Features – Adjacency features fta: for any word pair (ei, fj), if a link l(ei’, fj’) exists where -2  i’-i  2 and -2  j’-j  2, then fta(i-i’, j-j’, ei’) is active for this context.

– Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context.

 fta ( 1,1, host)  pair (the1 , l ' )    ftd ( 1, det)  pair (the1 , les)  ftd ( 3, det)

Experimental Results • Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs

Future Work • The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.

Statistical Alignment and Machine Translation

Short Description

Description

Comments

We need your help!