Statistical Alignment and Machine Translation
Short Description
Download Statistical Alignment and Machine Translation...
Description
Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12
(Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004SNaturalLanguageProcessing/NLP_main_2004S.htm)
References: 1. 2. 3.
Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993. 1
Machine Translation (MT) • Definition – Automatic translation of text or speech from one language to another
• Goal – Produce close to error-free output that reads fluently in the target language – Far from it ?
• Current Status – Existing systems seem not good in quality • Babel Fish Translation (before 2004) • Google Translation – A mix of probabilistic and non-probabilistic components 2
Issues • Build high-quality semantic-based MT systems in circumscribed domains • Abandon automatic MT, build software to assist human translators instead – Post-edit the output of a buggy translation
• Develop automatic knowledge acquisition techniques for improving general-purpose MT – Supervised or unsupervised learning
3
Different Strategies for MT Interlingua (knowledge representation) knowledge-based translation English (semantic representation)
English (syntactic parse)
English Text (word string)
semantic transfer
syntactic transfer
word-for-word
French (semantic representation)
French (syntactic parse)
French Text (word string) 4
Word for Word MT • Translate words one-by-one from one language to another – Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) – Need to look at the context larger than individual word (→ phrase or clause) 2. Languages have different word orders English suit
French lawsuit (訴訟), set of garments (服裝) meanings 5
Syntactic Transfer MT • Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree – Solve the problems of word ordering – Problems • Syntactic ambiguity • The target syntax will likely mirror that of the source text N
V
Adv
German: Ich esse gern ( I like to eat ) English: I eat readily/gladly
6
Text Alignment: Definition • Definition – Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages • Thus can learn which words tend to be translated by which other words in another language
bilingual dictionaries, MT , parallel grammars … – Is not part of MT process per se • But the obligatory first step for making use of multilingual text corpora
• Applications – – – –
Bilingual lexicography Machine translation Multilingual information retrieval … 7
Text Alignment: Sources and Granularities • Sources of parallel texts or bitexts – Parliamentary proceedings (Hansards) – Newspapers and magazines – Religious and literary works
with less literal translation
• Two levels of alignment – Gross large scale alignment • Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language – Word alignment • Learn which words tend to be translated by which words in another language • The necessary step for acquiring a bilingual dictionary Orders of word or sentence might not be preserved.
8
Text Alignment: Example 1
2:2 alignment
9
Text Alignment: Example 2
2:2 alignment
1:1 alignment 1:1 alignment
2:1 alignment
a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment. 10
Sentence Alignment • Crossing dependencies are not allowed here – Word ordering is preserved !
• Related work
11
Sentence Alignment • Length-based
• Lexical-guided • Offset-based
12
Sentence Alignment Length-based method
• Rationale: the short sentences will be translated as short sentences and long sentences as long sentences – Length is defined as the number of words or the number of characters
• Approach 1 (Gale & Church 1993)
Union Bank of Switzerland (UBS) corpus : English, French, and German
– Assumptions • The paragraph structure was clearly marked in the corpus, confusions are checked by hand s1 s2 s3 s4 . . . sI
t1 t2 t3 t4 . . . . tJ
• Lengths of sentences measured in characters •
• Crossing dependences are not handled here – The order of sentences are not changed in the translation Ignore the rich information available in the text. 13
Sentence Alignment Length-based method source
Source
B1
S s1s2 sI
T t1t2 t J Target
B2 B3
Bk
s1 s2 s3 s4 . . . sI
target
t1 t2 t3 t4 . . . . tJ
possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,…}
a bead
probability independence between beads
arg max PA S , T arg max P A, S , T PBk K
A
A
where A B1 , B2 ,...,Bk
k 1
14
Sentence Alignment Length-based method
– Dynamic Programming • The cost function (Distance Measure)
cost align l1 , l2 log P align l1 , l2 , , s 2
log P align P l1 , l2 , , s 2 align
log PBk
l1 , l2 , , s 2 l2 l1 Ratio of texts in two languages
l1s 2 is a normal distribution square difference of two paragraphs
L2 L1
• Sentence is the unit of alignment • Statistically modeling of character lengths
Bayes’ Law
P l1, l2 , , s 2 align 21 prob
The prob. distribution of standard normal distribution
15
Sentence Alignment Length-based method
• The priori probability Or P(α align)
Source
si si-1 si-2
tj-2 tj-1
tj
Di, j 1 cost 0 : 1 align , t j Di 1, j cost1 : 0 align si , Di 1, j 1 cost 1 : 1 alignsi , t j Di, j min Di 1, j 2 cost 1 : 2 alignsi , t j 1 , t j Di 2, j 1 cost 2 : 1 alignsi 1 , si , t j Di 2, j 2 cost 2 : 2 alignsi 1 , si , t j 1 , t j Target
16
Sentence Alignment Length-based method
– A simple example
L1 alignment 1
L1 alignment 2
s1
t1
cost(align(s1, s2, t1)) t1 +
s2
t2
t2
s3
t3
s4
cost(align(s3, t2)) + cost(align(s4, t3))
t3
cost(align(s1, t1)) + cost(align(s2, t2)) + cost(align(s3,Ø)) + cost(align(s4, t3))
17
Sentence Alignment Length-based method
– 4% error rate was achieved – Problems: • Can not handle noisy and imperfect input – E.g., OCR output or file containing unknown markup conventions – Finding paragraph or sentence boundaries is difficult – Solution: just align text (position) offsets in two parallel texts (Church 1993) • Questionable for languages with few cognates (同源) or different writing systems – E.g., English ←→ Chinese
eastern European languages ←→ Asian languages
18
Sentence Alignment Lexical method
• Approach 1 (Kay and Röscheisen 1993) – First assume the first and last sentences of the text were align as the initial anchors – Form an envelope of possible alignments • Alignments excluded when sentences across anchors or their respective distance from an anchor differ greatly – Choose word pairs their distributions are similar in most of the sentences – Find pairs of source and target sentences which contain many possible lexical correspondences • The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations 19
Word Alignment • The sentence/offset alignment can be extended to a word alignment • Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary – Frequency of word correspondences – Association measures – ….
20
Statistical Machine Translation • The noisy channel model Language Model Pe
e
f
Translation Model P f e
Decoder eˆ arg maxe Pe f
eˆ
e: English f: French
eak fk ea j
|e|=l
fj
|f|=m
– Assumptions: • An English word can be aligned with multiple French words while each French word is aligned with at most one English word • Independence of the individual word-to-word translations 21
Statistical Machine Translation • Three important components involved – Decoder eˆ arg max Pe f arg max e
pe p f e
e
arg max pe p f e
p f
e
– Language model • Give the probability p(e) – Translation model
m l 1 l P f e ... P f Z a1 0 am 0 j 1
all possible normalization alignments constant (the English word that a French word fj is aligned with)
j
ea j
translation probability
22
An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12
23
Introduction • Translations involve many cultural respects – We only consider the translation of individual sentence, just acceptable sentences.
• Every sentence in one language is a possible translation of any sentence in the other – Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.
24
Statistical Machine Translation (SMT)
• Noise channel problem
25
Fundamental of SMT • Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem)
• Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.
26
Practical Challenges • Computation of translation model Pr(f|e) • Computation of language model Pr(e) • Decoding (i.e., search for e that maximize Pr(f|e) Pr(e))
27
Alignment of case 1 (one-to-many)
28
Alignment of case 2 (many-to-one)
29
Alignment of case 3 (many-to-many)
30
Formulation of Alignment(1) • Let e = e1le1e2…el and f = f1m f1f2…fm • An alignment between a pair of strings e and f use a mapping of every word ei to some word fj e = e1e2…ei…el
aj =i
f = f1 f2… fj… fm
fj = eaj = ei
• In other words, an alignment a between e and f tells that the word ei, 1 i l is generated by the word fj, aj{1,…,l} • There are (l+1)m different alignments between e and f. (Including Null – no mapping ) 31
Translation Model
• The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null).
32
IBM Model I (1)
33
IBM Model I (2) • The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l.
34
Constrained Maximization • We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e
35
Lagrange Multipliers (1) • Method of Lagrange multipliers(拉格朗乘數法): Lagrange multipliers with one constraint – If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the
F ( x, y, ) f ( x, y) g ( x, y) – – – –
f(x,y) is called the objective function(目標函數). g(x,y) is called the constrained equation(條件限制方程式). F(x, y, ) is called the Lagrange function(拉格朗函數). is called the Lagrange multiplier(拉格朗乘數).
36
Lagrange Multipliers (2) • Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function
37
Derivation (1) • The partial derivative of h with respect to t(f|e) is
• where is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise
38
Derivation (2)
• We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,
39
Derivation (3) • replacing e by ePr(f|e), then Equation (11) can be written very compactly as
• In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes
40
Derivation (4) • For an expression that can be evaluated efficiently.
41
Derivation (5) • Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)
42
Other References about MT
44
Chinese-English Sentence Alignment • 林語君、高照明,結合統計與語言 訊息的混合式中英雙語句對應演算 法,ROCLING XVI,2004 • 問題:中文句子的界線並不清楚, 句點與逗點都有可能是句子的界線。 • 方法 – 群組句對應模組 • 2 Stages Iterative DP
– 句對應參數 • 採用廣泛的字典翻譯詞以及Stop list • 中文翻譯詞的部分比對(Partial Match) • 句中重要標點符號序列相似度 • 共同數字詞、時間詞、原文詞之對 應錨
– 群組句對應參數 • 句長相異度 • 句末標點符號要求相同
45
Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
46
Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
• Preprocessing steps to calculate the following information: – – – –
Lists of preferred POS patterns of collocation in both languages. Collocation candidates matching the preferred POS patterns. N-gram statistics for both languages, N = 1, 2. Log likelihood Ratio statistics for two consecutive words in both languages. – Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. – Content word alignment based on Competitive Linking Algorithm (Melamed 1997).
47
Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
48
Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
49
A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta
Outline • Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features
Introduction • Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. – E.g., translation lexicons, transfer rules
• Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)
– Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve
B(links(u, v) | cooc(u, v), ) scoreB(u, v) log B(links(u, v) | cooc(u, v), )
CISCO System Inc.
思科 系統
• To propose a probabilistic word-alignment model which allows easy integration of context-specific features.
noise
Probabilistic Word-Alignment Model • Given E = e1, e2, …, em , F = f1, f2, …, fn – If ei and fj are translation pair, then link l(ei, fj) exists – If ei has no corresponding translation, then null link l(ei, f0) exists – If fj has no corresponding translation, then null link l(e0, fj) exists – An alignment A is a set of links such that every word in E and F participates in at least one link
• Alignment problem is to find alignment A to maximize P(A|E, F) • IBM’s translation model: maximize P(A, F|E)
Probabilistic Word-Alignment Model (Cont.) • Given A = {l1, l2, …, lt}, where lk = l(eik, fjk), then j consecutive subsets of A, li = {li, li+1, …, lj} P( A | E, F )
P(l1t
t
| E, F ) P(lk | E, F , l1k 1 ) k 1
k-1
• Let Ck= {E, F, l1 } represent the context of lk P(lk | Ck )
P(lk , Ck ) P(Ck | lk ) P(lk ) P(Ck ) P(Ck , eik , f jk )
P(lk , eik , f jk ) P(Ck | lk ) P(Ck | eik , f jk ) P(eik , f jk ) P(Ck | lk ) P(lk | eik , f jk ) P(Ck | eik , f jk )
P(Ck , eik , f jk ) P(Ck ) P(eik , f jk | Ck ) P(eik , f jk | Ck ) 1
P(lk , eik , f jk ) P(lk ) P(eik , f jk | lk ) P(eik , f jk | lk ) 1
Probabilistic Word-Alignment Model (Cont.) k-1
• Ck = {E, F, l1 } is too complex to estimate • FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik, fjk, FTk) • Let Ck’ = {eik, fjk} ∪ FTk P(lk | Ck' )
P(lk | eik , f jk )
P(lk | eik , f jk )
P(Ck' | lk ) P(Ck' | eik , f jk ) P( FTk | lk ) P( FTk | eik , f jk )
P( ft | lk ) P( A | E , F ) P(lk | eik , f jk ) k 1 ftFTk P ( ft | eik , f jk ) t
An Illustrative Example
Word-Alignment Algorithm • Input: E, F, TE – TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions
• Constraints – One-to-one constraint: every word participates in exactly one link – Cohesion constraint: use TE to induce TF with no crossing dependencies
Word-Alignment Algorithm (Cont.) • Features – Adjacency features fta: for any word pair (ei, fj), if a link l(ei’, fj’) exists where -2 i’-i 2 and -2 j’-j 2, then fta(i-i’, j-j’, ei’) is active for this context.
– Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context.
fta ( 1,1, host) pair (the1 , l ' ) ftd ( 1, det) pair (the1 , les) ftd ( 3, det)
Experimental Results • Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs
Future Work • The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.
View more...
Comments