Neural Network Language Model
Short Description
Download Neural Network Language Model...
Description
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu
CUED Division F Speech Group
Overview
• Language Modelling • Machine Translation
Overview
• Language Modelling • Machine Translation
Language Modelling Problem
• Aim is to calculate the probability of a sequence (sentence) P(X) • Can be decomposed into product of conditional probabilities of tokens (works):
• In practice, only finite content used
N-Gram Language Model
• N-Grams estimate word conditional probabilities via counting:
• Sparse (alleviated by back-off, but not entirely) • Doesn’t exploit word similarity • Finite Context
Neural Network Language Model Y. Bengio et al., JMLR’03
Limitation of Neural Network Language Model
• Sparsity – Solved
• World Similarity – Solved • Finite Context – Not • Computational Complexity - Softmax
Recurrent Neural Network Language Model [X. Liu, et al.]
Wall Street Journal Results – T. Mikolov Google 2010
Limitation of RNN Language Model
• Sparsity – Solved!
• World Similarity -> Sentence Similarity – Solved! • Finite Context – Solved? Not quite… • Still Computationally Complex Softmax
Lattice Rescoring with RNNs • Application of RNNs to lattices expands space
• Lattice is expanded to a prefix tree or N-best list • Impractical to apply to large lattices • Approximate Lattice Expansion – expand if:
• N-gram history is different • RNN history vector distance exceeds a threshold
Overview
• Language Modeling • Machine Translation
Machine Translation Task
• Translate an Source Sentence E into a target sentence F • Can be formulated in Noisy-Channel Framework: E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)]
• P(F) is just a language model – need to estimate P(E|F).
Previous Approaches: Word Alignment
W. Byrne, 4F11
• Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs. • Make conditional independence assumptions to separate out sentence length, alignment and translation models.
• Bootstrap using simpler models to initialize more complex models.
Previous Approaches: Phrase Based SMT
W. Byrne, 4F11
• Using IBM world alignments create phrase alignments and a phrase translation model. • Parameters estimated by Maximum Likelihood or EM. • Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases.
Problems with Previous Approaches • Highly Memory Intensive • Initial alignment makes conditional independence assumption • Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account • Highly non-trivial to decode hierarchical phrase based translation • word alignments + lexical reordering model • language model • phrase translations • parse a synchronous context free grammar over the text – components are very different from one another.
Neural Machine Translation
• The translation problem is expressed as a probability P(F|E) • Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned on another sequence. • Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder). • We can connect them using a joint alignment and translation mechanism. • Results in a single gestalt Machine Translation model which can generate candidate translations.
Bi-Directional RNNs
Neural Machine Translation: Encoder
h0
e0
h1
e1
…
… h j
…
…
…
…
…
…
…
ej
hN
eN
• Can be pre-trained as a Bi Directional RNN language model
Neural Machine Translation: Decoder
f0
s0
f1
s1
…
… s t
…
ft
…
…
…
FM=
sM
• ft is produced by sampling the discrete probability produced by softmax output layer. • Can be pre-trained as a RNN language model
Neural Machine Translation: Joint Alignment
f0
s0
s1
… s t
…
ft
…
…
…
sM
Ct = ∑atjhj
a t,1:N
zj = W ∙ tanh(V ∙ st-1 + U ∙ hj)
st-1
…
f1
z0
z1
zj
h0
h1
… … hj
zN
… …
… …
hN
fM
Neural Machine Translation: Features • End-to-end differentiable, trained using SGD with cross-entropy error function. • Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner • Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc… • Does not pre-align words by bootstrapping from simpler models. • Learns translation and joint alignment in a semantic space, not over surface forms. • Conceptually easy to decode – complexity similar to speech processing, not SMT. • Fewer Parameters – more memory efficient.
NMT BLEU results on English to French Translation
D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.
Conclusion
• RNNs and LSTM RNNs have been widely applied to a large. • State of the art in language modelling • Competitive performance on new tasks.
• Quickly evolving.
Biliography
• W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf • D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014. • Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003) • X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014. • T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.
View more...
Comments