Neural Network Language Model

January 12, 2018 | Author: Anonymous | Category: Math, Statistics And Probability

Short Description

Download Neural Network Language Model...

Description

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu

CUED Division F Speech Group

Overview

• Language Modelling • Machine Translation

Overview

• Language Modelling • Machine Translation

Language Modelling Problem

• Aim is to calculate the probability of a sequence (sentence) P(X) • Can be decomposed into product of conditional probabilities of tokens (works):

• In practice, only finite content used

N-Gram Language Model

• N-Grams estimate word conditional probabilities via counting:

• Sparse (alleviated by back-off, but not entirely) • Doesn’t exploit word similarity • Finite Context

Neural Network Language Model Y. Bengio et al., JMLR’03

Limitation of Neural Network Language Model

• Sparsity – Solved

• World Similarity – Solved • Finite Context – Not • Computational Complexity - Softmax

Recurrent Neural Network Language Model [X. Liu, et al.]

Wall Street Journal Results – T. Mikolov Google 2010

Limitation of RNN Language Model

• Sparsity – Solved!

• World Similarity -> Sentence Similarity – Solved! • Finite Context – Solved? Not quite… • Still Computationally Complex Softmax

Lattice Rescoring with RNNs • Application of RNNs to lattices expands space

• Lattice is expanded to a prefix tree or N-best list • Impractical to apply to large lattices • Approximate Lattice Expansion – expand if:

• N-gram history is different • RNN history vector distance exceeds a threshold

Overview

• Language Modeling • Machine Translation

Machine Translation Task

• Translate an Source Sentence E into a target sentence F • Can be formulated in Noisy-Channel Framework: E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)]

• P(F) is just a language model – need to estimate P(E|F).

Previous Approaches: Word Alignment

W. Byrne, 4F11

• Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs. • Make conditional independence assumptions to separate out sentence length, alignment and translation models.

• Bootstrap using simpler models to initialize more complex models.

Previous Approaches: Phrase Based SMT

W. Byrne, 4F11

• Using IBM world alignments create phrase alignments and a phrase translation model. • Parameters estimated by Maximum Likelihood or EM. • Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases.

Problems with Previous Approaches • Highly Memory Intensive • Initial alignment makes conditional independence assumption • Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account • Highly non-trivial to decode hierarchical phrase based translation • word alignments + lexical reordering model • language model • phrase translations • parse a synchronous context free grammar over the text – components are very different from one another.

Neural Machine Translation

• The translation problem is expressed as a probability P(F|E) • Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned on another sequence. • Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder). • We can connect them using a joint alignment and translation mechanism. • Results in a single gestalt Machine Translation model which can generate candidate translations.

Bi-Directional RNNs

Neural Machine Translation: Encoder

h0

e0

h1

e1

…

… h j

…

…

…

…

…

…

…

ej

hN

eN

• Can be pre-trained as a Bi Directional RNN language model

Neural Machine Translation: Decoder

f0

s0

f1

s1

…

… s t

…

ft

…

…

…

FM=

sM

• ft is produced by sampling the discrete probability produced by softmax output layer. • Can be pre-trained as a RNN language model

Neural Machine Translation: Joint Alignment

f0

s0

s1

… s t

…

ft

…

…

…

sM

Ct = ∑atjhj

a t,1:N

zj = W ∙ tanh(V ∙ st-1 + U ∙ hj)

st-1

…

f1

z0

z1

zj

h0

h1

… … hj

zN

… …

… …

hN

fM

Neural Machine Translation: Features • End-to-end differentiable, trained using SGD with cross-entropy error function. • Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner • Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc… • Does not pre-align words by bootstrapping from simpler models. • Learns translation and joint alignment in a semantic space, not over surface forms. • Conceptually easy to decode – complexity similar to speech processing, not SMT. • Fewer Parameters – more memory efficient.

NMT BLEU results on English to French Translation

D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.

Conclusion

• RNNs and LSTM RNNs have been widely applied to a large. • State of the art in language modelling • Competitive performance on new tasks.

• Quickly evolving.

Biliography

• W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf • D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014. • Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003) • X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014. • T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.

Neural Network Language Model

Short Description

Description

Comments

We need your help!