Neural Network Language Model

January 12, 2018 | Author: Anonymous | Category: Math, Statistics And Probability
Share Embed Donate


Short Description

Download Neural Network Language Model...

Description

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu

CUED Division F Speech Group

Overview

• Language Modelling • Machine Translation

Overview

• Language Modelling • Machine Translation

Language Modelling Problem

• Aim is to calculate the probability of a sequence (sentence) P(X) • Can be decomposed into product of conditional probabilities of tokens (works):

• In practice, only finite content used

N-Gram Language Model

• N-Grams estimate word conditional probabilities via counting:

• Sparse (alleviated by back-off, but not entirely) • Doesn’t exploit word similarity • Finite Context

Neural Network Language Model Y. Bengio et al., JMLR’03

Limitation of Neural Network Language Model

• Sparsity – Solved

• World Similarity – Solved • Finite Context – Not • Computational Complexity - Softmax

Recurrent Neural Network Language Model [X. Liu, et al.]

Wall Street Journal Results – T. Mikolov Google 2010

Limitation of RNN Language Model

• Sparsity – Solved!

• World Similarity -> Sentence Similarity – Solved! • Finite Context – Solved? Not quite… • Still Computationally Complex Softmax

Lattice Rescoring with RNNs • Application of RNNs to lattices expands space

• Lattice is expanded to a prefix tree or N-best list • Impractical to apply to large lattices • Approximate Lattice Expansion – expand if:

• N-gram history is different • RNN history vector distance exceeds a threshold

Overview

• Language Modeling • Machine Translation

Machine Translation Task

• Translate an Source Sentence E into a target sentence F • Can be formulated in Noisy-Channel Framework: E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)]

• P(F) is just a language model – need to estimate P(E|F).

Previous Approaches: Word Alignment

W. Byrne, 4F11

• Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs. • Make conditional independence assumptions to separate out sentence length, alignment and translation models.

• Bootstrap using simpler models to initialize more complex models.

Previous Approaches: Phrase Based SMT

W. Byrne, 4F11

• Using IBM world alignments create phrase alignments and a phrase translation model. • Parameters estimated by Maximum Likelihood or EM. • Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases.

Problems with Previous Approaches • Highly Memory Intensive • Initial alignment makes conditional independence assumption • Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account • Highly non-trivial to decode hierarchical phrase based translation • word alignments + lexical reordering model • language model • phrase translations • parse a synchronous context free grammar over the text – components are very different from one another.

Neural Machine Translation

• The translation problem is expressed as a probability P(F|E) • Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned on another sequence. • Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder). • We can connect them using a joint alignment and translation mechanism. • Results in a single gestalt Machine Translation model which can generate candidate translations.

Bi-Directional RNNs

Neural Machine Translation: Encoder

h0

e0

h1

e1



… h j















ej

hN

eN

• Can be pre-trained as a Bi Directional RNN language model

Neural Machine Translation: Decoder



f0

s0

f1

s1



… s t



ft







FM=

sM

• ft is produced by sampling the discrete probability produced by softmax output layer. • Can be pre-trained as a RNN language model

Neural Machine Translation: Joint Alignment

f0

s0

s1

… s t



ft







sM

Ct = ∑atjhj

a t,1:N

zj = W ∙ tanh(V ∙ st-1 + U ∙ hj)

st-1



f1

z0

z1

zj

h0

h1

… … hj

zN

… …

… …

hN

fM

Neural Machine Translation: Features • End-to-end differentiable, trained using SGD with cross-entropy error function. • Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner • Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc… • Does not pre-align words by bootstrapping from simpler models. • Learns translation and joint alignment in a semantic space, not over surface forms. • Conceptually easy to decode – complexity similar to speech processing, not SMT. • Fewer Parameters – more memory efficient.

NMT BLEU results on English to French Translation

D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.

Conclusion

• RNNs and LSTM RNNs have been widely applied to a large. • State of the art in language modelling • Competitive performance on new tasks.

• Quickly evolving.

Biliography

• W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf • D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014. • Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003) • X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014. • T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF