2014_acl_goldberg

January 12, 2018 | Author: Anonymous | Category: Science, Astronomy
Share Embed Donate


Short Description

Download 2014_acl_goldberg...

Description

Linguistic Regularities in Sparse and Explicit Word Representations CONLL 2014 Omer Levy and Yoav Goldberg.

Astronomers have been able to capture and record the moment when a massive star begins to blow itself apart.

International Politics

A distance measure between words? 𝑑 𝑀𝑖 , 𝑀𝑗 β€’ Document clustering β€’ Machine translation β€’ Information retrieval β€’ Question answering β€’ ….

Space Science

Vector representation?

Vector Representation Explicit

Continuous ~ Embeddings ~ Neural

π‘…π‘’π‘ π‘ π‘–π‘Ž is to π‘€π‘œπ‘ π‘π‘œπ‘€ as π½π‘Žπ‘π‘Žπ‘› is to π‘‡π‘œπ‘˜π‘¦π‘œ Figures from (Mikolov et al.,NAACL, 2013) and (Turian et al., ACL, 2010)

Continuous vectors representation β€’ Interesting properties: directional similarity

Older continuous vector representations: β€’ Latent Semantic Analysis β€’ Analogy between words: β€’ Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) β€’ etc. woman βˆ’ man β‰ˆ queen βˆ’ king

A recent result and Golberg, NIPS, 2014) king βˆ’(Levy man + woman β‰ˆ queen mathematically showed that the two representation are β€œalmost” equivalent.

Figure from (Mikolov et al.,NAACL, 2013)

Goals of the paper β€’ Analogy:

π‘Ž is to π‘Ž βˆ— as 𝑏 is to 𝑏 βˆ—

β€’ With simple vector arithmetic: π‘Ž βˆ’ π‘Žβˆ— = 𝑏 βˆ’ π‘βˆ— β€’ Given 3 words: π‘Ž is to π‘Ž βˆ— as 𝑏 is to ? β€’ Can we solve analogy problems with explicit representation? β€’ Compare the performance of the β€’ Continuous (neural) representation β€’ Explicit representation Baroni et al. (ACL, 2014) showed that embeddings outperform explicit

Explicit representation β€’ Term-Context vector computer

data

pinch

result

sugar

apricot

0

0

1

0

1

pineapple

0

0

1

0

1

digital

2

1

0

1

0

information

1

6

0

4

0

β€’ π΄π‘π‘Ÿπ‘–π‘π‘œπ‘‘: πΆπ‘œπ‘šπ‘π‘’π‘‘π‘’π‘Ÿ: 0, π‘‘π‘Žπ‘‘π‘Ž: 0, π‘π‘–π‘›π‘β„Ž: 1, … ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ β‰ˆ100,000 β€’ Sparse! β€’ Pointwise Mutual Information: P ( word , context ) PMI ( word , context ) ο€½ log 2 P ( word ) P ( context ) β€’ Positive PMI: β€’ Replace the negative PMI values with zero.

Continuous vectors representation β€’ Example:

πΌπ‘‘π‘Žπ‘™π‘¦: (βˆ’7.35, 9.42, 0.88, … ) ∈ ℝ100

β€’ Continuous values and dense β€’ How to create? β€’ Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781) 1 𝐿= 𝑇

𝑇 𝑑=1

βˆ’π‘β‰€π‘—β‰€π‘ log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 ) 𝑗≠0

exp (𝑣𝑖 . 𝑣𝑗 ) 𝑝 𝑀𝑖 𝑀𝑗 = π‘˜ exp (𝑣𝑖 . π‘£π‘˜ ) β€’ Problem: The denominator is of size π‘˜. β€’ Hard to calculate the gradient

Formulating analogy objective β€’ Given 3 words:

π‘Ž is to π‘Žβˆ— as 𝑏 is to ?

β€’ Cosine similarity: β€’ Prediction:

π‘Ž. 𝑏 cos π‘Ž, 𝑏 = π‘Ž 𝑏 arg max cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Žβˆ— βˆ— 𝑏

β€’ If using normalized vectors: βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ— arg max cos 𝑏 βˆ— 𝑏

β€’ Alternative? β€’ Instead of adding similarities, multiply them! cos 𝑏 βˆ— , 𝑏 cos 𝑏 βˆ— , π‘Žβˆ— arg max π‘βˆ— cos 𝑏 βˆ— , π‘Ž

Corpora β€’ MSR: β€’ Google:

~8000 syntactic analogies ~19,000 syntactic and semantic analogies a

b

a*

b*

Good

Better

Rough

?

better

good

rougher

?

good

best

rough

?

…

…

…

…

β€’ Learn different representations from the same corpus:

Very important!! Recent controversies over inaccurate evaluations for β€œGloVe wordrepresentation” (Pennington et al, EMNLP)

Embedding vs Explicit (Round 1) 70% 60%

Accuracy

50%

40% 30%

Embedding 63%

Embedding 54%

20%

Explicit 29%

10%

Explicit 45%

0%

MSR

Google

Many analogies recovered by explicit, but many more by embedding.

Using multiplicative form β€’ Given 3 words:

π‘Ž is to π‘Žβˆ— as 𝑏 is to ?

β€’ Cosine similarity:

π‘Ž. 𝑏 cos π‘Ž, 𝑏 = π‘Ž 𝑏

β€’ Additive objective: arg max cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Žβˆ— βˆ— 𝑏

β€’ Multiplicative objective:

cos 𝑏 βˆ— , 𝑏 cos 𝑏 βˆ— , π‘Žβˆ— arg max π‘βˆ— cos 𝑏 βˆ— , π‘Ž

A relatively weak justification England βˆ’ London + Baghdad = ? Iraq

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15

0.13

0.63

0.13

0.14

0.75

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

Embedding vs Explicit (Round 2) 80%

70%

Accuracy

60% 50% 40%

30% 20%

Add 54%

Mul 59%

Add 63%

Mul 67%

Mul 68%

Mul 57%

Add 45%

Add 29%

10% 0% MSR

Google Embedding

MSR

Google Explicit

80%

70%

Accuracy

60% 50% 40% 30%

Embedding 59%

Explicit 57%

Embedding 67%

20% 10% 0%

MSR

Google

Explicit 68%

Summary β€’ On analogies, continuous (neural) representation is not magical β€’ Analogies are possible in the explicit representation β€’ On analogies explicit representation can be as good as continuous (neural) representation. β€’ Objective (function) matters!

Agreement between representations Objective

Both Correct

Both Wrong

Embedding Correct

Explicit Correct

MSR

43.97%

28.06%

15.12%

12.85%

Google

57.12%

22.17%

9.59%

11.12%

Comparison of objectives

Objective

Additive

Multiplicative

Representation

MSR

Google

Embedding Explicit Embedding Explicit

53.98% 29.04% 59.09% 56.83%

62.70% 45.05% 66.72% 68.24%

Recurrent Neural Network β€’ Recurrent Neural Networks (Mikolov et al.,NAACL, 2013) β€’ Input-output relations: 𝑦 𝑑 = π’ˆ(𝑽𝑠 𝑑 ) 𝑠 𝑑 = 𝑓(𝑾𝑠 𝑑 βˆ’ 1 + 𝑼𝑀(𝑑)) β€’ 𝑦 ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ β€’ 𝑀 𝑑 ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ β€’ 𝑠 𝑑 ∈ ℝ𝑑 β€’ 𝑼 ∈ ℝ|π‘‰π‘œπ‘π‘Žπ‘|×𝑑 β€’ 𝑾 ∈ ℝ𝑑×𝑑 β€’ 𝑽 ∈ ℝ𝑑×|π‘‰π‘œπ‘π‘Žπ‘|

Continuous vectors representation β€’ Example:

πΌπ‘‘π‘Žπ‘™π‘¦: (βˆ’7.35, 9.42, 0.88, … ) ∈ ℝ100

β€’ Continuous values and dense β€’ How to create? β€’ Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781) 1 𝐿= 𝑇

𝑇 𝑑=1

βˆ’π‘β‰€π‘—β‰€π‘ log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 ) 𝑗≠0

exp (𝑣𝑖 . 𝑣𝑗 ) 𝑝 𝑀𝑖 𝑀𝑗 = π‘˜ exp (𝑣𝑖 . π‘£π‘˜ ) β€’ Problem: The denominator is of size π‘˜. β€’ Hard to calculate the gradient

Continuous vectors representation 1 𝐿= 𝑇

𝑇 𝑑=1

βˆ’π‘β‰€π‘—β‰€π‘ log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 ) 𝑗≠0

exp (𝑣𝑖 . 𝑣𝑗 ) 𝑝 𝑀𝑖 𝑀𝑗 = π‘˜ exp (𝑣𝑖 . π‘£π‘˜ ) β€’ Problem: The denominator is of size π‘˜. β€’ Hard to calculate the gradient β€’ Change the objective: 1 𝑝 𝑀𝑖 𝑀𝑗 = 1 + exp (𝑣𝑖 . 𝑣𝑗 ) β€’ Has trivial solution! β€’ Introduce artificial negative instances! 𝐿𝑖𝑗 = log 𝜎(𝑣𝑖 . 𝑣𝑗 ) + π‘˜π‘™=1 𝐸𝑀𝑙 βˆΌπ‘ƒπ‘›

𝑀

[log 𝜎(βˆ’π‘£π‘™ . 𝑣𝑗 )]

References Slides from: http://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-andexplicit-word-representations/ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.

Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. 2013.

View more...

Comments

Copyright οΏ½ 2017 NANOPDF Inc.
SUPPORT NANOPDF