Concepts Identification from Queries and Its Application for Search

January 9, 2018 | Author: Anonymous | Category: Math, Statistics And Probability
Share Embed Donate


Short Description

Download Concepts Identification from Queries and Its Application for Search...

Description

Fuchun Peng Microsoft Bing 7/23/2010

1

 

Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks

simmons college’s

sports psychology (course)

Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports”

• Can we automatically segment the query to recover the concepts?

2

 

Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features



Conclusions

3



Supervised learning (Bergsma et al, EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation point ◦ Features: POS, web counts, the, and, … w1

w2 N

• Problem:

w3 Y

w4 N

w5 Y

– Limited-range context – Features specifically designed for noun phrases

4



Manual Data Preparation ◦ Linguistic driven  [San jose international airport]

◦ Relevance driven  [San jose] [international airport]

5

3,4

MI

MI(w1,w2) = P(w1w2) / P(w1)P(w2)

1,2

4,5

threshold 2,3

w1 w2

• Problem:

w3 w4

w5

insert segment boundary w1w2 | w3w4w5 Iterative update

– only captures short-range correlation (between adjacent words) – What about my heart will go on? 6

7



Assume the query is generated by independent sampling from a probability distribution of concepts: simmons college sports psychology

P(sports psychology)=0.000002

>

P(simmons college)=0.000016

P=0.000016×0.000002

unigram model

P=0.000007×0.000006×0.000024

simmons college sports psychology P(simmons)=0.000007

P(college sports)=0.000006 P(psychology)=0.000024

• Enumerate all possible segmentations; Rank by probability of being generated by the unigram model • How to estimate parameters P(w) for the unigram model? 8



We have ngram (n=1..5) counts in a web corpus ◦ 464M documents; L = 33B tokens ◦ Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399]  #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B)

Solved by DP

9



Maximum Likelihood Estimate: PMLE(t) = #(t) / N



Problem:

◦ #(potter and the goblet of) = 6765 ◦ P(potter and the goblet of) > P(harry potter and the goblet of fire)? Wrong! ◦ not prob. of seeing t in text, but prob. of seeing t as a self-contained concept in text

10

Query-relevant web corpus ngram

longest matching count

raw frequency

harry harry potter harry potter and harry potter and the harry potter and the goblet harry potter and the goblet of harry potter and the goblet of fire ... … fire

1657108 277736 10436 51330 101 618 5783 … … 4200957

2003112 346004 68268 57832 6502 6401 5783 … … 4478774

Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) posterior prob. θ = argmax P(D|θ)P(θ) DL of corpus DL of parameters = argmax log P(D|θ) + log P(θ)

log P(D|θ) = ∑t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) 11

12



Three human-segmented datasets ◦ 3 data sets, for training, validation, and testing, 500 queries for each set  Segmented by three editors A, B, C

13



Evaluation metric:

◦ Boundary classification accuracy

w1

w2 N

w3 Y

w4 N

w5 Y

◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy ◦ Segment accuracy: the percentage of segments being recovered  Truth [abc] [de] [fg]  Prediction: [abc] [de fg]: precision

14

15

16

 

Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features



Conclusions

17

 

Phrase Proximity Boosting Phrase Level Query Expansion

18



Classifying a segment into one of three categories ◦ Strong concept: no word reordering, no word insertion/deletion  Treat the whole segment as a single unit in matching and ranking

◦ Weak concept: allow word reordering or deletion/insertion  Boost documents matching the weak concepts

◦ Not a concept  Do nothing

19



Concept based BM25 ◦ Weighted by the confidence of concepts



Concept based min coverage ◦ Weighted by the confidence of concepts

20



Phrase level replacement ◦ [San Francisco] -> [sf] ◦ [red eye flight] ->[late night flight]

21



Significant relevance boosting ◦ Affects 40% query traffic ◦ Significant DCG gain (1.5% for affected queries) ◦ Significant online CTR gain (0.5% over all)

22

 

Summary of Segmentation approaches Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features



Conclusions

23





Data is segmentation is important for query segmentation Phrases are important for improving relevance

24



 



Bergsma et al, EMNLP-CoNLL07 Risvik et al. WWW 2003 Hagen et al SIGIR 2010 Tan & Peng, WWW 2008

25

26



Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments harry potter and the goblet of fire += 1

... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling | ... ...

potter and the goblet of += 0

C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01

• Technical difficulties 27



Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q=harry potter and the goblet of fire

... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ...

harry potter and the goblet of fire += 1 the += 2 harry potter += 1

28

29



Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q= potter and the goblet

... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ...

potter and the goblet += 1 the += 2 potter += 1

Directly compute longest matching counts using raw ngram frequency: O(|Q|2)

30

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF