KL RLT (u,s) - School of Computer Science

January 13, 2018 | Author: Anonymous | Category: Science, Health Science, Pediatrics
Share Embed Donate


Short Description

Download KL RLT (u,s) - School of Computer Science...

Description

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais

*Work done during internship at Microsoft Research

Search and recommendation are about the matching. Queries Documents Websites Users

Term-space matching is not always a good idea. Granularity Sparsity Efficiency

Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style

What would be their implications for search and recommendations?

Queries Documents Websites Users

Topic Category Reading Level Sentiment Style

In a Nutshell, WHAT WE DID:

WHAT WE FOUND:



Build Profiles of Reading Level and Topic (RLT)



Profile matching predicts user’s content preference



For queries, websites, users and search sessions



Profiles can indicate when not to personalize



In order to characterize and compare entities



Profile features can predict expert content

Building Reading Level and Topic Profiles

Predicting Reading Level and Topic for URL 

Reading Level Classifier 



Topic Classifier 



Based on language model and other sources

Trained using URLs in each Open Directory Project category

Profile 

Distribution over reading level, topic, or reading level and topic (RLT)

P(R|d1)

P(T|d1)

Entity Profile Built from Related URLs 

Entities and Related URLs   



Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs

Example: 

Site profile made from URLs visited during search sessions P(R|d 1)1) P(R|d P(R|d 1)

P(T|d 1)1) P(T|d P(T|d 1)

P(R,T|s)

Entity Profile Built with Related Entities 

Entity and related entities   



User – Websites visited Website – Surfacing queries Query – Issuing users

Surface

Website

Query Issue

Visit

Example: 

Site profile made from the profiles of its visitors P(R,T|u) P(R,T|u) P(R,T|u)

P(R,T|s)

User

Characterizing and Comparing Profiles 

Characterizing an Individual Entity  



Characterizing a Group of Entities  



Mean : expectation Variance : entropy

Build a group centroid from its members Variance : divergence among members

Comparing Entitles and Groups  

Difference in mean Divergence in profile (distribution)

Characterizing Web Content, User Interests, and Search Behavior

Data Set 

Session Log Data  



2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users

Profiles of Entities   

4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries

Reading Level Distribution for Top ODP Categories 

Each topic has different reading level distribution

Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02

E[R|T] 8.80 8.53 8.44 8.11 8.08 7.62 6.98 6.60 6.39 6.18 6.18 6.08 5.99 5.98 5.94

Topic and reading level characterize websites in each category

Profile matching predict user’s preference over search results 

Metric 



% of user’s preferences predicted by profile matching, for each clicked website over the skipped website above

Results  

By degree of focus in user profile : H(R,T|u) By the distance metric between user and website 

KLR(u,s) / KLT(u,s) / KLRLT(u,s) User Group ↑Focused

↓Diverse

#Clicks

KLR(u,s) KLT(u,s) KLRLT(u,s)

5,960

59.23%

60.79%

65.27%

147,195

52.25%

54.20%

54.41%

197,733

52.75%

53.36%

53.63%

Users’ Deviation from Their Own Profiles 

Stretch reading 



Session-level reading level >> Long-term reading level

Casual reading 

Session-level reading level
View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF