KL RLT (u,s) - School of Computer Science
Short Description
Download KL RLT (u,s) - School of Computer Science...
Description
Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais
*Work done during internship at Microsoft Research
Search and recommendation are about the matching. Queries Documents Websites Users
Term-space matching is not always a good idea. Granularity Sparsity Efficiency
Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style
What would be their implications for search and recommendations?
Queries Documents Websites Users
Topic Category Reading Level Sentiment Style
In a Nutshell, WHAT WE DID:
WHAT WE FOUND:
Build Profiles of Reading Level and Topic (RLT)
Profile matching predicts user’s content preference
For queries, websites, users and search sessions
Profiles can indicate when not to personalize
In order to characterize and compare entities
Profile features can predict expert content
Building Reading Level and Topic Profiles
Predicting Reading Level and Topic for URL
Reading Level Classifier
Topic Classifier
Based on language model and other sources
Trained using URLs in each Open Directory Project category
Profile
Distribution over reading level, topic, or reading level and topic (RLT)
P(R|d1)
P(T|d1)
Entity Profile Built from Related URLs
Entities and Related URLs
Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs
Example:
Site profile made from URLs visited during search sessions P(R|d 1)1) P(R|d P(R|d 1)
P(T|d 1)1) P(T|d P(T|d 1)
P(R,T|s)
Entity Profile Built with Related Entities
Entity and related entities
User – Websites visited Website – Surfacing queries Query – Issuing users
Surface
Website
Query Issue
Visit
Example:
Site profile made from the profiles of its visitors P(R,T|u) P(R,T|u) P(R,T|u)
P(R,T|s)
User
Characterizing and Comparing Profiles
Characterizing an Individual Entity
Characterizing a Group of Entities
Mean : expectation Variance : entropy
Build a group centroid from its members Variance : divergence among members
Comparing Entitles and Groups
Difference in mean Divergence in profile (distribution)
Characterizing Web Content, User Interests, and Search Behavior
Data Set
Session Log Data
2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users
Profiles of Entities
4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries
Reading Level Distribution for Top ODP Categories
Each topic has different reading level distribution
Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02
E[R|T] 8.80 8.53 8.44 8.11 8.08 7.62 6.98 6.60 6.39 6.18 6.18 6.08 5.99 5.98 5.94
Topic and reading level characterize websites in each category
Profile matching predict user’s preference over search results
Metric
% of user’s preferences predicted by profile matching, for each clicked website over the skipped website above
Results
By degree of focus in user profile : H(R,T|u) By the distance metric between user and website
KLR(u,s) / KLT(u,s) / KLRLT(u,s) User Group ↑Focused
↓Diverse
#Clicks
KLR(u,s) KLT(u,s) KLRLT(u,s)
5,960
59.23%
60.79%
65.27%
147,195
52.25%
54.20%
54.41%
197,733
52.75%
53.36%
53.63%
Users’ Deviation from Their Own Profiles
Stretch reading
Session-level reading level >> Long-term reading level
Casual reading
Session-level reading level
View more...
Comments