TextRank_-_Bringing_Order_into_Texts

January 10, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Statistics
Share Embed Donate


Short Description

Download TextRank_-_Bringing_Order_into_Texts...

Description

TextRank : Bringing Order into Texts Rada Mihalcea and Paul Tarau Department of Computer Science, University of North Texas EMNLP 2004 (Conference on Empirical Methods in Natural Language Processing)

May 18, 2011 In-seok An SNU Internet Database Lab.

Outline  Introduction  The TextRank Model – Undirected Graphs – Weighted Graphs – Text as a Graph

 Keyword Extraction – TextRank for Keyword Extraction – Evaluation

 Sentence Extraction – TextRank for Sentence Extraction – Evaluation

 Conclusion

2 / 33

Introduction  Graph-based ranking algorithm – Decide on the importance of a vertex  take into account global information – recursively computed from the entire graph

– Have been successfully used in  Citation analysis  Social networks  Link-structure of the Web

– Can be applied to NLP application  Automated extraction of key phrases  Extractive summarization  Etc.

3 / 33

Introduction  TextRank – Graph-based model  For graphs extracted from texts

– Two NLP tasks  Keyword extraction  Sentence extraction

– TextRank are competitive  Compared with other NLP algorithms

Graph, Ranking, Formula, Boring We introduce TextRank

4 / 33

Outline  Introduction  The TextRank Model – Undirected Graphs – Weighted Graphs – Text as a Graph

 Keyword Extraction – TextRank for Keyword Extraction – Evaluation

 Sentence Extraction – TextRank for Sentence Extraction – Evaluation

 Conclusion

5 / 33

The TextRank Model  Graph-based ranking algorithms – A way of deciding the importance of a vertex within a graph – Based on global information  Recursively drawn from the entire graph

 Basic idea – Voting – Recommendation

 The score of vertex – How many? – Who?

6 / 33

The TextRank Model  Socre of a vertex

– S (Vi ) : Score of the vertex –

Vi

: Vertex

– In(Vi ) : the set of vertices that point to it ( predecessors )

– Out(Vi ) : the set of vertices that vertex points to ( successors ) –

d

: damping factor

 The probability of jumping from a given vertex to another vertex – Random surfer model – 0.85 ( PageRank )

7 / 33

The TextRank Model  The Score of Graph – Starting from arbitrary values – The computation iterates  Until convergence below a given threshold is achieved

 The Score of vertex – Importance of the vertex – The final values are not affected by the initial value  Only the number of iterations to convergence may be different

8 / 33

The TextRank Model

Undirected Graphs  Recursive graph-based ranking algorithm – Traditionally applied on directed graphs – Can be applied to undirected graphs  The out-degree of a vertex is equal to the in-degree of the vertex

 Convergence curve – As the connectivity of the graph increases  Fewer iterations  The convergence curves for directed and undirected graphs practically overlap

9 / 33

The TextRank Model

Weighted Graphs  PageRank – Assuming unweighted graph – Page hardly include multiple or partial links to another page

 TextRank – May include multiple or partial link between the units  The graphs are build from natural language text

– Incorporate the “strength” of connectivity  Weight of the edge

10 / 33

The TextRank Model

Weighted Graphs  New measure

– The final scores differ significantly  as compared to original measure

– The number of iterations is almost identical  for weighted and unwieghted graphs

11 / 33

The TextRank Model

Text as a Graph  Build a graph – Represents the text – Interconnects words or other text entities with meaningful relations  Text unit of various size  Various characteristics – Words, entire sentences, collocations, etc.

 The type of relations – Lexical semantic relations – Contextual overlap – Etc.

12 / 33

The TextRank Model

Text as a Graph  4 steps of Graph-based ranking algorithms – Identify text units  Best define the task at hand  Add them as vertices in the graph

– Identify relations  Connect such text units  Use these relations to draw edges – Directed – Undirected

– Iterate the graph-based ranking algorithm  Until convergence

– Sort vertices based on their final score

13 / 33

Outline  Introduction  The TextRank Model – Undirected Graphs – Weighted Graphs – Text as a Graph

 Keyword Extraction – TextRank for Keyword Extraction – Evaluation

 Sentence Extraction – TextRank for Sentence Extraction – Evaluation

 Conclusion

14 / 33

Keyword Extraction  Keyword Extraction – Automatically identify a set of terms  Best describe the document

– Use of this keyword     

Building an automatic index Classify a text Concise summary Terminology extraction Construction of domain-specific dictionaries

 TextRank – No limitation of the size of the Text

15 / 33

Keyword Extraction  Possible approach – Frequency criterion – Supervised learning methods  Parametrized heuristic rules ( combined with genetic algorithm ) – Turney, 1999 – Precision : 29.0%  Five key phrases extracted per document

 Naïve Bayes learning scheme – Frank et al., 1999 – Precision : 18.3%  Fifteen key phrases per document

– Keyword extraction from abstract  More widely applicable – Many documents on the internet are not available as full-texts

– Accuracy of the system is almost doubled by adding linguistic knowledge to the term representation  Part of speech information  Hulth, 2003 16 / 33

Keyword Extraction

TextRank for Keyword Extraction  End result of TextRank – A set of words or phrases  Representative for a given natural language text  Sequences of one or more lexical units extracted from text

 Relation – Can be defined between two lexical units – Co-occurrence relation  Two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words  N can be set anywhere from 2 to 10 words

 Syntactic filter – Consider only  All open class words  Nouns and verbs  Nouns and adjectives only 17 / 33

Keyword Extraction

TextRank for Keyword Extraction  TextRank process – Text tokenizing  Annotated with part of speech tags – Preprocessing step required to enable the application of syntactic filters

 Only single words as candidates for addition to the graph – To avoid excessive growth of the graph size – Multi-word keywords being eventually reconstructed in the post-processing phase

– Syntactic filtering  All lexical units that pass the filter are added to the graph  Edge is added between those lexical units – That co-occur within a window of N words

 Initial score of each vertex is set to 1

18 / 33

Keyword Extraction

TextRank for Keyword Extraction – Ranking algorithm  Is run on the graph for several iterations – Until converges ( usually 20~30 iterations ) – Threshold of 0.0001

– Sorting  Reverse order of their score  The top T vertices are retained for post-processing – T may be set to any fixed value ( usually ranging from 5 to 20 ) – By decides the number of keywords based on the size of the text – T is set to a third of the number of vertices in the graph

– Post-processing  Sequences of adjacent keywords are collapsed into a multi-word keyword – E.g.) Matlab code for plotting ambiguity functions – If Matlab and code are selected as potential keywords – They are collapsed into on single keyword Matlab code 19 / 33

Keyword Extraction

TextRank for Keyword Extraction  Sample graph

20 / 33

Keyword Extraction

Evaluation  Inspec database – From journal papers from Computer Science and Information Technology – 500 abstract – Each abstract comes with two sets of keywords  Controlled keywords – Restricted to a given thesaurus

 Uncontrolled keywords – Freely assigned by the indexers – We use the uncontrolled set of keywords

 In the previous experiments – Hulth is using a total of 2000 abstracts  1000 for training  500 for development  500 for test

– TextRank is completely unsupervised  No training/development data  Only using the test documents for evaluation purposes 21 / 33

Keyword Extraction

Evaluation  Results for automatic keyword extraction

 TextRank achieves the highest precision and F-measure  Larger window does not seem to help – Relation between words that are further apart is not strong

22 / 33

Keyword Extraction

Evaluation  Syntactic filter – Experiments were performed with various syntactic filters – Best performance was “nouns and adjectives only” – Linguistic information helps the process of keyword extraction  No part of speech information were significantly lower

 TextRank system – Lead to an F-measure higher than any of the previously proposed system – Is completely unsupervised – Relies exclusively on information drawn from the text itself  Which makes it easily portable to other text collection, domains, and languages

23 / 33

Outline  Introduction  The TextRank Model – Undirected Graphs – Weighted Graphs – Text as a Graph

 Keyword Extraction – TextRank for Keyword Extraction – Evaluation

 Sentence Extraction – TextRank for Sentence Extraction – Evaluation

 Conclusion

24 / 33

Sentence Extraction  Sentence Extraction – Identifying sequences that are more representative for the given text – Deal with entire sentences – Regarded as similar to keyword extraction

 The goal – Rank entire sentences – Extraction for automatic summarization

25 / 33

Sentence Extraction

TextRank for Sentence Extraction  Build a graph – Vertex is added to the graph for each sentence in the text – Determines a connection between two sentences  If there is a “similarity” relation between them  Similarity is measured as a function of their content overlap  “co-occurrence” is not a meaningful relation for sentences – Sentence is large contexts

 Link can be drawn between any two such sentences that share common content

26 / 33

Sentence Extraction

TextRank for Sentence Extraction  Similarity – The number of common tokens between the lexical representations of the two sentences  It can be run through syntactic filters

– We are using a normalization factor  To avoid promoting long sentences

 Sentence similarity measure

– Sentence

Si  w1i , w2i ,...,wNi i , 27 / 33

Sentence Extraction

TextRank for Sentence Extraction  The resulting graph – Highly connected – Weight associated with each edge  The strength of the connections established between various sentence pairs in the text

– Use weighted graph-based ranking formula

 Sentences are sorted in reversed order of their score – Top ranked sentences are selected for inclusion in the summary

28 / 33

Sentence Extraction

TextRank for Sentence Extraction  Sample graph build for sentence extraction

29 / 33

Sentence Extraction

Evaluation  Single-document summarization – 567 news articles  Provided during the Document Understanding Evaluations 2002 ( DUC )

– TextRank generates an 100-words summary  The task undertaken by other systems participating in this single document summarization task ( fifteen different systems participated )

 ROGUE evaluation toolkit – Method based on Ngram statistics – Highly correlated with human evaluations ( Lin and Hovy, 2003 ) – Two manually produced reference summaries are provided  And used in the evaluation process

– We compare the performance of TextRank with the top five performing systems  As well as with the baseline proposed by the DUC evaluators 30 / 33

Sentence Extraction

Evaluation  Result for single document summarization

31 / 33

Sentence Extraction

Evaluation  Discussion – Represents a summarization model closer to what humans are doing  Fully unsupervised  Relies only on the given text to derive an extractive summary

– TextRank goes beyond the sentence “connectivity” in a text  Sentence 15 would not identified as “important” based on the number of connection  But it is identified as “important” by TextRank  Human also identify the sentence as “important”

– TextRank gives a ranking over all sentences in a text  It can be easily adapted to extracting very short or longer summaries 32 / 33

Conclusion  We introduce TextRank – Graph-based ranking model for text processing

 We show how it can be successfully used for natural language application – Keyword extraction – Sentence extraction – Accuracy achieved by TextRank is competitive

 TextRank – It does not require deep linguistic knowledge, nor domain or language specific annotated corpora – Highly portable to other domains, genres, or lanugages 33 / 33

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF