Chapter 16 - Richard (Rick) Watson

January 9, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Statistics
Share Embed Donate


Short Description

Download Chapter 16 - Richard (Rick) Watson...

Description

Natural language processing (NLP) From now on I will consider a language to be a set (finite or infinite) of sentences, each finite in length and constructed out of a finite set of elements. All natural languages in their spoken or written form are languages in this sense.

Noam Chomsky

Levels of processing Semantics Focuses on the study of the meaning of words and the interactions between words to form larger units of meaning (such as sentences) Discourse Building on the semantic level, discourse analysis aims to determine the relationships between sentences Pragmatics Studies how context, world knowledge, language conventions and other abstract properties contribute to the meaning of text 2

Evolution of translation

3

NLP Text is more difficult to process than numbers Language has many irregularities Typical speech and written text are not perfect Don’t expect perfection from text analysis

4

Sentiment analysis A popular and simple method of measuring aggregate feeling Give a score of +1 to each “positive” word and -1 to each “negative” word Sum the total to get a sentiment score for the unit of analysis (e.g., tweet)

5

Shortcomings Irony The name of Britain’s biggest dog (until it died) was Tiny

Sarcasm I started out with nothing and still have most of it left

Word analysis “Not happy” scores +1 6

Tokenization Breaking a document into chunks Tokens Typically words Break at whitespace

Create a “bag of words” Many operations are at the word level

7

Terminology N Corpus size Number of tokens

V Vocabulary Number of distinct tokens in the corpus

8

Count the number of words library(stringr) # split a string into words into a list of words y
View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF