Chapter 16 - Richard (Rick) Watson

January 8, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Statistics
Share Embed Donate

Short Description

Download Chapter 16 - Richard (Rick) Watson...


Natural language processing (NLP) From now on I will consider a language to be a set (finite or infinite) of sentences, each finite in length and constructed out of a finite set of elements. All natural languages in their spoken or written form are languages in this sense.

Noam Chomsky

Levels of processing Semantics Focuses on the study of the meaning of words and the interactions between words to form larger units of meaning (such as sentences) Discourse Building on the semantic level, discourse analysis aims to determine the relationships between sentences Pragmatics Studies how context, world knowledge, language conventions and other abstract properties contribute to the meaning of text 2

Evolution of translation


NLP Text is more difficult to process than numbers Language has many irregularities Typical speech and written text are not perfect Don’t expect perfection from text analysis


Sentiment analysis A popular and simple method of measuring aggregate feeling Give a score of +1 to each “positive” word and -1 to each “negative” word Sum the total to get a sentiment score for the unit of analysis (e.g., tweet)


Shortcomings Irony The name of Britain’s biggest dog (until it died) was Tiny

Sarcasm I started out with nothing and still have most of it left

Word analysis “Not happy” scores +1 6

Tokenization Breaking a document into chunks Tokens Typically words Break at whitespace

Create a “bag of words” Many operations are at the word level


Terminology N Corpus size Number of tokens

V Vocabulary Number of distinct tokens in the corpus


Count the number of words library(stringr) # split a string into words into a list of words y
View more...


Copyright � 2017 NANOPDF Inc.