How to remove stop words from a document or a bundle of documents

Although there are different ways of removing stop words from a document (or a bundle of documents), an easy way is to do so with the NLTK (Natural Language Toolkit) on Python.
You can use the stopwords lists from NLTK and the build in functionality to do the work.
A simple example would be:
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a message containing stopwords."
>>> stop = stopwords.words('english') + string.punctuation
>>> [i for i in word_tokenize(sent.lower()) if i not in stop] ['message', 'containing', 'stopwords']

In case you have specific stopwords that you would like to omit, you can always create a set and exclude it from the stopword list.
operators= set(('and','not'))
stop = set(stopwords.('english'))- operators

The condition would be as above:
if i not in stop :
# use word

%d bloggers like this: