How to remove stop words from a document or a bundle of documents

Although there are different ways of removing stop words from a document (or a bundle of documents), an easy way is to do so with the NLTK (Natural Language Toolkit) on Python.
You can use the stopwords lists from NLTK and the build in functionality to do the work.
A simple example would be:
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a message containing stopwords."
>>> stop = stopwords.words('english') + string.punctuation
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['message', 'containing', 'stopwords']

In case you have specific stopwords that you would like to omit, you can always create a set and exclude it from the stopword list.
operators= set(('and','not'))
stop = set(stopwords.('english'))- operators

The condition would be as above:
if i not in stop :
# use word

Information Management and Information Retrieval Modules

Praxis der INformationsverabereitung und Kommunikation I was given a chance to co-author (with Prof. Dr. D. Doherr) an article for the scientific journal “Praxis der Informationsverarbeitung und Kommunikation”. The article describes some innovations in the Humbold Digital Library Project in the field of Information Retrieval and Information Representation.

The article describes some methods that were used in Humboldt Digital Library to improve the findability of the information within the works of Alexander von Humboldt.

(part of the article: Section 3.1)

Internet today with the rise of new search engines like Wolfram Alpha and similar are proving that the users are no longer satisfied with Boolean search. Although algorithms like the Page Rank, HITS and similar provide great value in the results, the information that they return is purely term related. The search in those engines is handled by comparing terms for identical or similar word results (with little or no Natural Language Processing at all). This type of search can be considered a horizontal search as it is basically searching in the surface of the information without digging deeper. There exists another not yet fully revealed concept of search, referred as vertical search. The vertical search is an old concept of mining for knowledge, but is little put in practice in the public systems. The choices and the combination of the factors that influence the vertical search is very high (it can considered infinite, because knowledge includes also a random chance of discovery), therefore no system yet has fully delivered a state of art solution for this search. The sense behind the vertical search is that visitors search for topics and not for terms. By following this logic we have enhanced our digital library with a rich Information retrieval (IR) module.

The IR is the nucleus of our digital services. Usually, the visitors of a digital library are either rewarded by the riches of the options in the IR, or in an opposite scenario they are limited to what these IR module confines them to do.

Fig 1. General description of the modules of HDL
Fig 1. General description of the modules of HDL

As shown in the figure 1 the information in the Humboldt’s digital library and network is transformed in four layers.
The top layer handles the communication between the visitors and the system. Beside the Web-CMS Contrexx  which provides just a general informative module about the project (with no interest for the knowledge mining in the DL), the rest of the modules in this layer, handle continuous fetch and push operations. These operations provide the exchange of information between the system and the visitors.
It is a fact that systems serve better when they know their inhabitants. This is valid for the computer domain as well. If the system knows the background and interests of the users, then it can filter and provide them with some specific information. Based on the interests of the user, our system creates a set of statistical information about the paragraphs that may be of interest to the user, by analyzing on the experience of other users with a similar profile. By means of Personal Profiling, Personal Notes and Favorite Bookmarks, the system retrieves information about the interests of each user. In the Personal Profile section, the users may add information such as disciplines of interests, general interests and regional interests. A composition of these interests provides a cosmos for each user.

While users interact with the system through the content browsing, IR search tools or by writing personal notes related to any paragraph, they provide important feedback to the system. The system is basically learning what paragraphs are of interest to users of similar profile. The visitors of a Digital Library jump around the space of the digital library in search for the correct information. While they jump through links and documents, they leave behind disconnected traces of what they want and how they interact with the system. When these traces are connected to user profiles and user interests, they provide useful mining data that can be applied to other users who share the same preferences. The interactions of the user with the system are handled (stored and analyzed) by a Logger. The Logger retrieves the interactions of the user with the system. Based on these interactions, an algorithm provides for each user-profile, suggestions on the information that may better serve the users need. The Logger together with an algorithm for suggestions provides the Case-Based Reasoning (CBR) Engine. The CBR Engine takes in consideration: Click (Visits), User Personal Note, Editor Public Notes, Bookmarking of Paragraph Etc.
The CBR Engine stores an authority weight in the database. This authority weight is the influence weight, composed from the union of specific weights from the above options*. The Authority of Weight expresses a value of Interest for each Paragraph in the Humboldt Digital Library. Once a certain Value of Authority Weight (AW) for a specific Interest (ex: 5000) has been reached, the CBR system create notifications that the following paragraph should be suggested to the users of this specific Interest. This notification is visualized as the user navigates to the specific paragraph.

hdl2The level of relevance for each paragraph to the profile of the visitor is presented by a Heat Map. The heat map provides three levels of relevance. The levels are marked in different colors and they are an expression of relevance based on the AW/Interest value. The Heat Map represents only three top range values.
By using the colors of the Heat Map, and wrapping the Paragraphs in those colors, the system is informing the visitors that the paragraphs might be interesting to their profile.
As it can be seen from the Figure 1, the CBR stands as a bridge between the User Interface, the Services and the Storage System. The CBR together with the Natural Language Processing (NLP) serve as transforming engine in the second layer. The task of these two modules is to transform the search terms or search implications in one or multiple topics of search.
The term “Natural Language Processing” is normally used to describe the function of software or hardware components in a computer system which analyze or synthesize spoken or written language. A real translation from the human language to machine codes is handled by “Natural Language Understanding” (NLU). Implementing a full NLU System is a very challenging task which involves the work of many specialists from different science fields. The intent of this project was not the research on computational linguistics methods, but to facilitate the search of the information in the DL. By introducing the Thematic Variables like the Location, Time, Persons etc, our system can provides a simple NLP which can translate some phrases to correct queries. The set of the thematic
variables is referred as multi-variables and multi-variables search path. This is just another approach to retrieve information related to each paragraph in the Humboldt Digital library. The multivariable search path relies in additional thematic variables which are related to each paragraph. Every document or paragraph can be additionally described by a:

  • Theme
  • Time
  • Area / Location

Although Time and Location are pretty self-explanatory, the theme is a wide subject. In our normal spoken or narrative written language, we hide a lot of information. We may write a description about a country and not mention its name right away, we may speak of a person referring to him only once in the beginning of our speech, etc. The Boolean search will not provide any information in these cases, as the theme is hidden.
This is the case where the thematic variables come handy. The thematic variables provide additional information related to each paragraph in the HDL, describing what the paragraph covers in the space of individuals, scientific observations, ategory and an infinite set of options. Basically this means that for each paragraph, we have a commentary for the location where it was written and what it mentions, the date or period it describes, the people that the paragraph mentions and so on. To provide such information, for each paragraph in the HDL a separate thematic structure has been created and filled in with information from our Content Provider2 partners. More than 20 000 records of data were gathered only for the first document of Humboldt, ‘the Personal Narrative of Travels to the Equinoctial Regions of the New Continent during the Years 1799-1804’.
The category theme is also important. It may contain a set of keywords which if are hit from the search engine will suggest additional paragraphs. The other aspect of the theme category is related to personal interests of the visitor of the website.

This is part of the article: “Information Management beyond Digital Libraries: Alexander von Humboldt in the Web” DOI Reference: 10.1515/piko.2009.0030. For the full article, you may follow the link http://www.reference-global.com/doi/abs/10.1515/piko.2009.0030 .

List of English Stop Words

Stop Words

Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information. A better definition is provided below:

“Words that do not appear in the index in a particular database because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle (as in the case of IUCAT where terms such as United States or Department are stop words in keyword searching.) Stop words vary from system to system. Also, some systems will merely ignore stop words where use of stop words in other systems will result in retrieving zero hits. ”

http://www.iusb.edu/~libg/instruction/helpguide/handouts/2005Boolean.shtml

Since I needed to use them in a project (Humboldt Diglital Library and Network), I am posting here a list of English stop words, and below a PHP array containing these words

Here is a list of english stop words:

a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
bill
both
bottom
but
by
call
can
cannot
cant
co
computer
con
could
couldnt
cry
de
describe
detail
do
done
down
due
during
each
eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
few
fifteen
fify
fill
find
fire
first
five
for
former
formerly
forty
found
four
from
front
full
further
get
give
go
had
has
hasnt
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herse"
him
himse"
his
how
however
hundred
i
ie
if
in
inc
indeed
interest
into
is
it
its
itse"
keep
last
latter
latterly
least
less
ltd
made
many
may
me
meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myse"
name
namely
neither
never
nevertheless
next
nine
no
nobody
none
noone
nor
not
nothing
now
nowhere
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves
out
over
own
part
per
perhaps
please
put
rather
re
same
see
seem
seemed
seeming
seems
serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
system
take
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
thick
thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
un
under
until
up
upon
us
very
via
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
yet
you
your
yours
yourself
yourselves

And here is a php array with stop words:
$stopwords = array("a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the");

Updated October 3d, 2009.

This is the stop words list used by MySQL FullText feature

a’s, able, about, above, according, accordingly, across, actually, after, afterwards, again, against, ain’t, all, allow, allows, almost, alone, along, already, also, although, always, am, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, appear, appreciate, appropriate, are, aren’t, around, as, aside, ask, asking, associated, at, available, away, awfully, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, believe, below, beside, besides, best, better, between, beyond, both, brief, but, by, c’mon, c’s, came, can, can’t, cannot, cant, cause, causes, certain, certainly, changes, clearly, co, com, come, comes, concerning, consequently, consider, considering, contain, containing, contains, corresponding, could, couldn’t, course, currently, definitely, described, despite, did, didn’t, different, do, does, doesn’t, doing, don’t, done, down, downwards, during, each, edu, eg, eight, either, else, elsewhere, enough, entirely, especially, et, etc, even, ever, every, everybody, everyone, everything, everywhere, ex, exactly, example, except, far, few, fifth, first, five, followed, following, follows, for, former, formerly, forth, four, from, further, furthermore, get, gets, getting, given, gives, go, goes, going, gone, got, gotten, greetings, had, hadn’t, happens, hardly, has, hasn’t, have, haven’t, having, he, he’s, hello, help, hence, her, here, here’s, hereafter, hereby, herein, hereupon, hers, herself, hi, him, himself, his, hither, hopefully, how, howbeit, however, i’d, i’ll, i’m, i’ve, ie, if, ignored, immediate, in, inasmuch, inc, indeed, indicate, indicated, indicates, inner, insofar, instead, into, inward, is, isn’t, it, it’d, it’ll, it’s, its, itself, just, keep, keeps, kept, know, knows, known, last, lately, later, latter, latterly, least, less, lest, let, let’s, like, liked, likely, little, look, looking, looks, ltd, mainly, many, may, maybe, me, mean, meanwhile, merely, might, more, moreover, most, mostly, much, must, my, myself, name, namely, nd, near, nearly, necessary, need, needs, neither, never, nevertheless, new, next, nine, no, nobody, non, none, noone, nor, normally, not, nothing, novel, now, nowhere, obviously, of, off, often, oh, ok, okay, old, on, once, one, ones, only, onto, or, other, others, otherwise, ought, our, ours, ourselves, out, outside, over, overall, own, particular, particularly, per, perhaps, placed, please, plus, possible, presumably, probably, provides, que, quite, qv, rather, rd, re, really, reasonably, regarding, regardless, regards, relatively, respectively, right, said, same, saw, say, saying, says, second, secondly, see, seeing, seem, seemed, seeming, seems, seen, self, selves, sensible, sent, serious, seriously, seven, several, shall, she, should, shouldn’t, since, six, so, some, somebody, somehow, someone, something, sometime, sometimes, somewhat, somewhere, soon, sorry, specified, specify, specifying, still, sub, such, sup, sure, t’s, take, taken, tell, tends, th, than, thank, thanks, thanx, that, that’s, thats, the, their, theirs, them, themselves, then, thence, there, there’s, thereafter, thereby, therefore, therein, theres, thereupon, these, they, they’d, they’ll, they’re, they’ve, think, third, this, thorough, thoroughly, those, though, three, through, throughout, thru, thus, to, together, too, took, toward, towards, tried, tries, truly, try, trying, twice, two, un, under, unfortunately, unless, unlikely, until, unto, up, upon, us, use, used, useful, uses, using, usually, value, various, very, via, viz, vs, want, wants, was, wasn’t, way, we, we’d, we’ll, we’re, we’ve, welcome, well, went, were, weren’t, what, what’s, whatever, when, whence, whenever, where, where’s, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, who’s, whoever, whole, whom, whose, why, will, willing, wish, with, within, without, won’t, wonder, would, would, wouldn’t, yes, yet, you, you’d, you’ll, you’re, you’ve, your, yours, yourself, yourselves, zero

CSV Format

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

I have also created another article where you can download stop words in csv, txt or as a php file