Tuesday, August 11, 2009

Text mining – Stop Words Are Everything

We often find search results by search engines way off track and keep wondering how the search engine could produce this result when we have been very specific about my search term.

Let us see why this happens.

Search Engines usually scroll through your website, copy the text content from your website & index them after filtering Stop words. Stop words are considered useless or less relevant to search results and so they are ignored.

What are Stop words?

Words such as - and, because, or, not and a host of other words which basically form the foundation of English language are stop words. (There is a list of English Stop words given below this article)

Let us go through an example:

In a blog, I have written: I do not play basketball. (A simple sentence for us to understand, isn’t it?)

Now let us see how the search engine understands it:

The search engine crawls through this page, picks up the entire sentence “I do not play basketball” and then removes the Stop words do and not, then saves the keywords - I play basketball. May be it will also remove the “I” and just save play basketball.

Now you know why, when you search for Basket ball players, you get just the opposite.
Search engines are more accurate when you search for products and services since ignoring the keywords may not affect the concept of information provided. But when you go into personal details, blog entries and articles, you will have less accurate results.

If the search results have to be better and if we are looking forward for a semantic web (intelligent web), then Stop words are everything. Until and unless these are incorporated in the search and clustering done with a strong emphasis for stop words – Search results aren’t doing to turn any better.

Like Stop words A, AN, ONE should be clustered together when the result has to produce information related to single term. Words like AND should bring two keywords closer in the cluster and NOT should separate them more by an algorithm to ensure we get a more proper result. Stemming relevant stop words to one another using multiple algorithms will help clustering better.

I would rather suggest Stop words be renamed Constructs since they form the pillar of any text & they are the ultimate differentiators which will help get to the core of underlying information in the text. This will lead us towards foraying into Artificial Intelligence.

List of Stop words: a, about, above, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, con, could, couldn’t, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fifty, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasn’t, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noon, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, where after, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves and the

No comments: