Elasticsearch is a fancy application used in many cases for a search  layer or an analytics engine. What is also interesting, is the set of features that Elastic Search has when it comes to Natural Language Processing.

While working on different aspects of NLP, you may find yourself running into one of the following tasks quite often:

  • Tokenization
  • Normalization, Stemming and Lemmatization
  • Stopword cleaning
  • Part of Speech tagging

While there a number of tools that can help you in the process, most of them open source as well, Elastic Search handles most of these operations out of the box. In this example, we can see how elastic search cleans up the stop words:

For example, by relying on the default set of stop-words on ES, one can easily test the features.

PUT /stopword-testing
{
    "settings": {
        "analysis": {
            "filter": {
                "stopwordEn": {
                    "type":       "stop",
                    "stopwords":  "_english_"
                }
            }
        }
    }
}

With the above, we have just defined an analyzer related to stop words which takes the default stop words of ES by default.
In another post on English Stop Words, I provide a list of stopwords (also available for downloads in CSV, PHP an plain text).
You can create an analyzer with your own list of stop words in the following way:

PUT /stopword-testing
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stopwordEn": {
          "type": "stop",
          "stopwords": ["a","able","about","across","after","all","almost","..."]
        }
      }
    }
  }
}

Removing the stop words and retrieving only terms which are not in the stop words is trivial afterwards by invoking the analyzer in the following directive:

POST stopword-testing/_analyze
{
  "analyzer": "stopwordEn",
  "text": "About one in eight women are diagnosed with breast cancer during their lifetime. There's a good chance of recovery if it's detected in its early stages. For this reason, it's vital that women check their breasts regularly for any changes and always get any changes examined by their GP.In rare cases, men can also be diagnosed with breast cancer."
}

The response from ES will be a JSON array with a set of unique words that do not include the Stop Words.
P.S Text in example is taken from Zana Health website.

Posted by xpo6

Software developer in the realm of AI, NLP and black magic.

Leave a reply

Your email address will not be published. Required fields are marked *