How to remove stop words from a document or a bundle of documents

Although there are different ways of removing stop words from a document (or a bundle of documents), an easy way is to do so with the NLTK (Natural Language Toolkit) on Python.
You can use the stopwords lists from NLTK and the build in functionality to do the work.
A simple example would be:
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a message containing stopwords."
>>> stop = stopwords.words('english') + string.punctuation
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['message', 'containing', 'stopwords']

In case you have specific stopwords that you would like to omit, you can always create a set and exclude it from the stopword list.
operators= set(('and','not'))
stop = set(stopwords.('english'))- operators

The condition would be as above:
if i not in stop :
# use word

DATALHUB a digital data repository and an alternative to CKAN

This user guide covers using DATALHUB’s web interface to organize, publish and find datasets and digital assets.

Datasets and resources

The DATALHUB is oriented toward the concept of a Dataset *and a *Datum.
A Dataset is a (digital) Data Container. This may be a structure that describes and provides some metadata on digital assets. As an example, the dataset of the budget of a country is contained in a Dataset. The dataset provides general information (metadata) on the provenance of the information, licencing and authorship. Digital assets (datum) such as factual budget, approved budget are related to the Dataset and made available for download. Each digital asset has additionl technical metadata.
A Datum entity is a digital asset, or in a simple language, a file.
Datum or a Digital Asset in DATALHUB

Using DATALHUB

Please refer to the Installation guide for the creation of the first user in Datalhub

Administration Section

Upon succesful login on the page, a mini-flyout administration section will be shown behind a Gear Icon on the right side of the screen.

Datalhub - Administration menu behind the Gear Icon

Clicking the Gear Icon will bring a simple menu that points to the group of elements that you are able to alter with the permissions you are granted.

Generally these elements are:

  • Dataset
  • Organisation
  • Category
  • License
  • Partners
  • (User) Profile

Adding a new dataset

Clicking on Dataset from the Administration menu, will lead to a page showing a list of all the existing datasets. From the list of the datasets, one can select a dataset to be shown a form for the editing of the selected dataset.

The form is also the same and self-explanatory. Important is the ability to ingest digital assets in parallell. Through the same form, the users of the website can upload a large number of digital assets in parallel. Technical metadata for these assets are created on the fly.

DATALHUB will ask for the following information about each dataset.

  • Title – A title describing the dataset.
  • Description – A free form section describing the dataset.
  • Organization – Relates a dataset to an organization. Normally this will point to the organisation that issued or contributed to the creation of the digital dataset.
  • Author – Indicates the creator of the dataset in the DATALHUB.
  • Maintainer – Refers to a person or organisation that maintains and takes care of the dataset. This is a free text field.
  • Created & Revision – Are date references that point each operation to a specific timeline
  • Lineage – is a free text field that should be used to point to prior datasets that are extended in the current dataset. This is a classical provenance reference.
  • Category – pins the current dataset to a specific category, or topic of interest. The category has to be selected from a drop down which is populated by categories defined in the Categories Section.
  • Update Frequency – Will probably be never used, or filled with never in most of the cases
  • Tags – Are yet another word for keywords that best describe the dataset being uploaded. Examples could be “health”, “North Albania” etc.
  • License – it is important to include license information so that people know how they can use the data. This field is a drop-down box.

Adding Organisations, Categories, Licenses

A Dataset contains references to list of predefined entities in the groups of organisations, categories and licences. The predefined entities can be ammended at any time through a similar modification menu. Clicking on the link for each of these entites in the Administration Fly-out menu will lead to respective sections where you can create, edit or remove (if not referenced) instances of each of these entities.

Search

Search is based on a AngularJS faceted search. This will behave well up to a dozen thousands of dataset records. A faceted search, that will have no problem of handling millions of records based on a document index is planned, although work has not started.

API

This section documents DATALHUB’s API, for developers who want to write code that interacts with DATALHUB sites and their data. The following are a set of operations you can handle through the API
Get JSON-formatted lists of a site’s datasets:

http://$DOMAIN/api/dataset

GET JSON/XML formatted list of a specific dataset

http://$DOMAIN/api/dataset/$datasetID

The default response is JSON, but you can easily switch to XML by appending .xml by the end of the call.

http://$DOMAIN/api/dataset.xml

GET JSON/XML formatted metadata for a specific digital asset

http://$DOMAIN/api/datum/$datumID

The list of the digital assets is provided in the retrieval of a dataset. The get datum call returns the metadata for a digital data. The download path for a specific digital asset is: http://$DOMAIN/download/download?datumId=$datumID&datasetId=$datasetID

API calls for creation, upload, update of Datasets and Digital Assets is not active at this stage. If you would like to use this application and need these calls, drop me a line.

Drupal is Horrible – Drupal REVIEW and Criticism

These are a few points written in frustration, so please understand why I feel like screaming Drupal SUCKS! Take these criticism and drupal review with a slight smile.

(This is all based on the experience with Drupal 8)

I am forced to work on Drupal on some systems here at work and first impression is: DRUPAL IS HORRIBLE, and here a list of what is wrong:

  1. Drupal is sold as a CMS, but in fact is a framework with a CMS-demo!
    Yes, it provides basic functionality that make you believe it is a CMS, but soon enough you will find out you that you cannot do much without extending it. Extending it means to reuse a. Plugins and b. Rely on Documentation on the website.
  2. Documentation is almost incomplete, inaccurate, always outdated and confusing
    Yes, you believe you found something and then jump to find out that is outdated and superseded in the current version. Or the developer of the module wrote something even he would not understand in a few minutes. More like a Perl programm that makes sense only in the 5 first seconds you wrote it.
  3. Plugins are always on beta.
    Or sometime even Alpha. And guess what, people use them! and force you to use them. By the time you need to security upgrade, the plugin is stilling degrading beta and you end up with nonfunctional websites.
  4. Modularity!? More like a sledge hammer.
    Modules are great! But use a module when there a logic to use a module. In Drupal modules are abused! Take the Migration(_plus) module for example.
    It is a module that you install once and you should use to migrate data. Logically you need to provide a mapping and invoke/trigger a migration process! In this module, illogically you need to create a new plugin for each migration and you trigger migration once you install it. If the installation goes wrong, there is no much information you can get if you screwed the mapping or if Drupal was doing something strange. And if you change the mappings, you will have to uninstall/reinstall the module again. And the process is a black-magic-box that requires a lot of effort….
  5. You want to work directly with the database!?
    Good luck with that! Assuming you might want to complete a migration process by mapping a file to database tables, you are hopeless. Drupal database is something that makes sense only through the Drupal Internal API.
  6. Server abuse!
    I turned debug on in some old Drupal 6. I was shocked to see the number of queries Drupal was sending to the database on every page view. We all agree that computing power is something cheap these days, but I have never seen a system that abuses memory and computer processing like a drupal page. There is no perception of optimization in the database interaction logic of Drupal, and most queries are expensive joint queries.

End point, I am terrified of people wanting to use Drupal! But still, there is demand and developers need to provide…

 

Will SSL Implementation, HTTPS improve the ranking of my website?!

This is a question I have been asked lately a lot. While Google says: HTTPS (SSL) is a Ranking Signal, the practice has shown this claim to be incomplete.

In fact, many SEO Researchers say that HTTPS has no real impact at the moment.

Considering the lower prices of SSL Certificates, it does not hurt to have it implemented. It costs about 18$ to buy a cheap SSL certificate. You will need an additional dedicated IP (about 12-24$year depending on your data-center), but totally worth it, considering you will have a secure communcation to your webserver/mailserver.
Usually Google does not penalize at the first years of notifying new ranking signals. The ranking signal of SSL implementations might be more advantageous on future google-algorithm changes.

Groovy/Grails Recursive Function/Closure

Since I keep waisting time in recursive functions (and forget what I developped a few months back), here is a piece of code for a recursive function in Groovy.

def getAllChildren(entityId) {
                //Container for the results
		def results = []
		//Retrieve your first element from somewhere
		def entity = entityService.getEntity(entityId)
		if (entity) {
			results.add([entity.id,entity.label])
			entity.children?.each { child ->
				results.addAll(getAllChildren(child))
			}
		}
		return results
	}

 

Software Developing – At it’s best

What we do as developers

Found this gem with the caption: “what it feels like to be a software developer“.

It actually describes in full the process of developing. You try to fix one tiny thing and suddenly you find yourself in the middle of a heavy refactoring which influences the whole application…

From Malcolm in the Middle, but most geeks will describe this as Bryan Cranston, or as Walter White changing a light bulb

Grails – Language prefix in URL mappings – Language in URL as subdirectory

Starting a new project in Grails might lead to the need to support different languages. This can be done through a default ?lang=locale supported natively by Grails, but if you would like to provide a SEO friendly approach, then you might need to tweak your solution.
First of all, Google has provided a set of recommendation on how to support Multi-regional and multilingual sites. When it comes to URLs, they mention that best practices might include using different geo-domain for each language, using subdomains or subdirectories.

Google Recommendation

URL structures
Consider using a URL structure that makes it easy to geotarget parts of your site to different regions. The following table outlines your options:

URL structure Example Pros Cons
Country-specific example.al
  • Clear geotargeting
  • Server location irrelevant
  • Easy separation of sites
  • Expensive (can have limited availability)
  • Requires more infrastructure
  • Strict ccTLD requirements (sometimes)
Subdomains with gTLDS de.example.com
  • Easy to set up
  • Can use Webmaster Tools geotargeting
  • Allows different server locations
  • Easy separation of sites
  • Users might not recognize geotargeting from the URL alone (is “de” the language or country?)
Subdirectories with gTLDs example.com/de/
  • Easy to set up
  • Can use Webmaster Tools geotargeting
  • Low maintenance (same host)
  • Users might not recognize geotargeting from the URL alone
  • Single server location
  • Separation of sites harder
URL parameters site.com?loc=de
  • Not recommended.
  • URL-based segmentation difficult
  • Users might not recognize geotargeting from the URL alone
  • Geotargeting in Webmaster Tools is not possible

URL Mapping in Grails

Grails has a very neat way of mapping resources to URLs through the URL Mapping.

While the default URL Mapping is:

1
2
3
4
5
 /$controller/$action?/$id?(.$format)?"{
    constraints {
      // apply constraints here
    }
}

Adding support for directories is as simple as adding a $lang parameter

1
2
3
4
5
  /$lang/$controller/$action?/$id?(.$format)?"{
    constraints {
      // apply constraints here
    }
 }

Writing /de/controllername/action will automatically have support for the new language.

The challenge is in defining a default language where the /controllername/action can map without the need of the $lang-parameter.

Download Stop Word List

 

In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as ‘The Who’, ‘The The’, or ‘Take That’. Other search engines remove some of the most common words—including lexical words, such as “want”—from a query in order to improve performance.

Below is a group of stop words available for download. In response of interest of the previous article on English Stop Words, I have created a bunch of files for download.

Download Php Array Stop Word List

CSV Download of English Stop Words

Text file of stop words for download

Handling Exceptions in Grails

Improve error handling – Exceptions

The following document is a brief walk through to Exception Handling

"Grails controllers support a simple mechanism for declarative exception handling. If a controller declares a method that accepts a single argument and the argument type is java.lang.Exception or some subclass of java.lang.Exception, that method will be invoked any time an action in that controller throws an exception of that type. See the following example":

class ElloController  {
    def index() { 
        def message="Resource was not found"
        throw new NotFoundException(message);
    }

    def handleNotFoundExceptio(NotFoundException e) {
        response.status=404
        render ("error found")
}

In the previous example, a simple blank page will the message Error not found will be shown on the invocation of the controller
Important:

  • The exception handler method names can be any valid method name. The name is not what makes the method an exception handler, the Exception argument type is the important part.
  • The exception handler methods can do anything that a controller action can do including invoking render, redirect, returning a model, etc.

We need to avoid including redundant methods in every class. Therefore traits in Grails provide a clever way to include some methods to a class. With regard to Exceptions, we create a traits groovy file including the neccessary Exception Handlers. Example:

package com.xpo6.exception

trait NotFoundExceptionHandler {
    def handleNotFoundExceptio(NotFoundException e) {
        response.status=404
        render ("error found")
    }
}

This trait can be included in any controller through an implements directive. Thus our old controller becomes:

import com.xpo6.exception.NotFoundExceptionHandler
import com.xpo6.exception.NotFoundException

class ElloController implements NotFoundExceptionHandler {
    def index() { 
        throw new NotFoundException("Resource was not found");
    }
}

As it can be seen, the methods are no longer in the controller but are moved into the traits and included from the implements directive.

Exceptions can easily be invoked in services as well. *Controllers consuming such services should have the implementation of the traits or a method to Handle the Exceptions.

package test
import com.xpo6.exception.NotFoundException
class ElloService {
    def serviceMethod() {
        throw new NotFoundException("Resource was not found - thrown from service");
    }
}

References

References:

How was ArchivPortal Deutschland and Deutsche Digitale Bibliothek developed (FrontEnd)

archivportal-logo

Yesterday another big project in support of the German culture was launched at the 84th German Archive Conference. This is the Archivportal-D (with D standing for Deutschland {and APD for abbreviation}). The portal is accessible online at www.archivportal-d.de and it enables users to comb Germany’s archives in the course of their research and all is of course free of charge.

The project follows another similar project, the Deutsche Digitale Bibliothek (DDB for abbreviation} (https://www.deutsche-digitale-bibliothek.de/) which was launched roughly two years ago. Both the project share similar software stack characteristics and all is based on Open Source. The public repositories for the front-end development can be found online on GitHub at the: https://github.com/Deutsche-Digitale-Bibliothek/

deutsche-digitale-bibliothek-logo

Backend

Both project are backed by a backend solution provided by IAIS Fraunhofer named Cortex. Cortex is based on SolR and Lucene indexes and more information can be found in the research papers published on the topic (check https://www.iais.fraunhofer.de/iais-cortex.html for more info).

API

Following an open collaboration perspective, the DDB has decided to open up the API for everyone. The documentation of this service can be found at: https://api.deutsche-digitale-bibliothek.de/ and the registration can be done for free from the DDB web-page. The same API is used for APD & DDB and apparently it will like that for a while.

FrontEnd

grails-logoFrontend was based on Grails Framework, which is based on the (sexy) Groovy programming language and scales on top of the Spring Framework. Grails was chosen since it provided a comprehensive programming and configuration model while still allowing development on a modern transitional Java-based world.

 

As a browser framework  JQuery was used but additiona libraries as well.

elasticsearch-logo-icon-lgWhile Cortex was used for the Backend, there were a few services which needed to be developed while the front-end was being developed. LDAP and yet another existing technology, ElasticSearch were used in these cases (but no relation to GORM due to performance paranoia; this might change in the future though).

As mentioned before the team made use of GIT for version control.

Development Environment

GGTS_Eclipse_RatingMost of our frontend-developments were done in the Groovy/Grails Tool Suite, an Eclipse based IDE.

 

jenkins-logoJenkins-CI was used for the continuous integration process and the build systems.

 

 

 

atlassian-logo-1Atlassian products such as Crucible/Fireye & Jira have been used for code reviews and issue tracking.

 

 

Additional Software

In DDB, some additional scripts based on PHP have been used such as Omeka for exhibitions presentations and Piwik for the analytics.

 

If you want to contribute in the source code of these two projects, please check their repos at: https://github.com/Deutsche-Digitale-Bibliothek/