Blog post - 8 December 2014 12:00

Forskarbloggen: Crocodile, elephant, or something in-between?

My research is in theoretical computer science with applications in language technology, in particular in text classification. For me, it is a very appealing subfield of mathematics, applied to an area that is interesting in its own right. One of the major research efforts that I am involved in is the EU-project Media in Context (MICO, for short). The aim of MICO is to develop open-source tools for analysing and searching media assets, which will eventually be integrated in the Apache projects Stanbol and Marmotta. To focus our work, we aim towards a set of concrete use cases. One of these was contributed by the Department of Astronomy at Oxford University, and relates to the crowdsourcing platform Zooniverse.

Mico_Logo_ohne_rgb_web

Zooniverse was the first attempt at crowdsourcing, and is arguable the most successful. As the story goes, a group of astronomers gained access to a high-resolution telescope and used it to take pictures of galaxies, which were then classified manually based on their shape. Whether a galaxy is organised like a disc or a spiral apparently tells us much about the circumstances under which it was formed. Because the telescope was so effective, a hour-long photo shoot could keep an astronomer busy classifying for years. Since the task was too difficult for computers, which are still surprisingly poor at visual pattern recognition, but comparatively easy even for non-expert humans, the public was invited to partake in the project online.

Today, Zooniverse is host to a wide range of scientific projects. One example is Whale FM, in which researchers try to assemble a complete vocabulary for the language of whales. Another is Worm Watch Lab, where the goal is to learn more about how our genes work by observing how genetically modified worms lay eggs. Within the scope of MICO, we are particularly interested in Snapshot Serengeti. Here, a large set of motion-sensitive cameras have be mounted across the Serengeti and are triggered whenever something passes near. In the resulting images, animals can be seen going about their every-day chores, and the public is asked to help map out what species appear, and what they doing.

50dcdda4a2fc8e3789125cf3_1

The volunteers can also, if they like, discuss the photos and classification tasks in an online forum ‘Talk’. Our job in MICO is to build text-classification tools for ‘Talk’ to help understand what the volunteers think of the images. We want to know what images are found to be aesthetically pleasing, if there are any images that are particularly hard to classify, if the selection of images is sufficiently diverse to be interesting, and so forth. We also want to be able to direct the Zooniverse research team to relevant parts of the forum. There is frequent confusion about how to distinguish between similar-looking species, and when this matter is brought up in ’Talk’, the researchers would like to step in and give advice.

For this reason, we wish to improve the technological toolbox for tasks such as question detection, sentiment analysis (that is, judging whether the tone of a text is positive, negative, or neutral), and named entity recognition (finding occurrences of species and activity names). Our general approach is to improve on existing analysis systems that, simply put, divide the input text into sequences of two or three words before processing it. What we want to do is to make the systems ‘grammatically aware’ by first parsing the input text, and then using the derived syntactical analysis to guide later classification steps. The drawback is that the systems will need to do more computational work, but the upside is that we gain a deeper understanding of the content. For instance, in the case of sentiment analysis, it is not sufficient to know that a sentence contains the words “not” and “good”, we also want to know how these words relate to one another. Was the movie “not very good” or “not bad at all, quite good actually”?

Basic Syntax Tree

Blog post written by Johanna Björklund, CTO, for Umeå University's "Forskarbloggen" (the Reasearcher Blog).

Topics

Data, Telecom, IT

Forskarbloggen: Crocodile, elephant, or something in-between?

Related links

Topics

Categories

Contacts

Johanna Björklund

Related content

Codemill joins EU project MICO as a transfer partner

Codemill ansluter till EU-projektet MICO

Johanna på Umeå universitets "Forskarbloggen" denna vecka

Lucka 17: Teknisk chef och forskare

Att behålla och få fler kvinnor inom tekniska områden