Blog post -

Forskarbloggen: A very short note on authorship attribution

A text-classification task appears to interest many is authorship attribution. This problem consists in deciding who among a set of candidate author has written a given text. The underlying assumption is that there are several ways to express the same idea in words, that each individual has his or her own particular style, and that the differences can be exploited to distinguish between authors. The stylistic choices need not be great to be useful, it can simply be a matter of how the author uses punctuation or function words, that is, all those little words that are needed regardless of the topic, e.g. and, or, no, but, however, for, if, then, etc.

In an on-going project, we explore how useful syntactical patterns are for authorship attributions. As some may recall from primary school, a syntactic tree is a graph that represents the syntactic structure of a sentence, according to some grammar. The figure blew shows a syntactic tree for the sentence The teacher gave homework to his students. We see, for instance, that the sentence consists of a noun phrase (NP) followed by a verb phrase (VP), which in turn consists of a verb (V), another noun phrase, and finally a preposition (PP).

Basic Syntax Tree

What we are doing, is to look for small subgraphs that are reoccur with a frequency that is fairly stable for each individual author, but that differs between authors. The hypothesis is that every author has a ”finger print” of syntactical constructions that can be seen over and over in his or her texts, although the wording may change. Preliminary findings suggest that the syntactical patterns are weaker than function words when it comes to authorship attribution, but that the combination of syntax and function words has greater discriminating power that each method in its own.

Bundesarchiv B 145 Bild-F079064-0006, Bonn, Gymnasium

Blog post written by Johanna Björklund, CTO, for Umeå University's "Forskarbloggen" (the Reasearcher Blog).

Related links

Topics

  • Data, Telecom, IT

Categories

  • research & development
  • codemill

Contacts

Johanna Björklund

Press contact CEO - Smart Video Smart Video 070-603 94 59

Related content