[ Back ]


Clustering with Styles

Author Mirco KOCHER
Director of thesis Prof. Jacques Savoy
Co-director of thesis
Summary of thesis

This work focuses on the design, implementation and evaluation of text clustering algorithms based on styles.

Under this general formulation, one can find different applications.

Given a set of documents (or text excerpts), the targeted system may regroup works written by the same author (authorship attribution), according to the same genre (news, romance, play, poetry, etc.), text types ((verbatim of) spoken vs. written speech), political views (left-right, government-opposition), author’s profile (age, sex, socio-cultural psychological, education, background, etc.), or written during the same time period.

Such a system may provide answers in different domains, as, for example, in literary domains (e.g., what are the stylistic differences between Molière and P. Corneille) or in linguistics (e.g., what are the different styles used by teenagers on online forums).

In information retrieval (IR), instead of presenting the retrieved items only according to their similarity with the query, we can regroup them into clusters according to a given criterion (e.g., from surveys to specialized articles).

The internet offers other pertinent applications such as profiling different user types to detect the presence of cyber criminals (e.g., sexual predators, identity thieves).

The targeted style-based clustering system is mainly based on an unsupervised model where the needed features to discriminate between the different groups (and sub-groups) are not learned from a set of training instances.

Within this general context, we are targeting the following four main objectives.

First, we want to design, implement and evaluate different inter-textual metrics able to effectively discriminate between the various styles present in a given corpus.

Such an inter-textual distance (or similarity) must be non-negative, symmetric and respect the triangle inequality.

Moreover it must be stable and robust (small variations or small errors in the input must result in small distance variations).

The measure should have a clear interpretation for the end user (no black-box system).

Second, we want to select and evaluate various text representation strategies, focusing mainly on lexical-based information such as word types, lemmas or based on more sophisticated tools such as part-of-speech (POS) categories, along with the sequence of these items.

The selected representations must also be able to give a pertinent key to interpret the inter-textual distance between two texts or clusters. As a third objective, strongly correlated to the second, we want to work with various languages other than English.

The question is to determine how we can generalize conclusions obtained from English to other languages, in particular with languages having a more complex morphology (e.g., Finnish, Hungarian) or certain Asian languages (Japanese, Chinese) showing radically different linguistic characteristics.

Forth, we want to analyze, and have a better understanding of, the underlying variability and uncertainty related to the automatic assignment proposed by the system. Such an indication must be easy to understand for the end user.

Status finishing
Administrative delay for the defence
URL http://mirc0crim.github.io/