|Director of thesis||Prof. Dr. Jacques Savoy|
|Co-director of thesis|
|Summary of thesis||
This planned research (two PhD students over a three-year period) aims at designing, implementing, and evaluating text clustering models based on stylistic information. The three main targeted tasks are the following. First, our system will be able to determine the true author of a document (literary excerpt, threatening email, legal testimony, …) given a set of texts that have a known authorship. Second, our model will be able to regroup texts automatically according to different categories related to the written style. Given a set of texts (or blog excerpts), the targeted system will group together works written by the same author (author clustering), or belonging to the same text category, written during the same time period, or with the same author gender (author profiling), etc. Third, we want to propose a system able to determine whether two texts have been written by the same person (authorship verification). This grant will also guarantee our continued active participation in the next CLEF evaluation campaigns related to authorship attribution (PAN tasks).
To solve these challenges, our research will first analyze the relative effectiveness of various text representations, classifiers, and distance measures across different text genres and topics. Then we want to promote a combined text representation, and a resulting more effective intertextual distance measurement based on this more complex text surrogate. Our analysis must also determine or estimate the importance of various factors influencing the written style. In this perspective, the most important ones are the text genre, the author, the topics, the time period, and the type (oral (transcripts), written, web-based). This estimation of the relative importance of those factors will be done using not on a single text, but on a whole corpus.
In the elaboration of our model, we will impose the constraint that the proposed decision should be clearly explained or justified to the user (no black box system). The proposed system must also be capable of automatically carrying out computations using publicly attainable resources. In doing so we exclude any component requiring extensive manual work (e.g., creating an ontology or a specialized thesaurus, etc.).
Different applications can be found in the literary realm (e.g., who is behind Elena Ferrante?), in the political domain (e.g., what were the stylistic differences between H. Clinton and D. Trump during the last US election?), or in linguistics (e.g., what are the stylistic differences between teenagers and young adults on online forums?). Internet offers other pertinent applications such as the profiling of different users to detect the presence of cyber criminals (e.g., sexual predators, identity thieves), or depressed persons close to a suicide.
Keywords: Text clustering, inter-textual distance, text representation, stylometry, Natural Language Processing (NLP), digital library.
|Administrative delay for the defence||2022|