Natural language processing (NLP) techniques applied to the spatial analysis of literary corpora hold great promise for the analysis of literary corpora. This emerging research theme aims in particular at the analysis of places in works, their cartographic representation and their relationship to characters, across a diversity of authors, periods or literary currents.
Despite the emergence of operational NLP tools, notably thanks to deep learning, the main task of spatial analysis, namely the recognition of named entities (RNE), remains a thorny issue for the French language. An important limitation to the performance of these tools is the variability of the data, so that results on languages other than English remain disappointing. This lack of robustness to variation is particularly glaring when it comes to literary corpora (diachronic and diatopic variations, etc.).
At the intersection of NLP, AI and digital humanities, this project is first interested in the evaluation of different approaches and existing tools for spatial RNE and their applicability to literary data. This work will therefore build on existing tools, but will also require the development of proprietary tools and manually annotated data. Based on a corpus of 3000 novels from the 19th and 20th centuries, we will ask the question of the granularity of spatial named entities (streets, cities, regions), their nature (real, imaginary, disappeared) and their disambiguation.
This project will be carried by an interdisciplinary direction between researchers in digital humanities, NLP and AI with, on the one hand, the question of evaluation and added value brought to the final users of the tools, and on the other hand, epistemological questions on the difficulties encountered by learning systems to manage variability and the unknown.


PhD student: Caroline Parfait

PhD supervisor: Glenn Roe

Research laboratory: CELLF - Centre d’Etude de la Langue et des Littératures Françaises