Gallica, the digital library of the BnF, contains nearly 10 million digitized documents that are freely accessible online (18.5 million visits per year). However, most users do not know that Gallica contains not only printed documents, but also photographs, sound recordings, videos, and 3D objects. In satisfaction surveys, only a minority of users consider the search engine's answers to be relevant and a majority would like to be better guided in their searches. A recommendation system should be able to help users find their way through the mass of collections and improve the visibility of the least known. In this project, BnF is committed to adopting a resolutely ethical approach. The exploitation of user logs must respect their privacy and guarantee both the relevance and transparency of the algorithms, avoiding the risk of filter bubbles. The interface design is also at the heart of the approach: a trustworthy system relies on a good user experience and on the diversity and relevance of the proposed recommendations. Three lines of thought emerge:
1) based on the available data, including both user logs and collection descriptions, how to develop predictive algorithms?
2) how to integrate diversity in the recommendation algorithm while leaving the choice to the user to moderate his serendipity threshold?
3) how to build user trust in algorithm design and audit?
Main missions
This project consists in working on information access in the Gallica library, from the point of view of machine and deep learning techniques. The research axes concern (1) the analysis and indexing of textual documents as well as (2) the analysis of user traces and (3) recommendation systems. We are particularly interested in multimodal techniques that allow contextualizing a document or a query based on user interactions.
The successful candidate will be responsible for:
● Implementing models to learn the semantics of textual data for the purpose of indexing them.
● Developing algorithms based on representation learning methodologies to effectively blend text and user traces.
● Reporting and presenting development work in a clear and effective manner, both for discussion with BnF experts and writing machine learning publications.
The printed book collection will be the primary focus of the program described above, but an extension to other collections with textual descriptors (in particular iconographic collections) may be considered.
Photo credit: BnF Datalab - 2021 - Elie Ludwig - BnF