The PhD in the present proposal has a strong link with ANR PIA3 Equipex project E-Col+, started in 2022. It focuses on taxa related traits, to be extracted from naturalist descriptive texts, using AI techniques. Today indeed, artificial intelligence methods can provide valuable assistance in the documentation of collection objects through text and image analysis.
Text analysis has a prominent role, in that
(1) it provides in itself traits directly usable in the navigation and search interface and traits which are not accessible by image analysis (uses, behaviour, physical or chemical properties, etc.),
(2) these traits can also be used to annotate specimen images for the training of image analysis processes, and
(3) the text analysis provides synthetic outputs like the identification keys, and the basis (ontology, semantic graphs, etc.) for organising the information around the collections.
The objective of this project is to take advantage of natural language processing techniques to develop methods and approaches for structured information extraction mainly from morphological descriptions of taxa written in French. Extracted semantic concepts and relationships will be used to build openly accessible knowledge graphs. Such knowledge graphs will contribute to extending the knowledge about plants and facilitate the use of this knowledge through the development of new tools such as an on-line semantic search engine. Textual descriptions and identification keys exist in several domains, even in medicine (descriptions of symptoms and diseases, and diagnostic trees as keys). So the methodology developed during this PhD has potential uses in different fields, not only for biological collections.
PhD student: Ayoub NAINIA
PhD supervisors: Dr Hajar MOUSANNIF (Director), Dr Jihad ZAHIR (Co-advisor), Pr RĂ©gine VIGNES LEBBE (Co-director)
Research laboratory: ISYEB UMR 7205 CNRS/MNHN