The extraction of medical concepts (diseases, signs, symptoms, treatments, drugs, etc.) from clinical reports is an important research topic in natural language processing. These documents, written in natural language, by humans and for humans, are still very difficult to analyse and therefore to valorise, due to the variation of language in general, but also to the technical nature of the documents, whose vocabulary varies strongly from one medical specialty to another.
The objective of this thesis is to explore several approaches to reduce supervision for a multilingual and generalist annotation (extending to all medical fields and all types of documents) of clinical records :
- Distant supervision
- Active Learning
- Transfer approaches
This will make it possible to consider the application of extraction tools to all of the reports in a clinical data warehouse (for example, 50 million currently in the AP-HP), and to collect and structure very large amounts of information that have so far remained unexploited.
The scientific contribution lies both in the methodological aspects of machine learning, where there is still a lot of room for improvement in terms of semi-supervised approaches, and in the medical interest, for research as well as for care, of enriching clinical data warehouses.