Weakly supervised learning for accurate annotation of textual clinical documents

Type

Doctoral project

Start date

1 Sep 2019

End date

31 Aug 2022

Location

Paris

Start date

1 Sep 2019

End date

31 Aug 2022

Type

Doctoral project

Location

Paris

Present in very large quantities in health data warehouses, hospital clinical documents are rich sources of information for various applications such as patient recruitment for clinical research, epidemiological surveillance, medical coding and decisions.

The extraction of medical concepts (diseases, signs, symptoms, treatments, drugs, etc.) from clinical reports is an important research topic in natural language processing. These documents, written in natural language, by humans and for humans, are still very difficult to analyse and therefore to valorise, due to the variation of language in general, but also to the technical nature of the documents, whose vocabulary varies strongly from one medical specialty to another.

The objective of this thesis is to explore several approaches to reduce supervision for a multilingual and generalist annotation (extending to all medical fields and all types of documents) of clinical records :

- Distant supervision
- Active Learning
- Transfer approaches

This will make it possible to consider the application of extraction tools to all of the reports in a clinical data warehouse (for example, 50 million currently in the AP-HP), and to collect and structure very large amounts of information that have so far remained unexploited.

The scientific contribution lies both in the methodological aspects of machine learning, where there is still a lot of room for improvement in terms of semi-supervised approaches, and in the medical interest, for research as well as for care, of enriching clinical data warehouses.