This project deals with the preparation and transformation of large data graphs for training machine learning models. This task is all the more difficult as the data to be transformed is voluminous, heterogeneous and dynamic. The nodes represent various concepts and have properties whose semantics are not aligned on a common reference frame. The links between nodes are themselves heterogeneous. The analysis of these very large graphs requires the development of distributed algorithms that make the best use of big data infrastructures to scale up. Moreover, the preparation of these graphs must be scalable to accommodate new machine learning models.

The objective of this thesis is to design an efficient framework that will allow for more effective execution of both the preparation of training data and the training of a learning model.
The method will consist of defining a language to describe in a logical and declarative way the process that transforms the initial data into a graph, with unification and alignment.
Then, new aggregation indexing solutions will be studied to access the graph in a random and fast way and to update it incrementally.

PhD student: Yuhe BAI

PhD supervisors: Hubert Naacke, Camélia Constantin

Research laboratory: