Material Discovery with Machine Learning Trained from a Small Database

A large database is desired for machine learning (ML) technology to make accurate predictions of materials physicochemical properties based on their molecular structure. When a large database is not available, such as in energetic materials and elastomer materials, development of proper featurization method based on physicochemical nature of target proprieties could also improve predictive power of ML models even if with smaller database. For energetic materials, we have developed two new featurization methods: volume occupation spatial matrix and heat contribution spatial matrix. Both of them are able to improve accuracy in prediction of energetic materials’ crystal density  and solid phase enthalpy of formation using a database including only 451 energetics. Our ML models are applied to CHON-based molecules of that 1.5 million sized PubChem database, and screen out 29 candidates with competitive detonation performance and reasonable chemical structures. With higher level many-body indexes to be incorporated, spatial matrices are promising ML simulation tools to provide even better predictions in more fields of materials science. For elastomers, experimental and simulation data in current descriptors may not be available for all candidates of interest, hindering elastomer discovery through high-throughput screening. To address this challenge, we introduce structure-based multilevel (SM) descriptors of elastomers, derived solely from molecular structure that is universally available. Our SM descriptors are hierarchically organized to capture both local soft and hard segment structures, as well as the global structures of elastomers. With the SM-Morgan Fingerprint (SM-MF) descriptor, one of our SM descriptors, a machine learning model accurately predicts elastomer toughness with a remarkable accuracy of 0.91. Furthermore, an high-throughput screening pipeline is established to swiftly screen elastomers with targeted toughness. We also demonstrate the generality and applicability of SM descriptors by using them to construct high-throughput screening pipelines for screening elastomers with targeted critical strain or Young’s modulus. The user-friendly and low computational cost SM descriptors make them a promising tool to significantly enhance high-throughput screening in the search for novel materials.

Speaker : Li Shuzhou