Medico-administrative databases are rich sources of information on health systems. However, their exploitation is delicate because of their complexity. I hope to take advantage of unsupervised approaches that have proven to be very effective in Natural Language Processing by applying the word2vec method to sequences of care. I worked on vectorial representations that reflect the interactions (co-occurrences) during care pathways between the codes or events of four major French medical terminologies. This method has already been conducted on mixed data types such as text, American health insurance databases and articles published by Beam et al, (2018)[1].

Python package#

I tried to provide simple implementations of these methods in a python package, event2vec.


SNDS, French National Health Insurance database#

I applied these methods on the French National Health Insurance database, the SNDS containing billions of claims. For each separate event (medical concept) present in the data, a vector of dimension 150 (embedding) is obtained.

A poster summarizing this work has been presented at the congress of epidemiology, Emois : Doutreligne et al., 2020 [2].

In addition, I provide a web application exposing the vectors as well as some qualitative evaluations (2D projection and proximity requests): https://straymat.gitlab.io/event2vec/visualizations.html.

On the APHP data#

I also applied these methods on the APHP data, the largest hospital in Europe. I am currently describing these experiences in a working paper.