Infection with the bacterium campylobacter most often occurs through our food and can cause stomach infections with symptoms such as nausea, diarrhea, and fever. In recent years, the number of campylobacter infections has been on the increase. In 2023, more than 5,000 cases were registered in Denmark.
An important tool for the food authorities in connection with the disease-causing bacteria is the so-called source account, which DTU prepares. It estimates the proportion of illness cases that come from different animals and foods, and it can give the authorities an indication of where to take preventive action - and subsequently how effective these efforts have been.
More data to keep track of
The source account is based on the food authorities' samples from animals and food and data from Statens Serum Institut's samples from people infected with the disease-causing bacteria. The principle is to divide the identified bacteria into different genetic subtypes and create a model based on the patterns that emerge.
Such a source account has been prepared for many years for salmonella, where you can 'get away with' keeping an eye on relatively few subtypes. Campylobacter is a more complex organism, and to track it accurately, it was necessary to sequence, i.e. map, the entire bacterium's core genome of approximately 1,300 genes.
"Of course, when we go from less than 20 to 1,300 genes, the amount of data becomes much larger and more difficult to keep track of. That's why we came up with the idea of using machine learning a few years ago. At the time, few others had tried," says Professor Tine Hald. She leads a group of researchers at DTU, who – among other things – are responsible for preparing the source accounts.
Master's student gets started
As a master’s student at DTU, Maja Lykke Brinch started developing a machine learning solution. She 'fed' gene sequences for campylobacter from different animals and foods into a supercomputer.
"You typically take 70 per cent of your dataset – in this case campylobacter gene sequences from multiple sources collected in 2015-17 – and train the algorithm on it. Then you give it the last 30 per cent, where you know the source, and see if the machine can get it right. Once you have a sufficiently accurate model, you provide it with data from people with bacterial infections where neither we nor the model know where the disease originates. The model then predicts the probability that a case of infection originates from a specific food source," explains Maja Lykke Brinch, who has been responsible for a significant part of the work.
She is now a PhD student at DTU and first author of an article that compares different calculation methods and concludes that the machine learning algorithm is the most useful method for campylobacter. It finds the correct sources in 98 per cent of cases.