Publications

5.

Cheena, Asif; Dost, Katharina; Sarris, Theo; Straathof, Nina; Wicker, Jörg

Don't Swim in Data: Real-Time Microbial Forecasting for New Zealand Recreational Waters Unpublished Forthcoming

SRRN, Forthcoming.

4.

Hafner, Jasmin; Lorsbach, Tim; Schmidt, Sebastian; Brydon, Liam; Dost, Katharina; Zhang, Kunyang; Fenner, Kathrin; Wicker, Jörg

Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath Journal Article

In: Journal of Cheminformatics, vol. 16, no. 1, pp. 93, 2024, ISSN: 1758-2946.

3.

Long, Derek; Eade, Liam; Dost, Katharina; Meier-Menches, Samuel M; Goldstone, David C; Sullivan, Matthew P; Hartinger, Christian; Wicker, Jörg; Taskova, Katerina

AdductHunter: Identifying Protein-Metal Complex Adducts in Mass Spectra Journal Article

In: Journal of Cheminformatics, vol. 16, iss. 1, 2024, ISSN: 1758-2946.

2.

Dost, Katharina; Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg

Defining Applicability Domain in Biodegradation Pathway Prediction Unpublished Forthcoming

Forthcoming.

@unpublished{dost2023defining,

title = {Defining Applicability Domain in Biodegradation Pathway Prediction},

author = {Katharina Dost and Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},

doi = {https://doi.org/10.21203/rs.3.rs-3587632/v1},

year  = {2023},

date = {2023-11-10},

urldate = {2023-11-10},

abstract = {When developing a new chemical, investigating its long-term influences on the environment is crucial to prevent harm. Unfortunately, these experiments are time-consuming. In silico methods can learn from already obtained data to predict biotransformation pathways, and thereby help focus all development efforts on only the most promising chemicals. As all data-based models, these predictors will output pathway predictions for all input compounds in a suitable format, however, these predictions will be faulty unless the model has seen similar compounds during the training process. A common approach to prevent this for other types of models is to define an Applicability Domain for the model that makes predictions only for in-domain compounds and rejects out-of-domain ones. Nonetheless, although exploration of the compound space is particularly interesting in the development of new chemicals, no Applicability Domain method has been tailored to the specific data structure of pathway predictions yet. In this paper, we are the first to define Applicability Domain specialized in biodegradation pathway prediction. Assessing a model’s reliability from different angles, we suggest a three-stage approach that checks for applicability, reliability, and decidability of the model for a queried compound and only allows it to output a prediction if all three stages are passed. Experiments confirm that our proposed technique reliably rejects unsuitable compounds and therefore improves the safety of the biotransformation pathway predictor. },

keywords = {applicability domain, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways, reliable machine learning},

pubstate = {forthcoming},

tppubtype = {unpublished}

}

Close

1.

Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg

Combatting over-specialization bias in growing chemical databases Journal Article

In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.

@article{Dost2023Combatting,

title = {Combatting over-specialization bias in growing chemical databases},

author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w



},

doi = {10.1186/s13321-023-00716-w},

issn = {1758-2946},

year  = {2023},

date = {2023-05-19},

urldate = {2023-05-19},

journal = {Journal of Cheminformatics},

volume = {15},

issue = {1},

pages = {53},

abstract = {Background



Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution



In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results



An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},

keywords = {bias, biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways, multi-label classification, reliable machine learning},

pubstate = {published},

tppubtype = {article}

}

Close

Background

Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution

In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results

An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.

Close

Publications

2025

2024

2023