Publications

8.

Dost, Katharina; Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg

Defining Applicability Domain in Biodegradation Pathway Prediction Unpublished Forthcoming

Forthcoming.

@unpublished{dost2023defining,

title = {Defining Applicability Domain in Biodegradation Pathway Prediction},

author = {Katharina Dost and Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},

doi = {https://doi.org/10.21203/rs.3.rs-3587632/v1},

year  = {2023},

date = {2023-11-10},

urldate = {2023-11-10},

abstract = {When developing a new chemical, investigating its long-term influences on the environment is crucial to prevent harm. Unfortunately, these experiments are time-consuming. In silico methods can learn from already obtained data to predict biotransformation pathways, and thereby help focus all development efforts on only the most promising chemicals. As all data-based models, these predictors will output pathway predictions for all input compounds in a suitable format, however, these predictions will be faulty unless the model has seen similar compounds during the training process. A common approach to prevent this for other types of models is to define an Applicability Domain for the model that makes predictions only for in-domain compounds and rejects out-of-domain ones. Nonetheless, although exploration of the compound space is particularly interesting in the development of new chemicals, no Applicability Domain method has been tailored to the specific data structure of pathway predictions yet. In this paper, we are the first to define Applicability Domain specialized in biodegradation pathway prediction. Assessing a model’s reliability from different angles, we suggest a three-stage approach that checks for applicability, reliability, and decidability of the model for a queried compound and only allows it to output a prediction if all three stages are passed. Experiments confirm that our proposed technique reliably rejects unsuitable compounds and therefore improves the safety of the biotransformation pathway predictor. },

keywords = {applicability domain, biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways, reliable machine learning},

pubstate = {forthcoming},

tppubtype = {unpublished}

}

Close

7.

Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg

Combatting over-specialization bias in growing chemical databases Journal Article

In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.

@article{Dost2023Combatting,

title = {Combatting over-specialization bias in growing chemical databases},

author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w



},

doi = {10.1186/s13321-023-00716-w},

issn = {1758-2946},

year  = {2023},

date = {2023-05-19},

urldate = {2023-05-19},

journal = {Journal of Cheminformatics},

volume = {15},

issue = {1},

pages = {53},

abstract = {Background



Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution



In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results



An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},

keywords = {bias, biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways, multi-label classification, reliable machine learning},

pubstate = {published},

tppubtype = {article}

}

Close

Background

Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution

In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results

An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.

Close

6.

Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg

Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products Journal Article

In: Journal of Cheminformatics, vol. 13, no. 1, pp. 63, 2021.

@article{tam2021holisticb,

title = {Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products},

author = {Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00543-x 

https://chemrxiv.org/articles/preprint/Holistic_Evaluation_of_Biodegradation_Pathway_Prediction_Assessing_Multi-Step_Reactions_and_Intermediate_Products/14315963 

https://dx.doi.org/10.26434/chemrxiv.14315963},

doi = {10.1186/s13321-021-00543-x},

year  = {2021},

date = {2021-09-03},

urldate = {2021-09-03},

journal = {Journal of Cheminformatics},

volume = {13},

number = {1},

pages = {63},

abstract = {The prediction of metabolism and biotransformation pathways of xenobiotics is a highly desired tool in environmental sciences, drug discovery, and (eco)toxicology. Several systems predict single transformation steps or complete pathways as series of parallel and subsequent steps. Their performance is commonly evaluated on the level of a single transformation step. Such an approach cannot account for some specific challenges that are caused by specific properties of biotransformation experiments. That is, missing transformation products in the reference data that occur only in low concentrations, e.g. transient intermediates or higher-generation metabolites. Furthermore, some rule-based prediction systems evaluate the performance only based on the defined set of transformation rules. Therefore, the performance of these models cannot be directly compared. In this paper, we introduce a new evaluation framework that extends the evaluation of biotransformation prediction from single transformations to whole pathways, taking into account multiple generations of metabolites. We introduce a procedure to address transient intermediates and propose a weighted scoring system that acknowledges the uncertainty of higher-generation metabolites. We implemented this framework in enviPath and demonstrate its strict performance metrics on predictions of in vitro biotransformation and degradation of xenobiotics in soil. Our approach is model-agnostic and can be transferred to other prediction systems. It is also capable of revealing knowledge gaps in terms of incompletely defined sets of transformation rules.},

keywords = {biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways},

pubstate = {published},

tppubtype = {article}

}

Close

5.

Stepišnik, Tomaž; Škrlj, Blaž; Wicker, Jörg; Kocev, Dragi

A comprehensive comparison of molecular feature representations for use in predictive modeling Journal Article

In: Computers in Biology and Medicine, vol. 130, pp. 104197, 2021, ISSN: 0010-4825.

@article{stepisnik2021comprehensive,

title = {A comprehensive comparison of molecular feature representations for use in predictive modeling},

author = {Toma\v{z} Stepi\v{s}nik and Bla\v{z} \v{S}krlj and J\"{o}rg Wicker and Dragi Kocev},

url = {http://www.sciencedirect.com/science/article/pii/S001048252030528X},

doi = {10.1016/j.compbiomed.2020.104197},

issn = {0010-4825},

year  = {2021},

date = {2021-03-01},

journal = {Computers in Biology and Medicine},

volume = {130},

pages = {104197},

abstract = {Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.},

keywords = {biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways, molecular feature representation, toxicity},

pubstate = {published},

tppubtype = {article}

}

Close

4.

Wicker, Jörg; Fenner, Kathrin; Kramer, Stefan

A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction Book Section

In: Lässig, Jörg; Kersting, Kristian; Morik, Katharina (Ed.): Computational Sustainability, pp. 75-97, Springer International Publishing, Cham, 2016, ISBN: 978-3-319-31858-5.

@incollection{wicker2016ahybrid,

title = {A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction},

author = {J\"{o}rg Wicker and Kathrin Fenner and Stefan Kramer},

editor = {J\"{o}rg L\"{a}ssig and Kristian Kersting and Katharina Morik},

url = {http://dx.doi.org/10.1007/978-3-319-31858-5_5},

doi = {10.1007/978-3-319-31858-5_5},

isbn = {978-3-319-31858-5},

year  = {2016},

date = {2016-04-21},

booktitle = {Computational Sustainability},

pages = {75-97},

publisher = {Springer International Publishing},

address = {Cham},

abstract = {One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS.},

keywords = {biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways, multi-label classification},

pubstate = {published},

tppubtype = {incollection}

}

Close

3.

Wicker, Jörg; Lorsbach, Tim; Gütlein, Martin; Schmid, Emanuel; Latino, Diogo; Kramer, Stefan; Fenner, Kathrin

enviPath – The Environmental Contaminant Biotransformation Pathway Resource Journal Article

In: Nucleic Acid Research, vol. 44, no. D1, pp. D502-D508, 2016.

2.

Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan

Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach Journal Article

In: Bioinformatics, vol. 26, no. 6, pp. 814-821, 2010.

1.

Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan

Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction Proceedings Article

In: Bridewell, Will; Calders, Toon; Medeiros, Ana Karla; Kramer, Stefan; Pechenizkiy, Mykola; Todorovski, Ljupco (Ed.): Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008, 2008.

Links | BibTeX | Tags: biodegradation, cheminformatics, computational sustainability, enviPath, machine learning, metabolic pathways