enviPath is both, a database and a prediction system, for the microbial biotransformation of organic environmental contaminants. The database provides the possibility to store and view experimentally observed biotransformation pathways, and supports annotating pathways with experimental and environmental conditions. The pathway prediction system provides different relative reasoning models to predict likely biotransformation pathways and products.
enviPath is a joint project with Eawag and the spin-off company enviPath,
2024
Brydon, Liam; Zhang, Kunyang; Dobbie, Gillian; Taskova, Katerina; Wicker, Jörg
Predictive Modeling of Biodegradation Pathways Using Transformer Architectures Unpublished Forthcoming
ResearchSquare, Forthcoming.
@unpublished{Brydon2024b,
title = {Predictive Modeling of Biodegradation Pathways Using Transformer Architectures},
author = {Liam Brydon and Kunyang Zhang and Gillian Dobbie and Katerina Taskova and J\"{o}rg Wicker},
doi = {10.21203/rs.3.rs-5200860/v1},
year = {2024},
date = {2024-10-24},
urldate = {2024-10-24},
journal = {Journal of Cheminformatics},
abstract = {In recent years, the integration of machine learning techniques into chemical reaction product prediction has opened new avenues for understanding and predicting the behaviour of chemical substances. The necessity for such predictive methods stems from the growing regulatory and social awareness of the environmental consequences associated with the persistence and accumulation of chemical residues. Traditional biodegradation prediction methods rely on expert knowledge to perform predictions. However, creating this expert knowledge is becoming increasingly prohibitive due to the complexity and diversity of newer datasets, leaving existing methods unable to perform predictions on these datasets. We formulate the product prediction problem as a sequence-to-sequence generation task and take inspiration from natural language processing and other reaction prediction tasks. In doing so, we reduce the need for the expensive manual creation of expert-based rules.
Scientific Contribution We contribute the first study of the transformer\'s ability to predict biodegradation reactions. Our proposed method can more accurately and efficiently predict biodegradation reactions on more compounds than existing methods. We also contribute a framework for evaluating transformer product prediction methods that can better illustrate the method\'s performance and is more suitable for comparison to other methods.},
howpublished = {ResearchSquare},
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Scientific Contribution We contribute the first study of the transformer's ability to predict biodegradation reactions. Our proposed method can more accurately and efficiently predict biodegradation reactions on more compounds than existing methods. We also contribute a framework for evaluating transformer product prediction methods that can better illustrate the method's performance and is more suitable for comparison to other methods.
Hafner, Jasmin; Lorsbach, Tim; Schmidt, Sebastian; Brydon, Liam; Dost, Katharina; Zhang, Kunyang; Fenner, Kathrin; Wicker, Jörg
Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath Journal Article
In: Journal of Cheminformatics, vol. 16, no. 1, pp. 93, 2024, ISSN: 1758-2946.
@article{hafner2023advancements,
title = {Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath},
author = {Jasmin Hafner and Tim Lorsbach and Sebastian Schmidt and Liam Brydon and Katharina Dost and Kunyang Zhang and Kathrin Fenner and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-024-00881-6
https://envipath.org},
doi = {10.1186/s13321-024-00881-6},
issn = {1758-2946},
year = {2024},
date = {2024-08-06},
urldate = {2024-08-06},
journal = {Journal of Cheminformatics},
volume = {16},
number = {1},
pages = {93},
abstract = {enviPath is a widely used database and prediction system for microbial biotransformation pathways of primarily xenobiotic compounds. Data and prediction system are freely available both via a web interface and a public REST API. Since its initial release in 2016, we extended the data available in enviPath and improved the performance of the prediction system and usability of the overall system. We now provide three diverse data sets, covering microbial biotransformation in different environments and under different experimental conditions. This also enabled developing a pathway prediction model that is applicable to a more diverse set of chemicals. In the prediction engine, we implemented a new evaluation tailored towards pathway prediction, which returns a more honest and holistic view on the performance. We also implemented a novel applicability domain algorithm, which allows the user to estimate how well the model will perform on their data. Finally, we improved the implementation to speed up the overall system and provide new functionality via a plugin system.
},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Lorsbach, Tim; Wicker, Jörg
enviPath-python: v0.2.3 Miscellaneous
Zenedo, 2024.
@misc{lorsbach2024envipath,
title = {enviPath-python: v0.2.3},
author = {Tim Lorsbach and J\"{o}rg Wicker},
url = {https://github.com/enviPath/enviPath-python/tree/v0.2.3},
doi = {10.5281/zenodo.10929408},
year = {2024},
date = {2024-04-05},
urldate = {2024-04-05},
howpublished = {Zenedo},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
2023
Dost, Katharina; Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg
Defining Applicability Domain in Biodegradation Pathway Prediction Unpublished Forthcoming
Forthcoming.
@unpublished{dost2023defining,
title = {Defining Applicability Domain in Biodegradation Pathway Prediction},
author = {Katharina Dost and Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},
doi = {https://doi.org/10.21203/rs.3.rs-3587632/v1},
year = {2023},
date = {2023-11-10},
urldate = {2023-11-10},
abstract = {When developing a new chemical, investigating its long-term influences on the environment is crucial to prevent harm. Unfortunately, these experiments are time-consuming. In silico methods can learn from already obtained data to predict biotransformation pathways, and thereby help focus all development efforts on only the most promising chemicals. As all data-based models, these predictors will output pathway predictions for all input compounds in a suitable format, however, these predictions will be faulty unless the model has seen similar compounds during the training process. A common approach to prevent this for other types of models is to define an Applicability Domain for the model that makes predictions only for in-domain compounds and rejects out-of-domain ones. Nonetheless, although exploration of the compound space is particularly interesting in the development of new chemicals, no Applicability Domain method has been tailored to the specific data structure of pathway predictions yet. In this paper, we are the first to define Applicability Domain specialized in biodegradation pathway prediction. Assessing a model’s reliability from different angles, we suggest a three-stage approach that checks for applicability, reliability, and decidability of the model for a queried compound and only allows it to output a prediction if all three stages are passed. Experiments confirm that our proposed technique reliably rejects unsuitable compounds and therefore improves the safety of the biotransformation pathway predictor. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg
Combatting over-specialization bias in growing chemical databases Journal Article
In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.
@article{Dost2023Combatting,
title = {Combatting over-specialization bias in growing chemical databases},
author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w
},
doi = {10.1186/s13321-023-00716-w},
issn = {1758-2946},
year = {2023},
date = {2023-05-19},
urldate = {2023-05-19},
journal = {Journal of Cheminformatics},
volume = {15},
issue = {1},
pages = {53},
abstract = {Background
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.
2021
Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg
Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products Journal Article
In: Journal of Cheminformatics, vol. 13, no. 1, pp. 63, 2021.
@article{tam2021holisticb,
title = {Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products},
author = {Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00543-x
https://chemrxiv.org/articles/preprint/Holistic_Evaluation_of_Biodegradation_Pathway_Prediction_Assessing_Multi-Step_Reactions_and_Intermediate_Products/14315963
https://dx.doi.org/10.26434/chemrxiv.14315963},
doi = {10.1186/s13321-021-00543-x},
year = {2021},
date = {2021-09-03},
urldate = {2021-09-03},
journal = {Journal of Cheminformatics},
volume = {13},
number = {1},
pages = {63},
abstract = {The prediction of metabolism and biotransformation pathways of xenobiotics is a highly desired tool in environmental sciences, drug discovery, and (eco)toxicology. Several systems predict single transformation steps or complete pathways as series of parallel and subsequent steps. Their performance is commonly evaluated on the level of a single transformation step. Such an approach cannot account for some specific challenges that are caused by specific properties of biotransformation experiments. That is, missing transformation products in the reference data that occur only in low concentrations, e.g. transient intermediates or higher-generation metabolites. Furthermore, some rule-based prediction systems evaluate the performance only based on the defined set of transformation rules. Therefore, the performance of these models cannot be directly compared. In this paper, we introduce a new evaluation framework that extends the evaluation of biotransformation prediction from single transformations to whole pathways, taking into account multiple generations of metabolites. We introduce a procedure to address transient intermediates and propose a weighted scoring system that acknowledges the uncertainty of higher-generation metabolites. We implemented this framework in enviPath and demonstrate its strict performance metrics on predictions of in vitro biotransformation and degradation of xenobiotics in soil. Our approach is model-agnostic and can be transferred to other prediction systems. It is also capable of revealing knowledge gaps in terms of incompletely defined sets of transformation rules.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Stepišnik, Tomaž; Škrlj, Blaž; Wicker, Jörg; Kocev, Dragi
A comprehensive comparison of molecular feature representations for use in predictive modeling Journal Article
In: Computers in Biology and Medicine, vol. 130, pp. 104197, 2021, ISSN: 0010-4825.
@article{stepisnik2021comprehensive,
title = {A comprehensive comparison of molecular feature representations for use in predictive modeling},
author = {Toma\v{z} Stepi\v{s}nik and Bla\v{z} \v{S}krlj and J\"{o}rg Wicker and Dragi Kocev},
url = {http://www.sciencedirect.com/science/article/pii/S001048252030528X},
doi = {10.1016/j.compbiomed.2020.104197},
issn = {0010-4825},
year = {2021},
date = {2021-03-01},
journal = {Computers in Biology and Medicine},
volume = {130},
pages = {104197},
abstract = {Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2017
Latino, Diogo; Wicker, Jörg; Gütlein, Martin; Schmid, Emanuel; Kramer, Stefan; Fenner, Kathrin
Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data Journal Article
In: Environmental Science: Process & Impact, 2017.
@article{latino2017eawag,
title = {Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data},
author = {Diogo Latino and J\"{o}rg Wicker and Martin G\"{u}tlein and Emanuel Schmid and Stefan Kramer and Kathrin Fenner},
doi = {10.1039/C6EM00697C},
year = {2017},
date = {2017-01-01},
journal = {Environmental Science: Process \& Impact},
publisher = {The Royal Society of Chemistry},
abstract = {Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies
for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications.
2016
Wicker, Jörg; Fenner, Kathrin; Kramer, Stefan
A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction Book Section
In: Lässig, Jörg; Kersting, Kristian; Morik, Katharina (Ed.): Computational Sustainability, pp. 75-97, Springer International Publishing, Cham, 2016, ISBN: 978-3-319-31858-5.
@incollection{wicker2016ahybrid,
title = {A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction},
author = {J\"{o}rg Wicker and Kathrin Fenner and Stefan Kramer},
editor = {J\"{o}rg L\"{a}ssig and Kristian Kersting and Katharina Morik},
url = {http://dx.doi.org/10.1007/978-3-319-31858-5_5},
doi = {10.1007/978-3-319-31858-5_5},
isbn = {978-3-319-31858-5},
year = {2016},
date = {2016-04-21},
booktitle = {Computational Sustainability},
pages = {75-97},
publisher = {Springer International Publishing},
address = {Cham},
abstract = {One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS.},
keywords = {},
pubstate = {published},
tppubtype = {incollection}
}
Wicker, Jörg; Lorsbach, Tim; Gütlein, Martin; Schmid, Emanuel; Latino, Diogo; Kramer, Stefan; Fenner, Kathrin
enviPath – The Environmental Contaminant Biotransformation Pathway Resource Journal Article
In: Nucleic Acid Research, vol. 44, no. D1, pp. D502-D508, 2016.
@article{wicker2016envipath,
title = {enviPath - The Environmental Contaminant Biotransformation Pathway Resource},
author = {J\"{o}rg Wicker and Tim Lorsbach and Martin G\"{u}tlein and Emanuel Schmid and Diogo Latino and Stefan Kramer and Kathrin Fenner},
editor = {Michael Galperin},
url = {http://nar.oxfordjournals.org/content/44/D1/D502.abstract},
doi = {10.1093/nar/gkv1229},
year = {2016},
date = {2016-01-01},
journal = {Nucleic Acid Research},
volume = {44},
number = {D1},
pages = {D502-D508},
abstract = {The University of Minnesota Biocatalysis/Biodegradation Database and Pathway Prediction System (UM-BBD/PPS) has been a unique resource covering microbial biotransformation pathways of primarily xenobiotic chemicals for over 15 years. This paper introduces the successor system, enviPath (The Environmental Contaminant Biotransformation Pathway Resource), which is a complete redesign and reimplementation of UM-BBD/PPS. enviPath uses the database from the UM-BBD/PPS as a basis, extends the use of this database, and allows users to include their own data to support multiple use cases. Relative reasoning is supported for the refinement of predictions and to allow its extensions in terms of previously published, but not implemented machine learning models. User access is simplified by providing a REST API that simplifies the inclusion of enviPath into existing workflows. An RDF database is used to enable simple integration with other databases. enviPath is publicly available at https://envipath.org with free and open access to its core data.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2013
Wicker, Jörg
Large Classifier Systems in Bio- and Cheminformatics PhD Thesis
Technische Universität München, 2013.
@phdthesis{wicker2013large,
title = {Large Classifier Systems in Bio- and Cheminformatics},
author = {J\"{o}rg Wicker},
url = {http://mediatum.ub.tum.de/node?id=1165858},
year = {2013},
date = {2013-01-01},
school = {Technische Universit\"{a}t M\"{u}nchen},
abstract = {Large classifier systems are machine learning algorithms that use multiple
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.
The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.
Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.
The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.
The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.
Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.
The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.
2010
Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan
Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach Journal Article
In: Bioinformatics, vol. 26, no. 6, pp. 814-821, 2010.
@article{wicker2010predicting,
title = {Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach},
author = {J\"{o}rg Wicker and Kathrin Fenner and Lynda Ellis and Larry Wackett and Stefan Kramer},
url = {http://bioinformatics.oxfordjournals.org/content/26/6/814.full},
doi = {10.1093/bioinformatics/btq024},
year = {2010},
date = {2010-01-01},
journal = {Bioinformatics},
volume = {26},
number = {6},
pages = {814-821},
publisher = {Oxford University Press},
abstract = {Motivation: Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this article, we propose a hybrid knowledge- and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. As the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice.Results: Results from leave-one-out cross-validation show that a recall and precision of ∼0.8 can be achieved for a subset of 13 transformation rules. Therefore, it is possible to optimize precision without compromising recall. We are currently integrating the results into an experimental version of the UM-PPS server.Availability: The program is freely available on the web at http://wwwkramer.in.tum.de/research/applications/biodegradation/data.Contact: kramer@in.tum.de},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2008
Wicker, Jörg; Fenner, Kathrin; Ellis, Lynda; Wackett, Larry; Kramer, Stefan
Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction Proceedings Article
In: Bridewell, Will; Calders, Toon; Medeiros, Ana Karla; Kramer, Stefan; Pechenizkiy, Mykola; Todorovski, Ljupco (Ed.): Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008, 2008.
@inproceedings{wicker2008machine,
title = {Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction},
author = {J\"{o}rg Wicker and Kathrin Fenner and Lynda Ellis and Larry Wackett and Stefan Kramer},
editor = {Will Bridewell and Toon Calders and Ana Karla Medeiros and Stefan Kramer and Mykola Pechenizkiy and Ljupco Todorovski},
url = {http://www.ecmlpkdd2008.org/files/pdf/workshops/ipm/9.pdf},
year = {2008},
date = {2008-01-01},
booktitle = {Proceedings of the Second International Workshop on the Induction of Process Models at ECML PKDD 2008},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}