
Our lab researches machine learning and its application to cheminformatics, bioinformatics, and computational sustainability. We are always interested in interesting new research areas both for applied and fundamental machine learning. Currently, we are particularly interested in reliability of machine learning models, adversarial machine learning, and bias, with applications in chemistry, epidemiology, and environmental research.
To learn more about our lab, check out our publications or read more about our research and projects.
You can join us as PhD student, Honours student, or other postgraduate student. You can also visit our lab as visiting researcher or student.
News
-
Artificial Intelligence and Freshwater Modelling
To protect our freshwater for future generations, we develop a framework enabling an understanding of how environmental factors impact our water quality and how mitigation strategies can help. Our project […]
-
Welcome Rui Zhang!
We are very happy to welcome Rui Zhang to our lab! Rui will visit us for a year. He is PhD candidate in Computer Science at the University of Electronic […]
Social
- Likes @enviPath's Note New data package: EAWAG-SLUDGE !EAWAG-SLUDGE contains biotransformation data from activated sludge experiments extracted from 27 scientific articles, including our own paper by Trostel et al. (2023)https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a
- New data package: EAWAG-SLUDGE !EAWAG-SLUDGE contains biotransformation data from activated sludge experiments extracted from 27 scientific articles, including our own paper by Trostel et al. (2023)https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a
- Likes @Wickerlab's Note Joerg Simon Wicker wrote the following post Thu, 16 Nov 2023 08:54:15 +1300enviPath updatesenviPath was first published in 2016. Since then, we added heaps of functionality and updates, so we decided to summarize it all in an update paper. The preprint is now out, read it here! enviPath now has three reviewed data […]
- Joerg Simon Wicker wrote the following post Thu, 16 Nov 2023 08:54:15 +1300enviPath updatesenviPath was first published in 2016. Since then, we added heaps of functionality and updates, so we decided to summarize it all in an update paper. The preprint is now out, read it here! enviPath now has three reviewed data sets, reaction links […]
- Wickerlab likes enviPath's status
Recent Publications
Journal Articles
Miller, Catriona J; Golovina, Evgenija; Wicker, Jörg; Jacobson, Jessie C; O'Sullivan, Justin M
De novo network analysis reveals autism causal genes and developmental links to co-occurring traits Journal Article
In: Life Science Alliance, vol. 6, no. 10, 2023.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{Miller2023denovo,
title = {De novo network analysis reveals autism causal genes and developmental links to co-occurring traits},
author = {Catriona J Miller and Evgenija Golovina and J\"{o}rg Wicker and Jessie C Jacobson and Justin M O\'Sullivan},
url = {https://www.medrxiv.org/content/10.1101/2023.04.24.23289060v1},
doi = {10.26508/lsa.202302142},
year = {2023},
date = {2023-08-08},
urldate = {2023-08-08},
journal = {Life Science Alliance},
volume = {6},
number = {10},
abstract = {Autism is a complex neurodevelopmental condition that manifests in various ways. Autism is often accompanied by other conditions, such as attention-deficit/hyperactivity disorder and schizophrenia, which can complicate diagnosis and management. Although research has investigated the role of specific genes in autism, their relationship with co-occurring traits is not fully understood. To address this, we conducted a two-sample Mendelian randomisation analysis and identified four genes located at the 17q21.31 locus that are putatively causal for autism in fetal cortical tissue (LINC02210, LRRC37A4P, RP11-259G18.1, and RP11-798G7.6). LINC02210 was also identified as putatively causal for autism in adult cortical tissue. By integrating data from expression quantitative trait loci, genes and protein interactions, we identified that the 17q21.31 locus contributes to the intersection between autism and other neurological traits in fetal cortical tissue. We also identified a distinct cluster of co-occurring traits, including cognition and worry, linked to the genetic loci at 3p21.1. Our findings provide insights into the relationship between autism and co-occurring traits, which could be used to develop predictive models for more accurate diagnosis and better clinical management.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg
Combatting over-specialization bias in growing chemical databases Journal Article
In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{Dost2023Combatting,
title = {Combatting over-specialization bias in growing chemical databases},
author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w
},
doi = {10.1186/s13321-023-00716-w},
issn = {1758-2946},
year = {2023},
date = {2023-05-19},
urldate = {2023-05-19},
journal = {Journal of Cheminformatics},
volume = {15},
issue = {1},
pages = {53},
abstract = {Background
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.
Bensemann, Joshua; Cheena, Hasnain; Huang, David Tse Jung; Broadbent, Elizabeth; Williams, Jonathan; Wicker, Jörg
From What You See to What We Smell: Linking Human Emotions to Bio-markers in Breath Journal Article
In: IEEE Transactions on Affective Computing, pp. 1-13, 2023, ISSN: 1949-3045.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{bensemann2023from,
title = {From What You See to What We Smell: Linking Human Emotions to Bio-markers in Breath},
author = {Joshua Bensemann and Hasnain Cheena and David Tse Jung Huang and Elizabeth Broadbent and Jonathan Williams and J\"{o}rg Wicker},
url = {https://ieeexplore.ieee.org/document/10123109
https://doi.org/10.17608/k6.auckland.22777364
https://doi.org/10.17608/k6.auckland.22777352 },
doi = {10.1109/TAFFC.2023.3275216},
issn = {1949-3045},
year = {2023},
date = {2023-05-11},
urldate = {2023-05-11},
journal = {IEEE Transactions on Affective Computing},
pages = {1-13},
abstract = {Research has shown that the composition of breath can differ based on the human’s behavioral patterns and mental and physical states immediately before being collected. These breath-collection techniques have also been extended to observe the general processes occurring in groups of humans and can link them to what those groups are collectively experiencing. In this research, we applied machine learning techniques to the breath data collected from cinema audiences. These techniques included XGBOOST Regression, Hierarchical Clustering, and Item Basket analyses created using the Apriori algorithm. They were conducted to find associations between the biomarkers in the crowd’s breath and the movie’s audio-visual stimuli and thematic events. This analysis enabled us to directly link what the group was experiencing and their biological response to that experience. We first extracted visual and auditory features from a movie to achieve this. We compared it to the biomarkers in the crowd’s breath using regression and pattern mining techniques. Our results supported the theory that a crowd’s collective experience directly correlates to the biomarkers in the crowd’s breath. Consequently, these findings suggest that visual and auditory experiences have predictable effects on the human
body that can be monitored without requiring expensive or invasive neuroimaging techniques.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
body that can be monitored without requiring expensive or invasive neuroimaging techniques.
Roeslin, Samuel; Ma, Quincy; Chigullapally, Pavan; Wicker, Jörg; Wotherspoon, Liam
Development of a Seismic Loss Prediction Model for Residential Buildings using Machine Learning – Christchurch, New Zealand Journal Article
In: Natural Hazards and Earth System Sciences, vol. 23, no. 3, pp. 1207-1226, 2023.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{Roeslin2023development,
title = {Development of a Seismic Loss Prediction Model for Residential Buildings using Machine Learning \textendash Christchurch, New Zealand},
author = {Samuel Roeslin and Quincy Ma and Pavan Chigullapally and J\"{o}rg Wicker and Liam Wotherspoon},
url = {https://nhess.copernicus.org/articles/23/1207/2023/},
doi = {10.5194/nhess-23-1207-2023},
year = {2023},
date = {2023-03-22},
urldate = {2023-03-22},
journal = {Natural Hazards and Earth System Sciences},
volume = {23},
number = {3},
pages = {1207-1226},
abstract = {This paper presents a new framework for the seismic loss prediction of residential buildings in Christchurch, New Zealand. It employs data science techniques, geospatial tools, and machine learning (ML) trained on insurance claims data from the Earthquake Commission (EQC) collected following the 2010\textendash2011 Canterbury Earthquake Sequence (CES). The seismic loss prediction obtained from the ML model is shown to outperform the output from existing risk analysis tools for New Zealand for each of the main earthquakes of the CES. In addition to the prediction capabilities, the ML model delivered useful insights into the most important features contributing to losses during the CES. ML correctly highlighted that liquefaction significantly influenced buildings losses for the 22 February 2011 earthquake. The results are consistent with observations, engineering knowledge, and previous studies, confirming the potential of data science and ML in the analysis of insurance claims data and the development of seismic loss prediction models using empirical loss data.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Proceedings Articles
Pullar-Strecker, Zac; Chang, Xinglong; Brydon, Liam; Ziogas, Ioannis; Dost, Katharina; Wicker, Jörg
Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments Proceedings Article
In: Morales, Gianmarco De Francisci; Perlich, Claudia; Ruchansky, Natali; Kourtellis, Nicolas; Baralis, Elena; Bonchi, Francesco (Ed.): Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pp. 310-314, Springer Nature Switzerland, Cham, 2023, ISBN: 978-3-031-43430-3.
Abstract | Links | BibTeX | Altmetric | PlumX
@inproceedings{Pullar-Strecker2023memento,
title = {Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments},
author = {Zac Pullar-Strecker and Xinglong Chang and Liam Brydon and Ioannis Ziogas and Katharina Dost and J\"{o}rg Wicker},
editor = {Gianmarco De Francisci Morales and Claudia Perlich and Natali Ruchansky and Nicolas Kourtellis and Elena Baralis and Francesco Bonchi },
url = {https://arxiv.org/abs/2304.09175
https://github.com/wickerlab/memento},
doi = {10.1007/978-3-031-43430-3_21},
isbn = {978-3-031-43430-3},
year = {2023},
date = {2023-09-17},
urldate = {2023-09-17},
booktitle = {Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track},
journal = {Lecture Notes in Computer Science},
pages = {310-314},
publisher = {Springer Nature Switzerland},
address = {Cham},
abstract = { Running complex sets of machine learning experiments is challenging and time-consuming due to the lack of a unified framework. This leaves researchers forced to spend time implementing necessary features such as parallelization, caching, and checkpointing themselves instead of focussing on their project. To simplify the process, in our paper, we introduce Memento, a Python package that is designed to aid researchers and data scientists in the efficient management and execution of computationally intensive experiments. Memento has the capacity to streamline any experimental pipeline by providing a straightforward configuration matrix and the ability to concurrently run experiments across multiple threads.
Code related to this paper is available at: https://github.com/wickerlab/memento.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Code related to this paper is available at: https://github.com/wickerlab/memento.
Chang, Luke; Dost, Katharina; Zhai, Kaiqi; Demontis, Ambra; Roli, Fabio; Dobbie, Gillian; Wicker, Jörg
BAARD: Blocking Adversarial Examples by Testing for Applicability, Reliability and Decidability Proceedings Article
In: Kashima, Hisashi; Ide, Tsuyoshi; Peng, Wen-Chih (Ed.): The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 3-14, Springer Nature Switzerland, Cham, 2023, ISSN: 978-3-031-33374-3.
Abstract | Links | BibTeX | Altmetric | PlumX
@inproceedings{chang2021baard,
title = {BAARD: Blocking Adversarial Examples by Testing for Applicability, Reliability and Decidability},
author = {Luke Chang and Katharina Dost and Kaiqi Zhai and Ambra Demontis and Fabio Roli and Gillian Dobbie and J\"{o}rg Wicker},
editor = {Hisashi Kashima and Tsuyoshi Ide and Wen-Chih Peng},
url = {https://arxiv.org/abs/2105.00495
https://github.com/wickerlab/baard},
doi = {10.1007/978-3-031-33374-3_1},
issn = {978-3-031-33374-3},
year = {2023},
date = {2023-05-27},
urldate = {2023-05-27},
booktitle = {The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)},
journal = {The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)},
pages = {3-14},
publisher = {Springer Nature Switzerland},
address = {Cham},
abstract = {Adversarial defenses protect machine learning models from adversarial attacks, but are often tailored to one type of model or attack. The lack of information on unknown potential attacks makes detecting adversarial examples challenging. Additionally, attackers do not need to follow the rules made by the defender. To address this problem, we take inspiration from the concept of Applicability Domain in cheminformatics. Cheminformatics models struggle to make accurate predictions because only a limited number of compounds are known and available for training. Applicability Domain defines a domain based on the known compounds and rejects any unknown compound that falls outside the domain. Similarly, adversarial examples start as harmless inputs, but can be manipulated to evade reliable classification by moving outside the domain of the classifier. We are the first to identify the similarity between Applicability Domain and adversarial detection. Instead of focusing on unknown attacks, we focus on what is known, the training data. We propose a simple yet robust triple-stage data-driven framework that checks the input globally and locally, and confirms that they are coherent with the model’s output. This framework can be applied to any classification model and is not limited to specific attacks. We demonstrate these three stages work as one unit, effectively detecting various attacks, even for a white-box scenario.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Chen, Zeyu; Dost, Katharina; Zhu, Xuan; Chang, Xinglong; Dobbie, Gillian; Wicker, Jörg
Targeted Attacks on Time Series Forecasting Proceedings Article
In: Kashima, Hisashi; Ide, Tsuyoshi; Peng, Wen-Chih (Ed.): The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 314-327, Springer Nature Switzerland, Cham, 2023, ISSN: 978-3-031-33383-5.
Abstract | Links | BibTeX | Altmetric | PlumX
@inproceedings{Chen2023targeted,
title = {Targeted Attacks on Time Series Forecasting},
author = {Zeyu Chen and Katharina Dost and Xuan Zhu and Xinglong Chang and Gillian Dobbie and J\"{o}rg Wicker},
editor = {Hisashi Kashima and Tsuyoshi Ide and Wen-Chih Peng},
url = {https://github.com/wickerlab/nvita},
doi = {10.1007/978-3-031-33383-5_25},
issn = {978-3-031-33383-5},
year = {2023},
date = {2023-05-26},
urldate = {2023-05-25},
booktitle = {The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)},
pages = {314-327},
publisher = {Springer Nature Switzerland},
address = {Cham},
abstract = {Abstract. Time Series Forecasting (TSF) is well established in domains dealing with temporal data to predict future events yielding the basis for strategic decision-making. Previous research indicated that forecasting models are vulnerable to adversarial attacks, that is, maliciously crafted perturbations of the original data with the goal of altering the model’s predictions. However, attackers targeting specific outcomes pose a substantially more severe threat as they could manipulate the model and bend it to their needs. Regardless, there is no systematic approach for targeted adversarial learning in the TSF domain yet. In this paper, we introduce targeted attacks on TSF in a systematic manner. We establish a new experimental design standard regarding attack goals and perturbation control for targeted adversarial learning on TSF. For this purpose, we present a novel indirect sparse black-box evasion attack on TSF, nVita. Additionally, we adapt the popular white-box attacks Fast Gradient Sign Method (FGSM) and Basic Iterative Method (BIM). Our experiments confirm not only that all three methods are effective but also that current state-of-the-art TSF models are indeed susceptible to attacks. These results motivate future research in this area to achieve higher reliability of forecasting models.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Miscellaneous
Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Stönner, Christof; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Williams, Jonathan; Kramer, Stefan
Cinema Experiments 2013 Miscellaneous
2023.
Links | BibTeX | Altmetric | PlumX
@misc{Wicker2023cinema,
title = {Cinema Experiments 2013},
author = { J\"{o}rg Wicker and Nicolas Krauter and Bettina Derstorff and Christof St\"{o}nner and Efstratios Bourtsoukidis and Thomas Kl\"{u}pfel and Jonathan Williams and Stefan Kramer},
url = {https://auckland.figshare.com/articles/dataset/Cinema_Experiments_2013/22777364},
doi = {10.17608/k6.auckland.22777364.v3},
year = {2023},
date = {2023-05-23},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
Stönner, Christof; Edtbauer, Achim; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Wicker, Jörg; Williams, Jonathan
Cinema Experiments 2015 Miscellaneous
2023.
Links | BibTeX | Altmetric | PlumX
@misc{St\"{o}nner2023cinema,
title = {Cinema Experiments 2015},
author = { Christof St\"{o}nner and Achim Edtbauer and Bettina Derstorff and Efstratios Bourtsoukidis and Thomas Kl\"{u}pfel and J\"{o}rg Wicker and Jonathan Williams},
url = {https://auckland.figshare.com/articles/dataset/Cinema_Experiments_2015/22777352},
doi = {10.17608/k6.auckland.22777352.v2},
year = {2023},
date = {2023-05-23},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
Unpublished
Hua, Yan Cathy; Denny, Paul; Wicker, Jörg; Taskova, Katerina
A Systematic Review of Aspect-based Sentiment Analysis (ABSA): Domains, Methods, and Trends Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{hua2023systematic,
title = {A Systematic Review of Aspect-based Sentiment Analysis (ABSA): Domains, Methods, and Trends},
author = {Yan Cathy Hua and Paul Denny and J\"{o}rg Wicker and Katerina Taskova},
url = {https://arxiv.org/abs/2311.10777},
doi = {10.48550/arXiv.2311.10777},
year = {2023},
date = {2023-11-17},
urldate = {2023-11-17},
abstract = {Aspect-based Sentiment Analysis (ABSA) is a type of fine-grained sentiment analysis (SA) that identifies aspects and the associated opinions from a given text. In the digital era, ABSA gained increasing popularity and applications in mining opinionated text data to obtain insights and support decisions. ABSA research employs linguistic, statistical, and machine-learning approaches and utilises resources such as labelled datasets, aspect and sentiment lexicons and ontology. By its nature, ABSA is domain-dependent and can be sensitive to the impact of misalignment between the resource and application domains. However, to our knowledge, this topic has not been explored by the existing ABSA literature reviews. In this paper, we present a Systematic Literature Review (SLR) of ABSA studies with a focus on the research application domain, dataset domain, and the research methods to examine their relationships and identify trends over time. Our results suggest a number of potential systemic issues in the ABSA research literature, including the predominance of the ``product/service review\'\' dataset domain among the majority of studies that did not have a specific research application domain, coupled with the prevalence of dataset-reliant methods such as supervised machine learning. This review makes a number of unique contributions to the ABSA research field: 1) To our knowledge, it is the first SLR that links the research domain, dataset domain, and research method through a systematic perspective; 2) it is one of the largest scoped SLR on ABSA, with 519 eligible studies filtered from 4191 search results without time constraint; and 3) our review methodology adopted an innovative automatic filtering process based on PDF-mining, which enhanced screening quality and reliability. Suggestions and our review limitations are also discussed.},
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Dost, Katharina; Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg
Defining Applicability Domain in Biodegradation Pathway Prediction Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{dost2023defining,
title = {Defining Applicability Domain in Biodegradation Pathway Prediction},
author = {Katharina Dost and Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},
doi = {https://doi.org/10.21203/rs.3.rs-3587632/v1},
year = {2023},
date = {2023-11-10},
urldate = {2023-11-10},
abstract = {When developing a new chemical, investigating its long-term influences on the environment is crucial to prevent harm. Unfortunately, these experiments are time-consuming. In silico methods can learn from already obtained data to predict biotransformation pathways, and thereby help focus all development efforts on only the most promising chemicals. As all data-based models, these predictors will output pathway predictions for all input compounds in a suitable format, however, these predictions will be faulty unless the model has seen similar compounds during the training process. A common approach to prevent this for other types of models is to define an Applicability Domain for the model that makes predictions only for in-domain compounds and rejects out-of-domain ones. Nonetheless, although exploration of the compound space is particularly interesting in the development of new chemicals, no Applicability Domain method has been tailored to the specific data structure of pathway predictions yet. In this paper, we are the first to define Applicability Domain specialized in biodegradation pathway prediction. Assessing a model’s reliability from different angles, we suggest a three-stage approach that checks for applicability, reliability, and decidability of the model for a queried compound and only allows it to output a prediction if all three stages are passed. Experiments confirm that our proposed technique reliably rejects unsuitable compounds and therefore improves the safety of the biotransformation pathway predictor. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Hafner, Jasmin; Lorsbach, Tim; Schmidt, Sebastian; Brydon, Liam; Dost, Katharina; Zhang, Kunyang; Fenner, Kathrin; Wicker, Jörg
Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{nokey,
title = {Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath},
author = {Jasmin Hafner and Tim Lorsbach and Sebastian Schmidt and Liam Brydon and Katharina Dost and Kunyang Zhang and Kathrin Fenner and J\"{o}rg Wicker},
doi = {10.21203/rs.3.rs-3607847/v1},
year = {2023},
date = {2023-11-03},
urldate = {2023-11-03},
abstract = {enviPath is a widely used database and prediction system for microbial biotransformation pathways of primarily xenobiotic compounds. Data and prediction system are freely available both via a web interface and a public REST API. Since its initial release in 2016, we extended the data available in enviPath and improved the performance of the prediction system and usability of the overall system. We now provide three diverse data sets, covering microbial biotransformation in different environments and under different experimental conditions. This also enabled developing a pathway prediction model that is applicable to a more diverse set of chemicals. In the prediction engine, we implemented a new evaluation tailored towards pathway prediction, that returns a more honest and holistic view on the performance. We also implemented a novel applicability domain algorithm, which allows the user to estimate how well the model will perform on their data. Finally, we improved the implementation to speed up the overall system and provide new functionality via a plugin system. Overall, enviPath has developed into a reliable database and prediction system with a unique use case in research in microbial biotransformations. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Chang, Xinglong; Dost, Katharina; Dobbie, Gillian; Wicker, Jörg
Poison is Not Traceless: Fully-Agnostic Detection of Poisoning Attacks Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{Chang2023poison,
title = {Poison is Not Traceless: Fully-Agnostic Detection of Poisoning Attacks },
author = {Xinglong Chang and Katharina Dost and Gillian Dobbie and J\"{o}rg Wicker},
url = {http://arxiv.org/abs/2310.16224},
doi = {10.48550/arXiv.2310.16224},
year = {2023},
date = {2023-10-23},
urldate = {2023-10-23},
abstract = {The performance of machine learning models depends on the quality of the underlying data. Malicious actors can attack the model by poisoning the training data. Current detectors are tied to either specific data types, models, or attacks, and therefore have limited applicability in real-world scenarios. This paper presents a novel fully-agnostic framework, Diva (Detecting InVisible Attacks), that detects attacks solely relying on analyzing the potentially poisoned data set. Diva is based on the idea that poisoning attacks can be detected by comparing the classifier’s accuracy on poisoned and clean data and pre-trains a meta-learner using Complexity Measures to estimate the otherwise unknown accuracy on a hypothetical clean dataset. The framework applies to generic poisoning attacks. For evaluation purposes, in this paper, we test Diva on label-flipping attacks.},
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Chang, Xinglong; Dobbie, Gillian; Wicker, Jörg
Fast Adversarial Label-Flipping Attack on Tabular Data Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{Chang2023fast,
title = {Fast Adversarial Label-Flipping Attack on Tabular Data},
author = {Xinglong Chang and Gillian Dobbie and J\"{o}rg Wicker},
url = {https://arxiv.org/abs/2310.10744},
doi = {10.48550/arXiv.2310.10744},
year = {2023},
date = {2023-10-16},
urldate = {2023-10-16},
abstract = {Machine learning models are increasingly used in fields that require high reliability such as cybersecurity. However, these models remain vulnerable to various attacks, among which the adversarial label-flipping attack poses significant threats. In label-flipping attacks, the adversary maliciously flips a portion of training labels to compromise the machine learning model. This paper raises significant concerns as these attacks can camouflage a highly skewed dataset as an easily solvable classification problem, often misleading machine learning practitioners into lower defenses and miscalculations of potential risks. This concern amplifies in tabular data settings, where identifying true labels requires expertise, allowing malicious label-flipping attacks to easily slip under the radar. To demonstrate this risk is inherited in the adversary\'s objective, we propose FALFA (Fast Adversarial Label-Flipping Attack), a novel efficient attack for crafting adversarial labels. FALFA is based on transforming the adversary\'s objective and employs linear programming to reduce computational complexity. Using ten real-world tabular datasets, we demonstrate FALFA\'s superior attack potential, highlighting the need for robust defenses against such threats. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Long, Derek; Eade, Liam; Dost, Katharina; Meier-Menches, Samuel M; Goldstone, David C; Sullivan, Matthew P; Hartinger, Christian; Wicker, Jörg; Taskova, Katerina
AdductHunter: Identifying Protein-Metal Complex Adducts in Mass Spectra Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{Long2023adducthunter,
title = {AdductHunter: Identifying Protein-Metal Complex Adducts in Mass Spectra},
author = {Derek Long and Liam Eade and Katharina Dost and Samuel M Meier-Menches and David C Goldstone and Matthew P Sullivan and Christian Hartinger and J\"{o}rg Wicker and Katerina Taskova},
url = {https://adducthunter.wickerlab.org},
doi = {10.21203/rs.3.rs-3322854/v1},
year = {2023},
date = {2023-05-29},
urldate = {2023-05-29},
abstract = {Mass spectrometry (MS) is an analytical technique for molecule identification that can be used for investigating protein-metal complex interactions. Once the MS data is collected, the mass spectra are usually interpreted manually to identify the adducts formed which arise from the interactions between proteins and metal-based species. However, with increasing resolution, dataset size, and
species complexity, the time required to identify adducts and the error-prone nature of manual assignment have become limiting factors in MS analysis. AdductHunter is an analysis tool to automate the peak identification process using constraint integer optimization to find feasible combinations of protein and fragments, and dynamic time warping to calculate the dissimilarity between the theoretical isotope pattern of a species and its experimental isotope peak distribution. Our results show fast and accurate identification of protein adducts to aid mass spectrometry analysis.},
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
species complexity, the time required to identify adducts and the error-prone nature of manual assignment have become limiting factors in MS analysis. AdductHunter is an analysis tool to automate the peak identification process using constraint integer optimization to find feasible combinations of protein and fragments, and dynamic time warping to calculate the dissimilarity between the theoretical isotope pattern of a species and its experimental isotope peak distribution. Our results show fast and accurate identification of protein adducts to aid mass spectrometry analysis.