Current Machine Learning model evaluation methods, e.g., the use of test sets, will only detect whether a model’s predictions match the data. They cannot exclude the possibility that both predictions and data are biased. More targeted efforts to reduce or eliminate training data biases require either manual adjustments of the models, domain knowledge, or assume that data or model issues can easily be identified. This is seldom the case. Model auditing generally relies on manual analysis of the models, on existing data, and knowledgeable auditors. Automatically identifying deficiencies in both data and training, and how they impact application of models, is still an unsolved question. Bias mitigation approaches require either the ground-truth distribution or concrete information on the bias — or unbiased / differently biased data from other sources. In this project, we aim to develop a model-agnostic framework that will not be limited by any of these requirements.
2024
Lyu, Jiachen; Dost, Katharina; Koh, Yun Sing; Wicker, Jörg
Regional Bias in Monolingual English Language Models Journal Article
In: Machine Learning, 2024, ISSN: 1573-0565.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{lyu2023regional,
title = {Regional Bias in Monolingual English Language Models},
author = {Jiachen Lyu and Katharina Dost and Yun Sing Koh and J\"{o}rg Wicker},
url = {https://link.springer.com/article/10.1007/s10994-024-06555-6
https://dx.doi.org/10.21203/rs.3.rs-3713494/v1},
doi = {10.1007/s10994-024-06555-6},
issn = {1573-0565},
year = {2024},
date = {2024-07-09},
urldate = {2024-07-09},
journal = {Machine Learning},
abstract = { In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Chang, Xinglong; Brydon, Liam; Wicker, Jörg
Memento: v1.1.1 Miscellaneous
Zenedo, 2024.
Links | BibTeX | Altmetric | PlumX
@misc{chang2024memento,
title = {Memento: v1.1.1},
author = {Xinglong Chang and Liam Brydon and J\"{o}rg Wicker},
url = {https://github.com/wickerlab/memento/tree/v1.1.1},
doi = {10.5281/zenodo.10929406},
year = {2024},
date = {2024-04-05},
urldate = {2024-04-05},
howpublished = {Zenedo},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
2023
Dost, Katharina; Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg
Defining Applicability Domain in Biodegradation Pathway Prediction Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{dost2023defining,
title = {Defining Applicability Domain in Biodegradation Pathway Prediction},
author = {Katharina Dost and Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},
doi = {https://doi.org/10.21203/rs.3.rs-3587632/v1},
year = {2023},
date = {2023-11-10},
urldate = {2023-11-10},
abstract = {When developing a new chemical, investigating its long-term influences on the environment is crucial to prevent harm. Unfortunately, these experiments are time-consuming. In silico methods can learn from already obtained data to predict biotransformation pathways, and thereby help focus all development efforts on only the most promising chemicals. As all data-based models, these predictors will output pathway predictions for all input compounds in a suitable format, however, these predictions will be faulty unless the model has seen similar compounds during the training process. A common approach to prevent this for other types of models is to define an Applicability Domain for the model that makes predictions only for in-domain compounds and rejects out-of-domain ones. Nonetheless, although exploration of the compound space is particularly interesting in the development of new chemicals, no Applicability Domain method has been tailored to the specific data structure of pathway predictions yet. In this paper, we are the first to define Applicability Domain specialized in biodegradation pathway prediction. Assessing a model’s reliability from different angles, we suggest a three-stage approach that checks for applicability, reliability, and decidability of the model for a queried compound and only allows it to output a prediction if all three stages are passed. Experiments confirm that our proposed technique reliably rejects unsuitable compounds and therefore improves the safety of the biotransformation pathway predictor. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Chang, Xinglong; Dost, Katharina; Dobbie, Gillian; Wicker, Jörg
Poison is Not Traceless: Fully-Agnostic Detection of Poisoning Attacks Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{Chang2023poison,
title = {Poison is Not Traceless: Fully-Agnostic Detection of Poisoning Attacks },
author = {Xinglong Chang and Katharina Dost and Gillian Dobbie and J\"{o}rg Wicker},
url = {http://arxiv.org/abs/2310.16224},
doi = {10.48550/arXiv.2310.16224},
year = {2023},
date = {2023-10-23},
urldate = {2023-10-23},
abstract = {The performance of machine learning models depends on the quality of the underlying data. Malicious actors can attack the model by poisoning the training data. Current detectors are tied to either specific data types, models, or attacks, and therefore have limited applicability in real-world scenarios. This paper presents a novel fully-agnostic framework, Diva (Detecting InVisible Attacks), that detects attacks solely relying on analyzing the potentially poisoned data set. Diva is based on the idea that poisoning attacks can be detected by comparing the classifier’s accuracy on poisoned and clean data and pre-trains a meta-learner using Complexity Measures to estimate the otherwise unknown accuracy on a hypothetical clean dataset. The framework applies to generic poisoning attacks. For evaluation purposes, in this paper, we test Diva on label-flipping attacks.},
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Chang, Xinglong; Dobbie, Gillian; Wicker, Jörg
Fast Adversarial Label-Flipping Attack on Tabular Data Unpublished Forthcoming
Forthcoming.
Abstract | Links | BibTeX | Altmetric | PlumX
@unpublished{Chang2023fast,
title = {Fast Adversarial Label-Flipping Attack on Tabular Data},
author = {Xinglong Chang and Gillian Dobbie and J\"{o}rg Wicker},
url = {https://arxiv.org/abs/2310.10744},
doi = {10.48550/arXiv.2310.10744},
year = {2023},
date = {2023-10-16},
urldate = {2023-10-16},
abstract = {Machine learning models are increasingly used in fields that require high reliability such as cybersecurity. However, these models remain vulnerable to various attacks, among which the adversarial label-flipping attack poses significant threats. In label-flipping attacks, the adversary maliciously flips a portion of training labels to compromise the machine learning model. This paper raises significant concerns as these attacks can camouflage a highly skewed dataset as an easily solvable classification problem, often misleading machine learning practitioners into lower defenses and miscalculations of potential risks. This concern amplifies in tabular data settings, where identifying true labels requires expertise, allowing malicious label-flipping attacks to easily slip under the radar. To demonstrate this risk is inherited in the adversary\'s objective, we propose FALFA (Fast Adversarial Label-Flipping Attack), a novel efficient attack for crafting adversarial labels. FALFA is based on transforming the adversary\'s objective and employs linear programming to reduce computational complexity. Using ten real-world tabular datasets, we demonstrate FALFA\'s superior attack potential, highlighting the need for robust defenses against such threats. },
keywords = {},
pubstate = {forthcoming},
tppubtype = {unpublished}
}
Pullar-Strecker, Zac; Chang, Xinglong; Brydon, Liam; Ziogas, Ioannis; Dost, Katharina; Wicker, Jörg
Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments Proceedings Article
In: Morales, Gianmarco De Francisci; Perlich, Claudia; Ruchansky, Natali; Kourtellis, Nicolas; Baralis, Elena; Bonchi, Francesco (Ed.): Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pp. 310-314, Springer Nature Switzerland, Cham, 2023, ISBN: 978-3-031-43430-3.
Abstract | Links | BibTeX | Altmetric | PlumX
@inproceedings{Pullar-Strecker2023memento,
title = {Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments},
author = {Zac Pullar-Strecker and Xinglong Chang and Liam Brydon and Ioannis Ziogas and Katharina Dost and J\"{o}rg Wicker},
editor = {Gianmarco De Francisci Morales and Claudia Perlich and Natali Ruchansky and Nicolas Kourtellis and Elena Baralis and Francesco Bonchi },
url = {https://arxiv.org/abs/2304.09175
https://github.com/wickerlab/memento},
doi = {10.1007/978-3-031-43430-3_21},
isbn = {978-3-031-43430-3},
year = {2023},
date = {2023-09-17},
urldate = {2023-09-17},
booktitle = {Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track},
journal = {Lecture Notes in Computer Science},
pages = {310-314},
publisher = {Springer Nature Switzerland},
address = {Cham},
abstract = { Running complex sets of machine learning experiments is challenging and time-consuming due to the lack of a unified framework. This leaves researchers forced to spend time implementing necessary features such as parallelization, caching, and checkpointing themselves instead of focussing on their project. To simplify the process, in our paper, we introduce Memento, a Python package that is designed to aid researchers and data scientists in the efficient management and execution of computationally intensive experiments. Memento has the capacity to streamline any experimental pipeline by providing a straightforward configuration matrix and the ability to concurrently run experiments across multiple threads.
Code related to this paper is available at: https://github.com/wickerlab/memento.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Code related to this paper is available at: https://github.com/wickerlab/memento.
Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg
Combatting over-specialization bias in growing chemical databases Journal Article
In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.
Abstract | Links | BibTeX | Altmetric | PlumX
@article{Dost2023Combatting,
title = {Combatting over-specialization bias in growing chemical databases},
author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w
},
doi = {10.1186/s13321-023-00716-w},
issn = {1758-2946},
year = {2023},
date = {2023-05-19},
urldate = {2023-05-19},
journal = {Journal of Cheminformatics},
volume = {15},
issue = {1},
pages = {53},
abstract = {Background
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.