Identification and Mitigation of Selection Bias

Kia Ora!
I am Katharina Dost, a PhD student with the School of Computer Science. My research topic is “Identification and Mitigation of Selection Bias” and I would like to use this post to talk about my research and my experiences, so read on!

Our world runs on data. We gather whatever we can and use it to answer a variety of questions. These can be something as simple as “What is the average age of my customers?” but also as critical as “Which treatment is best suited for my disease?”.
But the collected data often does not represent the entire population about which we would like to make assumptions. This is called Selection Bias, and it typically means that if we learn a model based on the obtained data, it won’t perform well on the entire population. For example, if we only observe birds in South-America, a classification model won’t be able to identify a Kiwi (neither the bird, nor the fruit, nor the person). The good news is: If you are aware of that bias or if you know what your entire population looks like, there are options. But if you assume that everything is fine or you simply don’t know your entire population, you will use the biased dataset and train a poorly performing model. Let’s be honest: who knows which customers buy at a shop? We might know the ones with a membership card, but that’s a biased subset!
I believe that there’s already some information on the entire population hidden in a biased dataset, and my research aims to uncover it and generate data to fix the bias. Therefore, I investigate the dataset’s distribution and search for flaws that point out where to generate data. I am aware that my solution will not work in all cases, but if it can reliably warn the user where it doesn’t work, we can use it as a fixed component in every Data Analysis or Machine Learning preprocessing step and achieve cleaner data as well as better models!

In order to achieve this goal, my days are mostly spent in front of the computer. On a good day, I have a great idea and implement as well as test it. But often enough, I don’t know how to move on or how to solve my problems. Then I spend the day searching the literature for solutions and inspiration or discuss my problem with my fellow researchers. Often, I also try to support my peers with their own problems as it always feels good to help and I learn about other cool topics and perspectives.

What I enjoy most about my research are the moments when suddenly something works after a long series of failures. From the outside, it might not always look like much, but for me every small step towards achieving my goal is exciting.

Of course, there are not only enjoyable moments when you are trying to push the boundaries of knowledge but also rather frustrating and challenging ones. These include research-related questions like “How can I break that huge problem into bite-sized chunks?”, “Is there a method I can use to solve my problem, or do I have to come up with my own solution?” or “Is my research worth being published?”, efficiency-related questions like “Is there a Python library for that?”, but also existential questions like “Am I at the right place here? Am I smart enough for a PhD?”.

For me personally, it always helps to first narrow challenges down to the exact problem, and if I did not already find a solution along the way, I talk to my supervisors or peers – someone will drop the right keyword to provide me with a new idea. It surprised me, but presenting my research to different audiences also helped me to believe in what I am doing and that it can have a true impact if I succeed.

If I could travel back in time and talk to my younger research self, I would tell myself to be less nervous about the decision to start a PhD – I chose the best supervisors I could have asked for (which is the crucial part of doing a successful PhD in my opinion!) and the University of Auckland is a great place to study.

2025

Dai, Kejun; Kim, Jonathan; Džeroski, Sašo; Wicker, Jörg; Dobbie, Gillian; Dost, Katharina

Assessing the risk of discriminatory bias in classification datasets Journal Article

In: Machine Learning, vol. 114, iss. 9, pp. 204, 2025, ISSN: 1573-0565.

Abstract | Links | BibTeX

2024

Lyu, Jiachen; Dost, Katharina; Koh, Yun Sing; Wicker, Jörg

Regional Bias in Monolingual English Language Models Journal Article

In: Machine Learning, 2024, ISSN: 1573-0565.

Abstract | Links | BibTeX

@article{lyu2023regional,

title = {Regional Bias in Monolingual English Language Models},

author = {Jiachen Lyu and Katharina Dost and Yun Sing Koh and J\"{o}rg Wicker},

url = {https://link.springer.com/article/10.1007/s10994-024-06555-6

https://dx.doi.org/10.21203/rs.3.rs-3713494/v1},

doi = {10.1007/s10994-024-06555-6},

issn = {1573-0565},

year  = {2024},

date = {2024-07-09},

urldate = {2024-07-09},

journal = {Machine Learning},

abstract = { In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

2023

Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg

Combatting over-specialization bias in growing chemical databases Journal Article

In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.

Abstract | Links | BibTeX

@article{Dost2023Combatting,

title = {Combatting over-specialization bias in growing chemical databases},

author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w



},

doi = {10.1186/s13321-023-00716-w},

issn = {1758-2946},

year  = {2023},

date = {2023-05-19},

urldate = {2023-05-19},

journal = {Journal of Cheminformatics},

volume = {15},

issue = {1},

pages = {53},

abstract = {Background



Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution



In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results



An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Background

Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution

In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results

An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.

2022

Dost, Katharina; Duncanson, Hamish; Ziogas, Ioannis; Riddle, Pat; Wicker, Jörg

Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias Proceedings Article

In: 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2022), pp. 149-160, Springer-Verlag, Berlin, Heidelberg, 2022, ISBN: 978-3-031-05935-3.

Abstract | Links | BibTeX

2020

Dost, Katharina; Taskova, Katerina; Riddle, Pat; Wicker, Jörg

Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias Proceedings Article

In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

Abstract | Links | BibTeX