Kia Ora! I am Katharina Dost, a PhD student with the School of Computer Science. My research topic is “Identification and Mitigation of Selection Bias” and I would like to use this post to talk about my research and my experiences, so read on!
Our world runs on data. We gather whatever we can and use it to answer a variety of questions. These can be something as simple as “What is the average age of my customers?” but also as critical as “Which treatment is best suited for my disease?”. But the collected data often does not represent the entire population about which we would like to make assumptions. This is called Selection Bias, and it typically means that if we learn a model based on the obtained data, it won’t perform well on the entire population. For example, if we only observe birds in South-America, a classification model won’t be able to identify a Kiwi (neither the bird, nor the fruit, nor the person). The good news is: If you are aware of that bias or if you know what your entire population looks like, there are options. But if you assume that everything is fine or you simply don’t know your entire population, you will use the biased dataset and train a poorly performing model. Let’s be honest: who knows which customers buy at a shop? We might know the ones with a membership card, but that’s a biased subset! I believe that there’s already some information on the entire population hidden in a biased dataset, and my research aims to uncover it and generate data to fix the bias. Therefore, I investigate the dataset’s distribution and search for flaws that point out where to generate data. I am aware that my solution will not work in all cases, but if it can reliably warn the user where it doesn’t work, we can use it as a fixed component in every Data Analysis or Machine Learning preprocessing step and achieve cleaner data as well as better models!
In order to achieve this goal, my days are mostly spent in front of the computer. On a good day, I have a great idea and implement as well as test it. But often enough, I don’t know how to move on or how to solve my problems. Then I spend the day searching the literature for solutions and inspiration or discuss my problem with my fellow researchers. Often, I also try to support my peers with their own problems as it always feels good to help and I learn about other cool topics and perspectives.
What I enjoy most about my research are the moments when suddenly something works after a long series of failures. From the outside, it might not always look like much, but for me every small step towards achieving my goal is exciting.
Of course, there are not only enjoyable moments when you are trying to push the boundaries of knowledge but also rather frustrating and challenging ones. These include research-related questions like “How can I break that huge problem into bite-sized chunks?”, “Is there a method I can use to solve my problem, or do I have to come up with my own solution?” or “Is my research worth being published?”, efficiency-related questions like “Is there a Python library for that?”, but also existential questions like “Am I at the right place here? Am I smart enough for a PhD?”.
For me personally, it always helps to first narrow challenges down to the exact problem, and if I did not already find a solution along the way, I talk to my supervisors or peers – someone will drop the right keyword to provide me with a new idea. It surprised me, but presenting my research to different audiences also helped me to believe in what I am doing and that it can have a true impact if I succeed.
If I could travel back in time and talk to my younger research self, I would tell myself to be less nervous about the decision to start a PhD – I chose the best supervisors I could have asked for (which is the crucial part of doing a successful PhD in my opinion!) and the University of Auckland is a great place to study.
@article{lyu2023regional,
title = {Regional Bias in Monolingual English Language Models},
author = {Jiachen Lyu and Katharina Dost and Yun Sing Koh and J\"{o}rg Wicker},
url = {https://link.springer.com/article/10.1007/s10994-024-06555-6
https://dx.doi.org/10.21203/rs.3.rs-3713494/v1},
doi = {10.1007/s10994-024-06555-6},
issn = {1573-0565},
year = {2024},
date = {2024-07-09},
urldate = {2024-07-09},
journal = {Machine Learning},
abstract = { In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.
@article{Dost2023Combatting,
title = {Combatting over-specialization bias in growing chemical databases},
author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},
url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w
},
doi = {10.1186/s13321-023-00716-w},
issn = {1758-2946},
year = {2023},
date = {2023-05-19},
urldate = {2023-05-19},
journal = {Journal of Cheminformatics},
volume = {15},
issue = {1},
pages = {53},
abstract = {Background
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution
In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results
An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.
In: 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2022), pp. 149-160, Springer-Verlag, Berlin, Heidelberg, 2022, ISBN: 978-3-031-05935-3.
@inproceedings{dost2022divide,
title = {Divide and Imitate: Multi-Cluster Identification and Mitigation of Selection Bias},
author = {Katharina Dost and Hamish Duncanson and Ioannis Ziogas and Pat Riddle and J\"{o}rg Wicker},
url = {https://link.springer.com/chapter/10.1007/978-3-031-05936-0_12
https://github.com/KatDost/Mimic
https://pypi.org/project/imitatebias},
doi = {10.1007/978-3-031-05936-0_12},
isbn = {978-3-031-05935-3},
year = {2022},
date = {2022-05-16},
urldate = {2022-05-16},
booktitle = {26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2022)},
pages = {149-160},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
abstract = {Machine Learning can help overcome human biases in decision making by focusing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. In this paper, we introduce a novel algorithm, Mimic, which uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems. Experiments confirm that Mimic not only identifies potential biases in multi-cluster datasets which can be corrected early on but also improves classifier performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Machine Learning can help overcome human biases in decision making by focusing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. In this paper, we introduce a novel algorithm, Mimic, which uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems. Experiments confirm that Mimic not only identifies potential biases in multi-cluster datasets which can be corrected early on but also improves classifier performance.
@inproceedings{dost2020your,
title = {Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias},
author = {Katharina Dost and Katerina Taskova and Pat Riddle and J\"{o}rg Wicker},
url = {https://ieeexplore.ieee.org/document/9338355
https://github.com/KatDost/Imitate
https://pypi.org/project/imitatebias/},
doi = {10.1109/ICDM50108.2020.00115},
issn = {2374-8486},
year = {2020},
date = {2020-11-17},
urldate = {2020-11-17},
booktitle = {2020 IEEE International Conference on Data Mining (ICDM)},
pages = {996-1001},
publisher = {IEEE},
abstract = {Machine Learning typically assumes that training and test set are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias?
In contrast to prior work, this paper introduces a new method, Imitate, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available.
Imitate investigates the dataset\'s probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample.
We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Machine Learning typically assumes that training and test set are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias?
In contrast to prior work, this paper introduces a new method, Imitate, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available.
Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample.
We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.