Publications

6.

Dost, Katharina; Albrecht, Steffen; MacLean, Paul; Wicker, Jörg; Gupta, Sandeep

Understanding Rumen Methanogen Interactions in Sheep Using Machine Learning Proceedings Article

In: Lecture Notes in Computer Science, pp. 253-269, Springer Nature, 2025, ISSN: 0302-9743.

5.

Miller, Catriona J; Golovina, Evgenija; Gokuladhas, Sreemol; Wicker, Jörg; Jacobson, Jessie C; O'Sullivan, Justin M

Unraveling ADHD: genes, co-occurring traits, and developmental dynamics Journal Article

In: Life Science Alliance, vol. 8, no. 5, 2025.

@article{miller2025unraveling,

title = {Unraveling ADHD: genes, co-occurring traits, and developmental dynamics},

author = {Catriona J Miller and Evgenija Golovina and Sreemol Gokuladhas and J\"{o}rg Wicker and Jessie C Jacobson and Justin M O\'Sullivan},

doi = {10.26508/lsa.202403029},

year  = {2025},

date = {2025-02-25},

journal = {Life Science Alliance},

volume = {8},

number = {5},

abstract = {Attention-deficit/hyperactivity disorder (ADHD) is a heterogeneous neurodevelopmental condition with a high prevalence of co-occurring conditions, contributing to increased difficulty in long-term management. Genome-wide association studies have identified variants shared between ADHD and co-occurring psychiatric disorders; however, the genetic mechanisms are not fully understood. We integrated gene expression and spatial organization data into a two-sample Mendelian randomization study for putatively causal ADHD genes in fetal and adult cortical tissues. We identified four genes putatively causal for ADHD in cortical tissues (fetal: ST3GAL3, PTPRF, PIDD1; adult: ST3GAL3, TIE1). Protein{textendash}protein interaction databases seeded with the causal ADHD genes identified biological pathways linking these genes with conditions (e.g., rheumatoid arthritis) and biomarkers (e.g., lymphocyte counts) known to be associated with ADHD, but without previously shown genetic relationships. The analysis was repeated on adult liver tissue, where putatively causal ADHD gene ST3GAL3 was linked to cholesterol traits. This analysis provides insight into the tissue-dependent temporal relationships between ADHD, co-occurring traits, and biomarkers. Importantly, it delivers evidence for the genetic interplay between co-occurring conditions, both previously studied and unstudied, with ADHD.The multimorbid3D pipeline was created and run in Python (version 3.8.8). All visualizations and data analysis were performed in R (version 4.2.0) through RStudio (version 2022.02.2). Table S16 lists the datasets and software that have been used in our analyses. All scripts are available on GitHub (https://github.com/Catriona-Miller/ADHD_Co-occurring_Traits).Table S16. Software and datasets used for this analysis.Ethics statementEthics approval was obtained from the University of Auckland Human Participants Ethics Committee (Decoding SNPs in context, UAHPEC19373).},

keywords = {bioinformatics, Biological Sciences, biomarkers, computational sustainability, machine learning},

pubstate = {published},

tppubtype = {article}

}

Close

4.

Hafner, Jasmin; Lorsbach, Tim; Schmidt, Sebastian; Brydon, Liam; Dost, Katharina; Zhang, Kunyang; Fenner, Kathrin; Wicker, Jörg

Advancements in Biotransformation Pathway Prediction: Enhancements, Datasets, and Novel Functionalities in enviPath Journal Article

In: Journal of Cheminformatics, vol. 16, no. 1, pp. 93, 2024, ISSN: 1758-2946.

3.

Miller, Catriona J; Golovina, Evgenija; Wicker, Jörg; Jacobson, Jessie C; O'Sullivan, Justin M

De novo network analysis reveals autism causal genes and developmental links to co-occurring traits Journal Article

In: Life Science Alliance, vol. 6, no. 10, 2023.

2.

Poonawala-Lohani, Nooriyan; Riddle, Pat; Adnan, Mehnaz; Wicker, Jörg

Geographic Ensembles of Observations using Randomised Ensembles of Autoregression Chains: Ensemble methods for spatio-temporal Time Series Forecasting of Influenza-like Illness Proceedings Article

In: pp. 1-7, Association for Computing Machinery, New York, NY, USA, 2022, ISBN: 9781450393867.

@inproceedings{Poonawala-Lohani2022geographic,

title = {Geographic Ensembles of Observations using Randomised Ensembles of Autoregression Chains: Ensemble methods for spatio-temporal Time Series Forecasting of Influenza-like Illness},

author = {Nooriyan Poonawala-Lohani and Pat Riddle and Mehnaz Adnan and J\"{o}rg Wicker},

doi = {10.1145/3535508.3545562},

isbn = {9781450393867},

year  = {2022},

date = {2022-08-07},

pages = {1-7},

publisher = {Association for Computing Machinery},

address = {New York, NY, USA},

abstract = {Influenza is a communicable respiratory illness that can cause serious public health hazards. Flu surveillance in New Zealand tracks case counts from various District health boards (DHBs) in the country to monitor the spread of influenza in different geographic locations. Many factors contribute to the spread of the influenza across a geographic region, and it can be challenging to forecast cases in one region without taking into account case numbers in another region. This paper proposes a novel ensemble method called Geographic Ensembles of Observations using Randomised Ensembles of Autoregression Chains (GEO-Reach). GEO-Reach is an ensemble technique that uses a two layer approach to utilise interdependence of historical case counts between geographic regions in New Zealand. This work extends a previously published method by the authors called Randomized Ensembles of Auto-regression chains (Reach). State-of-the-art forecasting models look at studying the spread of the virus. They focus on accurate forecasting of cases for a location using historical case counts for the same location and other data sources based on human behaviour such as movement of people across cities/geographic regions. This new approach is evaluated using Influenza like illness (ILI) case counts in 7 major regions in New Zealand from the years 2015-2019 and compares its performance with other standard methods such as Dante, ARIMA, Autoregression and Random Forests. The results demonstrate that the proposed method performed better than baseline methods when applied to this multi-variate time series forecasting problem.},

keywords = {bioinformatics, computational sustainability, dynamic time warping, forecasting, influenza, machine learning, medicine, time series},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

1.

Wicker, Jörg

Large Classifier Systems in Bio- and Cheminformatics PhD Thesis

Technische Universität München, 2013.

Abstract | Links | BibTeX | Tags: biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity

@phdthesis{wicker2013large,

title = {Large Classifier Systems in Bio- and Cheminformatics},

author = {J\"{o}rg Wicker},

url = {http://mediatum.ub.tum.de/node?id=1165858},

year  = {2013},

date = {2013-01-01},

school = {Technische Universit\"{a}t M\"{u}nchen},

abstract = {Large classifier systems are machine learning algorithms that use multiple 

classifiers to improve the prediction of target values in advanced 

classification tasks. Although learning problems in bio- and 

cheminformatics commonly provide data in schemes suitable for large 

classifier systems, they are rarely used in these domains. This thesis 

introduces two new classifiers incorporating systems of classifiers 

using Boolean matrix decomposition to handle data in a schema that 

often occurs in bio- and cheminformatics. 

 

The first approach, called MLC-BMaD (multi-label classification using 

Boolean matrix decomposition), uses Boolean matrix decomposition to 

decompose the labels in a multi-label classification task. The 

decomposed matrices are a compact representation of the information 

in the labels (first matrix) and the dependencies among the labels 

(second matrix). The first matrix is used in a further multi-label 

classification while the second matrix is used to generate the final 

matrix from the predicted values of the first matrix. 

MLC-BMaD was evaluated on six standard multi-label data sets, the 

experiments showed that MLC-BMaD can perform particularly well on data 

sets with a high number of labels and a small number of instances and 

can outperform standard multi-label algorithms. 

Subsequently, MLC-BMaD is extended to a special case of 

multi-relational learning, by considering the labels not as simple 

labels, but instances. The algorithm, called ClassFact 

(Classification factorization), uses both matrices in a multi-label 

classification. Each label represents a mapping between two 

instances. 

Experiments on three data sets from the domain of bioinformatics show 

that ClassFact can outperform the baseline method, which merges the 

relations into one, on hard classification tasks. 

 

Furthermore, large classifier systems are used on two cheminformatics 

data sets, the first one is used to predict the environmental fate of 

chemicals by predicting biodegradation pathways. The second is a data 

set from the domain of predictive toxicology. In biodegradation 

pathway prediction, I extend a knowledge-based system and incorporate 

a machine learning approach to predict a probability for 

biotransformation products based on the structure- and knowledge-based 

predictions of products, which are based on transformation rules. The 

use of multi-label classification improves the performance of the 

classifiers and extends the number of transformation rules that can be 

covered. 

For the prediction of toxic effects of chemicals, I applied large 

classifier systems to the ToxCasttexttrademark data set, which maps 

toxic effects to chemicals. As the given toxic effects are not easy to 

predict due to missing information and a skewed class 

distribution, I introduce a filtering step in the multi-label 

classification, which finds labels that are usable in multi-label 

prediction and does not take the others in the 

prediction into account. Experiments show 

that this approach can improve upon the baseline method using binary 

classification, as well as multi-label approaches using no filtering. 

 

The presented results show that large classifier systems can play a 

role in future research challenges, especially in bio- and 

cheminformatics, where data sets frequently consist of more complex 

structures and data can be rather small in terms of the number of 

instances compared to other domains.},

keywords = {biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity},

pubstate = {published},

tppubtype = {phdthesis}

}

Close

Large classifier systems are machine learning algorithms that use multiple
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.

Close