Publications

27.

Long, Derek; Eade, Liam; Dost, Katharina; Meier-Menches, Samuel M; Goldstone, David C; Sullivan, Matthew P; Hartinger, Christian; Wicker, Jörg; Taskova, Katerina

AdductHunter: Identifying Protein-Metal Complex Adducts in Mass Spectra Journal Article

In: Journal of Cheminformatics, vol. 16, iss. 1, 2024, ISSN: 1758-2946.

26.

Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Stönner, Christof; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Williams, Jonathan; Kramer, Stefan

Cinema Experiments 2013 Miscellaneous

2023.

Links | BibTeX | Altmetric | PlumX | Tags: atmospheric chemistry, cinema data mining, data mining, machine learning, smell of fear, sof

25.

Stönner, Christof; Edtbauer, Achim; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Wicker, Jörg; Williams, Jonathan

Cinema Experiments 2015 Miscellaneous

2023.

Links | BibTeX | Altmetric | PlumX | Tags: cinema data mining, data mining, machine learning, smell of fear, sof

24.

Dost, Katharina; Pullar-Strecker, Zac; Brydon, Liam; Zhang, Kunyang; Hafner, Jasmin; Riddle, Pat; Wicker, Jörg

Combatting over-specialization bias in growing chemical databases Journal Article

In: Journal of Cheminformatics, vol. 15, iss. 1, pp. 53, 2023, ISSN: 1758-2946.

@article{Dost2023Combatting,

title = {Combatting over-specialization bias in growing chemical databases},

author = {Katharina Dost and Zac Pullar-Strecker and Liam Brydon and Kunyang Zhang and Jasmin Hafner and Pat Riddle and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00716-w



},

doi = {10.1186/s13321-023-00716-w},

issn = {1758-2946},

year  = {2023},

date = {2023-05-19},

urldate = {2023-05-19},

journal = {Journal of Cheminformatics},

volume = {15},

issue = {1},

pages = {53},

abstract = {Background



Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution



In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results



An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.},

keywords = {bias, biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways, multi-label classification, reliable machine learning},

pubstate = {published},

tppubtype = {article}

}

Close

Background

Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.
Proposed solution

In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.
Results

An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels.

Close

23.

Tam, Jason; Lorsbach, Tim; Schmidt, Sebastian; Wicker, Jörg

Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products Journal Article

In: Journal of Cheminformatics, vol. 13, no. 1, pp. 63, 2021.

@article{tam2021holisticb,

title = {Holistic Evaluation of Biodegradation Pathway Prediction: Assessing Multi-Step Reactions and Intermediate Products},

author = {Jason Tam and Tim Lorsbach and Sebastian Schmidt and J\"{o}rg Wicker},

url = {https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00543-x 

https://chemrxiv.org/articles/preprint/Holistic_Evaluation_of_Biodegradation_Pathway_Prediction_Assessing_Multi-Step_Reactions_and_Intermediate_Products/14315963 

https://dx.doi.org/10.26434/chemrxiv.14315963},

doi = {10.1186/s13321-021-00543-x},

year  = {2021},

date = {2021-09-03},

urldate = {2021-09-03},

journal = {Journal of Cheminformatics},

volume = {13},

number = {1},

pages = {63},

abstract = {The prediction of metabolism and biotransformation pathways of xenobiotics is a highly desired tool in environmental sciences, drug discovery, and (eco)toxicology. Several systems predict single transformation steps or complete pathways as series of parallel and subsequent steps. Their performance is commonly evaluated on the level of a single transformation step. Such an approach cannot account for some specific challenges that are caused by specific properties of biotransformation experiments. That is, missing transformation products in the reference data that occur only in low concentrations, e.g. transient intermediates or higher-generation metabolites. Furthermore, some rule-based prediction systems evaluate the performance only based on the defined set of transformation rules. Therefore, the performance of these models cannot be directly compared. In this paper, we introduce a new evaluation framework that extends the evaluation of biotransformation prediction from single transformations to whole pathways, taking into account multiple generations of metabolites. We introduce a procedure to address transient intermediates and propose a weighted scoring system that acknowledges the uncertainty of higher-generation metabolites. We implemented this framework in enviPath and demonstrate its strict performance metrics on predictions of in vitro biotransformation and degradation of xenobiotics in soil. Our approach is model-agnostic and can be transferred to other prediction systems. It is also capable of revealing knowledge gaps in terms of incompletely defined sets of transformation rules.},

keywords = {biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways},

pubstate = {published},

tppubtype = {article}

}

Close

22.

Stepišnik, Tomaž; Škrlj, Blaž; Wicker, Jörg; Kocev, Dragi

A comprehensive comparison of molecular feature representations for use in predictive modeling Journal Article

In: Computers in Biology and Medicine, vol. 130, pp. 104197, 2021, ISSN: 0010-4825.

@article{stepisnik2021comprehensive,

title = {A comprehensive comparison of molecular feature representations for use in predictive modeling},

author = {Toma\v{z} Stepi\v{s}nik and Bla\v{z} \v{S}krlj and J\"{o}rg Wicker and Dragi Kocev},

url = {http://www.sciencedirect.com/science/article/pii/S001048252030528X},

doi = {10.1016/j.compbiomed.2020.104197},

issn = {0010-4825},

year  = {2021},

date = {2021-03-01},

journal = {Computers in Biology and Medicine},

volume = {130},

pages = {104197},

abstract = {Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.},

keywords = {biodegradation, cheminformatics, computational sustainability, data mining, enviPath, machine learning, metabolic pathways, molecular feature representation, toxicity},

pubstate = {published},

tppubtype = {article}

}

Close

21.

Chester, Andrew; Koh, Yun Sing; Wicker, Jörg; Sun, Quan; Lee, Junjae

Balancing Utility and Fairness against Privacy in Medical Data Proceedings Article

In: IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1226-1233, IEEE, 2020.

@inproceedings{chester2020balancing,

title = {Balancing Utility and Fairness against Privacy in Medical Data},

author = {Andrew Chester and Yun Sing Koh and J\"{o}rg Wicker and Quan Sun and Junjae Lee},

url = {https://ieeexplore.ieee.org/abstract/document/9308226},

doi = {10.1109/SSCI47803.2020.9308226},

year  = {2020},

date = {2020-12-01},

booktitle = {IEEE Symposium Series on Computational Intelligence (SSCI)},

pages = {1226-1233},

publisher = {IEEE},

abstract = {There are numerous challenges when designing algorithms that interact with sensitive data, such as, medical or financial records. One of these challenges is privacy. However, there is a tension between privacy, utility (model accuracy), and fairness. While de-identification techniques, such as generalisation and suppression, have been proposed to enable privacy protection, it comes with a cost, specifically to fairness and utility. Recent work on fairness in algorithm design defines fairness as a guarantee of similar outputs for "similar" input data. This notion is discussed in connection to de-identification. This research investigates the trade-off between privacy, fairness, and utility. In contrast, other work investigates the trade-off between privacy and utility of the data or accuracy of the model overall. In this research, we investigate the effects of two standard de-identification techniques, k-anonymity and differential privacy, on both utility and fairness. We propose two measures to calculate the trade-off between privacy-utility and privacy-fairness. Although other research has provided guarantees for privacy regarding utility, this research focuses on the trade-offs given set de-identification levels and relies on guarantees provided by the privacy preservation methods. We discuss the effects of de-identification on data of different characteristics, class imbalance and outcome imbalance. We evaluated this is on synthetic datasets and standard real-world datasets. As a case study, we analysed the Medical Expenditure Panel Survey dataset.},

keywords = {accuracy, computational sustainability, data mining, fairness, imbalance, machine learning, medicine, privacy},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

20.

Dost, Katharina; Taskova, Katerina; Riddle, Pat; Wicker, Jörg

Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias Proceedings Article

In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996-1001, IEEE, 2020, ISSN: 2374-8486.

19.

Roeslin, Samuel; Ma, Quincy; Chigullapally, Pavan; Wicker, Jörg; Wotherspoon, Liam

Feature Engineering for a Seismic Loss Prediction Model using Machine Learning, Christchurch Experience Proceedings Article

In: 17th World Conference on Earthquake Engineering, 2020.

Abstract | Links | BibTeX | Tags: computational sustainability, data mining, earthquakes, machine learning

@inproceedings{roeslin2020feature,

title = {Feature Engineering for a Seismic Loss Prediction Model using Machine Learning, Christchurch Experience},

author = {Samuel Roeslin and Quincy Ma and Pavan Chigullapally and J\"{o}rg Wicker and Liam Wotherspoon},

url = {https://www.researchgate.net/profile/Samuel_Roeslin/publication/344503593_Feature_Engineering_for_a_Seismic_Loss_Prediction_Model_using_Machine_Learning_Christchurch_Experience/links/5f7d015a92851c14bcb36ed7/Feature-Engineering-for-a-Seismic-Loss-Prediction-Model-using-Machine-Learning-Christchurch-Experience.pdf},

year  = {2020},

date = {2020-09-17},

booktitle = {17th World Conference on Earthquake Engineering},

abstract = {The city of Christchurch, New Zealand experienced four major earthquakes (MW \> 5.9) and multiple aftershocks between 4 September 2010 and 23 December 2011. This series of earthquakes, commonly known as the Canterbury Earthquake Sequence (CES), induced over NZ$40 billion in total economic losses. Liquefaction alone led to building damage in 51,000 of the 140,000 residential buildings, with around 15,000 houses left unpractical to repair. Widespread damage to residential buildings highlighted the need for improved seismic prediction tools and to better understand factors influencing damage. Fortunately, due to New Zealand unique insurance setting, up to 80% of the losses were insured. Over the entire CES, insurers received more than 650,000 claims. This research project employs multi-disciplinary empirical data gathered during and prior to the CES to develop a seismic loss prediction model for residential buildings in Christchurch using machine learning. The intent is to develop a procedure for developing insights from post-earthquake data that is subjected to continuous updating, to enable identification of critical parameters affecting losses, and to apply such a model to establish priority building stock for risk mitigation measures. The following paper describes the complex data preparation process required for the application of machine learning techniques. The paper covers the production of a merged dataset with information from the Earthquake Commission (EQC) claim database, building characteristics from RiskScape, seismic demand interpolated from GeoNet strong motion records, liquefaction occurrence from the New Zealand Geotechnical Database (NZGD) and soil conditions from Land Resource Information Systems (LRIS).},

keywords = {computational sustainability, data mining, earthquakes, machine learning},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

18.

Roeslin, Samuel; Ma, Quincy; Juárez-Garcia, Hugon; Gómez-Bernal, Alonso; Wicker, Jörg; Wotherspoon, Liam

A machine learning damage prediction model for the 2017 Puebla-Morelos, Mexico, earthquake Journal Article

In: Earthquake Spectra, vol. 36, no. 2, pp. 314-339, 2020.

17.

Wicker, Jörg; Hua, Yan Cathy; Rebello, Rayner; Pfahringer, Bernhard

XOR-based Boolean Matrix Decomposition Proceedings Article

In: Wang, Jianyong; Shim, Kyuseok; Wu, Xindong (Ed.): 2019 IEEE International Conference on Data Mining (ICDM), pp. 638-647, IEEE, 2019, ISBN: 978-1-7281-4604-1.

16.

Williams, Jonathan; Stönner, Christof; Edtbauer, Achim; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Krauter, Nicolas; Wicker, Jörg; Kramer, Stefan

What can we learn from the air chemistry of crowds? Proceedings Article

In: Hansel, Armin; Dunkl, Jürgen (Ed.): 8th International Conference on Proton Transfer Reaction Mass Spectrometry and its Applications, pp. 121-123, Innsbruck University Press, Innsbruck, 2019.

Abstract | Links | BibTeX | Tags: atmospheric chemistry, breath analysis, cheminformatics, cinema data mining, data mining, emotional response analysis, machine learning, movie analysis, smell of fear, sof, time series

15.

Stönner, Christof; Edtbauer, Achim; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Wicker, Jörg; Williams, Jonathan

Proof of concept study: Testing human volatile organic compounds as tools for age classification of films Journal Article

In: PLOS One, vol. 13, no. 10, pp. 1-14, 2018.

@article{Stonner2018,

title = {Proof of concept study: Testing human volatile organic compounds as tools for age classification of films},

author = {Christof St\"{o}nner and Achim Edtbauer and Bettina Derstorff and Efstratios Bourtsoukidis and Thomas Kl\"{u}pfel and J\"{o}rg Wicker and Jonathan Williams},

doi = {10.1371/journal.pone.0203044},

year  = {2018},

date = {2018-10-11},

journal = {PLOS One},

volume = {13},

number = {10},

pages = {1-14},

publisher = {Public Library of Science},

abstract = {Humans emit numerous volatile organic compounds (VOCs) through breath and skin. The nature and rate of these emissions are affected by various factors including emotional state. Previous measurements of VOCs and CO2 in a cinema have shown that certain chemicals are reproducibly emitted by audiences reacting to events in a particular film. Using data from films with various age classifications, we have studied the relationship between the emission of multiple VOCs and CO2 and the age classifier (0, 6, 12, and 16) with a view to developing a new chemically based and objective film classification method. We apply a random forest model built with time independent features extracted from the time series of every measured compound, and test predictive capability on subsets of all data. It was found that most compounds were not able to predict all age classifiers reliably, likely reflecting the fact that current classification is based on perceived sensibilities to many factors (e.g. incidences of violence, sex, antisocial behaviour, drug use, and bad language) rather than the visceral biological responses expressed in the data. However, promising results were found for isoprene which reliably predicted 0, 6 and 12 age classifiers for a variety of film genres and audience age groups. Therefore, isoprene emission per person might in future be a valuable aid to national classification boards, or even offer an alternative, objective, metric for rating films based on the reactions of large groups of people.},

keywords = {atmospheric chemistry, breath analysis, cheminformatics, cinema data mining, data mining, emotional response analysis, machine learning, movie analysis, smell of fear, sof, time series},

pubstate = {published},

tppubtype = {article}

}

Close

14.

Latino, Diogo; Wicker, Jörg; Gütlein, Martin; Schmid, Emanuel; Kramer, Stefan; Fenner, Kathrin

Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data Journal Article

In: Environmental Science: Process & Impact, 2017.

@article{latino2017eawag,

title = {Eawag-Soil in enviPath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data},

author = {Diogo Latino and J\"{o}rg Wicker and Martin G\"{u}tlein and Emanuel Schmid and Stefan Kramer and Kathrin Fenner},

doi = {10.1039/C6EM00697C},

year  = {2017},

date = {2017-01-01},

journal = {Environmental Science: Process \& Impact},

publisher = {The Royal Society of Chemistry},

abstract = {Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies 

for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications.},

keywords = {biodegradation, cheminformatics, computational sustainability, data mining, enviPath, multi-label classification, REST, web services},

pubstate = {published},

tppubtype = {article}

}

Close

Developing models for the prediction of microbial biotransformation pathways and half-lives of trace organic contaminants in different environments requires as training data easily accessible and sufficiently large collections of respective biotransformation data that are annotated with metadata on study conditions. Here, we present the Eawag-Soil package, a public database that has been developed to contain all freely accessible regulatory data on pesticide degradation in laboratory soil simulation studies
for pesticides registered in the EU (282 degradation pathways, 1535 reactions, 1619 compounds and 4716 biotransformation half-life values with corresponding metadata on study conditions). We provide a thorough description of this novel data resource, and discuss important features of the pesticide soil degradation data that are relevant for model development. Most notably, the variability of half-life values for individual compounds is large and only about one order of magnitude lower than the entire range of median half-life values spanned by all compounds, demonstrating the need to consider study conditions in the development of more accurate models for biotransformation prediction. We further show how the data can be used to find missing rules relevant for predicting soil biotransformation pathways. From this analysis, eight examples of reaction types were presented that should trigger the formulation of new biotransformation rules, e.g., Ar-OH methylation, or the extension of existing rules e.g., hydroxylation in aliphatic rings. The data were also used to exemplarily explore the dependence of half-lives of different amide pesticides on chemical class and experimental parameters. This analysis highlighted the value of considering initial transformation reactions for the development of meaningful quantitative-structure biotransformation relationships (QSBR), which is a novel opportunity of f ered by the simultaneous encoding of transformation reactions and corresponding half-lives in Eawag-Soil. Overall, Eawag-Soil provides an unprecedentedly rich collection of manually extracted and curated biotransformation data, which should be useful in a great variety of applications.

Close

13.

Wicker, Jörg; Lorsbach, Tim; Gütlein, Martin; Schmid, Emanuel; Latino, Diogo; Kramer, Stefan; Fenner, Kathrin

enviPath – The Environmental Contaminant Biotransformation Pathway Resource Journal Article

In: Nucleic Acid Research, vol. 44, no. D1, pp. D502-D508, 2016.

12.

Raza, Atif; Wicker, Jörg; Kramer, Stefan

Trading Off Accuracy for Efficiency by Randomized Greedy Warping Proceedings Article

In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 883-890, ACM, New York, NY, USA, 2016, ISBN: 978-1-4503-3739-7.

11.

Williams, Jonathan; Stönner, Christof; Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Kramer, Stefan

Cinema audiences reproducibly vary the chemical composition of air during films, by broadcasting scene specific emissions on breath Journal Article

In: Scientific Reports, vol. 6, 2016.

10.

Wicker, Jörg; Krauter, Nicolas; Derstorff, Bettina; Stönner, Christof; Bourtsoukidis, Efstratios; Klüpfel, Thomas; Williams, Jonathan; Kramer, Stefan

Cinema Data Mining: The Smell of Fear Proceedings Article

In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235-1304, ACM ACM, New York, NY, USA, 2015, ISBN: 978-1-4503-3664-2.

@inproceedings{wicker2015cinema,

title = {Cinema Data Mining: The Smell of Fear},

author = {J\"{o}rg Wicker and Nicolas Krauter and Bettina Derstorff and Christof St\"{o}nner and Efstratios Bourtsoukidis and Thomas Kl\"{u}pfel and Jonathan Williams and Stefan Kramer},

url = {https://wicker.nz/nwp-acm/authorize.php?id=N10031 

http://doi.acm.org/10.1145/2783258.2783404},

doi = {10.1145/2783258.2783404},

isbn = {978-1-4503-3664-2},

year  = {2015},

date = {2015-01-01},

booktitle = {Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},

pages = {1235-1304},

publisher = {ACM},

address = {New York, NY, USA},

organization = {ACM},

series = {KDD '15},

abstract = {While the physiological response of humans to emotional events or stimuli is well-investigated for many modalities (like EEG, skin resistance, ...), surprisingly little is known about the exhalation of so-called Volatile Organic Compounds (VOCs) at quite low concentrations in response to such stimuli. VOCs are molecules of relatively small mass that quickly evaporate or sublimate and can be detected in the air that surrounds us. The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves. To do so, we measured the VOCs from a movie theatre over a whole month in intervals of thirty seconds, and annotated the screened films by a controlled vocabulary compiled from multiple sources. To gain a better understanding of the data and to reveal unknown relationships, we have built prediction models for so-called forward prediction (the prediction of future VOCs from the past), backward prediction (the prediction of past scene labels from future VOCs) and for some forms of abductive reasoning and Granger causality. Experimental results show that some VOCs and some labels can be predicted with relatively low error, and that hints for causality with low p-values can be detected in the data.},

keywords = {atmospheric chemistry, breath analysis, causality, cheminformatics, cinema data mining, data mining, emotional response analysis, movie analysis, smell of fear, sof, time series},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

9.

Šilc, Jurij; Taškova, Katerina; Korošec, Peter

Data mining-assisted parameter tuning of a search algorithm Journal Article

In: Informatica, vol. 39, no. 2, 2015.

Abstract | Links | BibTeX | Tags: data mining

8.

Tyukin, Andrey; Kramer, Stefan; Wicker, Jörg

BMaD — A Boolean Matrix Decomposition Framework Proceedings Article

In: Calders, Toon; Esposito, Floriana; Hüllermeier, Eyke; Meo, Rosa (Ed.): Machine Learning and Knowledge Discovery in Databases, pp. 481-484, Springer Berlin Heidelberg, 2014, ISBN: 978-3-662-44844-1.

7.

Wicker, Jörg

Large Classifier Systems in Bio- and Cheminformatics PhD Thesis

Technische Universität München, 2013.

Abstract | Links | BibTeX | Tags: biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity

@phdthesis{wicker2013large,

title = {Large Classifier Systems in Bio- and Cheminformatics},

author = {J\"{o}rg Wicker},

url = {http://mediatum.ub.tum.de/node?id=1165858},

year  = {2013},

date = {2013-01-01},

school = {Technische Universit\"{a}t M\"{u}nchen},

abstract = {Large classifier systems are machine learning algorithms that use multiple 

classifiers to improve the prediction of target values in advanced 

classification tasks. Although learning problems in bio- and 

cheminformatics commonly provide data in schemes suitable for large 

classifier systems, they are rarely used in these domains. This thesis 

introduces two new classifiers incorporating systems of classifiers 

using Boolean matrix decomposition to handle data in a schema that 

often occurs in bio- and cheminformatics. 

 

The first approach, called MLC-BMaD (multi-label classification using 

Boolean matrix decomposition), uses Boolean matrix decomposition to 

decompose the labels in a multi-label classification task. The 

decomposed matrices are a compact representation of the information 

in the labels (first matrix) and the dependencies among the labels 

(second matrix). The first matrix is used in a further multi-label 

classification while the second matrix is used to generate the final 

matrix from the predicted values of the first matrix. 

MLC-BMaD was evaluated on six standard multi-label data sets, the 

experiments showed that MLC-BMaD can perform particularly well on data 

sets with a high number of labels and a small number of instances and 

can outperform standard multi-label algorithms. 

Subsequently, MLC-BMaD is extended to a special case of 

multi-relational learning, by considering the labels not as simple 

labels, but instances. The algorithm, called ClassFact 

(Classification factorization), uses both matrices in a multi-label 

classification. Each label represents a mapping between two 

instances. 

Experiments on three data sets from the domain of bioinformatics show 

that ClassFact can outperform the baseline method, which merges the 

relations into one, on hard classification tasks. 

 

Furthermore, large classifier systems are used on two cheminformatics 

data sets, the first one is used to predict the environmental fate of 

chemicals by predicting biodegradation pathways. The second is a data 

set from the domain of predictive toxicology. In biodegradation 

pathway prediction, I extend a knowledge-based system and incorporate 

a machine learning approach to predict a probability for 

biotransformation products based on the structure- and knowledge-based 

predictions of products, which are based on transformation rules. The 

use of multi-label classification improves the performance of the 

classifiers and extends the number of transformation rules that can be 

covered. 

For the prediction of toxic effects of chemicals, I applied large 

classifier systems to the ToxCasttexttrademark data set, which maps 

toxic effects to chemicals. As the given toxic effects are not easy to 

predict due to missing information and a skewed class 

distribution, I introduce a filtering step in the multi-label 

classification, which finds labels that are usable in multi-label 

prediction and does not take the others in the 

prediction into account. Experiments show 

that this approach can improve upon the baseline method using binary 

classification, as well as multi-label approaches using no filtering. 

 

The presented results show that large classifier systems can play a 

role in future research challenges, especially in bio- and 

cheminformatics, where data sets frequently consist of more complex 

structures and data can be rather small in terms of the number of 

instances compared to other domains.},

keywords = {biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity},

pubstate = {published},

tppubtype = {phdthesis}

}

Close

Large classifier systems are machine learning algorithms that use multiple
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.

Close

6.

Hardy, Barry; Douglas, Nicki; Helma, Christoph; Rautenberg, Micha; Jeliazkova, Nina; Jeliazkov, Vedrin; Nikolova, Ivelina; Benigni, Romualdo; Tcheremenskaia, Olga; Kramer, Stefan; Girschick, Tobias; Buchwald, Fabian; Wicker, Jörg; Karwath, Andreas; Gütlein, Martin; Maunz, Andreas; Sarimveis, Haralambos; Melagraki, Georgia; Afantitis, Antreas; Sopasakis, Pantelis; Gallagher, David; Poroikov, Vladimir; Filimonov, Dmitry; Zakharov, Alexey; Lagunin, Alexey; Gloriozova, Tatyana; Novikov, Sergey; Skvortsova, Natalia; Druzhilovsky, Dmitry; Chawla, Sunil; Ghosh, Indira; Ray, Surajit; Patel, Hitesh; Escher, Sylvia

Collaborative development of predictive toxicology applications Journal Article

In: Journal of Cheminformatics, vol. 2, no. 1, pp. 7, 2010, ISSN: 1758-2946.

@article{hardy2010collaborative,

title = {Collaborative development of predictive toxicology applications},

author = {Barry Hardy and Nicki Douglas and Christoph Helma and Micha Rautenberg and Nina Jeliazkova and Vedrin Jeliazkov and Ivelina Nikolova and Romualdo Benigni and Olga Tcheremenskaia and Stefan Kramer and Tobias Girschick and Fabian Buchwald and J\"{o}rg Wicker and Andreas Karwath and Martin G\"{u}tlein and Andreas Maunz and Haralambos Sarimveis and Georgia Melagraki and Antreas Afantitis and Pantelis Sopasakis and David Gallagher and Vladimir Poroikov and Dmitry Filimonov and Alexey Zakharov and Alexey Lagunin and Tatyana Gloriozova and Sergey Novikov and Natalia Skvortsova and Dmitry Druzhilovsky and Sunil Chawla and Indira Ghosh and Surajit Ray and Hitesh Patel and Sylvia Escher},

url = {http://www.jcheminf.com/content/2/1/7},

doi = {10.1186/1758-2946-2-7},

issn = {1758-2946},

year  = {2010},

date = {2010-01-01},

journal = {Journal of Cheminformatics},

volume = {2},

number = {1},

pages = {7},

abstract = {OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.},

keywords = {cheminformatics, computational sustainability, data mining, machine learning, REST, toxicity},

pubstate = {published},

tppubtype = {article}

}

Close

OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.

Close

5.

Wicker, Jörg; Richter, Lothar; Kramer, Stefan

SINDBAD and SiQL: Overview, Applications and Future Developments Book Section

In: Džeroski, Sašo; Goethals, Bart; Panov, Panče (Ed.): Inductive Databases and Constraint-Based Data Mining, pp. 289-309, Springer New York, 2010, ISBN: 978-1-4419-7737-3.

4.

Wicker, Jörg; Richter, Lothar; Kessler, Kristina; Kramer, Stefan

SINDBAD and SiQL: An Inductive Database and Query Language in the Relational Model Proceedings Article

In: Daelemans, Walter; Goethals, Bart; Morik, Katharina (Ed.): Machine Learning and Knowledge Discovery in Databases, pp. 690-694, Springer Berlin Heidelberg, 2008, ISBN: 978-3-540-87480-5.

3.

Richter, Lothar; Wicker, Jörg; Kessler, Kristina; Kramer, Stefan

An Inductive Database and Query Language in the Relational Model Proceedings Article

In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, pp. 740–744, ACM, 2008, ISBN: 978-1-59593-926-5.

2.

Wicker, Jörg; Brosdau, Christoph; Richter, Lothar; Kramer, Stefan

SINDBAD SAILS: A Service Architecture for Inductive Learning Schemes Proceedings Article

In: Proceedings of the First Workshop on Third Generation Data Mining: Towards Service-Oriented Knowledge Discovery, 2008.

Abstract | Links | BibTeX | Tags: data mining, inductive databases, machine learning, query languages

1.

Kramer, Stefan; Aufschild, Volker; Hapfelmeier, Andreas; Jarasch, Alexander; Kessler, Kristina; Reckow, Stefan; Wicker, Jörg; Richter, Lothar

Inductive Databases in the Relational Model: The Data as the Bridge Proceedings Article

In: Bonchi, Francesco; Boulicaut, Jean-François (Ed.): Knowledge Discovery in Inductive Databases, pp. 124-138, Springer Berlin Heidelberg, 2006, ISBN: 978-3-540-33292-3.