Publications

2013

Wicker, Jörg

Large Classifier Systems in Bio- and Cheminformatics PhD Thesis

Technische Universität München, 2013.

Abstract | Links | BibTeX | Tags: biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity

@phdthesis{wicker2013large,

title = {Large Classifier Systems in Bio- and Cheminformatics},

author = {J\"{o}rg Wicker},

url = {http://mediatum.ub.tum.de/node?id=1165858},

year  = {2013},

date = {2013-01-01},

school = {Technische Universit\"{a}t M\"{u}nchen},

abstract = {Large classifier systems are machine learning algorithms that use multiple 

classifiers to improve the prediction of target values in advanced 

classification tasks. Although learning problems in bio- and 

cheminformatics commonly provide data in schemes suitable for large 

classifier systems, they are rarely used in these domains. This thesis 

introduces two new classifiers incorporating systems of classifiers 

using Boolean matrix decomposition to handle data in a schema that 

often occurs in bio- and cheminformatics. 

 

The first approach, called MLC-BMaD (multi-label classification using 

Boolean matrix decomposition), uses Boolean matrix decomposition to 

decompose the labels in a multi-label classification task. The 

decomposed matrices are a compact representation of the information 

in the labels (first matrix) and the dependencies among the labels 

(second matrix). The first matrix is used in a further multi-label 

classification while the second matrix is used to generate the final 

matrix from the predicted values of the first matrix. 

MLC-BMaD was evaluated on six standard multi-label data sets, the 

experiments showed that MLC-BMaD can perform particularly well on data 

sets with a high number of labels and a small number of instances and 

can outperform standard multi-label algorithms. 

Subsequently, MLC-BMaD is extended to a special case of 

multi-relational learning, by considering the labels not as simple 

labels, but instances. The algorithm, called ClassFact 

(Classification factorization), uses both matrices in a multi-label 

classification. Each label represents a mapping between two 

instances. 

Experiments on three data sets from the domain of bioinformatics show 

that ClassFact can outperform the baseline method, which merges the 

relations into one, on hard classification tasks. 

 

Furthermore, large classifier systems are used on two cheminformatics 

data sets, the first one is used to predict the environmental fate of 

chemicals by predicting biodegradation pathways. The second is a data 

set from the domain of predictive toxicology. In biodegradation 

pathway prediction, I extend a knowledge-based system and incorporate 

a machine learning approach to predict a probability for 

biotransformation products based on the structure- and knowledge-based 

predictions of products, which are based on transformation rules. The 

use of multi-label classification improves the performance of the 

classifiers and extends the number of transformation rules that can be 

covered. 

For the prediction of toxic effects of chemicals, I applied large 

classifier systems to the ToxCasttexttrademark data set, which maps 

toxic effects to chemicals. As the given toxic effects are not easy to 

predict due to missing information and a skewed class 

distribution, I introduce a filtering step in the multi-label 

classification, which finds labels that are usable in multi-label 

prediction and does not take the others in the 

prediction into account. Experiments show 

that this approach can improve upon the baseline method using binary 

classification, as well as multi-label approaches using no filtering. 

 

The presented results show that large classifier systems can play a 

role in future research challenges, especially in bio- and 

cheminformatics, where data sets frequently consist of more complex 

structures and data can be rather small in terms of the number of 

instances compared to other domains.},

keywords = {biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity},

pubstate = {published},

tppubtype = {phdthesis}

}

Large classifier systems are machine learning algorithms that use multiple
classifiers to improve the prediction of target values in advanced
classification tasks. Although learning problems in bio- and
cheminformatics commonly provide data in schemes suitable for large
classifier systems, they are rarely used in these domains. This thesis
introduces two new classifiers incorporating systems of classifiers
using Boolean matrix decomposition to handle data in a schema that
often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using
Boolean matrix decomposition), uses Boolean matrix decomposition to
decompose the labels in a multi-label classification task. The
decomposed matrices are a compact representation of the information
in the labels (first matrix) and the dependencies among the labels
(second matrix). The first matrix is used in a further multi-label
classification while the second matrix is used to generate the final
matrix from the predicted values of the first matrix.
MLC-BMaD was evaluated on six standard multi-label data sets, the
experiments showed that MLC-BMaD can perform particularly well on data
sets with a high number of labels and a small number of instances and
can outperform standard multi-label algorithms.
Subsequently, MLC-BMaD is extended to a special case of
multi-relational learning, by considering the labels not as simple
labels, but instances. The algorithm, called ClassFact
(Classification factorization), uses both matrices in a multi-label
classification. Each label represents a mapping between two
instances.
Experiments on three data sets from the domain of bioinformatics show
that ClassFact can outperform the baseline method, which merges the
relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics
data sets, the first one is used to predict the environmental fate of
chemicals by predicting biodegradation pathways. The second is a data
set from the domain of predictive toxicology. In biodegradation
pathway prediction, I extend a knowledge-based system and incorporate
a machine learning approach to predict a probability for
biotransformation products based on the structure- and knowledge-based
predictions of products, which are based on transformation rules. The
use of multi-label classification improves the performance of the
classifiers and extends the number of transformation rules that can be
covered.
For the prediction of toxic effects of chemicals, I applied large
classifier systems to the ToxCasttexttrademark data set, which maps
toxic effects to chemicals. As the given toxic effects are not easy to
predict due to missing information and a skewed class
distribution, I introduce a filtering step in the multi-label
classification, which finds labels that are usable in multi-label
prediction and does not take the others in the
prediction into account. Experiments show
that this approach can improve upon the baseline method using binary
classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a
role in future research challenges, especially in bio- and
cheminformatics, where data sets frequently consist of more complex
structures and data can be rather small in terms of the number of
instances compared to other domains.