Sicco Verwer Home

Estimating Prediction Certainty in Decision Trees

Eduardo Costa, Sicco Verwer and Hendrik Blockeel

In Intelligent Data Analysis, . to appear.

Download

Bibtex

@inproceedings{costa2013,

    author = {Eduardo Costa and Sicco Verwer and Hendrik Blockeel},

    title = {Estimating Prediction Certainty in Decision Trees
},

    booktitle = {Intelligent Data Analysis},

    year = {2013},

    pages = {},

}

Abstract

Decision trees estimate prediction certainty using the class distribution in the leaf responsible for the prediction. We introduce an alternative method that yields better estimates. For each instance to be predicted, our method inserts the instance to be classified in the training set with one of the possible labels for the target attribute; this procedure is repeated for each one of the labels. Then, by comparing the outcome of the different trees, the method can identify instances that might present some difficulties to be correctly classified, and attribute some uncertainty to their prediction. We perform an extensive evaluation of the proposed method, and show that it is particularly suitable for ranking and reliability estimations. The ideas investigated in this paper may also be applied to other machine learning techniques, as well as combined with other methods for prediction certainty estimation.

Predicate Logic as a Modeling Language: Modeling and Solving some Machine Learning and Data Mining Problems with IDP3

Maurice Bruynooghe, Hendrik Blockeel, Bart Bogaerts, Broes de Cat, Stef de Pooter, Joachim Jansen, Anthony Labarre, Jan Ramon, Marc Denecker and Sicco Verwer

In Theory and Practice of Logic Programming, 2013. submitted.

Download

local copy

Bibtex

@article{bruynooghe2013,

    author = {Maurice Bruynooghe and Hendrik Blockeel and Bart Bogaerts and Broes de Cat and Stef de Pooter and Joachim Jansen and Anthony Labarre and Jan Ramon and Marc Denecker and Sicco Verwer},

    title = {Predicate Logic as a Modeling Language: Modeling and Solving 
some Machine Learning and Data Mining Problems with IDP3
},

    journal = {Theory and Practice of Logic Programming},

    year = {2013},

    volume = {},

    number = {},

    pages = {},

}

Abstract

This paper explores the use of predicate logic as a modeling language. Using IDP3, a finite model generator that supports first order logic enriched with types, inductive definitions, aggregates and partial functions, search problems stated in a variant of predicate logic are solved. This variant is introduced and applied on a range of problems stemming from machine learning and data mining. In those areas, recently a strong interest has grown in the use of declarative modeling and constraint solving as a solution for their problems. We illustrate this methodology with three real world problems from that area. The first problem is in the domain of stemmatology, a domain of philology concerned with the relationship between surviving variant versions of text. The second problem is about a somewhat related problem within biology where phylogenetic trees are used to represent the evolution of species. The third and final problem concerns the classical problem of learning a minimal automaton consistent with a given set of strings. For this last problem, we show that the performance of our solution comes very close to that of a state-of-the art solution. We analyze the use of predicate logic in the three applications and analyze how alternative models affect the performance.

Merging partially labelled trees: hardness and a declarative programming solution.

Anthony Labarre and Sicco Verwer

In IEEE/ACM transactions on Computational Biology and Bioinformaticss, 2013. submitted.

Download

local copy

Bibtex

@article{labarre13BIO,

    author = {Anthony Labarre and Sicco Verwer},

    title = {Merging partially labelled trees: hardness and a declarative programming solution.},

    journal = {IEEE/ACM transactions on Computational Biology and 
Bioinformaticss},

    year = {2013},

    volume = {},

    number = {},

    pages = {},

}

Abstract

Intraspecific studies often make use of haplo type networks instead of gene genealogies to represent the evolution of a set of genes. Cassens et al. proposed one such network reconstruction method, based on the global maximum parsimony principle, which was later recast by the first author of the present work as the problem of finding a minimum common supergraph of a set of t partially labelled trees. Although algorithms were proposed for solving the problem on two graphs, the complexity of the general problem remains unknown. In this paper, we show that the corresponding decision problem is NP-complete for t = 3. We then propose a declarative programming approach to solving the problem to optimality in practice, as well as a heuristic approach, both based on the IDP system, and assess the performance of both methods on randomly generated data.

Discovering Probabilistic Structures of Care.

Arjen Hommerson, Sicco Verwer and Peter Lucas

In KR4HC/ProHealth, at AIME. to appear.

Download

local copy

Bibtex

@inproceedings{hommerson13KR4HC,

    author = {Arjen Hommerson and Sicco Verwer and Peter Lucas},

    title = {Discovering Probabilistic Structures of Care.},

    booktitle = {KR4HC/ProHealth},

    year = {2013},

    pages = {},

}

Abstract

Medical protocols and guidelines can be looked upon as concurrent programs, where the patients dynamically change over time. Methods based on veriﬁcation and model-checking developed in the past have been shown to oﬀer insight into their correctness by adopting a logical point of view. However, there is uncertainty involved both in the management of the disease and the way the disease will develop, and, therefore, a probabilistic view on medical protocols seems more appropriate. On the other hand, representations using Bayesian networks usually involve a single patient group and do not capture the dynamic nature of care. In this paper, we propose a new method inspired by automata learning to represent and identify patient groups for obtaining insight into the care that patients have received. We evaluate this approach using data obtained from general practitioners and identify signiﬁcant diﬀerences in patients who were diagnosed with a transient ischemic attack (TIA). Finally, we discuss the implications of such a computational method for the analysis of medical protocols.

Improving active Mealy machine learning for protocol performance testing.

Fides Aarts, Harco Kuppens, Jan Tretmans, Frits Vaandrager and Sicco Verwer

In Machine Learning, 2013. to appear.

Download

local copy

Bibtex

@article{aarts13MLj,

    author = {Fides Aarts and Harco Kuppens and Jan Tretmans and Frits Vaandrager and Sicco Verwer},

    title = {Improving active Mealy machine learning for protocol performance testing.},

    journal = {Machine Learning},

    year = {2013},

    volume = {},

    number = {},

    pages = {},

}

Abstract

Using a well-known industrial case study from the verification literature, the bounded retransmission protocol, we show how active learning can be used to establish the correctness of protocol implementation I relative to a given reference implementation R. Using active learning, we learn a model MR of reference implementation R, which serves as input for a model based testing tool that checks conformance of implementation I to MR . In addition, we also explore an alternative approach in which we learn a model MI of implementation I, which is compared to model MR using an equivalence checker. Our work uses a unique combination of software tools for model construction (Uppaal), active learning (LearnLib, Tomte), model-based testing (JTorX, TorXakis) and verification (CADP, MRMC). We show how these tools can be used for learning models of and revealing errors in implementations, present the new notion of a conformance oracle, and demonstrate how conformance oracles can be used to speed up conformance checking.

Pautomac: a PFA/HMM learning competition.

Sicco Verwer, Remi Eyraud and Colin de la Higuera

In Machine Learning, 2013. to appear.

Download

local copy

Bibtex

@article{verwer13MLj,

    author = {Sicco Verwer and Remi Eyraud and Colin de la Higuera},

    title = {Pautomac: a PFA/HMM learning competition.},

    journal = {Machine Learning},

    year = {2013},

    volume = {},

    number = {},

    pages = {},

}

Abstract

Approximating distributions over strings is a hard learning problem. Typical techniques involve using finite state machines as models and attempting to learn these; these machines can either be hand built and then have their weights estimated, or built by grammatical inference techniques: the structure and the weights are then learned simultaneously. The PAutomaC competition, run in 2012, was the first challenge to allow comparison between methods and algorithms and built a first state of the art for these techniques. Both artificial data and real data were proposed and contestants were to try to estimate the probabilities of new strings. The purpose of this paper is to describe some of the technical and intrinsic challenges such a competition has to face, to give a broad state of the art concerning both challenges dealing with learning grammars and finite state machines and the relevant literature. This paper also provides the results of the competition and a brief description and analysis of the different approaches the main participants used.

Regular Inference as Vertex Coloring

Christophe Costa Florencio and Sicco Verwer

In Theoretical Computer Science, 2013. Submitted.

Download

local copy

Bibtex

@article{florencioverwer13TCS,

    author = {Christophe Costa Florencio and Sicco Verwer},

    title = {Regular Inference as Vertex Coloring},

    journal = {Theoretical Computer Science},

    year = {2013},

    volume = {},

    number = {},

    pages = {},

}

Abstract

This paper is concerned with the problem of supervised learning of deterministic finite state automata, in the technical sense of identification in the limit from complete data, by finding a minimal DFA consistent with the data (regular inference). We solve this problem by translating it in its entirety to a vertex coloring problem. Essentially, such a problem consists of two types of constraints that restrict the hypothesis space: inequality and equality constraints. Inequality constraints translate to the vertex coloring problem in a very natural way. Equality constraints however greatly complicate the translation to vertex coloring. In previous coloring-based translations, these were therefore encoded either dynamically by modifying the vertex coloring instance on-the-fly, or by encoding them as satisfiability problems. We provide the first translation that encodes both types of constraints together in a pure vertex coloring instance. We prove the correctness of the construction, and show that regular inference and vertex coloring are in some sense equally hard. The coloring approach offers many opportunities for applying insights from combinatorial optimization and graph theory to regular inference. We immediately obtain new complexity bounds, as well as a family of new learning algorithms which can be used to obtain exact hypotheses as well as fast approximations.

Sharing confidential data for algorithm development by multiple imputation.

Sicco Verwer, Susan van den Braak and Sunil Choenni

In International Conference on Scientific and Statistical Database Management, .

Download

local copy

Bibtex

@inproceedings{verwer13SSDBM,

    author = {Sicco Verwer and Susan van den Braak and Sunil Choenni},

    title = {Sharing confidential data for algorithm development by multiple imputation.},

    booktitle = {International Conference on Scientific and Statistical Database Management},

    year = {2013},

    pages = {},

}

Abstract

The availability of real-life data sets is of crucial importance for research and development in many fields. Often, however, real-life databases may not be released or may be released for a limited time to a limited group of researchers due to the proprietary and confidential nature of the data. This can be problematic when designing algorithms and applications that should operate on such data, since this often requires insight into the specific properties of such databases. We propose to solve this problem using the statistical technique of multiple imputation. Although it was originally designed for imputing missing values in data sets, multiple imputation is also a very powerful method for generating realistic synthetic data sets. We show how this can be achieved using only a few lines of R code and the MICE multiple imputation tool. We apply this method to the problem of generating a realistic synthetic copy of a database used for fraud detection. In addition to generating a synthetic data set of the records in this database, we show how these records can be combined into networked data using clustering techniques.

Combining and Analyzing Judicial Databases.

Susan van den Braak, Sunil Choenni and Sicco Verwer

2013

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

2012

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

2011

Download

Bibtex

Abstract

Download

Bibtex

Abstract

Download

Bibtex

Abstract

2010

Download

Bibtex

Abstract

Download