Institute for Computing and Information
Sciences

      Machine Learning and Data Mining


      Charles Babbage's Difference Engine

      Latest News

      • Lecturer: Peter Lucas
      • Lecture room: A1020 (Toernooiveld 1)
      • Lecture: Monday, 10.30; Tutorial: Friday, 10.30
      • Lectures start on Monday, 17 March, 2003
      • Practical assignment [gzipped PS; PDF]
      • 16 May, 2003: assessment


      Rationale

      In an era where computers are widespread in society, people are collecting all sorts of data, mostly in an attempt to enhance their understanding and insights into various processes. For example, companies and organisations are collecting data concerning the preferences and behaviours of their customers and clients, and use data-mining techniques to extract useful knowledge from these data. Data Mining has close relationship to Statistics, Machine Learning, and Artificial Intelligence, and involves research into learning representations from data, the mathematics of learning, the process of data mining and knowledge discovery, and the construction and exploitation of software tools.

      This part of the Information Retrieval 2 course aims to convey the basic ideas underlying modern datamining and knowledge discovery from data, at the same time introducing you to some of the software tools used for data mining.

      Lectures

      1. Introduction (Slides: [PS, gzipped]; [PDF])
      2. Classification (Slides: [PS, gzipped]; [PDF])
        • performance measures
        • classification rules
        • rule-learning algorithms
        • decision trees
      3. Bayesian models and logistic regression (Slides: [PS, gzipped]; [PDF])
        • structure and meaning of Bayesian networks
        • naive Bayes' model, TANs
        • logistic regression and classification
        • structure learning
      4. Refinement and evaluation (Slides: [PS, gzipped]; [PDF])
        • cut-off points and ROC curves
        • holdout method
        • cross-validation
        • the bootstarp
        • boosting and bagging
      5. Clustering (Slides: [PS, gzipped]; [PDF])
        • market basket analysis
        • association rules (Apriori)
        • k-means algorithm
        • hierarchical clustering
      6. Basics of Neural Networks (Slides: [PDF])
        • the brain
        • basic mathematics
        • McCulloch Pitts neuron and activation function
        • perceptron
        • multilayer feedforward neural networks
        • back-propagation

      Practical and Tutorials

      • Assignment [gzipped PS; PDF]
      • Software: WEKA (uitpakken met jar -xvf weka-3-2.jar)
      • Introduction to the WEKA data-mining workbench
      • Introduction to Probability Theory [1/page-PDF; 2/page-PDF]
      • Summary probability theory [Slides PS, gzip; Slides PDF]
      • Exercises (will be distributed at the tutorials):
        • Exercises I: Revision
        • Exercises II: Practical symbolic machine learning
        • Exercises III: Bayesian networks
        • Exercises IV: Evaluation and unsupervised learning

      Additional Resources

      • Example logistic regression equation
      • ROC curves explained
      • Data mining for the corporate masses
      • The R Project: excellent software for statistics and data mining with its own programming language
      • XELOPES: data-mining library
      • Data Mining Cup International student competition in data mining
      • Salford Systems: training in data mining
      • Clementine data mining suite (Sold by SPSS)
      • SAS data mining software
      • UCI Machine Learning Repository



      Peter Lucas | Staff & Students | Computing Science
      University of Nijmegen

      Last updated: 16 March, 2003
      peterl@cs.ru.nl

      University of Nijmegen