CS851: Data Mining Algorithms
Nina Mishra


Course Overview

Many organizations have compiled a diverse collection of massively large and dynamic datasets over the years.  Data mining is a tool that has been actively used to discover interesting and surprising patterns in these datasets.  The technology has been successfully utilized by organizations that collect web click streams, financial transactions, observational science data, etc.  This course will cover major algorithmic advances in data mining with an eye towards both the theoretical underpinnings of these problems as well as successful practical deployments.  Topics covered include clustering, association rules, machine learning, web link analysis, data streams, and privacy-preserving algorithms.

This course should be of interest to graduate students in computer science and many other related disciplines. This course is especially tuned to those with an interest in understanding the fundamentals of data mining. Familiarity with basic material in algorithms, databases and probability at the level of the core undergraduate courses is useful.



Grading:  Those students interested in taking the course for a letter grade must complete a class project, and scribe two lectures.  Those signing up for the pass/fail option will need to scribe one lecture or complete a class mini-project.

Class Project: Deadlines:

  • Feb 15, 2006: Proposal: at most 1 page  (5%)
  • Mar 15, 2006: Progress Report: at most 5 pages (20%)
  • Apr 17, 2006: Project Presentations.  (25%)
  • May 8, 2006: Final Project Report  (30%)

Scribe:  Each registered student will sign up as the official scribe for two lectures.  This involves taking detailed notes, reading the background papers, and preparing a set of lecture notes that will be distributed on the web. (20%)

Scribe Schedule

Presentation Schedule

Homework Assignment: Due May 8th, for students who scribed only one lecture.


Time and Location:

Mon/Wed:  2:00-3:15pm
Room: 236D Olsson Hall

Office Hours:

Monday:  4:30-5:30pm, 226B Olsson Hall



Overview of Lectures (Tentative)

 

Date

Topics

Lectures

Scribes

Jan 18, 2006

Introduction, Overview of the Class, Preliminaries

Lecture 1

Scribe 1

Jan 23/25, 2006

Clustering: k-Center, k-Median, k-Median-squared

Lecture 2, Lecture 3

Scribe 2/3

Jan 30/Feb 1, 2006

Clustering: Hierarchical, Correlation

Lecture 4, Lecture 5

Scribe 4

Feb 6/8, 2006

Correlation Clustering, Association Rules

Lecture 6

Scribe 5/6

Feb 13/15, 2006

Frequent Itemsets and CNF/DNF Dualization

Lecture 7/8, Lecture 9

Scribe 7/8, Scribe 9

Feb 20/22, 2006

Winnow, PAC-learning, Consistency, Learning Conjunctions, k-DL

Lecture 10, Lecture 11

Scribe 10/11

Feb 27/Mar 1, 2006

Boosting

Lecture 12, Lecture 13

Scribe 13

Mar 6/8, 2006

Spring Break

 

 

Mar 13/15, 2006

Pagerank, Hubs & Authorities, Preferential Attachment Random Graph Model

Lecture 14, Lecture 15

Scribe 14, Scribe 15

Mar 20/22, 2006

Viral Marketing/Spreading Epidemics

Lecture 16/17

Scribe 16, Scribe 17

Mar 27/29, 2006

Data Streams

Lecture 18/19

Scribe 18, Scribe 19

Apr 3/5, 2006

Data Privacy

Lecture 20/21

Scribe 20/21

Apr 10/12, 2006

Data Privacy

Lecture 22

Scribe 22, Scribe 23

Apr 17/19, 2006

Project Presentations

 

 

April 24/26, 2006

Project Presentations

May 1, 2006

Project Presentations

 

 

 


Reading List

k-Center


k-Median/k-Median-squared/Facility Location


Hierarchical Clustering


Clustering Large Data Sets


Clustering Data Streams

Correlation Clustering

Association Rule Mining and Generalizations

Combinatorics of Association Rules

Frequency Counting

Machine Learning

 

Web Mining

 

Random Graph Models

Viral Marketing/Spreading Epidemics

Privacy: Query Restriction/Auditing

Privacy: Cell Suppression

  • A graph theoretic approach to statistical data security. D. Gusfield. SIAM J. Comput., 75:552--571, 1989.
  • Data security equals graph connectivity, Ming-Yang Kao. SIAM Journal on Discrete Mathematics, Volume 9, Number 1,pp. 87-100

Privacy: Input Perturbation

Privacy: Output Perturbation

Privacy: K-Anonymity