CS851: Data Mining
Algorithms
Nina
Mishra
Many
organizations have compiled a diverse collection of massively large and dynamic
datasets over the years. Data mining is a tool that has been
actively used to discover interesting and surprising patterns in these
datasets. The technology has been successfully utilized by organizations
that collect web click streams, financial transactions, observational science
data, etc. This course will cover major algorithmic advances in data
mining with an eye towards both the theoretical underpinnings of these problems
as well as successful practical deployments. Topics covered include
clustering, association rules, machine learning, web link analysis, data streams,
and privacy-preserving algorithms.
This course should be of
interest to graduate students in computer science and many other related
disciplines. This course is especially tuned to those with an interest in
understanding the fundamentals of data mining. Familiarity with basic material
in algorithms, databases and probability at the level of the core undergraduate
courses is useful.
Grading: Those students interested in taking the course for a
letter grade must complete a class project, and scribe two lectures.
Those signing up for the pass/fail option will need to scribe one lecture or
complete a class mini-project.
Class Project: Deadlines:
Scribe: Each registered student will sign up as the official scribe for two lectures. This involves taking detailed notes, reading the background papers, and preparing a set of lecture notes that will be distributed on the web. (20%)
Homework Assignment: Due May 8th, for students who scribed only one lecture.
Time and Location:
Mon/Wed:
2:00-3:15pm
Room: 236D Olsson Hall
Office Hours:
Monday:
4:30-5:30pm, 226B Olsson Hall
Overview of Lectures (Tentative)
|
Date |
Topics |
Lectures |
Scribes |
|
|
Introduction,
Overview of the Class, Preliminaries |
||
|
Jan 23/25,
2006 |
Clustering:
k-Center, k-Median, k-Median-squared |
||
|
Jan 30/ |
Clustering |
||
|
Feb 6/8,
2006 |
Correlation Clustering, Association Rules |
||
|
Feb 13/15, 2006 |
Frequent Itemsets and CNF/DNF Dualization |
||
|
Feb
20/22, 2006 |
Winnow, PAC-learning, Consistency, Learning Conjunctions, k-DL |
||
|
Feb 27/ |
Boosting |
Lecture 12, Lecture 13 |
|
|
Mar 6/8,
2006 |
Spring
Break |
|
|
|
Mar
13/15, 2006 |
Pagerank, Hubs & Authorities, Preferential Attachment Random Graph Model |
Lecture 14, Lecture 15 |
|
|
Mar
20/22, 2006 |
Viral Marketing/Spreading Epidemics |
||
|
Mar 27/29,
2006 |
Data Streams |
||
|
Apr 3/5,
2006 |
Data Privacy |
||
|
Apr 10/12, 2006 |
Data Privacy |
||
|
Apr 17/19,
2006 |
Project Presentations |
|
|
|
April 24/26, 2006 |
Project Presentations |
||
|
|
Project Presentations |
|
|
Reading List
k-Center
k-Median/k-Median-squared/Facility Location
Hierarchical Clustering
Clustering Large Data Sets
Clustering Data Streams
Correlation
Clustering
Association Rule Mining and
Generalizations
Combinatorics of Association Rules
Frequency Counting
Machine
Learning
Viral Marketing/Spreading Epidemics
Privacy:
Query Restriction/Auditing
Privacy:
Cell Suppression
Privacy:
Input Perturbation
Privacy:
Output Perturbation
Practical
Privacy: The SuLQ Framework. Avrim Blum, Cynthia Dwork, Frank McSherry, Kobbi Nissim: . PODS 2005.
Revealing
Information while Preserving Privacy. I. Dinur
and K. Nissim, PODS 2003.
(abstract)
(ppt)
(html)
(pdf)
Privacy Preserving
Data Mining on Vertically Partitioned Databases. C. Dwork and K. Nissim.
Manuscript. 2004.
Privacy:
K-Anonymity
l-diversity: Privacy beyond k-Anonymity. A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam.
ICDE'06.
Approximation
Algorithms for k-Anonymity. Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu.
On the Complexity of
Optimal k-Anonymity. A. Meyerson, R.
Williams. PODS 2004.