# diochnos/teaching

## Spring 2018: CS 6501 - Learning Theory

### Table of Contents

### Course Description

Learning theory is a field that develops models for capturing learning phenomena and ultimately studies the efficiency and the limitations of learning algorithms in these frameworks. The course will start with the notions of concept learning and version spaces and proceed to decision tree learning and artificial neural networks. We will then switch to the probably approximately correct (PAC) model of learning, Occam's razor, VC theory, and ultimately prove the fundamental theorem of learning theory. We will examine the relationship between weak and strong learning (boosting). Subsequently, we will move on to online learning and learning with expert advice in the mistake bound model, where we will see algorithms such as randomized halving, weighted majority, and winnow. We will also cover theoretical aspects of learning that are related to evolution and evolution seen as a learning process drawing results from evolutionary algorithms and Valiant's evolvability framework. (This part of the course related to evolution can also be seen as a special type of local search optimization with PAC guarantees.) Time permitting we will cover additional models of learning such as active learning, or the computational approach to evolution by Adi Livnat and Christos Papadimitriou.

[Course Description] [Table of Contents] [Top]

### Basic Information

#### Syllabus

The syllabus is available here.

#### Time and Location

Mondays and Wednesdays, 3:30pm – 4:45pm, 340 Rice Hall.

#### Contact Information

Please see here.

#### Office Hours

I hold my office hours at the following times and places.

- Mondays
- 2:00pm – 3:00pm, 436 Rice Hall
- Fridays
- 1:00pm – 2:00pm, 414 Rice Hall

##### Exceptions to the Regular Schedule of Office Hours

If you want to meet me outside of my office hours, the most reliable method is to send an email and arrange an appointment.

Friday, January 26, 2018: No office hours this day.

Friday, February 16, 2018: Office hours only between 1:00pm – 1:30pm.

Friday, April 6, 2018:Canceled

Monday, April 9, 2018:Canceled

Monday, April 16, 2018:Canceled

[Basic Information] [Table of Contents] [Top]

### Homework Assignments

Assignment 1: Here. Due on January 31.

Assignment 2: Here. Due on February 7.

Assignment 3: Here. Due on February 21.

Assignment 4: Here. Due on March 21.

Assignment 5: Here. Due on April 25.

[Homework Assignments] [Table of Contents] [Top]

### Final Projects

Graduate students should deliver a final project. This is optional for undergraduates.

Each student is free to work on her/his own or to collaborate with someone else and provide a final project. As mentioned in the syllabus, the route that I expect most people will take, is to give a summary of about two pages long (11pt) for a paper that they selected to read from a candidate set of papers. Two-person projects are allowed (and expected) twice as big summaries. For a list of candidate papers or book chapters to be presented at the end of the course, please have a look below. The candidate material below is split into one- and two-person projects.

I am happy to discuss final projects on other subjects that are related to the course. For example, you may want to use a data set that you acquired online for training a classifier and then predicting its generalization ability by performing cross-validation and ultimately compare how close this empirical estimate is with the prediction that you have from PAC learning (for an appropriate tradeoff between the accuracy ε and the confidence δ). I am also open to other suggestions; for example one can implement Shannon’s mind reading machine. However, in any case, you need to talk to me well in advance and make sure that I agree with the path that you want to follow for what you plan to deliver for your final project.

#### Candidate Material

Please have a look here. (Clicking on the title of a paper takes you to the paper.)

#### Assignments

Chen, Austin and Park, Felix: Studying Recommendation Algorithms by Graph Analysis

Conklin, Henry: Teachability in Computational Learning

Fu, Hao: Challenging the Long Tail Recommendation

Han, Minbiao: Explaining and Harnessing Adversarial Examples

Haworth, Court: Methods for the Analysis of Evolutionary Algorithms on Pseudo-Boolean Functions

Hezbor, Atallah: Is Feature Selection Secure against Training Data Poisoning?

Jha, Nishant and White, Kirsten: On the Complexity of Teaching

Jin, Haiyun: Pairwise Interaction Tensor Factorization for Personalized Tag Recommendation

Ju, Lei: Combining Labeled and Unlabeled Data with Co-Training

Li, Yujian: Incrementally Learning Time-Varying Half Planes

Luo, Zheng: Learning the parts of objects by non-negative matrix factorization

Pierce, Robert: Evolving Neural Networks through Augmenting Topologies

Robertson, Ethan: Ant-Based Computing

Suya: Is Feature Selection Secure against Training Data Poisoning?

[Final Projects] [Table of Contents] [Top]

### Learning Theory Class Resources

[Learning Theory Class Resources] [Table of Contents] [Top]

#### Assigned Readings

- Occam's Razor, by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth
- Learning in the Presence of Malicious Errors, by Michael Kearns and Ming Li

(In particular Theorem 1 is assigned reading.) - On-Line Learning with an Oblivious Environment and the Power of Randomization, by Wolfgang Maass

(In particular, Theorem 2.1 is assigned reading.)

[Assigned Readings] [Table of Contents] [Top]

#### Optional Readings

- A Few Useful Things to Know about Machine Learning, by Pedro Domingos
- A Theory of the Learnable, by Leslie Valiant
- Queries and Concept Learning, by Dana Angluin
- A survey of some aspects of computational learning theory, by György Turán
- Efficient Noise-Tolerant Learning from Statistical Queries, by Michael Kearns
- A Tutorial on Statistical Queries, by Lev Reyzin
- The Perceptron: A Model for Brain Functioning, by H. D. Block
- On Convergence Proofs for Perceptrons, by Albert B. J. Novikoff
- The Weighted Majority Algorithm, by Nick Littlestone and Manfred K. Warmuth
- A Short Introduction to Boosting, by Yoav Freund and Robert E. Shapire
- A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, by Yoav Freund & Robert E. Shapire.

#### Additional Resources

[Additional Resources] [Table of Contents] [Top]

### Class Log

#### Class 1 (Jan 17, 2018)

Introduction. Overview of the course by going through the syllabus.

Discuss the Coupon Collector's problem as an example of the kind of problems that we will be looking at in later parts of the course.

Discussion on basic terminology about machine learning.

Perhaps a good idea is to review Sections 1–2.4 (up to the weak law of large numbers) from the document Essentials on the Analysis of Randomized Algorithms. Also, have a look at Section 3 which discusses the Coupon Collector's problem.

For a general review on probability theory as well as a basic technique for proving bounds in randomized algorithms (which thus applies to learning theory as well), please have a look at the document Basic Tools and Techniques for Algorithms in Learning Theory.

Finally, the document Tools for Bounding Probabilities discusses bounds that we will use more frequently.

Assigned reading: The above three documents.

#### Class 2 (Jan 22, 2018)

Basic terminology on machine learning.

Definition of concept learning and introduction to terminology with the EnjoySport example.

Partial ordering of the hypothesis space (most general to most specific).

Algorithm Find-S; execution of Find-S on our toy example.

#### Class 3 (Jan 24, 2018)

Definition of version spaces.

List-Then-Eliminate algorithm and output on our toy example.

General and specific boundaries of version spaces.

Version space representation theorem.

Candidate Elimination algorithm. (Just stated the idea.)

#### Class 4 (Jan 29, 2018)

Execution of the Candidate Elimination algorithm on our toy example.

Related discussion on convergence (no noise and realizability assumption), queries that learners should request, the Halving algorithm, as well as using partially learned concepts in order to make predictions.

#### Class 5 (Jan 31, 2018)

Discussed the solution for Homework 1.

Decision tree learning.

Entropy, information gain, and the basic algorithm (ID3) for learning decision trees.

Partial execution of the algorithm on a toy example.

Searching the hypothesis space: characteristics and contrasting the search of ID3 with the search performed by the Candidate-Elimination algorithm.

Bias in decision tree learning and Occam's razor.

Definition of overfitting.

#### Class 6 (Feb 5, 2018)

Handling overfitting. Reduced-Error Pruning and Rule Post-Pruning.

Partitioning original examples to: training set, validation set, test set.

Handling continuous-valued attributes. One needs only to examine boundary points for ID3.

Introduction to neural networks and discussion on the geometry of solutions of a perceptron. (We did not even discuss the training rule.)

#### Class 7 (Feb 7, 2018)

Perceptron and the perceptron update rule.

Unthresholded linear unit and the delta rule.

Full gradient descent and stochastic gradient descent.

Neurons using the sigmoid function.

#### Class 8 (Feb 12, 2018)

Backpropagation and derivation of the update rules.

Momentum and discussion on convergence and local minima.

#### Class 9 (Feb 14, 2018)

Final remarks on neural networks.

No Free Lunch Theorems.

#### Class 10 (Feb 19, 2018)

Introduction to PAC learning.

PAC learning axis-aligned rectangles in 2D.

#### Class 11 (Feb 21, 2018)

Discussion on various aspects related to PAC learning. For example, discussing issues that arise for the representation of concepts or hypotheses. Proper vs representation independent learning ('improper' learning). Definition of agnostic PAC learning. Efficient (agnostic or not) PAC learnability.

Discussion on Find-S for properly learning conjunctions. We will prove PAC learnability next time.

#### Class 12 (Feb 26, 2018)

Efficient PAC learnability of conjunctions in the realizable case; in particular, efficient proper PAC learnability of conjunctions (algorithm used: Find-S).

PAC learnability of finite concept classes in the realizable setting. Application of this theorem to proper learning of conjunctions and actually improving the sample size that we obtained by investigating the Find-S (Occam) algorithm directly.

Efficiency considerations and discussion.

#### Class 13 (Feb 28, 2018)

In-class midterm.

#### Mar 5, 2018

No class. Spring break.

#### Mar 7, 2018

No class. Spring break.

#### Class 14 (Mar 12, 2018)

Going over the solutions of the midterm.

- Assigned Reading: The paper Occam's Razor, by Blumer, Ehrenfeucht, Haussler, and Warmuth.
- Optional Reading: The paper A Few Useful Things to Know about Machine Learning, by Domingos.
- Optional Reading: The paper A Theory of the Learnable, by Valiant.
- Optional Reading: The paper A survey of some aspects of computational learning theory, by Turán.

#### Class 15 (Mar 14, 2018)

Slides with an overview of Learning Theory: here. These slides explain the notions of the growth function, the VC dimension, and give an overview of where we are heading to in order to prove the fundamental theorem of learning theory.

Discussion on VC dimension and the growth function Π for various concept classes.

Exact computation of the VC-dimension for thresholds on a line and for axis-aligned rectangles in R^{2}.

Upper bound on the VC-dimension for finite concept classes. Application of the theorem to monotone and not necessarily monotone conjunctions, showing that the VC-dimension is O(n) in both cases.

#### Class 16 (Mar 19, 2018)

Application of the theorem to monotone and not necessarily monotone conjunctions, showing that the VC-dimension is Θ(n) in both cases. In fact, for the monotone case we proved that the VC-dimension is precisely n.

#### Class 17 (Mar 21, 2018)

Exploring bounds around (1 ± 1/n)^{about 1/n}
that involve Euler's constant e ≈ 2.71828...

Proof that Φ_{d}(m) ≤ (em/d)^{d}.

Discussing about the idea on decomposing the functions in Sauer's lemma (to be proved next time).

#### Class 18 (Mar 26, 2018)

Sauer's lemma.

Discussion on the upper bound on the number of examples needed for learning, based on the VC dimension of the concept class. We will prove the theorem next time.

#### Class 19 (Mar 28, 2018)

No class.

#### Class 20 (Apr 2, 2018)

Proved the theorem that we stated last time.

The discussion on Sauer's lemma and the last theorem are also available in Sections 3.4 and 3.5 in the Kearns-Vazirani book.

#### Class 21 (Apr 4, 2018)

Queries and concept learning. (The paper, by Dana Angluin)

#### Class 22 (Apr 9, 2018)

No class.

#### Class 23 (Apr 11, 2018)

Saeed Mahloujifar presenting results connecting PAC learnability and p-tampering attacks (a certain type of poisoning attacks).

#### Class 24 (Apr 16, 2018)

No class.

#### Class 25 (Apr 18, 2018)

Discussed the solutions to homework 3.

Different notions of noise.

Discussed the impossibility result of Kearns and Li, for PAC learning in the malicious noise model. This is Theorem 1 from the paper.

Note that Theorem 1 uses two oracles; one for positive and one for negative examples. This is an equivalent model of PAC learning as with one oracle that returns both positive and negative examples. In any case, in class we proved the theorem using one oracle, which is the model that we have been using all along in this class.

#### Class 26 (Apr 23, 2018)

Continuation of our discussion on noise and PAC learning.

Elements of Statistical Queries.

The Statistical Query (SQ) model and the Correlational Statistical Query (CSQ) model.

In our discussion on the board I used the notes from a similar course that was offered in MIT in 1994. Here are the relevant handouts from that course, where the person who had given those lectures back then was Javed Aslam:

Regarding the example that we discussed on board for SQ learning (monotone) conjunctions, it is useful to re-read Section 1.3 from the Kearns-Vazirani book and the proof that we used there. In fact, back then we used precisely the query that we used on the board today in order to characterize bad literals (which were thus removed from the hypothesis).

Discussed some slides from the Tutorial on Statistical Queries, by Lev Reyzin. I believe that these slides are useful for you to have them as a backup as well as for a quick reference in the future.

Optional Reading: The paper Efficient Noise-Tolerant Learning from Statistical Queries, by Michael Kearns. This is the (extended version of the) paper that introduced the Statistical Query model.

#### Class 27 (Apr 25, 2018)

Online learning and learning with expert advice.

Proved a mistake bound theorem for perceptrons due to Block and Novikoff.

Mistake bound for the halving algorithm. (Discussed in Section 7.5 of Tom Mitchell's book.)

#### Class 28 (Apr 30, 2018)

##### Concluding our Brief Tour on Online Learning

Mistake bound for randomized halving algorithm. The algorithm is from the paper On-Line Learning with an Oblivious Environment and the Power of Randomization, by Wolfgang Maass, and in particular, the theorem that we proved in class is Theorem 2.1.

Connection of the VC dimension and number of mistakes in online learning problems.

The weighted majority algorithm and number of mistakes. The algorithm appeared in the paper The Weighted Majority Algorithm by Nick Littlestone and Manfred Warmuth

These last two subjects (relation of VC-dimension with online learning and the weighted majority algorithm) are also discussed in Section 7.5 of Tom Mitchell's book.

##### Remarks on Boosting Methods

High level discussion on boosting.

###### Boosting the Confidence

Increase our confidence to arbitrarily close to 1 by creating a sufficiently large pool of hypotheses, so that at the end of the day one of them is going to have small error with as high probability as we want.

###### Boosting the Accuracy

The basic idea behind the AdaBoost algorithm of Freund and Shapire. A very good introduction to the subject is A Short Introduction to Boosting.

Note that the above paper also has references with the extensions of AdaBoost: for example dealing with multiple labels for classification instead of doing just binary classification.

Algorithm pseudocode: here

AdaBoost originally appeared in A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, by Freund & Shapire.

##### Closing Remarks

Ask me anything.

#### Class 29 (Tuesday, May 8, 2018; 2:00pm – 5:00pm)

Final exam in Rice 536.

[Class Log] [Table of Contents] [Top]