# diochnos/teaching

## Spring 2019: CS 6501 - Learning Theory

### Course Description

Learning theory is a field that develops models for capturing learning phenomena and ultimately studies the efficiency and the limitations of learning algorithms in these frameworks. We will explore basic aspects of machine learning theory.

We will start with learning algorithms that use membership and equivalence queries, version spaces, and decision trees. We will then switch to the probably approximately correct (PAC) model of learning, Occam's razor, VC theory, and ultimately prove the fundamental theorem of learning theory.

From this point and on, we will cover, not necessarily in this order, the following topics. The relationship between weak and strong learning (boosting). Online learning and learning with expert advice in the mistake bound model, where we will see algorithms such as randomized halving, weighted majority, and winnow. We will also cover aspects of adversarial machine learning (poisoning and evasion attacks) and thus explore connections between machine learning and security.

Time permitting, there are further ideas for related interesting topics. For example, theoretical aspects of learning that are related to evolution and evolution seen as a learning process drawing results from evolutionary algorithms and Valiant's framework of evolvability, or active learning.

### Basic Information

#### Syllabus

The syllabus is available here.

#### Time and Location

Mondays, Wednesdays, and Fridays 1:00pm – 1:50pm, Olsson Hall 009.

Please see here.

#### Teaching Assistants

The graduate student Suya (fs5xz) will be a TA for this class this year.

#### Office Hours

We have office hours at the following times and places.

Tuesdays
11:00am – 1:00pm, 414 Rice Hall (Dimitris)
Wednesdays
2:00pm – 4:00pm, 414 Rice Hall (Dimitris)
Fridays
2:00pm – 6:00pm, 436 Rice Hall (Suya)
##### Exceptions to the Regular Schedule of Office Hours

If you want to meet us outside of our office hours, the most reliable method is to send an email and arrange an appointment.

### Homework Assignments

Assignment 1: Here. Announced on January 16. Due on January 30.

Assignment 2: Here. Announced on January 30. Due on February 8.

Assignment 3: Here. Announced on February 10. Due on February 20.

Assignment 4: Here. Announced on February 27. Due on March 8.

Assignment 5: Here. Announced on March 29. Due on April 15.

### Final Projects

Graduate students should deliver a final project. This is optional for undergraduates.

In general I am open to discussion for various ideas regarding the final project. You are allowed to work alone but the route that I expect most people will take is to form a group of 2-3 people, read a research paper and write a summary that will be about two pages long (11pt) and be distributed to the rest of their students in class. Depending on our pace as well as on the number of different groups of students that are formed for a final project we may or may not have oral presentations of these research papers. For a list of candidate papers or book chapters to be presented at the end of the course, please have a look below. The candidate material below is split into one- and multi-person projects.

I am happy to discuss final projects on other subjects that are related to the course. For example, you may want to use a data set that you acquired online for training a classifier and then predicting its generalization ability by performing cross-validation and ultimately compare how close this empirical estimate is with the prediction that you have from PAC learning (for an appropriate tradeoff between the accuracy parameter ε and the confidence parameter δ). I am also open to other suggestions; for example one can implement Shannon's mind reading machine. However, in any case, you need to talk to me well in advance and make sure that I agree with the path that you want to follow for what you plan to deliver for your final project.

#### Candidate Material

Please have a look here.

Clicking on the title of a paper takes you to the paper. It might be easier if you first download the pdf file locally to your computer.

### Learning Theory Class Resources

#### Assigned Readings

• Queries and Concept Learning, by Dana Angluin.
• Sections 1 to 2.4, as well as Section 6 are mandatory; rest is optional.
• Occam's Razor, by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth
• Four types of noise in data for PAC learning, by Robert Sloan.
• Up to and including Section 2.2 is mandatory; rest is optional and in fact if you want you can present it as a 1-person project.
• Learning From Noisy Examples, by Dana Angluin and Philip Laird.
• Theorems 2 and 4 are mandatory; rest is optional and in fact if you want you can present all the rest (or part of it) as a 1- or 2-person project.
• We also have some related lecture notes for these two theorems by Javed Aslam here.
• Improved Noise-Tolerant Learning and Generalized Statistical Queries, by Javed A. Aslam and Scott E. Decatur.
• Most relevant to our discussion is Theorem 1. However we will discuss everything until the end of Section 3. If you want you can present some of the results from subsequent sections of the paper, as a 1- or 2-person project.
• Again we have some related lecture notes by Javed Aslam here.
• The Statistical Query model was introduced is in the paper Efficient Noise-Tolerant Learning From Statistical Queries, by Michael Kearns. Reading part or all of this paper is optional; I am mentioning it here because clearly it is very very relevant.
• Below, listed as an optional reading, are the slides from Tutorial on Statistical Queries, by Lev Reyzin. I believe that these slides are useful for all of us to have them as a backup as well as for a quick reference in the future.
• On-Line Algorithms in Machine Learning, by Avrim Blum.
• Everything from Sections 1 and 2.
• From Section 3, Theorem 5 and Theorem 8. Of course the discussion in Section 3, before Section 3.1 is very useful for general understanding and you are strongly encouraged to go through that part as well.
• Section 4 has a list of many interesting open problems.

#### Additional Resources

• Printout of the webpage: here
• Entire directory in zip format: here

### Class Log

#### Class 1 (Jan 14, 2019)

Introduction. Overview of the course by going through the syllabus.

Discuss the Coupon Collector's problem as an example of the kind of problems that we will be looking at in later parts of the course.

Discussion on basic terminology about machine learning.

Perhaps a good idea is to review Sections 1–2.4 (up to the weak law of large numbers) from the document Essentials on the Analysis of Randomized Algorithms. Also, have a look at Section 3 which discusses the Coupon Collector's problem.

For a general review on probability theory as well as a basic technique for proving bounds in randomized algorithms (which thus applies to learning theory as well), please have a look at the document Basic Tools and Techniques for Algorithms in Learning Theory.

Finally, the document Tools for Bounding Probabilities discusses bounds that we will use more frequently.

• Assigned reading: The above three documents.

#### Class 2 (Jan 16, 2019)

Homework 1 announced. Due on Wed, Jan 30.

Terminology and basic notions in machine learning: hypothesis class (hypothesis space), concept class, hypothesis, concept (ground truth), instances, examples, symmetric difference.

Boolean functions and the class of functions that will be the workhorse for our course: conjunctions and monotone conjunctions.

Started the discussion of queries and concept learning from Dana Angluin's paper Queries and Concept Learning.

#### Class 3 (Jan 18, 2019)

We discussed the learnability of monotone conjunctions using equivalence and membership queries. We also discussed whether the algorithms extend to general conjunctions.

Here are notes from what was discussed in the class.

#### Monday, January 21, 2019

Martin Luther King day – no classes.

#### Class 4 (Jan 23, 2019)

Discussed a few more things from Dana Angluin's paper. One example was how the learner may be forced to query all but one hypotheses when the hypothesis space is the set of all full conjunctions. Another thing that we discussed was Lemmas 1 and 2 that are also relevant to one of the homework problems.

Started discussion from Tom Mitchell's book; Chapter 2. The more-general-than-or-equal-to relation. Lattice with partial ordering of hypotheses in the hypothesis space. Correspondence between hypotheses in the hypothesis space and sets of examples in the instance space.

#### Class 5 (Jan 25, 2019)

Consistency and version spaces.

Algorithms Find-S and List-Then-Eliminate.

Most specific and most general boundaries.

#### Class 6 (Jan 28, 2019)

Version space representation theorem.

Candidate Elimination Algorithm. Execution on our toy example.

Discussion on the convergence of Candidate Elimination Algorithm (no noise and realizability assumption).

#### Class 7 (Jan 30, 2019)

Using partially learned concepts to make predictions. What queries the learners should request next and the Halving algorithm.

Inductive bias and the futility of bias-free learners.

Introduction to decision trees and inductive bias on decision trees.

#### Class 8 (Feb 1, 2019)

Entropy, information gain and description of ID3 algorithm for decision tree learning.

Partial execution of the algorithm.

Brief discussion on the characteristics of the search space in decision tree learning.

#### Class 9 (Feb 4, 2019)

Searching the hypothesis space: characteristics and contrasting the search of ID3 with the search performed by the Candidate-Elimination algorithm.

Bias in decision tree learning and Occam's razor.

Definition of overfitting. Handling overfitting: Reduced-Error Pruning and Rule Post-Pruning.

Partitioning original examples to: training set, validation set, test set.

#### Class 10 (Feb 6, 2019)

Handling continuous-valued attributes. One needs only to examine boundary points for ID3.

Discussion on missing attributes or attributes that have different costs for testing.

Perceptron and geometry of the solution (hypothesis).

The perceptron update rule for learning from linearly separable data.

Using a perceptron to learn a circle by augmenting the coordinates of the training examples. In general this is not efficient as we introduce exponentially many monomials in the representation; see "stars and bars" problem which can be used to compute precisely the number of monomials for multivariate polynomials of degree d.

#### Class 11 (Feb 8, 2019)

No free lunch theorems.

Proper learning vs representation independent (`improper') learning. Learning in a realizable setting. Learning in an agnostic setting.

Definition of PAC learning and efficient PAC learning. Definition of agnostic PAC learning and efficient agnostic PAC learning.

#### Class 12 (Feb 11, 2019)

Homework 3 announced last night.

Efficient PAC learnability of axis-aligned rectangles in the plane.

#### Class 13 (Feb 13, 2019)

Announced a list of papers that are candidate for presentation; see here.

Decided by plurality vote that the midterm will take place on Monday, February 25.

Using Find-S to properly and efficiently PAC learn general conjunctions.

#### Class 14 (Feb 15, 2019)

PAC learning finite concept classes. Efficiency depends on whether or not we also have a polynomial time algorithm for learning.

As an immediate application we improved the sample complexity of (properly) learning conjunctions merely by giving an upper bound on the size of the concept class.

Next time we will discuss the intractability of learning 3-term DNF formulae. Meanwhile we discussed efficient PAC learnability of 3-term DNFs by using 3-CNFs. In other words, whereas efficiently PAC learning 3-term DNFs is going to be proved to be hard when properly learning 3-term DNFs, nevertheless, by using 3-CNF formulae as our hypothesis class (learning in the realizable case, but not proper learning) we are able to efficiently PAC learn 3-term DNF formulae.

• Assigned Reading: The paper Occam's Razor, by Blumer, Ehrenfeucht, Haussler, and Warmuth.
• Has the theorem for PAC learning finite concept classes.
• Optional Reading: The paper A Few Useful Things to Know about Machine Learning, by Domingos.
• This is a good paper acting as a bridge between what we prove in class and what we can do in practice for machine learning. For example we have seen the "no-free-lunch" theorem as well as the theorem for PAC learning finite concept classes and both of these are discussed among other things in the paper.
• Optional Reading: The paper A Theory of the Learnable, by Valiant.
• This is the pioneering paper by Valiant that started this whole study of Probably Approximately Correct learning within computational learning theory. For example the algorithm for learning conjunctions as well as k-CNF that we discussed in class is mentioned in this paper. There is also more algorithms that touch other concepts (e.g., learning monotone DNF formulae using membership queries).

#### Class 15 (Feb 18, 2019)

Intractability of PAC learning 3-term DNF formulae properly. (Section 1.4 from the Kearns-Vazirani book.)

We still need to discuss a few things next time.

#### Class 16 (Feb 20, 2019)

No class due to snow.

#### Class 17 (Feb 22, 2019)

Finished our discussion on the intractability of 3-term DNF being learned properly.

Discussed some questions from the material that we have already covered.

Midterm.

#### Class 19 (Feb 27, 2019)

Set covering and finding an O(log(m)) optimal approximation with a greedy algorithm.

Application of the algorithm to PAC learning conjunctions with few relevant literals by covering the negative examples (that Find-S would otherwise ignore) with the literals obtained from the solution of Find-S (that initially only takes into account the positive examples).

The above are discussed in the Kearns-Vazirani book, in Sections 2.2 and 2.3 and relevant material is available on Collab under Resources.

Started discussion on VC-dimension, but we did not have time to say much. Our starting point were the slides here and we will continue next time.

Homework 4 announced via email.

#### Class 20 (Mar 1, 2019)

Discussion on the VC dimension; what it means to compute a lower bound and what it means to compute an upper bound.

Basic definitions and toy examples with HALFSPACES as well as with axis-aligned rectangles in 2D.

We then proved the simple theorem that upper bounds the VC dimension of a finite hypothesis class by the logarithm of its size.

• Optional Reading: The paper A survey of some aspects of computational learning theory, by Turán.
The paper gives an overview of several important results that are related to computational learning theory; some of which we have already touched, such as the VC dimension, and others that we have not yet touched, such as mistake bounds in the mistake bound model of learning. You can read this paper in pieces as we are making progress with the class, but now is a good time to start digesting the notions that we have covered already in class.

No class.

#### Class 22 (Mar 6, 2019)

Discussed the solutions to the midterm.

Computed the VC dimension of monotone conjunctions and showed that it is exactly n.

Mentioned that we can learn thresholds on the [0, 1] interval by labeling only O(log(1/ε)) data points from a pool of O(1/ε) unlabeled data points. This is the idea behind active learning; i.e., using O(log(1/ε)) labels and still achieving the PAC criterion.

#### Class 23 (Mar 8, 2019)

Lower bound on the number of examples that are necessary for PAC learning, based on the VC dimension. We used the slides found here. In particular we proved that any algorithm for PAC learning a concept class of VC dimension $d$ must use $m > \frac{d-1}{64\varepsilon}$ training examples in the worst case. (An almost identical result can be found in Theorem 3.5 of Section 3.6 in the Kearns-Vazirani book.)

Discussed briefly poisoning and evasion attacks (adversarial examples). We started using the slides found here and we will certainly continue and discuss a few more things after the Spring break.

#### Monday, March 11, 2019

No class; Spring recess.

#### Wednesday, March 13, 2019

No class; Spring recess.

#### Friday, March 15, 2019

No class; Spring recess.

No class.

#### Class 25 (Mar 20, 2019)

Defined the recursive function $\Phi_d(m) = \Phi_d(m-1) + \Phi_{d-1}(m-1)$, with $\Phi_d(0) = \Phi_0(m) = 1$.

Proved that $\Phi_d(m) = \sum_{i=0}^d\binom{m}{i}$. (This is Lemma 3.2 from the Kearns-Vazirani book.)

Subsequently proved that $\Phi_d(m) = \sum_{i=0}^d\binom{m}{i} \le \left(\frac{em}{d}\right)^d$. (This is proved as part of the discussion after Lemma 3.2 in the Kearns-Vazirani book.)

Discussed only the idea for the proof of Sauer's lemma because we were running out of time.

#### Class 26 (Mar 22, 2019)

Finished the proof on Sauer's lemma. That is $\Pi_{\mathcal{H}}(m) \le \sum_{i=0}^d \binom{m}{i} = \Phi_d(m)$. (Sauer's lemma is proved in the Kearns-Vazirani book in Lemma 3.1.)

Discussed about adversarial machine learning (poisoning and evasion attacks) based on the slides that we have.

#### Class 27 (Mar 25, 2019)

Proof of the Fundamental Theorem of Learning Theory (upper bound on the number of examples needed for PAC learning in the consistency model, using a hypothesis class with finite VC dimension).

In fact we proved that, with high probability (at least $1-\delta$), a consistent hypothesis $h\in\mathcal{H}$ on the training sample $S$ will have error $\varepsilon$ not more than $\frac{2}{m}\cdot\left(\lg\left(\Pi_{\mathcal{H}}(2m)\right) + \lg\left(\frac{2}{\delta}\right)\right)\,.$ (This is proved in Section 3.5 from the Kearns-Vazirani book.)

A homework exercise will involve using Sauer's lemma in order to substitute $\Pi_{\mathcal{H}}(m)$ with the VC dimension $d$ of the hypothesis class $\mathcal{H}$.

#### Class 28 (Mar 27, 2019)

Discussion on different noise models.

• Bob Sloan has a nice description for several cases of noise models in Four types of noise in data for PAC learning. (Up to and including Section 2.2 is mandatory reading; the rest is optional and in fact if you want you can present it as a 1-person project.)
• Apart from the models mentioned in Bob Sloan's paper we also mentioned the nasty noise model as a bounded-budget version of the malicious noise model. We have also made a related discussion here when we were discussing about poisoning attacks (p-tampering being the analogue of malicious noise and strong "clean label" p-budget attacks being the analogue of nasty noise).

Showed that the malicious noise that can be tolerated by PAC learning algorithms is less than $\varepsilon$, where $\varepsilon$ is the usual parameter for the required accuracy in PAC learning. The result that we showed was a simple variant of the result of Kearns and Li from their paper Learning in the Presence of Malicious Errors using the method of induced distributions; in particular Theorem 1 from that paper.

Here is what we discussed in class — this is what you need to know, the rest of the Kearns and Li paper is optional reading and perhaps you can present a few results from their paper as a 1-person or a 2-person project (after first agreeing with me).

The malicious noise model was introduced by Valiant in the paper Learning Disjunctions of Conjunctions (optional reading).

#### Class 29 (Mar 29, 2019)

Homework 5 announced.

Discussion on random classification noise.

Proved a theorem for PAC learning in the presence of random classification noise by minimizing disagreements. This is Theorem 2 from the paper by Angluin and Laird, Learning From Noisy Examples.

Next time we will discuss Theorem 4 from that paper, which is a hardness result related to the problem of minimizing disagreements.

In other words, while this time we can resolve positively the statistical question for PAC learnability based on minimizing disagreements, nevertheless we will have a hardness result for a fairly simple concept class (monotone conjunctions).

Here are some notes by Javed Aslam who was discussing these two theorems in 1994 in MIT.

Closing, one idea for you to present something from this paper of Angluin and Laird, for your in-class presentation, would be the discussion in Section 3 and the positive result that is presented there. Also, as a last remark, Theorem 3 in that paper waives the requirement of knowing an upper bound $\eta_b$ for the noise rate $\eta$, where the learner by requesting a few more examples is able to determine a good enough upper bound $\eta_b$ for the true noise rate $\eta$.

#### Class 30 (Apr 1, 2019)

As promised from last time, we proved Theorem 4 from the Angluin & Laird paper Learning From Noisy Examples. In other words we showed that finding a hypothesis that minimizes disagreements with a noisy sample is an NP-hard problem even for the case of monotone conjunctions.

We discussed a few high-level ideas about statistical queries. We will use the notes by Javed Aslam again, which you can find here. (These notes are based on Sections 1-3 from the paper Improved Noise-Tolerant Learning and Generalized Statistical Queries, by Javed A. Aslam and Scott E. Decatur.)

The Statistical Query model was introduced by Micheal Kearns in the paper Efficient Noise-Tolerant Learning from Statistical Queries; this paper however is optional reading.

#### Class 31 (Apr 3, 2019)

Discussed on board SQ-learning of monotone conjunctions. This is very very close to the first analysis that we used as introduction to PAC learning discrete concept classes in this course after the axis-aligned rectangles. You may want to re-read Section 1.3 from the Kearns-Vazirani book to pinpoint differences and similarities in these two approaches. Note that Section 1.3 from the Kearns-Vazirani book is about general conjunctions whereas we were discussing about monotone conjunctions this time. Furthermore, the ideas from the analysis on the SQ model extend naturally to the case where we have random classification noise (though we have only just started to address this issue).

Then we derived a formula for computing $P_{\chi}$ which is the probability that the statistical query $\chi$ has to be satisfied (and will subsequently be returned within $\tau$ of its true value by the STAT oracle).

$P_{\chi} = Pr_{EX(c, \mathcal{D})}[\chi = 1] = \frac{(1-\eta)Pr_{EX_{CN}^\eta(c,\mathcal{D})}[\chi = 1] - \eta Pr_{EX_{CN}^\eta(\overline{c}, \mathcal{D})}[\chi=1]}{1-2\eta}\,.$

The good thing with this last formula is that we can compute the noise-free estimate that we want based on noisy estimates.

#### Class 32 (Apr 5, 2019)

Finish our discussion on Statistical Query learning.

Though we are far from exhausting the subject on SQ learning, by now we have see several ideas around PAC learning, noise, and learning with Statistical Queries. The slides from A Tutorial on Statistical Queries, by Lev Reyzin can prove to be useful as a backup and provide additional pointers for further reading. The slides also have a discussion on other topics connecting to SQ learning such as evolvability, differential privacy and adaptive data analysis.

Note that some of the results mentioned in the slides are from papers that are suggested for presentation.

#### Class 33 (Apr 8, 2019)

Introduction to online learning, learning with expert advice, and the mistake bound model.

Upper bound on the number of mistakes that a perceptron will make during learning.

Halving algorithm and upper bound on the number of mistakes.

#### Class 34 (Apr 10, 2019)

VC dimension lower bounding the number of mistakes of an optimal online algorithm.

Randomized Halving Algorithm.

Weighted Majority Algorithm.

#### Class 35 (Apr 12, 2019)

Randomized Weighted Majority Algorithm.

Winnow on learning monotone disjunctions with few relevant variables.

Our discussion on variants of the Weighted Majority Algorithm as well as on Winnow (next class we will discuss an application of Winnow to drifting targets) can be found in the book chapter by Avrim Blum, On-Line Algorithms in Machine Learning. The relevant discussion is Section 2 (Weighted Majority) as well as Theorems 5 and 8 (Winnow) from Section 3. This book chapter by Avrim Blum appears in the book "Online Algorithms: The State of the Art".

Tom Mitchell's book, Chapter 7 also has related information. In Section 7.5 the Mistake Bound model of learning is discussed. In particular, Section 7.5.2 is about bounding the number of mistakes of the Halving Algorithm and Section 7.5.3 presents the lower bound on the number of mistakes of deterministic algorithms that is obtained by the VC dimension of the class.

#### Class 36 (Apr 15, 2019)

We did Theorem 8 from Avrim Blum's survey paper in online algorithms covering the case of Winnow being $O(\log(n))$-competitive against drifting target concepts.

I gave a very brief presentation of evolvability. Here are some notes so that you can have an understanding of the field quickly: Notes on Evolution and Learning. Ignore the second algorithm from the notes (the (1+1) EA) and the discussion in Section 8.

Here are some slides that I used in class to discuss briefly evolvability and the swapping algorithm: Evolvability in Learning Theory.

Finally, it should be noted that the slides that we already have by Lev Reyzin on statistical queries, also give a high level description of evolvability between slides 55 and 71; slides.

#### Class 37 (Apr 17, 2019)

Additional discussion on evolvability. Proved the formula for the error region and discussed briefly some remarks from Lev Reyzin's slides connecting statistical queries to evolvability.

Preliminary discussion on weak and strong learning.

Boosting the confidence.

We covered up to Section 4.2 from the Kearns-Vazirani book.

#### Class 38 (Apr 19, 2019)

A modest procedure for boosting the accuracy. Sections 4.3.1 and 4.3.2 from the Kearns-Vazirani book.

The above analysis is from the paper The Strength of Weak Learnability, by Robert E. Shapire which was the first paper presenting a boosting method and thus converting a weak learner to a strong learner.

Perhaps the most well known boosting method in our days is AdaBoost. A short introduction to boosting with AdaBoost can be found in the paper A Short Introduction to Boosting by Yoav Freund and Robert E. Shapire.

Also, a survey paper (book chapter) that has a large overlap with the above paper, is The Boosting Approach to Machine Learning: An Overview by Robert E. Shapire where this time boosting and logistic regression are also discussed.

All the above-mentioned papers are well worth your time; however, they are optional readings for this class.

AdaBoost in Pseudocode

Ask me anything.

#### Friday, May 3, 2019; 9:00am – 12:00pm

Final exam in Olsson Hall 009.

[Class Log] [Table of Contents] [Top]