SEEM 5380: Optimization Methods for High-Dimensional Statistics

2021-22 Second Term

Announcements

NEW: The exercise sheet has been updated.
Homework 1: Do Problems 1, 2, 3, and 5 in the Exercise Sheet (due March 1, 2022).
The University has announced that all classes will be taught in online mode for the remainder of Term 2. Please use the ZOOM link distributed on Piazza to access the online lectures.
Here is a collection of exercises for the class. The document will be updated regularly. You are encouraged to work out all the problems. Some problems will be selected as homework exercises later.
To better facilitate discussions and Q&As, we have set up a forum on Piazza. Please follow this link to sign up.
Welcome to SEEM 5380! The first class will be held on January 11, 2022. Students who are interested in taking the course but have not yet registered (or are not able to register) should contact the course instructor.

General Information

Instructor: Anthony Man-Cho So (manchoso at se.cuhk.edu.hk)
Office Hours: By appointment, in ERB 604 or online
Lecture Time/Location:
- Tuesdays 9:30am - 11:15am, ~~in KKB 101~~ online via ZOOM
- Fridays 9:30am - 11:15am, ~~in ERB 804~~ online via ZOOM

Online Q&A Forum: Follow this link.

Course Description

The prevalence of high-dimensional data has motivated active research on efficient methods for tackling optimization problems that arise in statistical analysis. In this course, we will give an introduction to this exciting area of research, with emphasis on the theory of structured regularizers for high-dimensional statistics and the design and analysis of statistically and computationally efficient optimization algorithms. Applications in various areas of science and engineering, such as machine learning, signal processing, and statistics, will also be discussed. Prerequisite: ENGG 5501 or equivalent.

Course Outline (subject to change)

Course Requirements

Homework sets (60%) and a take-home final examination (40%)

Open problems will be introduced throughout the semester. Students who solve any one of the open problems will automatically receive an A for the course.

General References

Nesterov: Lectures on Convex Optimization (2nd Edition), Springer Nature, 2018.
Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press, 2019.
Vershynin: High-Dimensional Probability: An Introduction with Applications in Data Science, Cambridge University Press, 2018.

Schedule and Reading

Week 1 - Jan 11: Introduction Notes
Jan 14: Linear regression in the over-determined case: Estimation error and spectrum of random Gaussian matrix. Notes
Reading:
- The material for the lecture on Jan 14 is taken from Chapter 4 of Vershynin: High-Dimensional Probability: An Introduction with Applications in Data Science; read Sections 4.1, 4.2, 4.4.1 and 4.6. You can also refer to my lecture summary. The result in Section 4.6 applies to a more general family of random matrices than that covered in class.
Week 2 - Jan 18: Linear regression in the over-determined case: Optimization error and convergence analysis of gradient method for smooth strongly convex minimization. Notes
Jan 21: Linear regression in the under-determined case: Regularization and notion of decomposability. Notes
Reading:
- The material for the lecture on Jan 18 is taken from Chapter 2 of Nesterov: Lectures on Convex Optimization (2nd Edition); read Sections 2.1.3 and 2.1.5. You can also refer to my lecture summary.
- The material for the lecture on Jan 21 is taken from Chapter 9 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Sections 9.1 and 9.2.1. You can also refer to my lecture summary.
Week 3 - Jan 25: General models: Localization of error vector and the notion of restricted strong convexity. Notes
Jan 28: General models: Bound on the estimation error. Notes
Reading:
- The material for the lecture on Jan 25 is taken from Chapter 9 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Sections 9.2 and 9.3.
- The material for the lecture on Jan 28 is taken from Chapter 9 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Section 9.4.1. We will discuss an application of the estimation error bound to LASSO with hard sparsity in the next lecture. If you want to read ahead, see Sections 7.3.1-7.3.2.
- Lecture summary.
Week 4 - Feb 8: Estimation error bound for LASSO with hard sparsity. Notes
Feb 11: Establishing the restricted strong convexity property for sparse linear regression - Part I. Notes
Reading:
- The material for the lecture on Feb 8 is taken from Chapter 7 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Sections 7.3.1-7.3.2. You can also refer to my lecture summary.
- For other applications of the general estimation error bound, you can read Sections 9.5-9.7 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint.
- The material for the lecture on Feb 11 is taken from Chapter 7 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Theorem 7.16 and its proof in Section 7.6. You can also refer to my lecture summary.
- The proof of Gordon's inequality can be found in Chapter 8.7 of
  Foucart, Rauhut. A Mathematical Introduction to Compressive Sensing. Springer Science+Business Media, New York, 2013.
- Further techniques for dealing with Gaussian random processes can be found in Chapter 7 of Vershynin: High-Dimensional Probability: An Introduction with Applications in Data Science; read in particular Sections 7.1-7.3.
Week 5 - Feb 15: Establishing the restricted strong convexity property for sparse linear regression - Part II. Notes
Feb 18: Algorithmic aspects of regularized loss minimization problems - Part I: The proximal gradient method. Notes
Reading:
- The material for the lecture on Feb 11 is taken from Chapter 7 of Wainwright: High-Dimensional Statistics: A Non-Asymptotic Viewpoint; read Theorem 7.16 and its proof in Section 7.6. Lecture summary.
- The proof of measure concentration of Lipschitz functions of Gaussian random variables can be found in Appendix V, Theorem V.1 of
  Milman, Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces. Lecture Notes in Mathematics, volume 1200, Springer-Verlag, 2001.
- It is possible to obtain comparison inequalities for sub-Gaussian processes. The interested reader can refer to Chapter 8 of Vershynin: High-Dimensional Probability: An Introduction with Applications in Data Science (particularly Section 8.6) for the setup and results.
- Concentration inequalities for Lipschitz functions can be obtained for various probability measures that exhibit the concentration phenomenon. The interested reader can refer to Ledoux: The Concentration of Measure Phenomenon, American Mathematical Society, 2001 (particularly Chapter 1) for details.
- Here is my lecture summary for the Feb 18 lecture.
- Some useful properties of the proximal map can be found in
  Combettes, Wajs. Signal Recovery by Proximal Forward-Backward Splitting. Multiscale Modeling and Simulation 4(4): 1168-1200, 2005.
  Rockafellar, Wets. Variational Analysis. Grundlehren der Mathematischen Wissenschaften 317, Springer-Verlag, 2009. (Particularly Chapters 1.G and 2.D)
Week 6 - Feb 22: Algorithmic aspects of regularized loss minimization problems - Part II: Basic convergence analysis of the proximal gradient method. Notes
Feb 25: Algorithmic aspects of regularized loss minimization problems - Part III: Local linear convergence of the proximal gradient method under error bound condition. Notes
Reading:
- The material for the lecture on Feb 22 is taken from my lecture summary.
- The lemma on p.6 of the lecture summary (concerning certain monotonicity properties of the proximal map) appeared in
  Sra. Scalable Nonconvex Inexact Proximal Splitting. Advances in Neural Information Processing Systems 25: Proceedings of the 2012 Conference (NIPS 2012), pp. 530-538, 2012.
  
  However, to obtain the desired conclusion, an envelope theorem should be invoked in the proof. In fact, just having a unique solution to displayed equation (18) of the above paper is not sufficient to guarantee the differentiability of the function E_g. The envelope theorem mentioned on p.8 of the lecture summary comes from Proposition 1 of
  Oyama, Takenawa. On the (Non-)Differentiability of the Optimal Value Function when the Optimal Solution is Unique. Journal of Mathematical Economics 76: 21-32, 2018.
  
  Envelope theorems have many applications in economics; see the above paper and the pointers therein for details.
- The material for the lecture on Feb 25 is taken from
  Hou, Zhou, So, Luo. On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization. Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference (NIPS 2013), pp. 710-718, 2013.
  
  See, in particular, Theorem 3.2. You can also refer to my lecture summary.
  
  WARNING: The conclusion of Theorem 3.1 of the above paper is not correct. This is due to a flaw in the proof. The correct conclusion appears in Proposition 12 of
  Zhou, So. A Unified Approach to Error Bounds for Structured Convex Optimization Problems. Mathematical Programming, Series A 165(2): 689-728, 2017.
Week 7 - Mar 1: Error bounds for regularized smooth convex minimization. Notes
Mar 4: Error bounds for regularized smooth convex minimization (cont'd). Notes
Reading:
- The material for the lectures on Mar 1 and 4 can be found in my lecture summary.
- A streamlined proof of Hoffman's error bound can be found in Section 11.8 of
  Güler. Foundations of Optimization. Graduate Texts in Mathematics 258, Springer Science+Business Media, LLC, 2010.
- The notion of linear regularity used in Corollary 1 of the lecture notes and other related notions are discussed at length in
  Bauschke, Borwein. On Projection Algorithms for Solving Convex Feasibility Problems. SIAM Review 38(3): 367-426, 1996.
Week 8 - Mar 8: Error bounds for regularized smooth convex minimization (cont'd). Notes
Mar 11: Error bounds for regularized smooth convex minimization (cont'd). Notes
Reading:
- The material for the lectures on Mar 8 and Mar 11 can be found in my lecture summary.
- The outer Lipschitz continuity property used in the lecture is discussed in Sections 3C and 3D (particularly Theorem 3D.1) of
  Dontchev, Rockafellar. Implicit Functions and Solution Mappings. Springer Monographs in Mathematics, Springer Science+Business Media, LLC, 2009.
- The statistical nature of regularized loss minimization problems can be exploited to develop an alternative convergence analysis of first-order methods. This is done, e.g., in
  Agarwal, Negahban, Wainwright. Fast Global Convergence of Gradient Methods for High-Dimensional Statistical Recovery. Annals of Statistics 40(5): 2452-2482, 2012. Supplementary Material
  
  The above paper shows that for several classes of regularized loss minimization problems, certain first-order methods enjoy global linear convergence up to the statistical precision of the model. Modulo the global vs. local nature of the linear convergence, this is weaker than the convergence result derived from the error bound theory. Nevertheless, the approach developed in the above paper can be used to tackle problems that are not known to possess an error bound.
Week 9 - Mar 15: Non-convex regularized loss minimization: Motivation and examples. Notes
Mar 18: Non-convex regularizers and their basic properties. Notes
Reading:
- The material for the lectures on Mar 15 and 18 is taken from
  Loh, Wainwright. High-Dimensional Regression with Noisy and Missing Data: Provable Guarantees with Nonconvexity. Annals of Statistics 40(3): 1637-1664, 2012. Supplementary Material
  Loh, Wainwright. Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima. Journal of Machine Learning Research 16(Mar): 559-616, 2015.
  
  You can also refer to my lecture summary.
- The Smoothly Clipped Absolute Deviation (SCAD) penalty and the Minimax Concave Penalty (MCP) penalty were introduced and analyzed respectively in
  Fan, Li. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association 96(456): 1348-1360, 2001.
  Zhang. Nearly Unbiased Variable Selection under Minimax Concave Penalty. Annals of Statistics 38(2): 894-942, 2010.
  
  For an in-depth analysis of the role of various non-convex regularizers in sparse estimation problems, see
  Zhang, Zhang. A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems. Statistical Science 27(4): 576-593, 2012. Supplementary Material
Week 10 - Mar 22: Non-convex regularized loss minimization: Bound on estimation error. Notes
Mar 25: Weakly convex optimization: Basic setup and stationarity measure. Notes
Reading:
- The material for the lecture on Mar 22 is based on
  Loh, Wainwright. Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima. Journal of Machine Learning Research 16(Mar): 559-616, 2015.
  
  You can also refer to my lecture summary.
- For a detailed treatment of weakly (and strongly) convex functions, see
  Vial. Strong and Weak Convexity of Sets and Functions. Mathematics of Operations Research 8(2): 231-259, 1983.
- The material for the lecture on Mar 25 is based on
  Davis, Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions. SIAM Journal on Optimization 29(1): 207-239, 2019.
- As mentioned in class, subgradient methods are generally not descent methods. For an example, see Figure 2 and the related discussions in
  Li, So, Ma. Understanding Notions of Stationarity in Nonsmooth Optimization. IEEE Signal Processing Magazine 37(5): 18-31, 2020.
Week 11 - Mar 29: Weakly convex optimization: Iteration complexity analysis of projected stochastic subgradient method. Notes
Apr 1: Synchronization problems: Examples. Phase synchronization: Basic setup. Notes
Reading:
- The material for the lecture on Mar 29 is based on
  Davis, Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions. SIAM Journal on Optimization 29(1): 207-239, 2019.
  
  The above paper establishes a sublinear rate of convergence to a near-approximately stationary point of the problem.
- Interestingly, by exploiting the statistical properties of non-convex regularized loss minimization problems, it is possible to show that certain iterative methods enjoy a global linear convergence rate up to the statistical precision of the model. This is done, e.g., in
  Loh, Wainwright. Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima. Journal of Machine Learning Research 16(Mar): 559-616, 2015.
  Fan, Liu, Sun, Zhang. I-LAMM for Sparse Learning: Simultaneous Control of Algorithmic Complexity and Statistical Error. Annals of Statistics 46(2): 814-841, 2018.
- The material for the lecture on Apr 1 is based on my lecture summary.
Week 12 - Apr 5: Ching Ming Festival.
Apr 8: Phase synchronization: Spectral estimator and its estimation error. Notes
Reading:
- The material for the lecture on Apr 8 is based on
  Liu, Yue, So. On the Estimation Performance and Convergence Rate of the Generalized Power Method for Phase Synchronization. SIAM Journal on Optimization 27(4): 2426-2446, 2017.
  
  You can also refer to my lecture summary.
Week 13 - Apr 12: Phase Synchronization: Entrywise error bound of spectral estimator. Notes
Apr 15: Easter holiday.
Reading:
- The material for the lecture on Apr 12 is based on
  Zhong, Boumal. Near-Optimal Bounds for Phase Synchronization. SIAM Journal on Optimization 28(2): 989-1016, 2018.
  
  The version of the Davis-Kahan theorem mentioned in the lecture is taken from the above paper. For further reading on the Davis-Kahan theorem and Weyl's inequality, you can refer to
  Stewart, Sun. Matrix Perturbation Theory. Academic Press, 1990.
Week 14 - Apr 19: Phase Synchronization: Entrywise error bound of spectral estimator (cont'd). Notes
Apr 22: Analysis of the projected gradient method for phase synchronization. Notes
Reading:
- The material for the lecture on Apr 19 is based on
  Zhong, Boumal. Near-Optimal Bounds for Phase Synchronization. SIAM Journal on Optimization 28(2): 989-1016, 2018.
- The material for the lecture on Apr 22 is based on
  Liu, Yue, So. On the Estimation Performance and Convergence Rate of the Generalized Power Method for Phase Synchronization. SIAM Journal on Optimization 27(4): 2426-2446, 2017.
  
  You can also refer to my lecture summary.
- For an introduction to optimization on manifolds, see
  Absil, Mahony, Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
  Boumal. An Introduction to Optimization on Smooth Manifolds. 2022.

Homework Sets

Homework 1: Do Problems 1, 2, 3, and 5 in the Exercise Sheet (due March 1, 2022)

Last Updated: April 22, 2022