1 Introduction
In its long, more than half-century history of study (going back to the classical article [113 G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes. C. R. Acad. Sci. Paris 258, 4413–4416 (1964) ]), variational inequalities have become one of the most popular and universal optimization formulations. Variational inequalities are used in various areas of applied mathematics. Here we can highlight both classical examples from game theory, economics, operator theory, convex analysis [6 K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming. Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958) , 19 F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities. Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966) , 106 R. T. Rockafellar, Convex functions, monotone operators and variational inequalities. In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969) , 110 M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970) , 113 G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes. C. R. Acad. Sci. Paris 258, 4413–4416 (1964) ], as well as newer and even more recent applications in optimization and machine learning: non-smooth optimization [93 Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005) ], unsupervised learning [9 F. Bach, J. Mairal and J. Ponce, Convex sparse matrix factorizations, preprint, arXiv:0812.1869 (2008) , 36 E. Esser, X. Zhang and T. F. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imaging Sci. 3, 1015–1046 (2010) , 22 A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision 40, 120–145 (2011) ], robust/adversarial optimization [11 A. Ben-Tal, L. El Ghaoui and A. Nemirovski, Robust optimization. Princeton Ser. Appl. Math., Princeton University Press, Princeton (2009) ], GANs [47 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial networks. Commun. ACM 63, 139–144 (2020) ] and reinforcement learning [100 S. Omidshafiei, J. Pazis, C. Amato, J. P. How and J. Vian, Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 70, 2681–2690 (2017) , 57 Y. Jin and A. Sidford, Efficiently solving MDPs with stochastic mirror descent. In Proceedings of the 37th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 119, 4890–4900 (2020) ]. Modern times present new challenges to the community. The increase in scale of problems and the need to speed up solution processes have sparked a huge interest in stochastic formulations of applied tasks, including variational inequalities. This paper surveys stochastic methods for solving variational inequalities.
Structure of the paper
In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have evolved. Section 4 covers a variety of stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.
2 Problem: Setting and assumptions
Notation. We use to denote the standard inner product of vectors , where is the -th component of in the standard basis of . It induces the -norm in by . We denote the -norm by for , and for . The dual norm corresponding to the norm is defined by . The symbol stands for the total mathematical expectation. Finally, we need to introduce the symbols and to enclose numerical constants that do not depend on any parameters of the problem, and the symbols and to enclose numerical constants and logarithmic factors.
We study variational inequalities (VI) of the form
where is an operator and is a convex set.
To emphasize the extensiveness of formulation (1), we give a few examples of variational inequalities arising in applied sciences.
Example 1 (Minimization). Consider the minimization problem
Let . Then, if is convex, one can prove that is a solution of (1) if and only if is a solution of problem (2).
Example 2 (Saddle point problem). Consider the saddle point problem (SPP)
Suppose that and with , . Then, if is convex-concave, one can prove that is a solution of problem (1) if and only if is a solution of problem (3).
The study of saddle point problems is often associated with variational inequalities.
Example 3 (Fixed point problem). Consider the fixed point problem
where is an operator. If we set , then one can prove that is a solution of problem (1) if and only if , i.e., is a solution of problem (4).
For the operator from (1) we assume the following.
Assumption 1 (Lipschitzness). The operator is -Lipschitz continuous, i.e., for all , we have .
In the context of problems (2) and (3), -Lipschitzness of the operator means that the functions and are -smooth.
Assumption 2 (Strong monotonicity). The operator is -strongly monotone, i.e., for all , we have . If , then the operator is monotone.
In the context of problems (2) and (3), strong monotonicity of means strong convexity of and strong convexity-strong concavity of . In this paper we first focus on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see [55 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020) ] and references therein).
We note that Assumptions 1 and 2 are sufficient for the existence of a solution to problem (1) (see, e.g., [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ]).
Since we work on the set , it is useful to introduce the Euclidean projection onto ,
To characterize the convergence of the methods for monotone variational inequalities we introduce the gap function,
Such a gap function, regarded as a convergence criterion, is more suitable for the following variational inequality problem:
Such a solution is also called weak or Minty (whereas the solution of (1) is called strong or Stampacchia). However, in view of Assumption 1, is single-valued and continuous on , meaning that actually the two indicated formulations of the variational inequality problem are equivalent [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ].
For the minimization problem (2), the functional distance to the solution, i.e., the difference , can be used instead of (5). For saddle point problems (3), a slightly different gap function is used, namely,
For both functions (5) and (6) it is crucial that the feasible set is bounded (in fact it is not necessary to take the whole set , which can be unbounded – it suffices to take a bounded convex subset which contains some solution, see [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ]). Therefore it is necessary to define a distance on the set . Since this survey covers methods not only in the Euclidean setup, let us introduce a more general notion of distance.
Definition 1 (Bregman divergence). Let be a function that is -strongly convex w.r.t. the norm and differentiable on . Then for any two points the Bregman divergence (or Bregman distance) associated with is defined as
We denote the Bregman diameter of the set w.r.t. the divergence as . In the Euclidean case, we simply write instead of . Using the definition of , we introduce the so-called proximal operator as follows:
3 Deterministic foundation: Extragradient and other methods
The first and the simplest method for solving the variational inequality (1) is the iterative scheme (also known as the Gradient method)
where is a step size. Note that using the proximal operator associated with the Euclidean Bregman divergence this method can be rewritten in the form
The basic result asserts the convergence of the method to the unique solution of (1) for strongly monotones and -Lipschitz operators ; it was obtained in the papers [19 F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities. Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966) , 106 R. T. Rockafellar, Convex functions, monotone operators and variational inequalities. In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969) , 110 M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970) ].
If Assumptions 1 and 2 hold and , then after iterations method (7) converges to with a linear rate:
and denotes (here and everywhere in the sequel) the norm . For , we have , , thus the upper bound on the number of iterations needed to achieve the -solution (i.e., ) is .
Various extensions of this statement (for the case when is not Lipschitz, but with linear growth bounds, or when the values of are corrupted by noise) can be found in [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) , Theorem 1].
When is a potential operator (see Example 1) method (7) coincides with the gradient projection algorithm. It converges for strongly monotone . Moreover, the bounds for the admissible step size are less restrictive () and the relevant complexity estimates are better () than in Theorem 1; see [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Theorem 2 in Section 1.4.2].
However, in the general monotone, but not strongly monotone case (for instance, for the convex-concave SPP, Example 2) convergence fails. The original statements on the convergence of Uzawa’s method (a version of (7)) for saddle point problems [6 K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming. Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958) ] were wrong; there are numerous well-known examples where method (7) for corresponding to a bilinear SPP diverges, see, e.g., [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Figure 39].
There have been many other attempts to recover the convergence of gradient-like methods, not for VIs, but for saddle point problems. One of them is based on the transition to modified Lagrangians when is a Lagrange function, see [45 E. G. Gol’šteĭn, Convergence of the gradient method for finding the saddle points of modified Lagrangian functions. Èkonom. i Mat. Metody 13, 322–329 (1977) , 104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) ]. However, we focus on the general VI case. A possible approach is based on the idea of regularization. Instead of the monotone variational inequality (1) one can deal with a regularized inequality, in which the monotone operator is replaced by strongly monotone one , where is a strongly monotone operator and is a regularization parameter. If we denote by the solution of the regularized VI, then one can prove that converges to as (see [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) ]). However, usually the solution is not easily available. To address this problem, an iterative regularization technique is proposed in [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) ], where one step of the basic method (7) is applied for the regularized problem. Step sizes and regularization parameters can be adjusted to guarantee convergence.
Another technique is based on the Proximal Point Method proposed independently by B. Martinet in [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) ] and by T. Rockafellar in [107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. At each iteration this methods requires the solution of the VI with the operator , where and is the identity operator. This is an implicit method (similar to the regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods discussed below can be considered from this point of view.
The breakthrough in methods for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) ]. She exploited the idea of extrapolation for the gradient method. How this works can be explained for the simplest example of a two-dimensional min-max problem with and . It has the unique saddle point , and in any point the direction is orthogonal to ; thus, the iteration (7) increases the distance to the saddle point. However, if we perform the step (7) and get the extrapolated point , the direction is attracted to the saddle point. Thus, the Extragradient method for solving (1) reads
Let satisfy Assumptions 1 and 2 (with ) and let . Then the sequence of iterates generated by the Extragradient method converges to .
For the particular cases of the zero-sum matrix game or the general bilinear problem with , the method converges linearly, provided that the optimal solution is unique (see [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) , Theorem 3]). In this case, the rate of convergence is equal to with . More general upper bounds for the Extragradient method can be found in [119 P. Tseng, On linear convergence of iterative methods for the variational inequality problem. J. Comput. Appl. Math. 60, 237–252 (1995) ] and in the recent paper [87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ]. In particular, for the strongly monotone case the estimate with holds true (compare with the much worse bound for the Gradient method). An adaptive version of the Extragradient method (no knowledge of is required) is proposed in [61 E. N. Khobotov, Modification of the extra-gradient method for solving variational inequalities and certain optimization problems. U.S.S.R. Comput. Math. Math. Phys. 27, 120–127 (1987) ].
Another version of the Extragradient method for finding saddle points is provided in [65 G. M. Korpelevich, Extrapolation gradient methods and their relation to modified Lagrange functions. Èkonom. i Mat. Metody 19, 694–703 (1983) ]. Considering the setup of Example 2, we can exploit just one extrapolating step for the variables :
with and . This method converges to the solution and if is linear in , then the rate of convergence is linear. If we set in method (8), then and we get the so-called Alternating Gradient Method (alternating descent-ascent). In [123 G. Zhang, Y. Wang, L. Lessard and R. B. Grosse, Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7659–7679 (2022) ], it was proved that this method has local linear convergence with complexity , where .
L. Popov [105 L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points. Math. Notes 28, 845–848 (1981) ] proposed a version of extrapolation scheme (sometimes this type of scheme is referred to as optimistic or single-call):
It requires the single calculation of at each iteration vs two calculations in the Extragradient method. As shown in [105 L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points. Math. Notes 28, 845–848 (1981) ], method (9) converges for . Rates of convergence for this method were derived recently in [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) , 87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ], i.e., with for the strongly monotone case and for the bilinear case. Note that in the general strongly monotone case this estimate is optimal [124 J. Zhang, M. Hong and S. Zhang, On lower iteration complexity bounds for the convex concave saddle point problems. Math. Program. 194, 901–935 (2022) ], but for the bilinear problem the upper bounds available in the literature for both the Extragradient and optimistic methods are not tight [56 A. Ibrahim, W. Azizian, G. Gidel and I. Mitliagkas, Linear lower bounds and conditioning of differentiable games. In International Conference on Machine Learning, Proc. Mach. Learn. Res., 4583–4593 (2020) ]. Meanwhile, optimal estimates with can be achieved using approaches from [7 W. Azizian, D. Scieur, I. Mitliagkas, S. Lacoste-Julien and G. Gidel, Accelerating smooth games by manipulating spectral shapes. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1705–1715 (2020) , 4 M. S. Alkousa, A. V. Gasnikov, D. M. Dvinskikh, D. A. Kovalev and F. S. Stonyakin, Accelerated methods for saddle-point problem. Comput. Math. Math. Phys. 60, 1787–1809 (2020) ].
An extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky [92 A. Nemirovski, Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15, 229–251 (2004) ]. He proposed the Mirror-Prox method for VIs, exploiting the Bregman divergence:
This yields the following rate-of-convergence result.
Let satisfy Assumptions 1 and 2 (with ), and let
where are generated by algorithm (10) with . Then, after iterations,
Numerous extensions of these original versions of iterative methods for solving variational inequalities were published later. One can highlight Tseng’s Forward-Backward Splitting [120 P. Tseng, A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim. 38, 431–446 (2000) ], Nesterov’s Dual Extrapolation [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ], Malitsky and Tam’s Forward-Reflected-Backward [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ]. All methods have convergence guarantees (12). It turns out that this rate is optimal [101 Y. Ouyang and Y. Xu, Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Math. Program. 185, 1–35 (2021) ].
4 Stochastic methods: Different setups and assumptions
In this section, we move from deterministic to stochastic methods, i.e., we consider problem (1) with an operator of the form
where is a random variable, is some (typically unknown) probability distribution and is a stochastic operator. In this setup, calculating the value of the full operator is computationally expensive or even intractable. Therefore, one has to work mainly with stochastic realizations .
4.1 General case
The stochastic formulation (13) for problem (1) was first considered by the authors of [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ]. They proposed a natural stochastic generalization of the Extragradient method (more precisely, of the Mirror-Prox methods):
Here it is important to note that the variables and are independent and is an unbiased estimator of . Moreover, is assumed to satisfy the following condition.
Assumption 3 (Bounded variance). The unbiased operator has uniformly bounded variance, i.e., for all and , we have .
Under this assumption, the following result was established in [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ].
Let satisfy Assumptions 1, 2 (with ) and 3, and let be defined as in (11), where are generated by algorithm (14) with . Then, after iterations, one can guarantee that
In [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors carried out an analysis of algorithm (14) for strongly monotone VIs in the Euclidean case. In particular, under Assumptions 1, 2 and 3 one can guarantee that after iterations of method (14) one has that (here and below we omit numerical constants in the exponential multiplier)
Also in [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 1, 2 and 3 with stochastic methods. It turns out that the conclusions of Theorem 4 in the monotone case and estimate (16) are optimal and meet lower bounds up to numerical constants.
Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) ] applies the following update scheme:
For this method, in the monotone Euclidean case, the authors proved an estimate similar to (15). In the strongly monotone case, method (17) was investigated in the paper [54 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, On the convergence of single-call stochastic extra-gradient methods. Adv. Neural Inf. Process. Syst. 32, 6938–6948 (2019) ], but the estimates obtained there do not meet the lower bounds. The optimal estimates for this scheme were obtained later in [14 A. Beznosikov, A. Gasnikov, K. Zainulina, A. Maslovskiy and D. Pasechnyuk, A unified analysis of variational inequality methods: Variance reduction, sampling, quantization and coordinate descent, preprint, arXiv:2201.12206 (2022) ].
The work [66 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. I: Operator extrapolation. SIAM J. Optim. 32, 2041–2073 (2022) ] deals with a slightly different single-call approach in the non-Euclidean case:
This update rule is a modification of the Forward-Reflected-Backward approach, namely, here is a parameter, while in [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ], . The analysis of method (18) gives optimal estimates in both the strongly monotone and monotone cases.
The theoretical results and guarantees discussed above rely in significant manner on the bounded variance assumption (Assumption 3). This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems. Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that method (14) diverges exponentially fast [26 T. Chavdarova, G. Gidel, F. Fleuret and S. Lacoste-Julien, Reducing noise in GAN training with variance reduced extragradient. Adv. Neural Inf. Process. Syst. 32, 393–403 (2019) ]. The authors of [55 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020) , 48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ] consider a relaxed form of the bounded variance condition and assume that with in the Euclidean case. Under this condition and Assumptions 1 and 2, it is proved in [48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ] that after iterations of algorithm (14) (when ) it holds that
where . The same assumption on stochastic realizations was considered in [67 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. II: Markovian noise and policy evaluation in reinforcement learning. SIAM J. Optim. 32, 1120–1155 (2022) ], where method (18) was used, yielding the estimate
Estimates (19) and (20) are competitive: the former is superior in terms of the stochastic term (second term), while the latter is superior in terms of the deterministic term (first term). However, none of these results deals completely with the issue of bounded noise, because the condition considered above is not general. The key to avoiding the bounded variance assumption on lies in the way how stochasticity is generated in method (14). Method (14) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address the bounded variance issue, K. Mishchenko et al. [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] proposed another stochastic modification of the Extragradient algorithm, called Same Sample Stochastic Extragradient (S-SEG):
For simplicity, we present the above method for the case when (), and refer the reader to [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] for a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample for both steps at iteration . Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems, which have been gaining an increasing attention in the recent years. Moreover, S-SEG relies in significant manner on the following assumption.
Assumption 4. The operator is -Lipschitz and -strongly monotone almost surely in , i.e., and for all , almost surely in .
The evident difference between the I-SEG and S-SEG setups can be explained through the connection between the Extragradient and the Proximal Point (PP) methods [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) , 107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. In the rest of this subs-section we assume that (). In this setup, PP has the update rule
The method converges for any monotone operator and any . However, the update rule of PP is implicit and in many situations it cannot be computed efficiently. The Extragradient method can be seen as a natural approximation of PP that substitutes in the right-hand side by one gradient step from :
In addition, when is -Lipschitz, one can estimate how good the approximation is. Consider (the Extragradient step) and (the PP step). Then can be estimated as follows [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ]:
That is, the difference between the Extragradient and PP steps is of the order rather than . Since the latter corresponds to the difference between PP and the simple gradient step (7), the Extragradient method approximates PP better than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of the Extragradient method is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of the operator .
Let us go back to the differences between I-SEG and S-SEG. In S-SEG, the -th iteration can be regarded as a single Extragradient step for the operator . Therefore, Lipschitzness and monotonicity of (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual s. Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of as well as on the closeness (on average) of and (Assumption 3).
The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, one has the following result [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ].
Let Assumption 4 hold. Then there exists a choice of step size (see [48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ]) such that the output of S-SEG after iterations satisfies
where .
This rate is similar to the one known for I-SEG, with the following differences. First, instead of the uniform bound on the variance , the rate depends on , which is the variance of measured at the solution. In many cases, , while is finite. From this perspective, S-SEG enjoys a better rate of convergence than I-SEG. However, this comes at a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of , the rate of S-SEG depends on the worst constants of , which can be much worse than those for . In particular, consider the finite-sum setup with uniform sampling of : , where is -Lipschitz and -strongly monotone, and . Then Assumption 4 holds with and and these constants appear in the rate from Theorem 3. The authors of [48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ] tighten this rate. In particular, they prove that for S-SEG with different step sizes for the extrapolation and update steps one has that
where and . Since is (sometimes considerably) larger than , the improvement is noticeable. Moreover, when the constants are known, one can consider the so-called importance sampling [52 R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin and P. Richtárik, SGD: General analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, Proc. Mach. Learn. Res. 97, 5200–5209 (2019) ]: , where . As the authors of [48