Magazine Cover
Download article (PDF)

This article is published open access.

Smooth monotone stochastic variational inequalities and saddle point problems: A survey

  • Aleksandr Beznosikov

    Moscow Institute of Physics and Technology, Russia
  • Boris Polyak

    Moscow University of Physics and Engineering, Russia
  • Eduard Gorbunov

    Moscow Institute of Physics and Technology, Russia
  • Dmitry Kovalev

    King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
  • Alexander Gasnikov

    Moscow Institute of Physics and Technology, Russia
This paper is a survey of methods for solving smooth, (strongly) monotone stochastic variational inequalities. To begin with, we present the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite-sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.

1 Introduction

In its long, more than half-century history of study (going back to the classical article [113 G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes. C. R. Acad. Sci. Paris 258, 4413–4416 (1964) ]), variational inequalities have become one of the most popular and universal optimization formulations. Variational inequalities are used in various areas of applied mathematics. Here we can highlight both classical examples from game theory, economics, operator theory, convex analysis [6 K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming. Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958) , 19 F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities. Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966) , 106 R. T. Rockafellar, Convex functions, monotone operators and variational inequalities. In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969) , 110 M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970) , 113 G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes. C. R. Acad. Sci. Paris 258, 4413–4416 (1964) ], as well as newer and even more recent applications in optimization and machine learning: non-smooth optimization [93 Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005) ], unsupervised learning [9 F. Bach, J. Mairal and J. Ponce, Convex sparse matrix factorizations, preprint, arXiv:0812.1869 (2008) , 36 E. Esser, X. Zhang and T. F. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imaging Sci. 3, 1015–1046 (2010) , 22 A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision 40, 120–145 (2011) ], robust/adversarial optimization [11 A. Ben-Tal, L. El Ghaoui and A. Nemirovski, Robust optimization. Princeton Ser. Appl. Math., Princeton University Press, Princeton (2009) ], GANs [47 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial networks. Commun. ACM 63, 139–144 (2020) ] and reinforcement learning [100 S. Omidshafiei, J. Pazis, C. Amato, J. P. How and J. Vian, Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 70, 2681–2690 (2017) , 57 Y. Jin and A. Sidford, Efficiently solving MDPs with stochastic mirror descent. In Proceedings of the 37th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 119, 4890–4900 (2020) ]. Modern times present new challenges to the community. The increase in scale of problems and the need to speed up solution processes have sparked a huge interest in stochastic formulations of applied tasks, including variational inequalities. This paper surveys stochastic methods for solving variational inequalities.

Structure of the paper

In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have evolved. Section 4 covers a variety of stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.

2 Problem: Setting and assumptions

Notation. We use x,yi=1dxiyi\langle x,y\rangle≔\sum_{i=1}^{d}x_{i}y_{i} to denote the standard inner product of vectors x,yRdx,y\in\mathbb{R}^{d}, where xix_{i} is the ii-th component of xx in the standard basis of Rd\mathbb{R}^{d}. It induces the 2\ell_{2}-norm in Rd\mathbb{R}^{d} by x2x,x\lVert x\rVert_{2}≔\sqrt{\langle x,x\rangle}. We denote the p\ell_{p}-norm by xp(i=1dxip)1/p\lVert x\rVert_{p}≔\bigl(\sum_{i=1}^{d}\lvert x_{i}\rvert^{p}\bigr)^{1/{p}} for p[1,)p\in[1,\infty), and xmax1idxi\lVert x\rVert_{\infty}≔\max_{1\leq i\leq d}\lvert x_{i}\rvert for p=p=\infty. The dual norm \lVert\cdot\rVert_{*} corresponding to the norm \lVert\cdot\rVert is defined by ymax{x,yx1}\lVert y\rVert_{*}≔\max\{\langle x,y\rangle\mid\lVert x\rVert\leq 1\}. The symbol E[]\mathbb{E}[\cdot] stands for the total mathematical expectation. Finally, we need to introduce the symbols O\mathcal{O} and Ω\Omega to enclose numerical constants that do not depend on any parameters of the problem, and the symbols O~\tilde{\mathcal{O}} and Ω~\tilde{\Omega} to enclose numerical constants and logarithmic factors.

We study variational inequalities (VI) of the form

find zZ such that F(z),zz0zZ,\textrm{find}\ z^{*}\in\mathcal{Z}\ \textrm{such that}\ \langle F(z^{*}),z-z^{*}\rangle\geq 0\quad\forall z\in\mathcal{Z},

where F ⁣:ZRdF\colon\mathcal{Z}\to\mathbb{R}^{d} is an operator and ZRd\mathcal{Z}\subseteq\mathbb{R}^{d} is a convex set.

To emphasize the extensiveness of formulation (1), we give a few examples of variational inequalities arising in applied sciences.

Example 1 (Minimization). Consider the minimization problem

minzZf(z).\min_{z\in\mathcal{Z}}f(z).

Let F(z)f(z)F(z)≔\nabla f(z). Then, if ff is convex, one can prove that zZz^{*}\in\mathcal{Z} is a solution of (1) if and only if zZz^{*}\in\mathcal{Z} is a solution of problem (2).

Example 2 (Saddle point problem). Consider the saddle point problem (SPP)

minxXmaxyYg(x,y).\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}g(x,y).

Suppose that F(z)F(x,y)=[xg(x,y),yg(x,y)]F(z)≔F(x,y)=[\nabla_{x}g(x,y),-\nabla_{y}g(x,y)] and Z=X×Y\mathcal{Z}=\mathcal{X}\times\mathcal{Y} with XRdx\mathcal{X}\subseteq\mathbb{R}^{d_{x}}, YRdy\mathcal{Y}\subseteq\mathbb{R}^{d_{y}}. Then, if gg is convex-concave, one can prove that zZz^{*}\in\mathcal{Z} is a solution of problem (1) if and only if zZz^{*}\in\mathcal{Z} is a solution of problem (3).

The study of saddle point problems is often associated with variational inequalities.

Example 3 (Fixed point problem). Consider the fixed point problem

find zRd such that T(z)=z,\textrm{find}\ z^{*}\in\mathbb{R}^{d}\ \textrm{such that}\ T(z^{*})=z^{*},

where T ⁣:RdRdT\colon\mathbb{R}^{d}\to\mathbb{R}^{d} is an operator. If we set F(z)=zT(z)F(z)=z-T(z), then one can prove that zZ=Rdz^{*}\in\mathcal{Z}=\mathbb{R}^{d} is a solution of problem (1) if and only if F(z)=0F(z^{*})=0, i.e., zRdz^{*}\in\mathbb{R}^{d} is a solution of problem (4).

For the operator FF from (1) we assume the following.

Assumption 1 (Lipschitzness). The operator FF is LL-Lipschitz continuous, i.e., for all u,vZu,v\in\mathcal{Z}, we have F(u)F(v)Luv\lVert F(u)-F(v)\rVert_{*}\leq L\lVert u-\nobreak v\rVert.

In the context of problems (2) and (3), LL-Lipschitzness of the operator means that the functions f(z)f(z) and g(x,y)g(x,y) are LL-smooth.

Assumption 2 (Strong monotonicity). The operator FF is μ\mu-strongly monotone, i.e., for all u,vZu,v\in\mathcal{Z}, we have F(u)F(v),uvμuv22\langle F(u)-F(v),u-v\rangle\geq\mu\lVert u-v\rVert^{2}_{2}. If μ=0\mu=0, then the operator FF is monotone.

In the context of problems (2) and (3), strong monotonicity of FF means strong convexity of f(z)f(z) and strong convexity-strong concavity of g(x,y)g(x,y). In this paper we first focus on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see [55 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020) ] and references therein).

We note that Assumptions 1 and 2 are sufficient for the existence of a solution to problem (1) (see, e.g., [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ]).

Since we work on the set Z\mathcal{Z}, it is useful to introduce the Euclidean projection onto Z\mathcal{Z},

PZ(z)=argminvZzv2.P_{\mathcal{Z}}(z)=\arg\min_{v\in\mathcal{Z}}\lVert z-v\rVert_{2}.

To characterize the convergence of the methods for monotone variational inequalities we introduce the gap function,

GapVI(z)supuZ[F(u),zu].\operatorname{Gap}_{\mathrm{VI}}(z)≔\sup_{u\in\mathcal{\mathcal{Z}}}[\langle F(u),z-u\rangle].

Such a gap function, regarded as a convergence criterion, is more suitable for the following variational inequality problem:

find zZ such that F(z),zz0for zZ.\textrm{find}\ z^{*}\in\mathcal{Z}\ \textrm{such that}\ \langle F(z),z^{*}-z\rangle\leq 0\quad\textrm{for}\ z\in\mathcal{Z}.

Such a solution is also called weak or Minty (whereas the solution of (1) is called strong or Stampacchia). However, in view of Assumption 1, FF is single-valued and continuous on Z\mathcal{Z}, meaning that actually the two indicated formulations of the variational inequality problem are equivalent [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ].

For the minimization problem (2), the functional distance to the solution, i.e., the difference f(z)f(z)f(z)-f(z^{*}), can be used instead of (5). For saddle point problems (3), a slightly different gap function is used, namely,

GapSPP(z)gap(x,y)=maxyYf(x,y)minxXf(x,y).\operatorname{Gap}_{\mathrm{SPP}}(z)≔\operatorname{gap}(x,y)=\max_{y^{\prime}\in\mathcal{Y}}f(x,y^{\prime})-\min_{x^{\prime}\in\mathcal{X}}f(x^{\prime},y).

For both functions (5) and (6) it is crucial that the feasible set is bounded (in fact it is not necessary to take the whole set Z\mathcal{Z}, which can be unbounded – it suffices to take a bounded convex subset C\mathcal{C} which contains some solution, see [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ]). Therefore it is necessary to define a distance on the set Z\mathcal{Z}. Since this survey covers methods not only in the Euclidean setup, let us introduce a more general notion of distance.

Definition 1 (Bregman divergence). Let ν(z)\nu(z) be a function that is 11-strongly convex w.r.t. the norm \lVert\cdot\rVert and differentiable on Z\mathcal{Z}. Then for any two points z,zZz,z^{\prime}\in\mathcal{Z} the Bregman divergence (or Bregman distance) V(z,z)V(z,z^{\prime}) associated with ν(z)\nu(z) is defined as

V(z,z)ν(z)ν(z)ν(z),zz.V(z,z^{\prime})≔\nu(z^{\prime})-\nu(z)-\langle\nabla\nu(z),z^{\prime}-z\rangle.

We denote the Bregman diameter of the set Z\mathcal{Z} w.r.t. the divergence V(z,z)V(z,z^{\prime}) as DZ,Vmax{2V(z,z)z,zZ}D_{\mathcal{Z},V}≔\max\{\sqrt{2V(z,z^{\prime})}\mid z,z^{\prime}\in\mathcal{Z}\}. In the Euclidean case, we simply write DZD_{\mathcal{Z}} instead of DZ,VD_{\mathcal{Z},V}. Using the definition of VV, we introduce the so-called proximal operator as follows:

proxx(y)=argminzZ{y,z+V(z,x)}.\operatorname{prox}_{x}(y)=\arg\min_{z\in\mathcal{Z}}\{\langle y,z\rangle+V(z,x)\}.

3 Deterministic foundation: Extragradient and other methods

The first and the simplest method for solving the variational inequality (1) is the iterative scheme (also known as the Gradient method)

zk+1=PZ(zkγF(zk)),z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k})),

where γ>0\gamma>0 is a step size. Note that using the proximal operator associated with the Euclidean Bregman divergence this method can be rewritten in the form

zk+1=proxzk(γF(zk)).z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k})).

The basic result asserts the convergence of the method to the unique solution of (1) for strongly monotones and LL-Lipschitz operators FF; it was obtained in the papers [19 F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities. Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966) , 106 R. T. Rockafellar, Convex functions, monotone operators and variational inequalities. In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969) , 110 M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970) ].

Theorem 1.

If Assumptions 1 and 2 hold and 0<γ<2μ/L20<\gamma<2\mu/L^{2}, then after kk iterations method (7) converges to zz^{*} with a linear rate:

zkz22=O(R02qk),with q=(12γμ+γ2L2)\lVert z^{k}-z^{*}\rVert_{2}^{2}=\mathcal{O}(R_{0}^{2}q^{k}),\quad\textrm{with}\ q=(1-2\gamma\mu+\gamma^{2}L^{2})

and R0R_{0} denotes (here and everywhere in the sequel) the norm z0z2\lVert z^{0}-z^{*}\rVert_{2}. For γ=μ/L2\gamma=\mu/L^{2}, we have q=(11/κ2)q=(1-1/\kappa^{2}), κ=L/μ\kappa=L/\mu, thus the upper bound on the number of iterations needed to achieve the ε\varepsilon-solution (i.e., zkz22ε\lVert z^{k}-z^{*}\rVert_{2}^{2}\leq\nobreak\varepsilon) is O(κ2log(R02/ε))\mathcal{O}(\kappa^{2}\log(R_{0}^{2}/\varepsilon)).

Various extensions of this statement (for the case when FF is not Lipschitz, but with linear growth bounds, or when the values of FF are corrupted by noise) can be found in [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) , Theorem 1].

When FF is a potential operator (see Example 1) method (7) coincides with the gradient projection algorithm. It converges for strongly monotone FF. Moreover, the bounds for the admissible step size are less restrictive (0<γ<2/L0<\gamma<2/L) and the relevant complexity estimates are better (O(κlog(R02/ε))O(\kappa\log(R_{0}^{2}/\varepsilon))) than in Theorem 1; see [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Theorem 2 in Section 1.4.2].

However, in the general monotone, but not strongly monotone case (for instance, for the convex-concave SPP, Example 2) convergence fails. The original statements on the convergence of Uzawa’s method (a version of (7)) for saddle point problems [6 K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming. Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958) ] were wrong; there are numerous well-known examples where method (7) for FF corresponding to a bilinear SPP diverges, see, e.g., [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Figure 39].

There have been many other attempts to recover the convergence of gradient-like methods, not for VIs, but for saddle point problems. One of them is based on the transition to modified Lagrangians when g(x,y)g(x,y) is a Lagrange function, see [45 E. G. Gol’šteĭn, Convergence of the gradient method for finding the saddle points of modified Lagrangian functions. Èkonom. i Mat. Metody 13, 322–329 (1977) , 104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) ]. However, we focus on the general VI case. A possible approach is based on the idea of regularization. Instead of the monotone variational inequality (1) one can deal with a regularized inequality, in which the monotone operator FF is replaced by strongly monotone one F+εkTF+\varepsilon_{k}T, where T(z)T(z) is a strongly monotone operator and εk>0\varepsilon_{k}>0 is a regularization parameter. If we denote by zkz^{k} the solution of the regularized VI, then one can prove that zkz^{k} converges to zz^{*} as εk0\varepsilon_{k}\rightarrow 0 (see [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) ]). However, usually the solution zkz^{k} is not easily available. To address this problem, an iterative regularization technique is proposed in [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) ], where one step of the basic method (7) is applied for the regularized problem. Step sizes and regularization parameters can be adjusted to guarantee convergence.

Another technique is based on the Proximal Point Method proposed independently by B. Martinet in [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) ] and by T. Rockafellar in [107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. At each iteration this methods requires the solution of the VI with the operator F+cIF+cI, where c>0c>0 and II is the identity operator. This is an implicit method (similar to the regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods discussed below can be considered from this point of view.

The breakthrough in methods for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) ]. She exploited the idea of extrapolation for the gradient method. How this works can be explained for the simplest example of a two-dimensional min-max problem with g(x,y)=xyg(x,y)=xy and Z=R2\mathcal{Z}=\mathbb{R}^{2}. It has the unique saddle point z=0z=0, and in any point zkz^{k} the direction F(zk)F(z^{k}) is orthogonal to zkz^{k}; thus, the iteration (7) increases the distance to the saddle point. However, if we perform the step (7) and get the extrapolated point zk+1/2z^{k+1/2}, the direction F(zk+1/2)-F(z^{k+1/2}) is attracted to the saddle point. Thus, the Extragradient method for solving (1) reads

zk+1/2=PZ(zkγF(zk)),\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k})),
zk+1=PZ(zkγF(zk+1/2)).\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k+1/2})).
Theorem 2.

Let FF satisfy Assumptions 1 and 2 (with μ=0\mu=0) and let 0<γ<1/L0<\gamma<1/L. Then the sequence of iterates zkz^{k} generated by the Extragradient method converges to zz^{\star}.

For the particular cases of the zero-sum matrix game or the general bilinear problem with g(x,y)=yAxbx+cyg(x,y)=y^{\top}\mathbf{A}x-b^{\top}x+c^{\top}y, the method converges linearly, provided that the optimal solution is unique (see [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) , Theorem 3]). In this case, the rate of convergence is equal to O(κlog(R02/ε))\mathcal{O}(\kappa\log(R_{0}^{2}/\varepsilon)) with κ=λmax(AA)/λmin(AA)\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top}). More general upper bounds for the Extragradient method can be found in [119 P. Tseng, On linear convergence of iterative methods for the variational inequality problem. J. Comput. Appl. Math. 60, 237–252 (1995) ] and in the recent paper [87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ]. In particular, for the strongly monotone case the estimate O(κlog(R02/ε))O(\kappa\log(R_{0}^{2}/\varepsilon)) with κ=L/μ\kappa=L/\mu holds true (compare with the much worse bound O(κ2log(R02/ε))O(\kappa^{2}\log(R_{0}^{2}/\varepsilon)) for the Gradient method). An adaptive version of the Extragradient method (no knowledge of LL is required) is proposed in [61 E. N. Khobotov, Modification of the extra-gradient method for solving variational inequalities and certain optimization problems. U.S.S.R. Comput. Math. Math. Phys. 27, 120–127 (1987) ].

Another version of the Extragradient method for finding saddle points is provided in [65 G. M. Korpelevich, Extrapolation gradient methods and their relation to modified Lagrange functions. Èkonom. i Mat. Metody 19, 694–703 (1983) ]. Considering the setup of Example 2, we can exploit just one extrapolating step for the variables yy:

yk+1/2=PY(yk+γyg(xk,yk)),\displaystyle y^{k+1/2}=P_{\mathcal{Y}}(y^{k}+\gamma\nabla_{y}g(x^{k},y^{k})),
xk+1=PX(xkγxg(xk,yk+1/2),\displaystyle x^{k+1}=P_{\mathcal{X}}(x^{k}-\gamma\nabla_{x}g(x^{k},y^{k+1/2}),
yk+1=yk+q(yk+1/2yk),\displaystyle y^{k+1}=y^{k}+q(y^{k+1/2}-y^{k}),

with 0<γ<1/(2L)0<\gamma<1/(2L) and 0<q<10<q<1. This method converges to the solution and if g(x,y)g(x,y) is linear in yy, then the rate of convergence is linear. If we set q=1q=1 in method (8), then yk+1=yk+1/2y^{k+1}=y^{k+1/2} and we get the so-called Alternating Gradient Method (alternating descent-ascent). In [123 G. Zhang, Y. Wang, L. Lessard and R. B. Grosse, Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7659–7679 (2022) ], it was proved that this method has local linear convergence with complexity O(κlog(R02/ε))O(\kappa\log(R_{0}^{2}/\varepsilon)), where κ=L/μ\kappa=L/\mu.

L. Popov [105 L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points. Math. Notes 28, 845–848 (1981) ] proposed a version of extrapolation scheme (sometimes this type of scheme is referred to as optimistic or single-call):

zk+1/2=PZ(zkγF(zk1/2)),\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k-1/2})),
zk+1=PZ(zkγF(zk+1/2)).\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k+1/2})).

It requires the single calculation of FF at each iteration vs two calculations in the Extragradient method. As shown in [105 L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points. Math. Notes 28, 845–848 (1981) ], method (9) converges for 0<γ<1/(3L)0<\gamma<1/(3L). Rates of convergence for this method were derived recently in [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) , 87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ], i.e., O(κlog(R02/ε))O(\kappa\log(R_{0}^{2}/\varepsilon)) with κ=L/μ\kappa=L/\mu for the strongly monotone case and κ=λmax(AA)/λmin(AA)\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top}) for the bilinear case. Note that in the general strongly monotone case this estimate is optimal [124 J. Zhang, M. Hong and S. Zhang, On lower iteration complexity bounds for the convex concave saddle point problems. Math. Program. 194, 901–935 (2022) ], but for the bilinear problem the upper bounds available in the literature for both the Extragradient and optimistic methods are not tight [56 A. Ibrahim, W. Azizian, G. Gidel and I. Mitliagkas, Linear lower bounds and conditioning of differentiable games. In International Conference on Machine Learning, Proc. Mach. Learn. Res., 4583–4593 (2020) ]. Meanwhile, optimal estimates O(κlog(R02/ε))O(\sqrt{\kappa}\log(R_{0}^{2}/\varepsilon)) with κ=λmax(AA)/λmin(AA)\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top}) can be achieved using approaches from [7 W. Azizian, D. Scieur, I. Mitliagkas, S. Lacoste-Julien and G. Gidel, Accelerating smooth games by manipulating spectral shapes. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1705–1715 (2020) , 4 M. S. Alkousa, A. V. Gasnikov, D. M. Dvinskikh, D. A. Kovalev and F. S. Stonyakin, Accelerated methods for saddle-point problem. Comput. Math. Math. Phys. 60, 1787–1809 (2020) ].

An extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky [92 A. Nemirovski, Prox-method with rate of convergence O⁢(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15, 229–251 (2004) ]. He proposed the Mirror-Prox method for VIs, exploiting the Bregman divergence:

zk+1/2=proxzk(γF(zk)),\displaystyle z^{k+1/2}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k})),
zk+1=proxzk(γF(zk+1/2)).\displaystyle z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k+1/2})).

This yields the following rate-of-convergence result.

Theorem 3.

Let FF satisfy Assumptions 1 and 2 (with μ=0\mu=0), and let

z^k=1ki=1kzi+1/2,\hat{z}^{k}=\smash[t]{\frac{1}{k}\sum_{i=1}^{k}z^{i+1/2}},

where zi+1/2z^{i+1/2} are generated by algorithm (10) with γ=1/(2L)\gamma=1/(\sqrt{2}L). Then, after kk iterations,

GapVI(z^k)=O(LDZ,V2k).\operatorname{Gap}_{\mathrm{VI}}(\hat{z}^{k})=\mathcal{O}\biggl(\frac{LD_{\mathcal{Z},V}^{2}}{k}\biggr).

Numerous extensions of these original versions of iterative methods for solving variational inequalities were published later. One can highlight Tseng’s Forward-Backward Splitting [120 P. Tseng, A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim. 38, 431–446 (2000) ], Nesterov’s Dual Extrapolation [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ], Malitsky and Tam’s Forward-Reflected-Backward [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ]. All methods have convergence guarantees (12). It turns out that this rate is optimal [101 Y. Ouyang and Y. Xu, Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Math. Program. 185, 1–35 (2021) ].

4 Stochastic methods: Different setups and assumptions

In this section, we move from deterministic to stochastic methods, i.e., we consider problem (1) with an operator of the form

F(z)=EξD[Fξ(z)],F(z)=\mathbb{E}_{\xi\sim\mathcal{D}}[F_{\xi}(z)],

where ξ\xi is a random variable, D\mathcal{D} is some (typically unknown) probability distribution and Fξ ⁣:ZRdF_{\xi}\colon\mathcal{Z}\to\mathbb{R}^{d} is a stochastic operator. In this setup, calculating the value of the full operator FF is computationally expensive or even intractable. Therefore, one has to work mainly with stochastic realizations FξF_{\xi}.

4.1 General case

The stochastic formulation (13) for problem (1) was first considered by the authors of [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ]. They proposed a natural stochastic generalization of the Extragradient method (more precisely, of the Mirror-Prox methods):

zk+1/2=proxzk(γFξk(zk)),\displaystyle z^{k+1/2}=\operatorname{prox}_{z^{k}}(\gamma F_{\xi^{k}}(z^{k})),
zk+1=proxzk(γFξk+1/2(zk+1/2)).\displaystyle z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F_{\xi^{k+1/2}}(z^{k+1/2})).

Here it is important to note that the variables ξk\xi^{k} and ξk+1/2\xi^{k+1/2} are independent and Fξ(z)F_{\xi}(z) is an unbiased estimator of F(z)F(z). Moreover, Fξ(z)F_{\xi}(z) is assumed to satisfy the following condition.

Assumption 3 (Bounded variance). The unbiased operator FξF_{\xi} has uniformly bounded variance, i.e., for all ξD\xi\sim\mathcal{D} and uZu\in\mathcal{Z}, we have EFξ(u)F(u)2σ2\mathbb{E}\lVert F_{\xi}(u)-F(u)\rVert^{2}_{*}\leq\sigma^{2}.

Under this assumption, the following result was established in [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ].

Theorem 4.

Let FξF_{\xi} satisfy Assumptions 12 (with μ=0\mu=0) and 3, and let z^k\hat{z}^{k} be defined as in (11), where zi+1/2z^{i+1/2} are generated by algorithm (14) with γ=min{13L,DZ,V17kσ2}\gamma=\min\bigl\{\frac{1}{\sqrt{3}L},D_{\mathcal{Z},V}\sqrt{\frac{1}{7k\sigma^{2}}}\bigr\}. Then, after kk iterations, one can guarantee that

E[GapVI(z^k)]=O(LDZ,V2k+DZ,Vσ2k).\mathbb{E}[\operatorname{Gap}_{\mathrm{VI}}(\hat{z}^{k})]=\mathcal{O}\biggl(\frac{LD^{2}_{\mathcal{Z},V}}{k}+D_{\mathcal{Z},V}\sqrt{\frac{\sigma^{2}}{k}}\biggr).

In [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors carried out an analysis of algorithm (14) for strongly monotone VIs in the Euclidean case. In particular, under Assumptions 12 and 3 one can guarantee that after kk iterations of method (14) one has that (here and below we omit numerical constants in the exponential multiplier)

Ezkz22=O~(R02exp(μkL)+σ2μ2k).\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\tilde{\mathcal{O}}\biggl(R_{0}^{2}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma^{2}}{\mu^{2}k}\biggr).

Also in [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 12 and 3 with stochastic methods. It turns out that the conclusions of Theorem 4 in the monotone case and estimate (16) are optimal and meet lower bounds up to numerical constants.

Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) ] applies the following update scheme:

zk+1/2=PZ(zkγFξk1/2(zk1/2)),\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F_{\xi^{k-1/2}}(z^{k-1/2})),
zk+1=PZ(zkγFξk+1/2(zk+1/2)).\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F_{\xi^{k+1/2}}(z^{k+1/2})).

For this method, in the monotone Euclidean case, the authors proved an estimate similar to (15). In the strongly monotone case, method (17) was investigated in the paper [54 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, On the convergence of single-call stochastic extra-gradient methods. Adv. Neural Inf. Process. Syst. 32, 6938–6948 (2019) ], but the estimates obtained there do not meet the lower bounds. The optimal estimates for this scheme were obtained later in [14 A. Beznosikov, A. Gasnikov, K. Zainulina, A. Maslovskiy and D. Pasechnyuk, A unified analysis of variational inequality methods: Variance reduction, sampling, quantization and coordinate descent, preprint, arXiv:2201.12206 (2022) ].

The work [66 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. I: Operator extrapolation. SIAM J. Optim. 32, 2041–2073 (2022) ] deals with a slightly different single-call approach in the non-Euclidean case:

zk+1=proxzk(γkFξk(zk)+γkαk[Fξk(zk)Fξk1(zk1)]).z^{k+1}=\operatorname{prox}_{z^{k}}\bigl(\gamma_{k}F_{\xi^{k}}(z^{k})+\gamma_{k}\alpha_{k}[F_{\xi^{k}}(z^{k})-F_{\xi^{k-1}}(z^{k-1})]\bigr).

This update rule is a modification of the Forward-Reflected-Backward approach, namely, here αk\alpha_{k} is a parameter, while in [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ], αk1\alpha_{k}\equiv 1. The analysis of method (18) gives optimal estimates in both the strongly monotone and monotone cases.

The theoretical results and guarantees discussed above rely in significant manner on the bounded variance assumption (Assumption 3). This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems. Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that method (14) diverges exponentially fast [26 T. Chavdarova, G. Gidel, F. Fleuret and S. Lacoste-Julien, Reducing noise in GAN training with variance reduced extragradient. Adv. Neural Inf. Process. Syst. 32, 393–403 (2019) ]. The authors of [55 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020) , 48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ] consider a relaxed form of the bounded variance condition and assume that EFξ(u)F(u)22σ2+δuz22\mathbb{E}\lVert F_{\xi}(u)-F(u)\rVert^{2}_{2}\leq\sigma^{2}+\delta\lVert u-z^{*}\rVert_{2}^{2} with δ0\delta\geq 0 in the Euclidean case. Under this condition and Assumptions 1 and 2, it is proved in [48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ] that after kk iterations of algorithm (14) (when Z=Rd\mathcal{Z}=\mathbb{R}^{d}) it holds that

Ezkz22=O(κR02exp(kκ)+σ2μ2k),\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\mathcal{O}\biggl(\kappa R^{2}_{0}\exp\biggl(-\frac{k}{\kappa}\biggr)+\frac{\sigma^{2}}{\mu^{2}k}\biggr),

where κ=max{δμ2;L+δμ}\kappa=\max\bigl\{\frac{\delta}{\mu^{2}};\frac{L+\sqrt{\delta}}{\mu}\bigr\}. The same assumption on stochastic realizations was considered in [67 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. II: Markovian noise and policy evaluation in reinforcement learning. SIAM J. Optim. 32, 1120–1155 (2022) ], where method (18) was used, yielding the estimate

Ezkz22=O(R02exp(μkL)+σ2+δ2DZ2μ2k).\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\mathcal{O}\biggl(R^{2}_{0}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma^{2}+\delta^{2}D^{2}_{\mathcal{Z}}}{\mu^{2}k}\biggr).

Estimates (19) and (20) are competitive: the former is superior in terms of the stochastic term (second term), while the latter is superior in terms of the deterministic term (first term). However, none of these results deals completely with the issue of bounded noise, because the condition considered above is not general. The key to avoiding the bounded variance assumption on FξF_{\xi} lies in the way how stochasticity is generated in method (14). Method (14) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address the bounded variance issue, K. Mishchenko et al. [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] proposed another stochastic modification of the Extragradient algorithm, called Same Sample Stochastic Extragradient (S-SEG):

zk+1/2=zkγFξk(zk),\displaystyle z^{k+1/2}=z^{k}-\gamma F_{\xi^{k}}(z^{k}),
zk+1=zkγFξk(zk+1/2).\displaystyle z^{k+1}=z^{k}-\gamma F_{\xi^{k}}(z^{k+1/2}).

For simplicity, we present the above method for the case when Z=Rd\mathcal{Z}=\mathbb{R}^{d} (F(x)=0F(x^{*})=0), and refer the reader to [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] for a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample ξk\xi^{k} for both steps at iteration kk. Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems, which have been gaining an increasing attention in the recent years. Moreover, S-SEG relies in significant manner on the following assumption.

Assumption 4. The operator Fξ(z)F_{\xi}(z) is LL-Lipschitz and μ\mu-strongly monotone almost surely in ξ\xi, i.e., Fξ(z)Fξ(z)2Lzz2\lVert F_{\xi}(z)-F_{\xi}(z^{\prime})\rVert_{2}\leq L\lVert z-z^{\prime}\rVert_{2} and Fξ(z)Fξ(z),zzμzz22\langle F_{\xi}(z)-F_{\xi}(z^{\prime}),z-z^{\prime}\rangle\geq\mu\lVert z-z^{\prime}\rVert_{2}^{2} for all z,zRdz,z^{\prime}\in\mathbb{R}^{d}, almost surely in ξ\xi.

The evident difference between the I-SEG and S-SEG setups can be explained through the connection between the Extragradient and the Proximal Point (PP) methods [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) , 107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. In the rest of this subs-section we assume that Z=Rd\mathcal{Z}=\mathbb{R}^{d} (F(z)=0F(z^{*})=0). In this setup, PP has the update rule

zk+1=zkγF(zk+1).z^{k+1}=z^{k}-\gamma F(z^{k+1}).

The method converges for any monotone operator FF and any γ>0\gamma>0. However, the update rule of PP is implicit and in many situations it cannot be computed efficiently. The Extragradient method can be seen as a natural approximation of PP that substitutes zk+1z^{k+1} in the right-hand side by one gradient step from zkz^{k}:

zk+1=zkγF(zkγF(zk)).z^{k+1}=z^{k}-\gamma F(z^{k}-\gamma F(z^{k})).

In addition, when FF is LL-Lipschitz, one can estimate how good the approximation is. Consider zk+1=zkγF(zkγF(zk))z^{k+1}=z^{k}-\gamma F(z^{k}-\gamma F(z^{k})) (the Extragradient step) and z~k+1=zkγF(z~k+1)\tilde{z}^{k+1}=z^{k}-\gamma F(\tilde{z}^{k+1}) (the PP step). Then zk+1z~k+12\lVert z^{k+1}-\tilde{z}^{k+1}\rVert_{2} can be estimated as follows [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ]:

zk+1z~k+12=γF(zkγF(zk))F(z~k+1)2γLzkγF(zk)z~k+12=γ2LF(zk)F(z~k+1)2γ2L2zkz~k+12=γ3L2F(z~k+1)2γ3L3z~k+1z2.\begin{split}&\lVert z^{k+1}-\tilde{z}^{k+1}\rVert_{2}=\gamma\lVert F(z^{k}-\gamma F(z^{k}))-F(\tilde{z}^{k+1})\rVert_{2}\\ &\qquad\leq\gamma L\lVert z^{k}-\gamma F(z^{k})-\tilde{z}^{k+1}\rVert_{2}=\gamma^{2}L\lVert F(z^{k})-F(\tilde{z}^{k+1})\rVert_{2}\\ &\qquad\leq\gamma^{2}L^{2}\lVert z^{k}-\tilde{z}^{k+1}\rVert_{2}=\gamma^{3}L^{2}\lVert F(\tilde{z}^{k+1})\rVert_{2}\\ &\qquad\leq\gamma^{3}L^{3}\lVert\tilde{z}^{k+1}-z^{*}\rVert_{2}.\end{split}

That is, the difference between the Extragradient and PP steps is of the order O(γ3)\mathcal{O}(\gamma^{3}) rather than O(γ2)\mathcal{O}(\gamma^{2}). Since the latter corresponds to the difference between PP and the simple gradient step (7), the Extragradient method approximates PP better than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of the Extragradient method is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of the operator FF.

Let us go back to the differences between I-SEG and S-SEG. In S-SEG, the kk-th iteration can be regarded as a single Extragradient step for the operator Fξk(z)F_{\xi^{k}}(z). Therefore, Lipschitzness and monotonicity of Fξk(z)F_{\xi^{k}}(z) (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual Fξ(z)F_{\xi}(z)s. Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of F(z)F(z) as well as on the closeness (on average) of Fξ(z)F_{\xi}(z) and F(z)F(z) (Assumption 3).

The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, one has the following result [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ].

Theorem 5.

Let Assumption 4 hold. Then there exists a choice of step size γ\gamma (see [48 E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022) ]) such that the output of S-SEG after kk iterations satisfies

Ezkz22=O(LR02μexp(μkL)+σ2μ2k),\mathbb{E}\lVert z^{k}-z^{*}\rVert_{2}^{2}=\mathcal{O}\biggl(\frac{LR^{2}_{0}}{\mu}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma_{*}^{2}}{\mu^{2}k}\biggr),

where σ2=EFξ(z)22\sigma_{*}^{2}=\mathbb{E}\lVert F_{\xi}(z^{*})\rVert_{2}^{2}.

This rate is similar to the one known for I-SEG, with the following differences. First, instead of the uniform bound on the variance σ2\sigma^{2}, the rate depends on σ2\sigma_{*}^{2}, which is the variance of FξF_{\xi} measured at the solution. In many cases, σ2=\sigma^{2}=\infty, while σ2\sigma_{*}^{2} is finite. From this perspective, S-SEG enjoys a better rate of convergence than I-SEG. However, this comes at a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of FF, the rate of S-SEG depends on the worst constants of FξF_{\xi}, which can be much worse than those for FF. In particular, consider the finite-sum setup with uniform sampling of ξ\xi: F(x)=1ni=1nFi(x)F(x)=\frac{1}{n}\sum_{i=1}^{n}F_{i}(x)