Download article (PDF)

This article is published *open access.*

# Smooth monotone stochastic variational inequalities and saddle point problems: A survey

This paper is a survey of methods for solving smooth, (strongly) monotone stochastic variational inequalities.
To begin with, we present the deterministic foundation from which the stochastic methods eventually evolved.
Then we review methods for the general stochastic formulation, and look at the finite-sum setup.
The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.

## 1 Introduction

In its long, more than half-century history of study (going back to the classical article [113
G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes.
C. R. Acad. Sci. Paris 258, 4413–4416 (1964)
]), variational inequalities have become one of the most popular and universal optimization formulations.
Variational inequalities are used in various areas of applied mathematics.
Here we can highlight both classical examples from game theory, economics, operator theory, convex analysis [6
K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming.
Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958)
, 19
F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities.
Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966)
, 106
R. T. Rockafellar, Convex functions, monotone operators and variational inequalities.
In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969)
, 110
M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone.
Calcolo 7, 65–183 (1970)
, 113
G. Stampacchia, Formes bilinéaires coercitives sur les ensembles convexes.
C. R. Acad. Sci. Paris 258, 4413–4416 (1964)
], as well as newer and even more recent applications in optimization and machine learning: non-smooth optimization [93
Y. Nesterov, Smooth minimization of non-smooth functions.
Math. Program. 103, 127–152 (2005)
], unsupervised learning [9
F. Bach, J. Mairal and J. Ponce, Convex sparse matrix factorizations, preprint, arXiv:0812.1869 (2008)
, 36
E. Esser, X. Zhang and T. F. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science.
SIAM J. Imaging Sci. 3, 1015–1046 (2010)
, 22
A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging.
J. Math. Imaging Vision 40, 120–145 (2011)
], robust/adversarial optimization [11
A. Ben-Tal, L. El Ghaoui and A. Nemirovski, Robust optimization.
Princeton Ser. Appl. Math., Princeton University Press, Princeton (2009)
], GANs [47
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Generative adversarial networks.
Commun. ACM 63, 139–144 (2020)
] and reinforcement learning [100
S. Omidshafiei, J. Pazis, C. Amato, J. P. How and J. Vian, Deep decentralized multi-task multi-agent reinforcement learning under partial observability.
In Proceedings of the 34th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 70, 2681–2690 (2017)
, 57
Y. Jin and A. Sidford, Efficiently solving MDPs with stochastic mirror descent.
In Proceedings of the 37th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res. 119, 4890–4900 (2020)
].
Modern times present new challenges to the community.
The increase in scale of problems and the need to speed up solution processes have sparked a huge interest in *stochastic* formulations of applied tasks, including variational inequalities.
This paper surveys stochastic methods for solving variational inequalities.

## Structure of the paper

In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have evolved. Section 4 covers a variety of stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.

## 2 Problem: Setting and assumptions

**Notation****.** We use $\langle x,y\rangle≔\sum_{i=1}^{d}x_{i}y_{i}$ to denote the standard inner product of vectors $x,y\in\mathbb{R}^{d}$, where $x_{i}$ is the $i$-th component of $x$ in the standard basis of $\mathbb{R}^{d}$.
It induces the $\ell_{2}$-norm in $\mathbb{R}^{d}$ by $\lVert x\rVert_{2}≔\sqrt{\langle x,x\rangle}$.
We denote the $\ell_{p}$-norm by $\lVert x\rVert_{p}≔\bigl(\sum_{i=1}^{d}\lvert x_{i}\rvert^{p}\bigr)^{1/{p}}$ for $p\in[1,\infty)$, and $\lVert x\rVert_{\infty}≔\max_{1\leq i\leq d}\lvert x_{i}\rvert$ for $p=\infty$.
The dual norm $\lVert\cdot\rVert_{*}$ corresponding to the norm $\lVert\cdot\rVert$ is defined by $\lVert y\rVert_{*}≔\max\{\langle x,y\rangle\mid\lVert x\rVert\leq 1\}$.
The symbol $\mathbb{E}[\cdot]$ stands for the total mathematical expectation.
Finally, we need to introduce the symbols $\mathcal{O}$ and $\Omega$ to enclose numerical constants that do not depend on any parameters of the problem, and the symbols $\tilde{\mathcal{O}}$ and $\tilde{\Omega}$ to enclose numerical constants and logarithmic factors.

We study variational inequalities (VI) of the form

$\textrm{find}\ z^{*}\in\mathcal{Z}\ \textrm{such that}\ \langle F(z^{*}),z-z^{*}\rangle\geq 0\quad\forall z\in\mathcal{Z},$

where $F\colon\mathcal{Z}\to\mathbb{R}^{d}$ is an operator and $\mathcal{Z}\subseteq\mathbb{R}^{d}$ is a convex set.

To emphasize the extensiveness of formulation (1), we give a few examples of variational inequalities arising in applied sciences.

**Example 1** (Minimization)**.** Consider the minimization problem

$\min_{z\in\mathcal{Z}}f(z).$

Let $F(z)≔\nabla f(z)$. Then, if $f$ is convex, one can prove that $z^{*}\in\mathcal{Z}$ is a solution of (1) if and only if $z^{*}\in\mathcal{Z}$ is a solution of problem (2).

**Example 2** (Saddle point problem)**.** Consider the saddle point problem (SPP)

$\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}g(x,y).$

Suppose that $F(z)≔F(x,y)=[\nabla_{x}g(x,y),-\nabla_{y}g(x,y)]$ and $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ with $\mathcal{X}\subseteq\mathbb{R}^{d_{x}}$, $\mathcal{Y}\subseteq\mathbb{R}^{d_{y}}$. Then, if $g$ is convex-concave, one can prove that $z^{*}\in\mathcal{Z}$ is a solution of problem (1) if and only if $z^{*}\in\mathcal{Z}$ is a solution of problem (3).

The study of saddle point problems is often associated with variational inequalities.

**Example 3** (Fixed point problem)**.** Consider the fixed point problem

$\textrm{find}\ z^{*}\in\mathbb{R}^{d}\ \textrm{such that}\ T(z^{*})=z^{*},$

where $T\colon\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an operator. If we set $F(z)=z-T(z)$, then one can prove that $z^{*}\in\mathcal{Z}=\mathbb{R}^{d}$ is a solution of problem (1) if and only if $F(z^{*})=0$, i.e., $z^{*}\in\mathbb{R}^{d}$ is a solution of problem (4).

For the operator $F$ from (1) we assume the following.

**Assumption 1** (Lipschitzness)**.** The operator $F$ is $L$-Lipschitz continuous, i.e., for all $u,v\in\mathcal{Z}$, we have $\lVert F(u)-F(v)\rVert_{*}\leq L\lVert u-\nobreak v\rVert$.

In the context of problems (2) and (3), $L$-Lipschitzness of the operator means that the functions $f(z)$ and $g(x,y)$ are $L$-smooth.

**Assumption 2** (Strong monotonicity)**.** The operator $F$ is $\mu$-strongly monotone, i.e., for all $u,v\in\mathcal{Z}$, we have $\langle F(u)-F(v),u-v\rangle\geq\mu\lVert u-v\rVert^{2}_{2}$.
If $\mu=0$, then the operator $F$ is monotone.

In the context of problems (2) and (3), strong monotonicity of $F$ means strong convexity of $f(z)$ and strong convexity-strong concavity of $g(x,y)$. In this paper we first focus on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see [55 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020) ] and references therein).

We note that Assumptions 1 and 2 are sufficient for the existence of a solution to problem (1) (see, e.g., [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ]).

Since we work on the set $\mathcal{Z}$, it is useful to introduce the Euclidean projection onto $\mathcal{Z}$,

$P_{\mathcal{Z}}(z)=\arg\min_{v\in\mathcal{Z}}\lVert z-v\rVert_{2}.$

To characterize the convergence of the methods for monotone variational inequalities we introduce the gap function,

$\operatorname{Gap}_{\mathrm{VI}}(z)≔\sup_{u\in\mathcal{\mathcal{Z}}}[\langle F(u),z-u\rangle].$

Such a gap function, regarded as a convergence criterion, is more suitable for the following variational inequality problem:

$\textrm{find}\ z^{*}\in\mathcal{Z}\ \textrm{such that}\ \langle F(z),z^{*}-z\rangle\leq 0\quad\textrm{for}\ z\in\mathcal{Z}.$

Such a solution is also called weak or Minty (whereas the solution of (1) is called strong or Stampacchia). However, in view of Assumption 1, $F$ is single-valued and continuous on $\mathcal{Z}$, meaning that actually the two indicated formulations of the variational inequality problem are equivalent [37 F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems. Springer Series in Operations Research, Springer, New York (2003) ].

For the minimization problem (2), the functional distance to the solution, i.e., the difference $f(z)-f(z^{*})$, can be used instead of (5). For saddle point problems (3), a slightly different gap function is used, namely,

$\operatorname{Gap}_{\mathrm{SPP}}(z)≔\operatorname{gap}(x,y)=\max_{y^{\prime}\in\mathcal{Y}}f(x,y^{\prime})-\min_{x^{\prime}\in\mathcal{X}}f(x^{\prime},y).$

For both functions (5) and (6) it is crucial that the feasible set is bounded (in fact it is not necessary to take the whole set $\mathcal{Z}$, which can be unbounded – it suffices to take a bounded convex subset $\mathcal{C}$ which contains some solution, see [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ]). Therefore it is necessary to define a distance on the set $\mathcal{Z}$. Since this survey covers methods not only in the Euclidean setup, let us introduce a more general notion of distance.

**Definition 1** (Bregman divergence)**.** Let $\nu(z)$ be a function that is $1$-strongly convex w.r.t. the norm $\lVert\cdot\rVert$ and differentiable on $\mathcal{Z}$.
Then for any two points $z,z^{\prime}\in\mathcal{Z}$ the Bregman divergence (or Bregman distance) $V(z,z^{\prime})$ associated with $\nu(z)$ is defined as

$V(z,z^{\prime})≔\nu(z^{\prime})-\nu(z)-\langle\nabla\nu(z),z^{\prime}-z\rangle.$

We denote the Bregman diameter of the set $\mathcal{Z}$ w.r.t. the divergence $V(z,z^{\prime})$ as $D_{\mathcal{Z},V}≔\max\{\sqrt{2V(z,z^{\prime})}\mid z,z^{\prime}\in\mathcal{Z}\}$. In the Euclidean case, we simply write $D_{\mathcal{Z}}$ instead of $D_{\mathcal{Z},V}$. Using the definition of $V$, we introduce the so-called proximal operator as follows:

$\operatorname{prox}_{x}(y)=\arg\min_{z\in\mathcal{Z}}\{\langle y,z\rangle+V(z,x)\}.$

## 3 Deterministic foundation: Extragradient and other methods

The first and the simplest method for solving the variational inequality (1) is the iterative scheme (also known as the Gradient method)

$z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k})),$

where $\gamma>0$ is a step size. Note that using the proximal operator associated with the Euclidean Bregman divergence this method can be rewritten in the form

$z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k})).$

The basic result asserts the convergence of the method to the unique solution of (1) for strongly monotones and $L$-Lipschitz operators $F$; it was obtained in the papers [19 F. E. Browder, Existence and approximation of solutions of nonlinear variational inequalities. Proc. Nat. Acad. Sci. U.S.A. 56, 1080–1086 (1966) , 106 R. T. Rockafellar, Convex functions, monotone operators and variational inequalities. In Theory and Applications of Monotone Operators (Proc. NATO Advanced Study Inst., Venice, 1968), Edizioni “Oderisi”, Gubbio, 35–65 (1969) , 110 M. Sibony, Méthodes itératives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970) ].

**Theorem 1**

**.**

*If Assumptions 1 and 2 hold and $0<\gamma<2\mu/L^{2}$, then after $k$ iterations method (7) converges to $z^{*}$ with a linear rate:*

$\lVert z^{k}-z^{*}\rVert_{2}^{2}=\mathcal{O}(R_{0}^{2}q^{k}),\quad\textrm{with}\ q=(1-2\gamma\mu+\gamma^{2}L^{2})$

*and $R_{0}$ denotes (here and everywhere in the sequel) the norm $\lVert z^{0}-z^{*}\rVert_{2}$.
For $\gamma=\mu/L^{2}$, we have $q=(1-1/\kappa^{2})$, $\kappa=L/\mu$, thus the upper bound on the number of iterations needed to achieve the $\varepsilon$-solution (i.e., $\lVert z^{k}-z^{*}\rVert_{2}^{2}\leq\nobreak\varepsilon$) is $\mathcal{O}(\kappa^{2}\log(R_{0}^{2}/\varepsilon))$.*

Various extensions of this statement (for the case when $F$ is not Lipschitz, but with linear growth bounds, or when the values of $F$ are corrupted by noise) can be found in [10 A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974) , Theorem 1].

When $F$ is a potential operator (see Example 1) method (7) coincides with the gradient projection algorithm. It converges for strongly monotone $F$. Moreover, the bounds for the admissible step size are less restrictive ($0<\gamma<2/L$) and the relevant complexity estimates are better ($O(\kappa\log(R_{0}^{2}/\varepsilon))$) than in Theorem 1; see [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Theorem 2 in Section 1.4.2].

However, in the general monotone, but not strongly monotone case (for instance, for the convex-concave SPP, Example 2) convergence fails. The original statements on the convergence of Uzawa’s method (a version of (7)) for saddle point problems [6 K. J. Arrow, L. Hurwicz and H. Uzawa, Studies in linear and non-linear programming. Stanford Mathematical Studies in the Social Sciences, II, Stanford University Press, Stanford (1958) ] were wrong; there are numerous well-known examples where method (7) for $F$ corresponding to a bilinear SPP diverges, see, e.g., [104 B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987) , Figure 39].

There have been many other attempts to recover the convergence of gradient-like methods, not for VIs, but for saddle point problems.
One of them is based on the transition to modified Lagrangians when $g(x,y)$ is a Lagrange function, see [45
E. G. Gol’šteĭn, Convergence of the gradient method for finding the saddle points of modified Lagrangian functions.
Èkonom. i Mat. Metody 13, 322–329 (1977)
, 104
B. T. Polyak, Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York (1987)
].
However, we focus on the general VI case.
A possible approach is based on the idea of *regularization*.
Instead of the monotone variational inequality (1) one can deal with a regularized inequality, in which the monotone operator $F$ is replaced by strongly monotone one $F+\varepsilon_{k}T$, where $T(z)$ is a strongly monotone operator and $\varepsilon_{k}>0$ is a regularization parameter.
If we denote by $z^{k}$ the solution of the regularized VI, then one can prove that $z^{k}$ converges to $z^{*}$ as $\varepsilon_{k}\rightarrow 0$ (see [10
A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974)
]).
However, usually the solution $z^{k}$ is not easily available.
To address this problem, an *iterative regularization* technique is proposed in [10
A. Bakushinskii and B. Polyak, On the solution of variational inequalities. Sov. Math. Dokl. 15, 1705–1710 (1974)
], where one step of the basic method (7) is applied for the regularized problem.
Step sizes and regularization parameters can be adjusted to guarantee convergence.

Another technique is based on the Proximal Point Method proposed independently by B. Martinet in [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) ] and by T. Rockafellar in [107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. At each iteration this methods requires the solution of the VI with the operator $F+cI$, where $c>0$ and $I$ is the identity operator. This is an implicit method (similar to the regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods discussed below can be considered from this point of view.

The breakthrough in methods for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) ]. She exploited the idea of extrapolation for the gradient method. How this works can be explained for the simplest example of a two-dimensional min-max problem with $g(x,y)=xy$ and $\mathcal{Z}=\mathbb{R}^{2}$. It has the unique saddle point $z=0$, and in any point $z^{k}$ the direction $F(z^{k})$ is orthogonal to $z^{k}$; thus, the iteration (7) increases the distance to the saddle point. However, if we perform the step (7) and get the extrapolated point $z^{k+1/2}$, the direction $-F(z^{k+1/2})$ is attracted to the saddle point. Thus, the Extragradient method for solving (1) reads

$\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k})),$

$\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k+1/2})).$

**Theorem 2**

**.**

*Let $F$ satisfy Assumptions 1 and 2 (with $\mu=0$) and let $0<\gamma<1/L$.
Then the sequence of iterates $z^{k}$ generated by the Extragradient method converges to $z^{\star}$.*

For the particular cases of the zero-sum matrix game or the general bilinear problem with $g(x,y)=y^{\top}\mathbf{A}x-b^{\top}x+c^{\top}y$, the method converges linearly, provided that the optimal solution is unique (see [64 G. M. Korpelevič, An extragradient method for finding saddle points and for other problems. Èkonom. i Mat. Metody 12, 747–756 (1976) , Theorem 3]). In this case, the rate of convergence is equal to $\mathcal{O}(\kappa\log(R_{0}^{2}/\varepsilon))$ with $\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top})$. More general upper bounds for the Extragradient method can be found in [119 P. Tseng, On linear convergence of iterative methods for the variational inequality problem. J. Comput. Appl. Math. 60, 237–252 (1995) ] and in the recent paper [87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ]. In particular, for the strongly monotone case the estimate $O(\kappa\log(R_{0}^{2}/\varepsilon))$ with $\kappa=L/\mu$ holds true (compare with the much worse bound $O(\kappa^{2}\log(R_{0}^{2}/\varepsilon))$ for the Gradient method). An adaptive version of the Extragradient method (no knowledge of $L$ is required) is proposed in [61 E. N. Khobotov, Modification of the extra-gradient method for solving variational inequalities and certain optimization problems. U.S.S.R. Comput. Math. Math. Phys. 27, 120–127 (1987) ].

Another version of the Extragradient method for finding saddle points is provided in [65 G. M. Korpelevich, Extrapolation gradient methods and their relation to modified Lagrange functions. Èkonom. i Mat. Metody 19, 694–703 (1983) ]. Considering the setup of Example 2, we can exploit just one extrapolating step for the variables $y$:

$\displaystyle y^{k+1/2}=P_{\mathcal{Y}}(y^{k}+\gamma\nabla_{y}g(x^{k},y^{k})),$

$\displaystyle x^{k+1}=P_{\mathcal{X}}(x^{k}-\gamma\nabla_{x}g(x^{k},y^{k+1/2}),$

$\displaystyle y^{k+1}=y^{k}+q(y^{k+1/2}-y^{k}),$

with $0<\gamma<1/(2L)$ and $0<q<1$.
This method converges to the solution and if $g(x,y)$ is linear in $y$, then the rate of convergence is linear.
If we set $q=1$ in method (8), then $y^{k+1}=y^{k+1/2}$ and we get the so-called Alternating Gradient Method (alternating descent-ascent).
In [123
G. Zhang, Y. Wang, L. Lessard and R. B. Grosse, Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization.
In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7659–7679 (2022)
], it was proved that this method has *local* linear convergence with complexity $O(\kappa\log(R_{0}^{2}/\varepsilon))$, where $\kappa=L/\mu$.

L. Popov [105
L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points.
Math. Notes 28, 845–848 (1981)
] proposed a version of extrapolation scheme (sometimes this type of scheme is referred to as *optimistic* or *single-call*):

$\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k-1/2})),$

$\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F(z^{k+1/2})).$

It requires the single calculation of $F$ at each iteration vs two calculations in the Extragradient method. As shown in [105 L. D. Popov, A modification of the Arrow–Hurwicz method for search of saddle points. Math. Notes 28, 845–848 (1981) ], method (9) converges for $0<\gamma<1/(3L)$. Rates of convergence for this method were derived recently in [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) , 87 A. Mokhtari, A. Ozdaglar and S. Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1497–1507 (2020) ], i.e., $O(\kappa\log(R_{0}^{2}/\varepsilon))$ with $\kappa=L/\mu$ for the strongly monotone case and $\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top})$ for the bilinear case. Note that in the general strongly monotone case this estimate is optimal [124 J. Zhang, M. Hong and S. Zhang, On lower iteration complexity bounds for the convex concave saddle point problems. Math. Program. 194, 901–935 (2022) ], but for the bilinear problem the upper bounds available in the literature for both the Extragradient and optimistic methods are not tight [56 A. Ibrahim, W. Azizian, G. Gidel and I. Mitliagkas, Linear lower bounds and conditioning of differentiable games. In International Conference on Machine Learning, Proc. Mach. Learn. Res., 4583–4593 (2020) ]. Meanwhile, optimal estimates $O(\sqrt{\kappa}\log(R_{0}^{2}/\varepsilon))$ with $\kappa=\lambda_{\mathrm{max}}(\mathbf{A}\mathbf{A}^{\top})/\lambda_{\mathrm{min}}(\mathbf{A}\mathbf{A}^{\top})$ can be achieved using approaches from [7 W. Azizian, D. Scieur, I. Mitliagkas, S. Lacoste-Julien and G. Gidel, Accelerating smooth games by manipulating spectral shapes. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 1705–1715 (2020) , 4 M. S. Alkousa, A. V. Gasnikov, D. M. Dvinskikh, D. A. Kovalev and F. S. Stonyakin, Accelerated methods for saddle-point problem. Comput. Math. Math. Phys. 60, 1787–1809 (2020) ].

An extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky [92 A. Nemirovski, Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15, 229–251 (2004) ]. He proposed the Mirror-Prox method for VIs, exploiting the Bregman divergence:

$\displaystyle z^{k+1/2}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k})),$

$\displaystyle z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F(z^{k+1/2})).$

This yields the following rate-of-convergence result.

**Theorem 3**

**.**

*Let $F$ satisfy Assumptions 1 and 2 (with $\mu=0$), and let*

$\hat{z}^{k}=\smash[t]{\frac{1}{k}\sum_{i=1}^{k}z^{i+1/2}},$

*where $z^{i+1/2}$ are generated by algorithm (10) with $\gamma=1/(\sqrt{2}L)$.
Then, after $k$ iterations,*

$\operatorname{Gap}_{\mathrm{VI}}(\hat{z}^{k})=\mathcal{O}\biggl(\frac{LD_{\mathcal{Z},V}^{2}}{k}\biggr).$

Numerous extensions of these original versions of iterative methods for solving variational inequalities were published later. One can highlight Tseng’s Forward-Backward Splitting [120 P. Tseng, A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim. 38, 431–446 (2000) ], Nesterov’s Dual Extrapolation [95 Y. Nesterov, Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007) ], Malitsky and Tam’s Forward-Reflected-Backward [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ]. All methods have convergence guarantees (12). It turns out that this rate is optimal [101 Y. Ouyang and Y. Xu, Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Math. Program. 185, 1–35 (2021) ].

## 4 Stochastic methods: Different setups and assumptions

In this section, we move from deterministic to stochastic methods, i.e., we consider problem (1) with an operator of the form

$F(z)=\mathbb{E}_{\xi\sim\mathcal{D}}[F_{\xi}(z)],$

where $\xi$ is a random variable, $\mathcal{D}$ is some (typically unknown) probability distribution and $F_{\xi}\colon\mathcal{Z}\to\mathbb{R}^{d}$ is a stochastic operator. In this setup, calculating the value of the full operator $F$ is computationally expensive or even intractable. Therefore, one has to work mainly with stochastic realizations $F_{\xi}$.

### 4.1 General case

The stochastic formulation (13) for problem (1) was first considered by the authors of [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ]. They proposed a natural stochastic generalization of the Extragradient method (more precisely, of the Mirror-Prox methods):

$\displaystyle z^{k+1/2}=\operatorname{prox}_{z^{k}}(\gamma F_{\xi^{k}}(z^{k})),$

$\displaystyle z^{k+1}=\operatorname{prox}_{z^{k}}(\gamma F_{\xi^{k+1/2}}(z^{k+1/2})).$

Here it is important to note that the variables $\xi^{k}$ and $\xi^{k+1/2}$ are independent and $F_{\xi}(z)$ is an unbiased estimator of $F(z)$. Moreover, $F_{\xi}(z)$ is assumed to satisfy the following condition.

**Assumption 3** (Bounded variance)**.** The unbiased operator $F_{\xi}$ has uniformly bounded variance, i.e., for all $\xi\sim\mathcal{D}$ and $u\in\mathcal{Z}$, we have $\mathbb{E}\lVert F_{\xi}(u)-F(u)\rVert^{2}_{*}\leq\sigma^{2}$.

Under this assumption, the following result was established in [60 A. Juditsky, A. Nemirovski and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1, 17–58 (2011) ].

**Theorem 4**

**.**

*Let $F_{\xi}$ satisfy Assumptions 1, 2 (with $\mu=0$) and 3, and let $\hat{z}^{k}$ be defined as in (11), where $z^{i+1/2}$ are generated by algorithm (14) with $\gamma=\min\bigl\{\frac{1}{\sqrt{3}L},D_{\mathcal{Z},V}\sqrt{\frac{1}{7k\sigma^{2}}}\bigr\}$.
Then, after $k$ iterations, one can guarantee that*

$\mathbb{E}[\operatorname{Gap}_{\mathrm{VI}}(\hat{z}^{k})]=\mathcal{O}\biggl(\frac{LD^{2}_{\mathcal{Z},V}}{k}+D_{\mathcal{Z},V}\sqrt{\frac{\sigma^{2}}{k}}\biggr).$

In [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors carried out an analysis of algorithm (14) for strongly monotone VIs in the Euclidean case. In particular, under Assumptions 1, 2 and 3 one can guarantee that after $k$ iterations of method (14) one has that (here and below we omit numerical constants in the exponential multiplier)

$\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\tilde{\mathcal{O}}\biggl(R_{0}^{2}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma^{2}}{\mu^{2}k}\biggr).$

Also in [17 A. Beznosikov, V. Samokhin and A. Gasnikov, Distributed saddle-point problems: Lower bounds, optimal and robust algorithms, preprint arXiv:2010.13112 (2020) ], the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 1, 2 and 3 with stochastic methods. It turns out that the conclusions of Theorem 4 in the monotone case and estimate (16) are optimal and meet lower bounds up to numerical constants.

Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work [41 G. Gidel, H. Berard, G. Vignoud, P. Vincent and S. Lacoste-Julien, A variational inequality perspective on generative adversarial networks, preprint, arXiv:1802.10551 (2018) ] applies the following update scheme:

$\displaystyle z^{k+1/2}=P_{\mathcal{Z}}(z^{k}-\gamma F_{\xi^{k-1/2}}(z^{k-1/2})),$

$\displaystyle z^{k+1}=P_{\mathcal{Z}}(z^{k}-\gamma F_{\xi^{k+1/2}}(z^{k+1/2})).$

For this method, in the monotone Euclidean case, the authors proved an estimate similar to (15). In the strongly monotone case, method (17) was investigated in the paper [54 Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, On the convergence of single-call stochastic extra-gradient methods. Adv. Neural Inf. Process. Syst. 32, 6938–6948 (2019) ], but the estimates obtained there do not meet the lower bounds. The optimal estimates for this scheme were obtained later in [14 A. Beznosikov, A. Gasnikov, K. Zainulina, A. Maslovskiy and D. Pasechnyuk, A unified analysis of variational inequality methods: Variance reduction, sampling, quantization and coordinate descent, preprint, arXiv:2201.12206 (2022) ].

The work [66 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. I: Operator extrapolation. SIAM J. Optim. 32, 2041–2073 (2022) ] deals with a slightly different single-call approach in the non-Euclidean case:

$z^{k+1}=\operatorname{prox}_{z^{k}}\bigl(\gamma_{k}F_{\xi^{k}}(z^{k})+\gamma_{k}\alpha_{k}[F_{\xi^{k}}(z^{k})-F_{\xi^{k-1}}(z^{k-1})]\bigr).$

This update rule is a modification of the Forward-Reflected-Backward approach, namely, here $\alpha_{k}$ is a parameter, while in [83 Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30, 1451–1472 (2020) ], $\alpha_{k}\equiv 1$. The analysis of method (18) gives optimal estimates in both the strongly monotone and monotone cases.

The theoretical results and guarantees discussed above rely in significant manner on the bounded variance assumption (Assumption 3).
This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems.
Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that method (14) *diverges* exponentially fast [26
T. Chavdarova, G. Gidel, F. Fleuret and S. Lacoste-Julien, Reducing noise in GAN training with variance reduced extragradient.
Adv. Neural Inf. Process. Syst. 32, 393–403 (2019)
].
The authors of [55
Y.-G. Hsieh, F. Iutzeler, J. Malick and P. Mertikopoulos, Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling.
Adv. Neural Inf. Process. Syst. 33, 16223–16234 (2020)
, 48
E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates.
In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022)
] consider a relaxed form of the bounded variance condition and assume that $\mathbb{E}\lVert F_{\xi}(u)-F(u)\rVert^{2}_{2}\leq\sigma^{2}+\delta\lVert u-z^{*}\rVert_{2}^{2}$ with $\delta\geq 0$ in the Euclidean case.
Under this condition and Assumptions 1 and 2, it is proved in [48
E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates.
In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022)
] that after $k$ iterations of algorithm (14) (when $\mathcal{Z}=\mathbb{R}^{d}$) it holds that

$\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\mathcal{O}\biggl(\kappa R^{2}_{0}\exp\biggl(-\frac{k}{\kappa}\biggr)+\frac{\sigma^{2}}{\mu^{2}k}\biggr),$

where $\kappa=\max\bigl\{\frac{\delta}{\mu^{2}};\frac{L+\sqrt{\delta}}{\mu}\bigr\}$. The same assumption on stochastic realizations was considered in [67 G. Kotsalis, G. Lan and T. Li, Simple and optimal methods for stochastic variational inequalities. II: Markovian noise and policy evaluation in reinforcement learning. SIAM J. Optim. 32, 1120–1155 (2022) ], where method (18) was used, yielding the estimate

$\mathbb{E}\lVert z^{k}-z^{*}\rVert^{2}_{2}=\mathcal{O}\biggl(R^{2}_{0}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma^{2}+\delta^{2}D^{2}_{\mathcal{Z}}}{\mu^{2}k}\biggr).$

Estimates (19) and (20) are competitive: the former is superior in terms of the stochastic term (second term), while the latter is superior in terms of the deterministic term (first term). However, none of these results deals completely with the issue of bounded noise, because the condition considered above is not general. The key to avoiding the bounded variance assumption on $F_{\xi}$ lies in the way how stochasticity is generated in method (14). Method (14) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address the bounded variance issue, K. Mishchenko et al. [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] proposed another stochastic modification of the Extragradient algorithm, called Same Sample Stochastic Extragradient (S-SEG):

$\displaystyle z^{k+1/2}=z^{k}-\gamma F_{\xi^{k}}(z^{k}),$

$\displaystyle z^{k+1}=z^{k}-\gamma F_{\xi^{k}}(z^{k+1/2}).$

For simplicity, we present the above method for the case when $\mathcal{Z}=\mathbb{R}^{d}$ ($F(x^{*})=0$), and refer the reader to [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ] for a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample $\xi^{k}$ for both steps at iteration $k$. Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems, which have been gaining an increasing attention in the recent years. Moreover, S-SEG relies in significant manner on the following assumption.

**Assumption 4****.** The operator $F_{\xi}(z)$ is $L$-Lipschitz and $\mu$-strongly monotone almost surely in $\xi$, i.e., $\lVert F_{\xi}(z)-F_{\xi}(z^{\prime})\rVert_{2}\leq L\lVert z-z^{\prime}\rVert_{2}$ and $\langle F_{\xi}(z)-F_{\xi}(z^{\prime}),z-z^{\prime}\rangle\geq\mu\lVert z-z^{\prime}\rVert_{2}^{2}$ for all $z,z^{\prime}\in\mathbb{R}^{d}$, almost surely in $\xi$.

The evident difference between the I-SEG and S-SEG setups can be explained through the connection between the Extragradient and the Proximal Point (PP) methods [84 B. Martinet, Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Recherche Opérationnelle 4, 154–158 (1970) , 107 R. T. Rockafellar, Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976) ]. In the rest of this subs-section we assume that $\mathcal{Z}=\mathbb{R}^{d}$ ($F(z^{*})=0$). In this setup, PP has the update rule

$z^{k+1}=z^{k}-\gamma F(z^{k+1}).$

The method converges for any monotone operator $F$ and any $\gamma>0$. However, the update rule of PP is implicit and in many situations it cannot be computed efficiently. The Extragradient method can be seen as a natural approximation of PP that substitutes $z^{k+1}$ in the right-hand side by one gradient step from $z^{k}$:

$z^{k+1}=z^{k}-\gamma F(z^{k}-\gamma F(z^{k})).$

In addition, when $F$ is $L$-Lipschitz, one can estimate how good the approximation is. Consider $z^{k+1}=z^{k}-\gamma F(z^{k}-\gamma F(z^{k}))$ (the Extragradient step) and $\tilde{z}^{k+1}=z^{k}-\gamma F(\tilde{z}^{k+1})$ (the PP step). Then $\lVert z^{k+1}-\tilde{z}^{k+1}\rVert_{2}$ can be estimated as follows [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ]:

$\begin{split}&\lVert z^{k+1}-\tilde{z}^{k+1}\rVert_{2}=\gamma\lVert F(z^{k}-\gamma F(z^{k}))-F(\tilde{z}^{k+1})\rVert_{2}\\
&\qquad\leq\gamma L\lVert z^{k}-\gamma F(z^{k})-\tilde{z}^{k+1}\rVert_{2}=\gamma^{2}L\lVert F(z^{k})-F(\tilde{z}^{k+1})\rVert_{2}\\
&\qquad\leq\gamma^{2}L^{2}\lVert z^{k}-\tilde{z}^{k+1}\rVert_{2}=\gamma^{3}L^{2}\lVert F(\tilde{z}^{k+1})\rVert_{2}\\
&\qquad\leq\gamma^{3}L^{3}\lVert\tilde{z}^{k+1}-z^{*}\rVert_{2}.\end{split}$

That is, the difference between the Extragradient and PP steps is of the order $\mathcal{O}(\gamma^{3})$ rather than $\mathcal{O}(\gamma^{2})$. Since the latter corresponds to the difference between PP and the simple gradient step (7), the Extragradient method approximates PP better than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of the Extragradient method is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of the operator $F$.

Let us go back to the differences between I-SEG and S-SEG. In S-SEG, the $k$-th iteration can be regarded as a single Extragradient step for the operator $F_{\xi^{k}}(z)$. Therefore, Lipschitzness and monotonicity of $F_{\xi^{k}}(z)$ (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual $F_{\xi}(z)$s. Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of $F(z)$ as well as on the closeness (on average) of $F_{\xi}(z)$ and $F(z)$ (Assumption 3).

The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, one has the following result [86 K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik and Y. Malitsky, Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 4573–4582 (2020) ].

**Theorem 5**

**.**

*Let Assumption 4 hold.
Then there exists a choice of step size $\gamma$ (see [48
E. Gorbunov, H. Berard, G. Gidel and N. Loizou, Stochastic extragradient: General analysis and improved rates.
In International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res., 7865–7901 (2022)
]) such that the output of S-SEG after $k$ iterations satisfies
*

$\mathbb{E}\lVert z^{k}-z^{*}\rVert_{2}^{2}=\mathcal{O}\biggl(\frac{LR^{2}_{0}}{\mu}\exp\biggl(-\frac{\mu k}{L}\biggr)+\frac{\sigma_{*}^{2}}{\mu^{2}k}\biggr),$

*where $\sigma_{*}^{2}=\mathbb{E}\lVert F_{\xi}(z^{*})\rVert_{2}^{2}$.*

This rate is similar to the one known for I-SEG, with the following differences.
First, instead of the uniform bound on the variance $\sigma^{2}$, the rate depends on $\sigma_{*}^{2}$, which is the variance of $F_{\xi}$ measured at the solution.
In many cases, $\sigma^{2}=\infty$, while $\sigma_{*}^{2}$ is finite.
From this perspective, S-SEG enjoys a better rate of convergence than I-SEG.
However, this comes at a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of $F$, the rate of S-SEG depends on *the worst* constants of $F_{\xi}$, which can be much worse than those for $F$.
In particular, consider the finite-sum setup with uniform sampling of $\xi$: $F(x)=\frac{1}{n}\sum_{i=1}^{n}F_{i}(x)$