In its long, more than half-century history of study (going back to the classical article ), variational inequalities have become one of the most popular and universal optimization formulations. Variational inequalities are used in various areas of applied mathematics. Here we can highlight both classical examples from game theory, economics, operator theory, convex analysis [6, 19, 106, 110, 113], as well as newer and even more recent applications in optimization and machine learning: non-smooth optimization , unsupervised learning [9, 36, 22], robust/adversarial optimization , GANs  and reinforcement learning [100, 57]. Modern times present new challenges to the community. The increase in scale of problems and the need to speed up solution processes have sparked a huge interest in stochastic formulations of applied tasks, including variational inequalities. This paper surveys stochastic methods for solving variational inequalities.
Structure of the paper
In Section 2, we give a formal statement of the variational inequality problem, basic examples, and main assumptions. Section 3 deals with deterministic methods, from which stochastic methods have evolved. Section 4 covers a variety of stochastic methods. Section 5 is devoted to the recent advances in (not necessarily stochastic) variational inequalities and saddle point problems.
2 Problem: Setting and assumptions
Notation. We use to denote the standard inner product of vectors , where is the -th component of in the standard basis of . It induces the -norm in by . We denote the -norm by for , and for . The dual norm corresponding to the norm is defined by . The symbol stands for the total mathematical expectation. Finally, we need to introduce the symbols and to enclose numerical constants that do not depend on any parameters of the problem, and the symbols and to enclose numerical constants and logarithmic factors.
We study variational inequalities (VI) of the form
where is an operator and is a convex set.
To emphasize the extensiveness of formulation (1), we give a few examples of variational inequalities arising in applied sciences.
Example 1 (Minimization). Consider the minimization problem
Example 2 (Saddle point problem). Consider the saddle point problem (SPP)
The study of saddle point problems is often associated with variational inequalities.
Example 3 (Fixed point problem). Consider the fixed point problem
For the operator from (1) we assume the following.
Assumption 1 (Lipschitzness). The operator is -Lipschitz continuous, i.e., for all , we have .
Assumption 2 (Strong monotonicity). The operator is -strongly monotone, i.e., for all , we have . If , then the operator is monotone.
In the context of problems (2) and (3), strong monotonicity of means strong convexity of and strong convexity-strong concavity of . In this paper we first focus on the strongly monotone and monotone cases. But there are also various assumptions relaxing monotonicity and strong monotonicity (e.g., see  and references therein).
Since we work on the set , it is useful to introduce the Euclidean projection onto ,
To characterize the convergence of the methods for monotone variational inequalities we introduce the gap function,
Such a gap function, regarded as a convergence criterion, is more suitable for the following variational inequality problem:
Such a solution is also called weak or Minty (whereas the solution of (1) is called strong or Stampacchia). However, in view of Assumption 1, is single-valued and continuous on , meaning that actually the two indicated formulations of the variational inequality problem are equivalent .
For the minimization problem (2), the functional distance to the solution, i.e., the difference , can be used instead of (5). For saddle point problems (3), a slightly different gap function is used, namely,
For both functions (5) and (6) it is crucial that the feasible set is bounded (in fact it is not necessary to take the whole set , which can be unbounded – it suffices to take a bounded convex subset which contains some solution, see ). Therefore it is necessary to define a distance on the set . Since this survey covers methods not only in the Euclidean setup, let us introduce a more general notion of distance.
Definition 1 (Bregman divergence). Let be a function that is -strongly convex w.r.t. the norm and differentiable on . Then for any two points the Bregman divergence (or Bregman distance) associated with is defined as
We denote the Bregman diameter of the set w.r.t. the divergence as . In the Euclidean case, we simply write instead of . Using the definition of , we introduce the so-called proximal operator as follows:
3 Deterministic foundation: Extragradient and other methods
The first and the simplest method for solving the variational inequality (1) is the iterative scheme (also known as the Gradient method)
where is a step size. Note that using the proximal operator associated with the Euclidean Bregman divergence this method can be rewritten in the form
and denotes (here and everywhere in the sequel) the norm . For , we have , , thus the upper bound on the number of iterations needed to achieve the -solution (i.e., ) is .
Various extensions of this statement (for the case when is not Lipschitz, but with linear growth bounds, or when the values of are corrupted by noise) can be found in [10, Theorem 1].
When is a potential operator (see Example 1) method (7) coincides with the gradient projection algorithm. It converges for strongly monotone . Moreover, the bounds for the admissible step size are less restrictive () and the relevant complexity estimates are better () than in Theorem 1; see [104, Theorem 2 in Section 1.4.2].
However, in the general monotone, but not strongly monotone case (for instance, for the convex-concave SPP, Example 2) convergence fails. The original statements on the convergence of Uzawa’s method (a version of (7)) for saddle point problems  were wrong; there are numerous well-known examples where method (7) for corresponding to a bilinear SPP diverges, see, e.g., [104, Figure 39].
There have been many other attempts to recover the convergence of gradient-like methods, not for VIs, but for saddle point problems. One of them is based on the transition to modified Lagrangians when is a Lagrange function, see [45, 104]. However, we focus on the general VI case. A possible approach is based on the idea of regularization. Instead of the monotone variational inequality (1) one can deal with a regularized inequality, in which the monotone operator is replaced by strongly monotone one , where is a strongly monotone operator and is a regularization parameter. If we denote by the solution of the regularized VI, then one can prove that converges to as (see ). However, usually the solution is not easily available. To address this problem, an iterative regularization technique is proposed in , where one step of the basic method (7) is applied for the regularized problem. Step sizes and regularization parameters can be adjusted to guarantee convergence.
Another technique is based on the Proximal Point Method proposed independently by B. Martinet in  and by T. Rockafellar in . At each iteration this methods requires the solution of the VI with the operator , where and is the identity operator. This is an implicit method (similar to the regularization method), however there exist numerous implementable versions of Proximal Point. For instance, some methods discussed below can be considered from this point of view.
The breakthrough in methods for solving (non-strongly) monotone variational inequalities was made by Galina Korpelevich . She exploited the idea of extrapolation for the gradient method. How this works can be explained for the simplest example of a two-dimensional min-max problem with and . It has the unique saddle point , and in any point the direction is orthogonal to ; thus, the iteration (7) increases the distance to the saddle point. However, if we perform the step (7) and get the extrapolated point , the direction is attracted to the saddle point. Thus, the Extragradient method for solving (1) reads
For the particular cases of the zero-sum matrix game or the general bilinear problem with , the method converges linearly, provided that the optimal solution is unique (see [64, Theorem 3]). In this case, the rate of convergence is equal to with . More general upper bounds for the Extragradient method can be found in  and in the recent paper . In particular, for the strongly monotone case the estimate with holds true (compare with the much worse bound for the Gradient method). An adaptive version of the Extragradient method (no knowledge of is required) is proposed in .
with and . This method converges to the solution and if is linear in , then the rate of convergence is linear. If we set in method (8), then and we get the so-called Alternating Gradient Method (alternating descent-ascent). In , it was proved that this method has local linear convergence with complexity , where .
L. Popov  proposed a version of extrapolation scheme (sometimes this type of scheme is referred to as optimistic or single-call):
It requires the single calculation of at each iteration vs two calculations in the Extragradient method. As shown in , method (9) converges for . Rates of convergence for this method were derived recently in [41, 87], i.e., with for the strongly monotone case and for the bilinear case. Note that in the general strongly monotone case this estimate is optimal , but for the bilinear problem the upper bounds available in the literature for both the Extragradient and optimistic methods are not tight . Meanwhile, optimal estimates with can be achieved using approaches from [7, 4].
An extension of the above schemes to an arbitrary proximal setup was obtained in the work of A. Nemirovsky . He proposed the Mirror-Prox method for VIs, exploiting the Bregman divergence:
This yields the following rate-of-convergence result.
where are generated by algorithm (10) with . Then, after iterations,
Numerous extensions of these original versions of iterative methods for solving variational inequalities were published later. One can highlight Tseng’s Forward-Backward Splitting , Nesterov’s Dual Extrapolation , Malitsky and Tam’s Forward-Reflected-Backward . All methods have convergence guarantees (12). It turns out that this rate is optimal .
4 Stochastic methods: Different setups and assumptions
In this section, we move from deterministic to stochastic methods, i.e., we consider problem (1) with an operator of the form
where is a random variable, is some (typically unknown) probability distribution and is a stochastic operator. In this setup, calculating the value of the full operator is computationally expensive or even intractable. Therefore, one has to work mainly with stochastic realizations .
4.1 General case
The stochastic formulation (13) for problem (1) was first considered by the authors of . They proposed a natural stochastic generalization of the Extragradient method (more precisely, of the Mirror-Prox methods):
Here it is important to note that the variables and are independent and is an unbiased estimator of . Moreover, is assumed to satisfy the following condition.
Assumption 3 (Bounded variance). The unbiased operator has uniformly bounded variance, i.e., for all and , we have .
Under this assumption, the following result was established in .
In , the authors carried out an analysis of algorithm (14) for strongly monotone VIs in the Euclidean case. In particular, under Assumptions 1, 2 and 3 one can guarantee that after iterations of method (14) one has that (here and below we omit numerical constants in the exponential multiplier)
Also in , the authors obtained lower complexity bounds for solving VIs satisfying Assumptions 1, 2 and 3 with stochastic methods. It turns out that the conclusions of Theorem 4 in the monotone case and estimate (16) are optimal and meet lower bounds up to numerical constants.
Optimistic-like (or single-call) methods were also investigated in the stochastic setting. The work  applies the following update scheme:
For this method, in the monotone Euclidean case, the authors proved an estimate similar to (15). In the strongly monotone case, method (17) was investigated in the paper , but the estimates obtained there do not meet the lower bounds. The optimal estimates for this scheme were obtained later in .
The work  deals with a slightly different single-call approach in the non-Euclidean case:
This update rule is a modification of the Forward-Reflected-Backward approach, namely, here is a parameter, while in , . The analysis of method (18) gives optimal estimates in both the strongly monotone and monotone cases.
The theoretical results and guarantees discussed above rely in significant manner on the bounded variance assumption (Assumption 3). This assumption is quite restrictive (especially when the domain is unbounded) and does not hold for many popular machine learning problems. Moreover, one can even design a strongly monotone variational inequality on an unbounded domain such that method (14) diverges exponentially fast . The authors of [55, 48] consider a relaxed form of the bounded variance condition and assume that with in the Euclidean case. Under this condition and Assumptions 1 and 2, it is proved in  that after iterations of algorithm (14) (when ) it holds that
Estimates (19) and (20) are competitive: the former is superior in terms of the stochastic term (second term), while the latter is superior in terms of the deterministic term (first term). However, none of these results deals completely with the issue of bounded noise, because the condition considered above is not general. The key to avoiding the bounded variance assumption on lies in the way how stochasticity is generated in method (14). Method (14) is sometimes called Independent Samples Stochastic Extragradient (I-SEG). To address the bounded variance issue, K. Mishchenko et al.  proposed another stochastic modification of the Extragradient algorithm, called Same Sample Stochastic Extragradient (S-SEG):
For simplicity, we present the above method for the case when (), and refer the reader to  for a more general case of regularized VIs. In contrast to I-SEG, S-SEG uses the same sample for both steps at iteration . Although such a strategy cannot be implemented in some scenarios (streaming oracle), it can be applied to finite-sum problems, which have been gaining an increasing attention in the recent years. Moreover, S-SEG relies in significant manner on the following assumption.
Assumption 4. The operator is -Lipschitz and -strongly monotone almost surely in , i.e., and for all , almost surely in .
The evident difference between the I-SEG and S-SEG setups can be explained through the connection between the Extragradient and the Proximal Point (PP) methods [84, 107]. In the rest of this subs-section we assume that (). In this setup, PP has the update rule
The method converges for any monotone operator and any . However, the update rule of PP is implicit and in many situations it cannot be computed efficiently. The Extragradient method can be seen as a natural approximation of PP that substitutes in the right-hand side by one gradient step from :
In addition, when is -Lipschitz, one can estimate how good the approximation is. Consider (the Extragradient step) and (the PP step). Then can be estimated as follows :
That is, the difference between the Extragradient and PP steps is of the order rather than . Since the latter corresponds to the difference between PP and the simple gradient step (7), the Extragradient method approximates PP better than gradient steps, which are known to be non-convergent for general monotone Lipschitz variational inequalities. This approximation feature of the Extragradient method is crucial for its convergence and, as the above derivation implies, the approximation argument significantly relies on the Lipschitzness of the operator .
Let us go back to the differences between I-SEG and S-SEG. In S-SEG, the -th iteration can be regarded as a single Extragradient step for the operator . Therefore, Lipschitzness and monotonicity of (Assumption 4) are important for the analysis of S-SEG. In contrast, I-SEG uses different operators for the extrapolation and update steps. In this case, there is no effect from the Lipschitzness/monotonicity of individual s. Therefore, the analysis of I-SEG naturally relies on the Lipschitzness and monotonicity of as well as on the closeness (on average) of and (Assumption 3).
The convergence of I-SEG was discussed earlier in this section. Regarding S-SEG, one has the following result .
This rate is similar to the one known for I-SEG, with the following differences. First, instead of the uniform bound on the variance , the rate depends on , which is the variance of measured at the solution. In many cases, , while is finite. From this perspective, S-SEG enjoys a better rate of convergence than I-SEG. However, this comes at a price: while the rate of I-SEG depends on the Lipschitz and strong-monotonicity constants of , the rate of S-SEG depends on the worst constants of , which can be much worse than those for . In particular, consider the finite-sum setup with uniform sampling of : , where is -Lipschitz and -strongly monotone, and . Then Assumption 4 holds with and and these constants appear in the rate from Theorem 3. The authors of  tighten this rate. In particular, they prove that for S-SEG with different step sizes for the extrapolation and update steps one has that
where and . Since is (sometimes considerably) larger than , the improvement is noticeable. Moreover, when the constants are known, one can consider the so-called importance sampling : , where . As the authors of [48