Magazine Cover
Download article (PDF)

This article is published open access.

Tensor networks in machine learning

  • Richik Sengupta

    Skolkovo Institute of Science and Technology, Moscow, Russia
  • Soumik Adhikary

    Skolkovo Institute of Science and Technology, Moscow, Russia
  • Ivan Oseledets

    Skolkovo Institute of Science and Technology, Moscow, Russia
  • Jacob Biamonte

    Skolkovo Institute of Science and Technology, Moscow, Russia
A tensor network is a type of decomposition used to express and approximate large arrays of data. A given dataset, quantum state, or higher-dimensional multilinear map is factored and approximated by a composition of smaller multilinear maps. This is reminiscent to how a Boolean function might be decomposed into a gate array: this represents a special case of tensor decomposition, in which the tensor entries are replaced by 00, 11 and the factorisation becomes exact. The associated techniques are called tensor network methods: the subject developed independently in several distinct fields of study, which have more recently become interrelated through the language of tensor networks. The tantamount questions in the field relate to expressability of tensor networks and the reduction of computational overheads. A merger of tensor networks with machine learning is natural. On the one hand, machine learning can aid in determining a factorisation of a tensor network approximating a data set. On the other hand, a given tensor network structure can be viewed as a machine learning model. Herein the tensor network parameters are adjusted to learn or classify a data-set. In this survey we review the basics of tensor networks and explain the ongoing effort to develop the theory of tensor networks in machine learning.

1 Introduction

Tensors networks are ubiquitous in most areas of modern science including data science [11 A. Cichocki, Tensor networks for big data analytics and large-scale optimization problems, preprint, arXiv:1407.3124 (2014) ], condensed matter physics [19 C. Fernández-González, N. Schuch, M. M. Wolf, J. I. Cirac and D. Pérez-García, Frustration free gapless Hamiltonians for matrix product states. Comm. Math. Phys. 333, 299–333 (2015) ], string theory [22 A. Jahn and J. Eisert, Holographic tensor network models and quantum error correction: A topical review. Quantum Sci. Technol. 6, 033002 (2021) ] and quantum computer science. The manners in which tensors are employed/treated exhibit significant overlap across many of these areas. In data science, tensors are used to represent large datasets. In condensed matter physics and in quantum computer science, tensors are used to represent states of quantum systems.

Manipulating large tensors comes at a high computational cost [28 A. Novikov, D. Podoprikhin, A. Osokin and D. P. Vetrov, Tensorizing neural networks. Adv. Neural Inf. Process. Syst. 28 (2015) ]. This observation has inspired techniques for tensor decompositions that would reduce computational complexity while preserving the original data that they represents. Such techniques are now known as tensor network methods.

Tensor networks have risen to prominence in the last fifteen years, with several European schools playing leading roles in their modern development, as a means to describe and approximate quantum states (see the review [14 J. I. Cirac, D. Pérez-García, N. Schuch and F. Verstraete, Matrix product states and projected entangled pair states: Concepts, symmetries, theorems. Rev. Modern Phys. 93, 045003 (2021) ]). The topic however dates back much further, to work of Penrose [32 R. Penrose, Applications of negative dimensional tensors. In Combinatorial Mathematics and its Applications (Proc. Conf., Oxford, 1969), Academic Press, London, 221–244 (1971) ] and in retrospect, even arose as special cases in work of Cayley [10 A. Cayley, On the theory of groups as depending on the symbolic equation θn=1. Philos. Mag. 7, 40–47 (1854) ]. Tensor networks have a rich modern history in mathematical physics [32 R. Penrose, Applications of negative dimensional tensors. In Combinatorial Mathematics and its Applications (Proc. Conf., Oxford, 1969), Academic Press, London, 221–244 (1971) ], in category theory [44 P. Selinger, A survey of graphical languages for monoidal categories. In New Structures for Physics, Lecture Notes in Phys. 813, Springer, Heidelberg, 289–355 (2011) ], in computer science, algebraic logic and related disciplines [1 J. Baez and M. Stay, Physics, topology, logic and computation: A Rosetta Stone. In New Structures for Physics, Lecture Notes in Phys. 813, Springer, Heidelberg, 95–172 (2011) ]. Such techniques are now becoming more common in data science and machine learning (see the reviews [12 A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao and D. P. Mandic, Tensor networks for dimensionality reduction and large-scale optimization. I: Low-rank tensor decompositions. Found. Trends Mach. Learn. 9, 249–429 (2016) , 13 A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama and D. P. Mandic, Tensor networks for dimensionality reduction and large-scale optimization. II: Applications and future perspectives. Found. Trends Mach. Learn. 9, 431–673 (2017) ]).

1.1 Basic material about multilinear maps

It might be stated that the objective of linear algebra is to classify linear operators up to isomorphism and study the simplest representative in each equivalence class. This motivates the prevalence of decompositions such as SVD, LU and the Jordan normal form. A special case of linear operators are linear maps from a vector space VV to an arbitrary field like C\mathbb{C} or R\mathbb{R}. These linear maps form the dual space (vector space of covectors) VV^{\star} to our vector space VV.

A natural generalisation of linear maps is provided by the multilinear maps, i.e., maps that are linear in each argument when the values of other arguments are fixed. For a given pair of non-negative integers pp and qq, a type-(p,q)(p,q) tensor TT is defined as a multilinear map

T ⁣:V××Vpcopies×V××VqcopiesK,T\colon\underbrace{V^{*}\times\dots\times V^{*}}_{p\,\textrm{copies}}\times\underbrace{V\times\dots\times V}_{q\,\textrm{copies}}\to\mathbb{K},

where K\mathbb{K} is an arbitrary field. The tensor TT is said to be of order (valence) p+qp+q. Note that some authors refer to this as rank p+qp+q, but we will never do that.

It is often more convenient to view tensors as elements of a vector space known as the tensor product space. Thus, the above (p,q)(p,q)-tensor TT in this alternative interpretation can be defined as an element

TVVpcopiesVVqcopies.T\in\underbrace{V\otimes\dots\otimes V}_{p\,\textrm{copies}}\otimes\underbrace{V^{*}\otimes\dots\otimes V^{*}}_{q\,\textrm{copies}}.

Moreover, the universality property of tensor products of vector spaces states that any multilinear map can be replaced by a unique linear map acting from the tensor product of vector spaces to the base field.

If we assume the axiom of choice, every vector space admits a Hamel basis. If ei\mathbf{e}_{i} is such a basis in VV, then the components of a tensor TT are therefore the coefficients of TT with respect to the basis ei\mathbf{e}_{i} and its dual basis εj\mathbb{\varepsilon}^{j} (basis of the dual space VV^{*}), that is

T=Tj1jqi1ipei1eipεj1εjq.T=T_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}}\mathbf{e}_{i_{1}}\otimes\cdots\otimes\mathbf{e}_{i_{p}}\otimes\mathbb{\varepsilon}^{j_{1}}\otimes\cdots\otimes\mathbb{\varepsilon}^{j_{q}}.

Adopting Einstein’s summation convention, summation over repeated indices is implied in (2).

Returning to (1), given pp covectors (c1,,cp)(c^{1},\ldots,c^{p}) and qq vectors (v1,,vq)(v_{1},\ldots,v_{q}), the value of the tensor is evaluated as

Tj1jqi1ipc1(ei1)××cp(eip)×εj1(v1)××εjq(vq)K,T_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}}c^{1}(\mathbf{e}_{i_{1}})\times\cdots\times c^{p}(\mathbf{e}_{i_{p}})\times{\mathbb{\varepsilon}}^{j_{1}}(v_{1})\times\cdots\times{\mathbb{\varepsilon}}^{j_{q}}(v_{q})\in\mathbb{K},

where ck(eik)c^{k}(\mathbf{e}_{i_{k}}) and εjl(vl){\mathbb{\varepsilon}}^{j_{l}}(v_{l}) are numbers obtained by evaluating covectors (functionals) at the corresponding vectors. In quantum computation, the basis vectors eik\mathbf{e}_{i_{k}} are denoted by ik\lvert i_{k}\rangle and the basis covectors εjl{\mathbb{\varepsilon}}^{j_{l}} are denoted by jl\langle j_{l}\rvert. Using Dirac’s notation, tensor products are written in compact form as

eimeil:=imil,εjmεjl:=jmjl,eimεjl:=imjl,\mathbf{e}_{i_{m}}\otimes\mathbf{e}_{i_{l}}:=\lvert i_{m}i_{l}\rangle,\quad\mathbb{\varepsilon}^{j_{m}}\otimes{\mathbb{\varepsilon}}^{j_{l}}:=\langle j_{m}j_{l}\rvert,\quad\mathbf{e}_{i_{m}}\otimes{\mathbb{\varepsilon}}^{j_{l}}:=\lvert i_{m}\rangle\langle j_{l}\rvert,

and the tensor TT takes the form

T=Tj1jqi1ipi1ipj1jp.T=T_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}}\lvert i_{1}\ldots i_{p}\rangle\langle j_{1}\ldots j_{p}\rvert.

Similarly, given a tuple of pp covectors (c1, ,cp)(c^{1},\ \ldots,c^{p}) and qq vectors (v1,,vq)(v_{1},\ldots,v_{q}) we write them as c1cp\langle c_{1}\ldots c_{p}\rvert and v1vq\lvert v_{1}\ldots v_{q}\rangle, as elements of the corresponding tensor product space(s).

In this notation the evaluation (3) takes the form

Tj1jqi1ipc1cpi1ipj1jpv1vq.T_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}}\langle c_{1}\ldots c_{p}|i_{1}\ldots i_{p}\rangle\langle j_{1}\ldots j_{p}|v_{1}\ldots v_{q}\rangle.

Since in quantum computation the vector spaces under consideration as well as their duals are Hilbert spaces, Riesz’s representation theorem for linear functionals implies that the evaluation \langle*|*\rangle above can be seen as an inner product.

Finally, a tensor can be identified with the array of coefficients Tj1jqi1ipT_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}} in a specific basis decomposition. This approach is not basis independent, but is useful in applications. Henceforth in this review we will fix the standard basis, which will establish a canonical isomorphism between the vector space and its dual. In the simplest case p=q=1p=q=1, for example, this gives us the following equivalences:

TijTjiTijTjiTjiTij.T_{ij}\cong T_{ji}\cong T^{ij}\cong T^{ji}\cong T^{i}_{j}\cong T^{j}_{i}.

In the more general case this leads to the equivalence between the components Tj1jqi1ipT_{j_{1}\dots j_{q}}^{i_{1}\dots i_{p}} and Tj1jqi1ipT_{j_{1}\dots j_{q}i_{1}\dots i_{p}}. Given a valence-mm tensor TT, the total number of tensors in the equivalence class formed by raising, lowering and exchanging indices has cardinality (1+m)!(1+\nobreak m)! (see [4 J. Biamonte and V. Bergholm, Tensor networks in a nutshell, preprint, arXiv:1708.00006 (2017) ]). Recognising this arbitrariness, Penrose considered a graphical depiction of tensors [32 R. Penrose, Applications of negative dimensional tensors. In Combinatorial Mathematics and its Applications (Proc. Conf., Oxford, 1969), Academic Press, London, 221–244 (1971) ], stating that “it now ceases to be important to maintain a distinction between upper and lower indices.” This convention is widely adopted in the contemporary literature.

1.2 Tensor trains aka matrix product states

Consider a tensor TT with components Tj1j2jnT_{j_{1}j_{2}\dots j_{n}} with jk=1,2,,dj_{k}=1,2,\ldots,d. Hence, TT has dnd^{n} components, a number that can exceed the total number of electrons in the universe when dd is as small as 22 and n300n\approx 300. Clearly, storing the components of such a large tensor in a computer memory and subsequently manipulating it can become impossible. The good news is that for most practical purposes, a tensor typically contains a large amount of redundant information. This enables factoring of TT into a sequence of “smaller” tensors.

Tensor trains (see [31 I. V. Oseledets, Tensor-train decomposition. SIAM J. Sci. Comput. 33, 2295–2317 (2011) ]) and matrix product states (MPS) [33 D. Perez-Garcia, F. Verstraete, M. M. Wolf and J. I. Cirac, Matrix product state representations. Quantum Inf. Comput. 7, 401–430 (2007) , 30 R. Orús, Advances on tensor network theory: Symmetries, fermions, entanglement, and holography. Eur. Phys. J. B 87, 280 (2014) ] arose in data science and in condensed matter physics, where it was shown that any tensor TT with components Tj1j2jnT_{j_{1}j_{2}\cdots j_{n}} admits a decomposition of the form

Tj1j2jn=αAj1(1)Aj2(2)Ajn(n),βT_{j_{1}j_{2}\dots j_{n}}=\mathbb{\alpha}^{\dagger}A_{j_{1}}^{(1)}A_{j_{2}}^{(2)}\cdots A_{j_{n}}^{(n)},\mathbb{\beta}

where Ajk(k)A_{j_{k}}^{(k)} is an (rk1×rk)(r_{k-1}\times r_{k})-dimensional matrix, and α\mathbb{\alpha} and β\mathbb{\beta} as r0r_{0}- and rnr_{n}-dimensional vectors, respectively.

Likewise, an nn-qubit state ψ(CP2)n\lvert\psi\rangle\in(\mathbb{CP}^{2})^{\otimes n}, written in the computational basis as ψ=j1j2jnTj1j2jnj1j2jn\lvert\psi\rangle=\sum_{j_{1}j_{2}\cdots j_{n}}T_{j_{1}j_{2}\dots j_{n}}\lvert j_{1}j_{2}\dots j_{n}\rangle, jk{0,1}j_{k}\in\{0,1\}, can equivalently be expressed as

ψ=j1jnαAj1(1)Aj2(2)Ajn(n)βj1j2jn.\lvert\psi\rangle=\sum_{j_{1}\dots j_{n}}\langle\alpha\rvert A^{(1)}_{j_{1}}A^{(2)}_{j_{2}}\cdots A^{(n)}_{j_{n}}\lvert\beta\rangle\lvert j_{1}j_{2}\dots j_{n}\rangle.

Here Ajk(k)A_{j_{k}}^{(k)} is a rk1×rkr_{k-1}\times r_{k} dimensional matrix and α,β\lvert\alpha\rangle,\lvert\beta\rangle are r0r_{0} and rnr_{n} dimensional vectors, respectively. Note that here we are adhering to the braket notation, as is customary in quantum mechanics. The representation of ψ\lvert\psi\rangle in (5) is called the matrix product state representation with an open boundary condition (OBC-MPS). See Figure 1 for a graphical representation.

Figure 1. Graphical representation of tensor trains (open boundary condition—matrix product state representation). See (4), (5). Each index in the tensor Tj1jnT_{j_{1}\dots j_{n}} is represented in the diagram by an open wire pointing downwards. We call these wires physical bonds. The horizontal wires represent extra indices which are summed over. Such internal wires are known as virtual bonds.

Yet another useful MPS decomposition that a state might admit is the MPS with periodic boundary condition (PBC-MPS) [30 R. Orús, Advances on tensor network theory: Symmetries, fermions, entanglement, and holography. Eur. Phys. J. B 87, 280 (2014) ]. The PBC-MPS representation of an nn-qubit state

ψ=j1j2jnTj1j2jnj1j2jn,jk=0,1,\lvert\psi\rangle=\smash[b]{\sum_{j_{1}j_{2}\dots j_{n}}T_{j_{1}j_{2}\dots j_{n}}\lvert j_{1}j_{2}\dots j_{n}\rangle},\quad j_{k}=0,1,

is given by

ψ=j1j2jnTr(Aj1(1)Aj2(2)Ajn(n))j1j2jn,\lvert\psi\rangle=\sum_{j_{1}j_{2}\dots j_{n}}\operatorname{Tr}(A^{(1)}_{j_{1}}A^{(2)}_{j_{2}}\cdots A^{(n)}_{j_{n}})\lvert j_{1}j_{2}\dots j_{n}\rangle,

where Ajk(k)A^{(k)}_{j_{k}} is an r×rr\times r matrix. The graphical representation of a PBC-MPS is shown in Figure 2.

Figure 2. Graphical representation of the matrix product state in (6).

An nn-qubit state ψ=j1jnTj1jnj1jn\lvert\psi\rangle=\sum_{j_{1}\dots j_{n}}T_{j_{1}\dots j_{n}}\lvert j_{1}\dots j_{n}\rangle has 2n2^{n} independent coefficients Tj1jnT_{j_{1}\dots j_{n}}. The MPS representation of ψ\lvert\psi\rangle, on the other hand, is less data intensive. If Ajk(k)A^{(k)}_{j_{k}} is an r×rr\times r matrix for all kk, the size of the representation becomes 2nr22nr^{2}, which is linear in nn for a constant rr. The point of the method is to choose rr such that a good and compact approximation of ψ\lvert\psi\rangle is obtained. The number rr is often also referred to as the virtual bond dimension. Data compression becomes even more effective if the MPS is site independent, that is, if Ajk(k)=AjkA^{(k)}_{j_{k}}=A_{j_{k}} for all kk. It has been shown that a site-independent representation of a PBC-MPS always exists if the state is translation invariant [33 D. Perez-Garcia, F. Verstraete, M. M. Wolf and J. I. Cirac, Matrix product state representations. Quantum Inf. Comput. 7, 401–430 (2007) ]. Note that MPS is invariant under the transformation AjPAjP1A_{j}\to PA_{j}P^{-1} for any invertible PP; this follows from the cyclicity of the trace operator. Therefore, it is often customary to impose an additional constraint here, viz. jAjAj=1\sum_{j}A_{j}A_{j}^{\dagger}=\mathbf{1}, in order to fix the gauge freedom [14 J. I. Cirac, D. Pérez-García, N. Schuch and F. Verstraete, Matrix product states and projected entangled pair states: Concepts, symmetries, theorems. Rev. Modern Phys. 93, 045003 (2021) ] (see also the connections to algebraic invariant theory [5 J. Biamonte, V. Bergholm and M. Lanzagorta, Tensor network methods for invariant theory. J. Phys. A 46, 475301 (2013) ]).

2 Machine learning: classical to quantum

2.1 Classical machine learning

At the core of machine learning is the task of data classification. In this task, we are typically provided with a labelled dataset S={(xj,yj)}j=1M\mathcal{S}=\{(\mathbf{x}_{j},\mathbf{y}_{j})\}_{j=1}^{M}, where the vectors xjRN{\mathbf{x}_{j}}\in\mathbb{R}^{N} are the input data (e.g., animal images) and the vectors yjRd\mathbf{y}_{j}\in\mathbb{R}^{d} are the corresponding labels (e.g., animal types). The objective is to find a suitable machine learning model FF with tunable parameters θRk\mathbb{\theta}\in\mathbb{R}^{k} that generates the correct label for a given input xRN\mathbf{x}\in\mathbb{R}^{N}. Note that our model can be looked upon as a family of functions parameterised by θ\mathbb{\theta}: FF takes a data vector as an input and outputs a predicted label; for an input datum xj\mathbf{x}_{j}, the predicted label is F(xj,θ)F(\mathbf{x}_{j},\mathbb{\theta}). To ensure that our model generates the correct labels, it needs to be trained; in order to accomplish this, a training set TS\mathcal{T}\subset\mathcal{S} is chosen, the elements of which serve as input data to train FF. Training requires a cost function,

L(θ)=(xj,yj)TD(F(xj,θ),yj),\mathcal{L}(\mathbb{\theta})=\smash[b]{\sum_{(\mathbf{x}_{j},\mathbf{y}_{j})\in\mathcal{T}}D(F(\mathbf{x}_{j},\mathbb{\theta}),\mathbf{y}_{j})},

where D(,)D(\cdot\,,\cdot) measures the mismatch between the real label and the estimated label. Typical choices for DD include, e.g., the negative log-likelihood function, mean squared errors (MSE), etc. [37 L. Rosasco, E. De Vito, A. Caponnetto, M. Piana and A. Verri, Are loss functions all the same? Neural Comput. 16, 1063–1076 (2004) ]. By minimising (7) with respect to θ\mathbb{\theta}, one obtains the value θarg minθL(θ){\mathbb{\theta}}^{\star}\in\operatorname*{arg\,min}_{\mathbb{\theta}}\mathcal{L}(\mathbb{\theta}), which completes the training. After the model FF has been trained, we can evaluate its performance by feeding it inputs from ST\mathcal{S}\setminus\mathcal{T} (often referred to as the validation set) and checking for classification accuracy.

For a more formal description, let us assume that the dataset S\mathcal{S} is sampled from a joint probability distribution with a density function p(x,y)p(\mathbf{x},\mathbf{y}). The role of a classifier model is to approximate the conditional distribution p(yx)p(\mathbf{y}|\mathbf{x}). The knowledge of p(x,y)p(\mathbf{x},\mathbf{y}) allows us, in principle, to establish theoretical bounds on the performance of the classifier. Consider the generalisation error (also called risk), defined as G(θ)=Ep(x,y)(D(F(x,θ),y))\mathcal{G}(\mathbb{\theta})=\mathbb{E}_{p(\mathbf{x},\mathbf{y})}(D(F(\mathbf{x},\mathbb{\theta}),\mathbf{y})). A learning algorithm is said to generalise if limM(L(θ)/M)=G(θ)\lim_{M^{\prime}\to\infty}({\mathcal{L}(\mathbb{\theta})}/{M^{\prime}})=\mathcal{G}(\mathbb{\theta}); here MM^{\prime} is the cardinality of the training set. However, since in general we do not have access to p(x,y)p(\mathbf{x},\mathbf{y}) we can only attempt to provide necessary conditions to bound the difference of the generalisation error and the empirical error by checking certain stability conditions to ensure that our learning model is not too sensitive to noise in the data [8 O. Bousquet and A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002) ]. For example, we can try to ensure that our learning model is not affected if one of the data points is left out during training. The technique of regularisation prevents overfitting.

Several different types of machine learning models FF exist, which range from fairly elementary models, such as perceptrons, to highly involved ones, such as deep neural networks (DNNs) [24 Y. LeCun, Y. Bengio and G. Hinton, Deep learning. Nature 521, 436–444 (2015) , 38 F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386 (1958) ]. The choice of FF is heavily dependent on the classification task, the type of the dataset, and the desired training properties. Consider a dataset S\mathcal{S} with two classes (a binary dataset) that is linearly separable. That is, (i) yj{1,1}\mathbf{y}_{j}\in\{-1,1\} and (ii) one can construct a hyperplane that separates the input data belonging to the different classes. Finding this hyperplane, aka the decision boundary, is therefore sufficient for data classification in S\mathcal{S}. This task can be accomplished with a simplistic machine learning model—the perceptron—which is in fact a single McCulloch–Pitts neuron [26 W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943) ]. The algorithm starts with the candidate solution for the hyperplane wx+b=0\mathbf{w}^{\top}\cdot\mathbf{x}+b=0, where w,b\mathbf{w},b are tunable parameters and play the role of θ\mathbb{\theta}. It is known from the perceptron convergence algorithm that one can always find a set of parameters w=w,b=b\mathbf{w}=\mathbf{w}^{\star},b=b^{\star}, such that for every xj\mathbf{x}_{j}, if yj=1\mathbf{y}_{j}=-1, then wxj+b0\mathbf{w}^{\star\top}\cdot\mathbf{x}_{j}+b^{\star}\leq 0, while if yj=1\mathbf{y}_{j}=1, then wxj+b>0\mathbf{w}^{\star\top}\cdot\mathbf{x}_{j}+b^{\star}>0.

Most datasets of practical importance, however, are not linearly separable and consequently cannot be classified by the perceptron model alone. Assuming that S\mathcal{S} is a binary dataset which is not linearly separable, we consider a map Λ ⁣:RNΓ\Lambda\colon\mathbb{R}^{N}\to\Gamma, dim(Γ)>N\dim(\Gamma)>N, with the proviso that Λ\Lambda is nonlinear in the components of its inputs [6 C. M. Bishop, Pattern recognition and machine learning. Information Science and Statistics, Springer, New York (2006) ]. In machine learning Λ\Lambda is called a feature map and the vector space Γ\Gamma is known as a feature space. Thus Λ\Lambda nonlinearly maps each input datum xj\mathbf{x}_{j} to a vector Λ(xj)\Lambda(\mathbf{x}_{j}) in the feature space. The significance of this step follows from Cover’s theorem on separability of patterns [16 T. M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14, 326–334 (1965) ], which suggest that the transformed dataset S={(Λ(xj),yj)}j=1M\mathcal{S}^{\prime}=\{(\Lambda(\mathbf{x}_{j}),\mathbf{y}_{j})\}_{j=1}^{M} is more likely to be linearly separable. For a good choice of Λ\Lambda, the data classification step now becomes straightforward, as it is sufficient to fit a hyperplane to separate the two classes in the feature space. Indeed, the sought-for hyperplane can be constructed, using the perceptron model, provided the feature map Λ\Lambda is explicitly known. Actually, a hyperplane can still be constructed even when Λ\Lambda is not explicitly known. A particularly elegant way to accomplish this is via the support vector machine (SVM) [15 C. Cortes and V. Vapnik, Support-vector networks. Mach. Learn. 20, 273–297 (1995) ], by employing the so-called kernel trick [45 J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004) ].

Consider again the binary dataset S={(Λ(xj),yj)}j=1M\mathcal{S}^{\prime}=\{(\Lambda(\mathbf{x}_{j}),\mathbf{y}_{j})\}_{j=1}^{M}. The aim is to construct a hyperplane that separates the samples belonging to the two classes. In addition, we would like to maximise the margin of separation. Formally, we search for a set of parameters w=w,b=b\mathbf{w}=\mathbf{w}^{\star},b=b^{\star} such that for every xj\mathbf{x}_{j}, if yj=1\mathbf{y}_{j}=-1, then wΛ(xj)+b1\mathbf{w}^{\star\top}\cdot\Lambda(\mathbf{x}_{j})+b^{\star}\leq-1, while if yj=1\mathbf{y}_{j}=1, then wΛ(xj)+b1\mathbf{w}^{\star\top}\cdot\Lambda(\mathbf{x}_{j})+b^{\star}\geq 1. As in the perceptron model, the SVM algorithm starts with the candidate solution wx+b\mathbf{w}^{\top}\cdot\mathbf{x}+b and the parameters w,b\mathbf{w},b are tuned based on the training data. An interesting aspect of the SVM approach, however, is the dependence of the algorithm on a special subset of the training data called support vectors, namely, the ones that satisfy the relation wΛ(xj)+b=±1\mathbf{w}^{\star\top}\cdot\Lambda(\mathbf{x}_{j})+b^{\star}=\pm 1. We assume that there are SS such vectors Λ(xj(s))\Lambda(\mathbf{x}_{j}^{(\mathrm{s})}), j=1,,Sj=1,\dots,S. With some algebra, which we omit here, it can be shown that the decision boundary is given by the hyperplane

j=1Sαjyj[Λ(xj(s))Λ(x)]+b=0,\sum_{j=1}^{S}\alpha_{j}^{\star}\mathbf{y}_{j}[\Lambda(\mathbf{x}_{j}^{(\mathrm{s})})^{\top}\Lambda(\mathbf{x})]+b^{\star}=0,

where

b=1Sj=1S(yjk=1Sαjyj[Λ(xj(s))Λ(xk(s))]),\displaystyle b^{\star}=\frac{1}{S}\sum_{j=1}^{S}\biggl(\mathbf{y}_{j}-\sum_{k=1}^{S}\alpha^{\star}_{j}\mathbf{y}_{j}[\Lambda(\mathbf{x}_{j}^{(\mathrm{s})})^{\top}\Lambda(\mathbf{x}_{k}^{(\mathrm{s})})]\biggr),
α=arg maxα(jαj12jkαjαkyjykΛ(xj)Λ(xk))\displaystyle\mathbb{\alpha}^{\star}=\operatorname*{arg\,max}_{\mathbb{\alpha}}\biggl(\sum_{j}\alpha_{j}-\frac{1}{2}\sum_{j}\sum_{k}\alpha_{j}\alpha_{k}\mathbf{y}_{j}\mathbf{y}_{k}\Lambda(\mathbf{x}_{j})^{\top}\Lambda(\mathbf{x}_{k})\biggr)

with the additional conditions jαjyj=0\sum_{j}\alpha_{j}\mathbf{y}_{j}=0 and αj0\alpha_{j}\geq 0. All summations in (10) are over the entire training set. We make a key observation here: from (8), (9) and (10) we see that the expression of the decision boundary has no explicit dependence on the feature vectors Λ(xj)\Lambda(\mathbf{x}_{j}). Instead, the dependence is solely on the inner products of the form Λ(xj)Λ(xk)\Lambda(\mathbf{x}_{j})^{\top}\Lambda(\mathbf{x}_{k}). This allows us to use the kernel trick.

Support functions in optimisation

If CRnC\subset\mathbb{R}^{n} is a nonempty closed convex set, the support function hC ⁣:RnRh_{C}\colon\mathbb{R}^{n}\to\mathbb{R} of CC is given by hC(x)=sup{xc:cC,xRn}h_{C}(x)=\sup\{\langle x|c\rangle:c\in C,\,x\in{\mathbb{R}}^{n}\}. The hyperplane H(x)={yRn:yx=hC(x)}H(x)=\{y\in{\mathbb{R}}^{n}:\langle y|x\rangle=h_{C}(x)\} is called a supporting hyperplane of CC with outer unit normal vector xx. The function hC(x)h_{C}(x) outputs the signed distance of H(x)H(x) from the origin. The data points that lie on H(x)H(x) are called support vectors in the machine learning literature.

Graphical representation of the support function and supporting hyperplane.

Kernel functions

The concept of the kernel function originates in the theory of Reproducing Kernel Hilbert Spaces (RKHS). Consider a Hilbert space HH consisting of real-valued functions defined on an arbitrary set XX. We can define an evaluation functional Ex(f)=f(x)E_{x}(f)=f(x) that maps each function to a real number. If for every xXx\in X the linear functional ExE_{x} is bounded on H,H, we call HH an RKHS. Applying Riesz’s theorem mentioned in the introduction, we have the representation Ex(f)=fKxE_{x}(f)=\langle f|K_{x}\rangle, where KxHK_{x}\in H. It follows that Ey(Kx)=KxKy=Kx(y)=K(x,y)E_{y}(K_{x})=\langle K_{x}|K_{y}\rangle=K_{x}(y)=K(x,y).

The function K(x,y) ⁣:X×XRK(x,y)\colon X\times X\to\mathbb{R} is called the reproducing kernel of the space HH. Briefly, in an RKHS the evaluation of a function at a point can be replaced by taking an inner product with a function determined by the kernel in the associated function space.

Now given a feature map Φ ⁣:XΓ\Phi\colon X\to\Gamma, where Γ\Gamma is a Hilbert space, we define a normed space HΦ={f ⁣:XC:gΓ,f(x)=gΦ(x)xX}H_{\varPhi}=\{f\colon X\to\mathbb{C}:\exists g\in\Gamma,\allowbreak\,f(x)=\langle g|\varPhi(x)\rangle\,\forall x\in X\} with the norm fΦ=inf{gΓ:gG,f(x)=gΦ(x)xX}\lVert f\rVert_{\varPhi}=\inf\{\lVert g\rVert_{\Gamma}:g\in G,\allowbreak\,f(x)=\langle g|\varPhi(x)\rangle\,\forall x\in X\}. One can show that HΦ{H_{\varPhi}} is a RKHS with the kernel K(x,y)=Φ(x)Φ(y)K(x,y)=\langle\varPhi(x)|\varPhi(y)\rangle.

Formally, a kernel k(xj,xk)k(\mathbf{x}_{j},\mathbf{x}_{k}) is defined as a function that computes the inner product of the images of xj,xk\mathbf{x}_{j},\mathbf{x}_{k} in the feature space, i.e., k(xj,xk)=Λ(xj)Λ(xk)k(\mathbf{x}_{j},\mathbf{x}_{k})=\Lambda(\mathbf{x}_{j})^{\top}\Lambda(\mathbf{x}_{k}). Accordingly, all the inner products in (8), (9) and (10) can be replaced by the corresponding kernels. We can then use the kernel trick, that is, assign an analytic expression for the kernel in (8), (9) and (10), provided that the expression satisfies all the required conditions for it to be an actual kernel [39 B. Schölkopf, A. J. Smola, F. Bach et al., Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press (2002) ]. Indeed, every choice of a kernel can be implicitly associated to a feature map Λ\Lambda. However, in the current approach we do not need to know the explicit form of the feature map. In fact, this is the advantage of the kernel trick, as calculating Λ(x)\Lambda(\mathbf{x}) is less efficient than using the kernels directly.

Most modern applications in machine learning, however, involve deep neural networks (DNNs) [43 H. Schulz and S. Behnke, Deep learning. KI-Künstliche Intelligenz 26, 357–363 (2012) ]. A DNN can also be regarded as a collection of perceptrons arranged in a definite fashion. The architecture of a DNN is best understood and visualised in the form of a graph. Specifically, a DNN is composed of an input layer, an output layer, and several intermediate hidden layers. Each layer consists of several nodes. Each of these nodes represents a neuron. Edges are allowed to exist between nodes belonging to adjacent layers only: nodes in layer jj share edges with nodes in layer j+1j+1 and nodes in layer j1j-1. For the sake of simplicity, we consider the case where all the nodes in the jj-th layer are connected to all the nodes in the (j±1)(j\pm 1)-th layers—a fully connected neural network.

A DNN takes a data vector x\mathbf{x} as an input (at the input layer). This input is subsequently manipulated by the neurons in the next layer (the first hidden layer) to output a transformed vector x(1)\mathbf{x}^{(1)}, and this process is repeated till the last layer (output layer) is reached. Consider the kk-th neuron in the jj-th layer; for convenience, we denote this neuron by (k,j)(k,j). It receives an input vector x(j1)\mathbf{x}^{(j-1)} whose components are the outputs of the neurons in the (j1)(j-1)-th layer, and then transforms x(j1)\mathbf{x}^{(j-1)} by the rule

x(j1)Ψ((wk(j1))x(j1)+bk(j))=xk(j),\mathbf{x}^{(j-1)}\to\Psi\bigl((\mathbf{w}^{(j-1)}_{k})^{\top}\cdot\mathbf{x}^{(j-1)}+b^{(j)}_{k}\bigr)=\mathbf{x}^{(j)}_{k},

where wk(j1)\mathbf{w}^{(j-1)}_{k} are weights associated with the edges that connect the neuron (k,j)(k,j) to the neurons in the previous layer, bk(j)b^{(j)}_{k} is the bias corresponding to the neuron (k,j)(k,j), and Ψ()\Psi(\cdot) is a differentiable nonlinear function known as the activation function. This is the fundamental mathematical operation that propagates data in a DNN, layer by layer, in the forward direction (input layer to output layer), through a series of complex transformations. The training component of the algorithm is however accomplished through the back-propagation step [35 R. Rojas, The backpropagation algorithm. In Neural Networks, Springer, 149–182 (1996) ]: a cost function is calculated by comparing the signal at the output layer (the model predicted label for the data x\mathbf{x}) and the desired signal (the actual label for the data x\mathbf{x}), based on which the weights and the biases are adjusted so that the cost function is minimised. Apart from supervised learning, DNNs are routinely used for unsupervised learning, including generative learning. In generative learning, given access to a training dataset, a machine learning model learns its underlying probability distribution for future sample generation. To formulate this mathematically, consider a dataset S={xj}\mathcal{S}=\{\mathbf{x}_{j}\}, whose entries xjRN\mathbf{x}_{j}\in\mathbb{R}^{N} are independent and identically distributed vectors and are sampled according to a distribution q(x)q(\mathbf{x}). The purpose of a generative model is to approximate q(x)q(\mathbf{x}), given the access to training data from the dataset S\mathcal{S}. To achieve this, a machine learning model (with tunable parameters θ\mathbb{\theta}) is trained so that the model generated distribution, p(x,θ)p(\mathbf{x},\mathbb{\theta}), mimics the true distribution. The standard practice in generative learning is to minimise the negative log-likelihood with respect to the model parameters, which is tantamount to minimising the Kullback–Leibler divergence DKL(q(x)p(x,θ))D_{\mathrm{KL}}(q(\mathbf{x})||p(\mathbf{x},\mathbb{\theta})) between the two distributions.

2.2 Variational algorithms and quantum machine learning (QML)

Variational quantum computing has emerged as the preeminent model for quantum computation. The model merges ideas from machine learning to better utilise modern quantum hardware.

Mathematically, the problem in variational quantum computing can be formulated as follows: given (i) a variational quantum circuit (aka ansatz) U(θ)UC(2n)U(\mathbb{\theta})\in\textrm{U}_{\mathbb{C}}(2^{n}) which produces an nn-qubit variational state ψ(θ)=U(θ)0n\lvert\psi(\mathbb{\theta})\rangle=U(\mathbb{\theta})\lvert 0\rangle^{\otimes n}; θ[0,2π)×p\mathbb{\theta}\in[0,2\pi)^{\times p}, (ii) an objective function HhermC(2n)\mathcal{H}\in\operatorname{herm}_{\mathbb{C}}(2^{n}), and (iii) the expectation ψ(θ)Hψ(θ)\langle\psi(\mathbb{\theta})\rvert\mathcal{H}\lvert\psi(\mathbb{\theta})\rangle, find

θarg minθ[0,2π)×pψ(θ)Hψ(θ).\mathbb{\theta}^{\star}\in\operatorname*{arg\,min}_{\mathbb{\theta}\in[0,2\pi)^{\times p}}\langle\psi(\mathbb{\theta})\rvert\mathcal{H}\lvert\psi(\mathbb{\theta})\rangle.

Then ψ(θ)\lvert\psi(\mathbb{\theta}^{\star})\rangle will approximate the ground state (eigenvector corresponding to the lowest eigenvalue) of the Hamiltonian H\mathcal{H}. The operator H\mathcal{H}, often called the problem Hamiltonian, can suitably treat several classes of problems so that the solution to a problem is encoded in the ground state of H\mathcal{H}. The variational model of quantum computation was shown to be universal in [3 J. Biamonte, Universal variational quantum computation. Phys. Rev. A 103, L030401 (2021) ].

Quantum machine learning, both discriminative and generative, emerged as an important application of variational algorithms with suitable modifications to the aforementioned scheme. Indeed, by their very design, variational algorithms are well suited to implement machine learning tasks on a quantum device. The earlier developments in QML came mainly in the form of classification tasks [27 K. Mitarai, M. Negoro, M. Kitagawa and K. Fujii, Quantum circuit learning. Phys. Rev. A 98, 032309 (2018) , 18 E. Farhi and H. Neven, Classification with quantum neural networks on near term processors, preprint, arXiv:1802.06002 (2018) ].

Classification of a classical dataset S={(xj,yj))}j=1M\mathcal{S}=\{(\mathbf{x}_{j},\mathbf{y}_{j}))\}_{j=1}^{M} on quantum hardware typically involves four steps. First, the input vector xj\mathbf{x}_{j} is embedded into an nn-qubit state ψ(xj)\lvert\psi(\mathbf{x}_{j})\rangle. The effect of data encoding schemes on the expressive power of a quantum machine learning model was studied in [42 M. Schuld, R. Sweke and J. J. Meyer, Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys. Rev. A 103, 032430 (2021) ]. What is the most effective data embedding scheme? Although there are several interesting candidates [25 S. Lloyd, M. Schuld, A. Ijaz, J. Izaac and N. Killoran, Quantum embeddings for machine learning, preprint, arXiv:2001.03622 (2020) , 34 A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster and J. I. Latorre, Data re-uploading for a universal quantum classifier. Quantum 4, 226 (2020) ], this question remains largely unanswered. In the second step, a parameterised ansatz U(θ)U(\mathbb{\theta}) is applied to ψ(xj)\lvert\psi(\mathbf{x}_{j})\rangle to output ψ(xj,θ)\lvert\psi(\mathbf{x}_{j},\mathbb{\theta})\rangle. A number of different ansatzes are in use today, including the hardware efficient ansatz, the checkerboard ansatz, the tree tensor network ansatz, etc., which are chosen according to the application and implementation specifications under consideration. The third step in the process is where data is read out of ψ(xj,θ)\lvert\psi(\mathbf{x}_{j},\mathbb{\theta})\rangle: expectation values of certain chosen observables (Hermitian operators) are calculated with respect to ψ(xj,θ)\lvert\psi(\mathbf{x}_{j},\mathbb{\theta})\rangle to generate a predicted label F(xj,θ)F(\mathbf{x}_{j},\mathbb{\theta}). The measured operators are typically the Pauli strings, which form a basis in hermC(2n)\operatorname{herm}_{\mathbb{C}}(2^{n}). In the final step, a cost function is constructed as in (7) and minimised by tuning θ\mathbb{\theta}. This approach was used in several studies to produce successful classifications in practical datasets (see, e.g., [40 M. Schuld, A. Bocharov, K. M. Svore and N. Wiebe, Circuit-centric quantum classifiers. Phys. Rev. A 101, 032308 (2020) ]).

An interesting variation of the approach described above was shown in [41 M. Schuld and N. Killoran, Quantum machine learning in feature Hilbert spaces. Phys. Rev. Lett. 122, 040504 (2019) , 21 V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow and J. M. Gambetta, Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019) ] to implement data classification based on the kernel trick. In this method the Hilbert space is treated as a feature space and the data embedding step, xjψ(xj)\mathbf{x}_{j}\to\lvert\psi(\mathbf{x}_{j})\rangle, as a feature map. A quantum circuit is used directly to compute the inner product ψ(xj)ψ(xk)\langle\psi(\mathbf{x}_{j})|\psi(\mathbf{x}_{k})\rangle, using, e.g., the swap test, which is then employed for data classification by means of classical algorithms such as SVMs.

Quantum machine learning has also been used to classify genuine quantum data. Some prominent examples of such applications include: classifying phases of matter [47 A. V. Uvarov, A. S. Kardashin and J. D. Biamonte, Machine learning phase transitions with a quantum processor. Phys. Rev. A 102, 012415 (2020) ], quantum channel discrimination [23 A. Kardashin, A. Pervishko, D. Yudin, J. Biamonte et al., Quantum machine learning channel discrimination, preprint, arXiv:2206.09933 (2022) ], and entanglement classification [20 E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart, V. Stojevic, A. G. Green and S. Severini, Hierarchical quantum classifiers. npj Quantum Inf. 4, 1–8 (2018) ]. Other machine learning problems with quantum mechanical origins that have been solved by variational algorithms include quantum data compression [36 J. Romero, J. P. Olson and A. Aspuru-Guzik, Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol. 2, 045001 (2017) ] and denoising of quantum data [7 D. Bondarenko and P. Feldmann, Quantum autoencoders to denoise quantum data. Phys. Rev. Lett. 124, 130502 (2020) ]. Both of these applications use a quantum autoencoder. A quantum autoencoder, much like its classical counterparts, consists of two parts: an encoder and a decoder. The encoder removes the redundant information from the input data to produce a minimal low-dimensional representation. This process is known as feature extraction. To ensure that the minimal representation produced by the encoder is efficient, a decoder is used which takes the output of the encoder and tries to reconstruct the original data. Thus, in an autoencoder, both the encoder and the decoder are trained in tandem to ensure that the input at the encoder and the output at the decoder closely match each other. While in the classical case the encoders and the decoders are chosen to be neural networks, in the quantum version of an autoencoder neural networks are replaced by variational circuits.

Considerable advances were made on the front of quantum generative learning as well. In [2 M. Benedetti, D. Garcia-Pintos, O. Perdomo, V. Leyton-Ortega, Y. Nam and A. Perdomo-Ortiz, A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Inf. 5, 1–9 (2019) ] it was shown that generative modelling can be used to prepare quantum states by training shallow quantum circuits. The central idea is to obtain the model generated probability distribution p(θ)p(\mathbb{\theta}) by performing repeated measurements on a variational state ψ(θ)\lvert\psi(\mathbb{\theta})\rangle. The state ψ(θ)\lvert\psi(\mathbb{\theta})\rangle is prepared on a short-depth circuit with a fixed ansatz and parameterised with the vector θ\mathbb{\theta}. The target distribution qq is also constructed in much the same manner, by performing repeated measurements on the target state. The measurement basis (preferably informationally complete positive operator-valued measures), as expected, is kept to be the same in both cases. The training objective therefore is to ensure that p(θ)p(\mathbb{\theta}) mimics qq so that the variational circuit learns to prepare the target state. The same task in an alternate version can be looked upon as a machine-learning assisted quantum state tomography [9 J. Carrasquilla, G. Torlai, R. G. Melko and L. Aolita, Reconstructing quantum states with generative models. Nat. Mach. Intell. 1, 155–161 (2019) ].

3 Tensor networks in machine learning

3.1 Tensor networks in classical machine learning

Recently tensor network methods have found several applications in machine learning. Here we discuss some of these applications with a focus on supervised learning models. We return to our labelled dataset S={(xj,yj)}j=1M\mathcal{S}=\{(\mathbf{x}_{j},\mathbf{y}_{j})\}_{j=1}^{M}, where xjRN\mathbf{x}_{j}\in\mathbb{R}^{N}. As mentioned earlier, there are several machine learning models FF to choose from to perform a classification on the dataset S\mathcal{S}.

However, in more abstract terms, a classifier FF can be expressed as a function of the form

FW(x)=j1,j2,,jN{0,1}Wj1j2jNx1j1x2j2xNjN,F_{\mathbf{W}}(\mathbf{x})=\sum_{j_{1},j_{2},\dots,j_{N}\in\{0,1\}}W_{j_{1}j_{2}\dots j_{N}}x_{1}^{j_{1}}x_{2}^{j_{2}}\cdots x_{N}^{j_{N}},

in the polynomial basis [29 A. Novikov, M. Trofimov and I. Oseledets, Exponential machines, preprint, arXiv:1605.03795 (2016) ]. Here xRN\mathbf{x}\in\mathbb{R}^{N} is an input datum and xkRx_{k}\in\mathbb{R} is the kk-th component of x\mathbf{x}. The tensor Wj1j2jNW_{j_{1}j_{2}\dots j_{N}} is what we call as the weight tensor, which encodes the tunable parameters in FF. Going back to the case of binary classification, that is, yj{1,1}\mathbf{y}_{j}\in\nobreak\{1,-1\}, F(x)F(\mathbf{x}) can be regarded as a surface in RN+1\mathbb{R}^{N+1} that can be tuned (trained) so that it acts as a decision boundary between the two classes of input data. Indeed, the training can be accomplished by the minimisation

minW(j=1Msgn(FW(xj))yj2),\min_{\mathbf{W}}\biggl(\sum_{j=1}^{M}\lvert\operatorname{sgn}(F_{\mathbf{W}}(\mathbf{x}_{j}))-\mathbf{y}_{j}\rvert^{2}\biggr),

where sgn(FW(xj))\operatorname{sgn}(F_{\mathbf{W}}(\mathbf{x}_{j})) is the predicted label. However, in practice we run into a bottleneck when we compute (12), since this involves 2N2^{N} components of the weight tensor. One way to circumvent this bottleneck is to express the weight tensor as a MPS. Following the observation in Section 1.2, for a suitable choice of the virtual bond dimension rr, the MPS representation of the weight tensor WW would involve only O(polyN)O(\operatorname{poly}N) components, thus making the computation of (12) less resource intensive [46 E. Stoudenmire and D. J. Schwab, Supervised learning with tensor networks. Adv. Neural Inf. Process. Syst. 29 (2016) ]. Here it is worth noting that we could alternatively have opted for any other basis in the decomposition of the function FF, depending on the optimisation problem at hand.

Yet another application of tensor networks in machine learning can be seen in DNNs. Consider the transformation in (11). For most practically relevant DNNs, this transformation is highly resource intensive. This is due to the fact that the vectors x(j1)\mathbf{x}^{(j-1)} are typically very large and hence computing the inner products (wk(j1))x(j1)(\mathbf{w}^{(j-1)}_{k})^{\top}\cdot\mathbf{x}^{(j-1)} is difficult. This computation can be made efficient by using MPS. In order to do this, the vectors wk(j1),x(j1)\mathbf{w}^{(j-1)}_{k},\mathbf{x}^{(j-1)} are first reshaped, converting them into tensors and then expressing them as an MPS. When we express a vector as a MPS we need to keep track of much fewer components compared to the original representation. This makes the computation of (11) tractable.

3.2 The parent Hamiltonian problem

Consider the quantum state preparation problem using a variational algorithm. Given a variational circuit U(θ)U(\mathbb{\theta}) and an nn-qubit target state t\lvert t\rangle, tune θθ\mathbb{\theta}\to\mathbb{\theta}^{\star} such that U(θ)0nU(\mathbb{\theta}^{\star})\lvert 0\rangle^{\otimes n} approximates t\lvert t\rangle. To accomplish this task one needs to construct a Hamiltonian HhermC(2n)\mathcal{H}\in\operatorname{herm}_{\mathbb{C}}(2^{n}) with t\lvert t\rangle as its unique ground state, which will serve as the objective function of the algorithm. Constructing such a Hamiltonian for a given target state is known as the parent Hamiltonian problem. The simplest recipe is to set H=1tt\mathcal{H}=\mathbf{1}-\lvert t\rangle\langle t\rvert. This construction is however not always useful, because expressing H\mathcal{H} in the basis of Pauli strings—the basis of measurement—may require an exponential number of terms. Thus estimating the expectation of H\mathcal{H} in polynomial time becomes unfeasible.

Ideally, we want the sought-for Hamiltonian to enjoy the following properties:

  1. The Hamiltonian is non-negative.

  2. The Hamiltonian has a non-degenerate (unique) ground state t\lvert t\rangle.

  3. The Hamiltonian is gapped. An nn-qubit Hamiltonian H(n)0\mathcal{H}(n)\geq 0 is said to be gapped if

    limn[dimkerH(n)]=1.\lim_{n\to\infty}[\dim\ker{\mathcal{H}(n)}]=1.

    Validity of (13) ensures that H(n)\mathcal{H}(n) is gapped for all finite nn.

  4. The Hamiltonian is local. An nn-qubit Hamiltonian H(n)\mathcal{H}(n) is said to be local if it can be expressed as

    H(n)=j2Vh(j),\mathcal{H}(n)=\sum_{j\in 2^{V}}h(j),

    where VV is the set of nn symbols (qubits) and h(j)=kjPkhermC(2n)h(j)=\bigotimes_{k\in j}P_{k}\in\operatorname{herm}_{\mathbb{C}}(2^{n}), where PhermC(2)P\in\operatorname{herm}_{\mathbb{C}}(2). The Hamiltonian H(n)\mathcal{H}(n) is said to be kk-local if none of the h(j)h(j)’s operates on more than kk symbols (qubits) nontrivially; here a trivial operation refers to the case when PkP_{k} is the identity for some index kk.

  5. The Hamiltonian must have O(polyn)O(\operatorname{poly}n) terms when expressed in the Pauli basis. The number of terms in a Hamiltonian when expressed in the Pauli basis is also known as the cardinality of the Hamiltonian, Hcard\lVert\mathcal{H}\rVert_{\mathrm{card}} (see [3 J. Biamonte, Universal variational quantum computation. Phys. Rev. A 103, L030401 (2021) ]).

Hamitonians with such properties can indeed be constructed if t\lvert t\rangle admits a matrix product state, albeit with the additional requirement that t\lvert t\rangle must satisfy the injectivity condition. For the parent Hamiltonian construction consider the following setting. Let t\lvert t\rangle be an nn-qubit state written as a translation-invariant and site-independent MPS with periodic boundary conditions:

t=j1jnTr(Aj1Ajn)j1jn,\lvert t\rangle=\sum_{j_{1}\dots j_{n}}\operatorname{Tr}(A_{j_{1}}\cdots A_{j_{n}})\lvert j_{1}\dots j_{n}\rangle,

where AjkMatC(r)A_{j_{k}}\in\mathrm{Mat}_{\mathbb{C}}(r). For the sake of brevity we will call these matrices Kraus operators.1There is indeed there a connection between the matrices AjkA_{j_{k}} and completely positive trace-preserving (CPTP) maps from which the AjkA_{j_{k}} derive their name. For the purpose of this paper, however, we will skip the detailed explanation. Consider the map

ΓL ⁣:Xψ(L)X=j1jLTr(XAj1AjL)j1jL,\Gamma_{L}\colon X\to\lvert\psi^{(L)}\rangle_{X}=\sum_{j_{1}\dots j_{L}}\operatorname{Tr}(XA_{j_{1}}\cdots A_{j_{L}})\lvert j_{1}\dots j_{L}\rangle,

where XMatC(r)X\in\mathrm{Mat}_{\mathbb{C}}(r). We say that the state t\lvert t\rangle is injective with injectivity length LL if the map ΓL\Gamma_{L} is injective. Several corollaries follow from this definition. A particularly useful one connects the notion of injectivity to the rank of reduced density matrices. It asserts that for an LL-qubit reduced density matrix, ρ(L)\rho^{(L)}, of t\lvert t\rangle, we have rank(ρL)=r2\textrm{rank}(\rho^{L})=r^{2} if injectivity holds. It has been shown that in the large-nn limit, ρ(L)\rho^{(L)} is given by

ρ(L)=TrnL(tt)=α,β=1rΛαψαβ(L)ψαβ(L),\rho^{(L)}=\operatorname{Tr}_{n-L}(\lvert t\rangle\langle t\rvert)=\sum_{\alpha,\beta=1}^{r}\Lambda_{\alpha}\lvert\psi_{\alpha\beta}^{(L)}\rangle\langle\psi_{\alpha\beta}^{(L)}\rvert,

with ψαβ(L)=j1jLαAj1AjLβj1jL\lvert\psi_{\alpha\beta}^{(L)}\rangle=\sum_{j_{1}\dots j_{L}}\langle\alpha|A_{j_{1}}\cdots A_{j_{L}}|\beta\rangle\lvert j_{1}\dots j_{L}\rangle, α,βCr\lvert\alpha\rangle,\lvert\beta\rangle\in\mathbb{C}^{r} and ΛαR+\Lambda_{\alpha}\in\mathbb{R}_{+}. Alternatively, this would mean that {ψαβ(L)}αβ\{\lvert\psi_{\alpha\beta}^{(L)}\rangle\}_{\alpha\beta} is a linearly independent set.

The form of the reduced density matrix in (14) is particularly telling and allows us to construct the parent Hamiltonian of t\lvert t\rangle: H0\mathcal{H}\geq 0. We formally write our parent Hamiltonian as

H=j=1nhj(L),\mathcal{H}=\sum_{j=1}^{n}h_{j}^{(L)},

where hj(L)h_{j}^{(L)} operates nontrivially over LL-qubits from jj to L+jL+j and obeys the condition kerhj(L)=span{ψαβ(L)}αβ\ker h_{j}^{(L)}=\operatorname{span}{\{\lvert\psi_{\alpha\beta}^{(L)}\rangle}\}_{\alpha\beta}. The latter condition combined with (14) ensures that Tr(hj(L)ρj(L))=0\operatorname{Tr}(h_{j}^{(L)}\rho_{j}^{(L)})=0 for all jj, which in turn implies that tkerH\lvert t\rangle\in\ker\mathcal{H}. In fact, t=kerH\lvert t\rangle=\ker\mathcal{H}, provided t\lvert t\rangle is injective, and so condition 1 for H\mathcal{H} is satisfied. Conditions 3 and 4 are satisfied naturally due to the form of H\mathcal{H} in (15). In addition, H\mathcal{H} can also seen to be frustration free, that is, tHt=0thj(L)t=0\langle t|\mathcal{H}|t\rangle=0\Rightarrow\langle t|h_{j}^{(L)}|t\rangle=0 for all jj. Finally, it was shown in [17 M. Fannes, B. Nachtergaele and R. F. Werner, Finitely correlated states on quantum spin chains. Comm. Math. Phys. 144, 443–490 (1992) ] that if t\lvert t\rangle is injective, then H\mathcal{H} is gapped.

4 Conclusion

The importance of matrix product states in physics is due to the ease with which one could calculate and verify important quantities or properties, such as two-point functions, thermal properties, and more. This is also true in machine learning applications. For example, images of size 256×256256\times 256 can be viewed as rank-one tensor networks on R256\mathbb{R}^{256}. Departing from this linear (train) structure results in tensors with potentially much greater expressability at the cost of many desirable properties being lost.

Richik Sengupta is a research scientist at Skolkovo Institute of Science and Technology. r.sengupta@skoltech.ru Soumik Adhikary is a research scientist at Skolkovo Institute of Science and Technology. s.adhikari@skoltech.ru Ivan Oseledets is full professor, director of the Center for Artificial Intelligence Technology, head of the Laboratory of Computational Intelligence at Skolkovo Institute of Science and Technology. i.oseledets@skoltech.ru Jacob Biamonte is full professor, head of the Laboratory of Quantum Algorithms for Machine Learning and Optimisation at Skolkovo Institute of Science and Technology. j.biamonte@skoltech.ru

  1. 1

    There is indeed there a connection between the matrices AjkA_{j_{k}} and completely positive trace-preserving (CPTP) maps from which the AjkA_{j_{k}} derive their name. For the purpose of this paper, however, we will skip the detailed explanation.

References

  1. J. Baez and M. Stay, Physics, topology, logic and computation: A Rosetta Stone. In New Structures for Physics, Lecture Notes in Phys. 813, Springer, Heidelberg, 95–172 (2011)
  2. M. Benedetti, D. Garcia-Pintos, O. Perdomo, V. Leyton-Ortega, Y. Nam and A. Perdomo-Ortiz, A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Inf. 5, 1–9 (2019)
  3. J. Biamonte, Universal variational quantum computation. Phys. Rev. A 103, L030401 (2021)
  4. J. Biamonte and V. Bergholm, Tensor networks in a nutshell, preprint, arXiv:1708.00006 (2017)
  5. J. Biamonte, V. Bergholm and M. Lanzagorta, Tensor network methods for invariant theory. J. Phys. A 46, 475301 (2013)
  6. C. M. Bishop, Pattern recognition and machine learning. Information Science and Statistics, Springer, New York (2006)
  7. D. Bondarenko and P. Feldmann, Quantum autoencoders to denoise quantum data. Phys. Rev. Lett. 124, 130502 (2020)
  8. O. Bousquet and A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)
  9. J. Carrasquilla, G. Torlai, R. G. Melko and L. Aolita, Reconstructing quantum states with generative models. Nat. Mach. Intell. 1, 155–161 (2019)
  10. A. Cayley, On the theory of groups as depending on the symbolic equation θn=1. Philos. Mag. 7, 40–47 (1854)
  11. A. Cichocki, Tensor networks for big data analytics and large-scale optimization problems, preprint, arXiv:1407.3124 (2014)
  12. A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao and D. P. Mandic, Tensor networks for dimensionality reduction and large-scale optimization. I: Low-rank tensor decompositions. Found. Trends Mach. Learn. 9, 249–429 (2016)
  13. A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama and D. P. Mandic, Tensor networks for dimensionality reduction and large-scale optimization. II: Applications and future perspectives. Found. Trends Mach. Learn. 9, 431–673 (2017)
  14. J. I. Cirac, D. Pérez-García, N. Schuch and F. Verstraete, Matrix product states and projected entangled pair states: Concepts, symmetries, theorems. Rev. Modern Phys. 93, 045003 (2021)
  15. C. Cortes and V. Vapnik, Support-vector networks. Mach. Learn. 20, 273–297 (1995)
  16. T. M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14, 326–334 (1965)
  17. M. Fannes, B. Nachtergaele and R. F. Werner, Finitely correlated states on quantum spin chains. Comm. Math. Phys. 144, 443–490 (1992)
  18. E. Farhi and H. Neven, Classification with quantum neural networks on near term processors, preprint, arXiv:1802.06002 (2018)
  19. C. Fernández-González, N. Schuch, M. M. Wolf, J. I. Cirac and D. Pérez-García, Frustration free gapless Hamiltonians for matrix product states. Comm. Math. Phys. 333, 299–333 (2015)
  20. E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart, V. Stojevic, A. G. Green and S. Severini, Hierarchical quantum classifiers. npj Quantum Inf. 4, 1–8 (2018)
  21. V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow and J. M. Gambetta, Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019)
  22. A. Jahn and J. Eisert, Holographic tensor network models and quantum error correction: A topical review. Quantum Sci. Technol. 6, 033002 (2021)
  23. A. Kardashin, A. Pervishko, D. Yudin, J. Biamonte et al., Quantum machine learning channel discrimination, preprint, arXiv:2206.09933 (2022)
  24. Y. LeCun, Y. Bengio and G. Hinton, Deep learning. Nature 521, 436–444 (2015)
  25. S. Lloyd, M. Schuld, A. Ijaz, J. Izaac and N. Killoran, Quantum embeddings for machine learning, preprint, arXiv:2001.03622 (2020)
  26. W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943)
  27. K. Mitarai, M. Negoro, M. Kitagawa and K. Fujii, Quantum circuit learning. Phys. Rev. A 98, 032309 (2018)
  28. A. Novikov, D. Podoprikhin, A. Osokin and D. P. Vetrov, Tensorizing neural networks. Adv. Neural Inf. Process. Syst. 28 (2015)
  29. A. Novikov, M. Trofimov and I. Oseledets, Exponential machines, preprint, arXiv:1605.03795 (2016)
  30. R. Orús, Advances on tensor network theory: Symmetries, fermions, entanglement, and holography. Eur. Phys. J. B 87, 280 (2014)
  31. I. V. Oseledets, Tensor-train decomposition. SIAM J. Sci. Comput. 33, 2295–2317 (2011)
  32. R. Penrose, Applications of negative dimensional tensors. In Combinatorial Mathematics and its Applications (Proc. Conf., Oxford, 1969), Academic Press, London, 221–244 (1971)
  33. D. Perez-Garcia, F. Verstraete, M. M. Wolf and J. I. Cirac, Matrix product state representations. Quantum Inf. Comput. 7, 401–430 (2007)
  34. A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster and J. I. Latorre, Data re-uploading for a universal quantum classifier. Quantum 4, 226 (2020)
  35. R. Rojas, The backpropagation algorithm. In Neural Networks, Springer, 149–182 (1996)
  36. J. Romero, J. P. Olson and A. Aspuru-Guzik, Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol. 2, 045001 (2017)
  37. L. Rosasco, E. De Vito, A. Caponnetto, M. Piana and A. Verri, Are loss functions all the same? Neural Comput. 16, 1063–1076 (2004)
  38. F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386 (1958)
  39. B. Schölkopf, A. J. Smola, F. Bach et al., Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press (2002)
  40. M. Schuld, A. Bocharov, K. M. Svore and N. Wiebe, Circuit-centric quantum classifiers. Phys. Rev. A 101, 032308 (2020)
  41. M. Schuld and N. Killoran, Quantum machine learning in feature Hilbert spaces. Phys. Rev. Lett. 122, 040504 (2019)
  42. M. Schuld, R. Sweke and J. J. Meyer, Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys. Rev. A 103, 032430 (2021)
  43. H. Schulz and S. Behnke, Deep learning. KI-Künstliche Intelligenz 26, 357–363 (2012)
  44. P. Selinger, A survey of graphical languages for monoidal categories. In New Structures for Physics, Lecture Notes in Phys. 813, Springer, Heidelberg, 289–355 (2011)
  45. J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004)
  46. E. Stoudenmire and D. J. Schwab, Supervised learning with tensor networks. Adv. Neural Inf. Process. Syst. 29 (2016)
  47. A. V. Uvarov, A. S. Kardashin and J. D. Biamonte, Machine learning phase transitions with a quantum processor. Phys. Rev. A 102, 012415 (2020)

Cite this article

Richik Sengupta, Soumik Adhikary, Ivan Oseledets, Jacob Biamonte, Tensor networks in machine learning. Eur. Math. Soc. Mag. (2022), published online first

DOI 10.4171/MAG/101
🅭🅯
This open access article is published by EMS Press under a CC BY 4.0 license, with the exception of logos and branding of the European Mathematical Society and EMS Press, and where otherwise noted.