-
In classical statistics the probability of one sample in a physical ensemble is determined by the energy of the sample, which is given by the Hamiltonian H of the model. Taking a spin system for example, a configuration is represented as a set of spin values at every site as
$ \sigma=\{s_1, s_2, ..., s_N\} $ , and the thermodynamic properties of the system at a certain temperature T is governed by the ensemble whose samples are distributed according to the Hamiltonian$ H(\sigma) $ as the Boltzmann distribution$ P(\sigma)\propto\exp(-H(\sigma)/T). $
(1) Once the Hamiltonian of the system is known one can obtain the ensemble by some certain sampling methods, such as the Markov chain Monte Carlo(MCMC). In this work we focus on the inverse problem, i.e. extracting the Hamiltonian, or estimating probability equivalently, when an ensemble
$ \{\sigma^{(1)}, \sigma^{(2)}, ..., \sigma^{(M)}\} $ is known, where M is a large enough integer for an ensemble. Clearly, it is difficult to estimate the high-dimensional probability distribution with traditional method, i.e., the so-called curse of dimensionality [16]. Fortunately the ANN provides a good solution to this problem [25, 27]. Such a network could be helpful in two aspects. Firstly, a numerical effective model can be constructed via this network in a more straightforward way. Explicitly, basing the original ensemble each sample$ \sigma^{(i)} $ can be reconstructed by an effective mode$ \Sigma^{(i)} $ , such as the vortex in XY model, to obtain a correctly distributed ensemble [28]. With the network the energy or equivalently probability of$ \Sigma^{(i)} $ can be extracted, and thus a numerical interaction of the effective mode has been obtained$ H(\Sigma)=-T \ln (P(\Sigma))+C $
(2) up to a constant C, which corresponds to the partition function. Secondly, if the ensemble can be measured in laboratory, even in a more macroscopic degree of freedom, an numerical Hamiltonian can be constructed by the network. Once a Hamiltonian has been obtained, the system properties at different temperatures can be estimated by generating a new ensemble according to the Boltzmann distribution with the help of a standard sampling method. It should be noted that the temperature dependence is explicitly introduced as a physical prior for the construction of the numerical Hamiltonian with ANNs [26].
-
We further consider the quantum version of the above problem, i.e. basing on several known ensembles at one or more temperatures to extract the action by estimating the probability density of each sample. The quantum effects can be introduced into classical statistics by either considering the Hamiltonian as a quantum operator
$ \exp(-{\hat H}/T) $ or integrating over all the possible evolution processes which is known as the path integral approach of the quantum theory. At zero temperature, the weight of a quantum state is drawn as a complex factor,$ \exp(-i S) $ , which is inapplicable for real-valued ANNs. However, in finite temperature case the imaginary formalism compacts the time onto a ring with radius$ \beta=T^{-1} $ of imaginary time$ \tau=i t $ . In such a formalism the partition function is$ Z=\int D\Phi \exp(-S[\Phi]), $
(3) where the action is
$ S[\Phi]=\int_0^\beta d\tau d^3x[(\partial_\tau\Phi)^2+(\nabla\Phi)^2+ V(\Phi)] $ in general if we only consider the usual local and Lorentz covariant Bosonic field system for example. And for fermionic cases a proper Hubbard-Stratonovich transformation can be applied to obtain a bonsonic one. Although the temperature-dependence appears more complicated than the classical case which is$ e^{-E/T} $ , it is still tractable if the discretization formalism is rewritten down explicitly as$ \begin{aligned}[b] S[\Phi]=\;&\sum \Delta\tau (\Delta x)^3\left[(\frac{\Delta \Phi}{\Delta \tau})^2+(\nabla\Phi)^2+V(\Phi)\right]\\ =\;&\sum (\Delta x)^3\left[\frac{(\Delta \Phi)^2}{\Delta \tau}+\Delta\tau((\nabla\Phi)^2+V(\Phi))\right]\\ =\;&\beta^{-1}K+\beta V \end{aligned}$
(4) where
$ \Delta\tau=\beta/N_\tau $ , the first term is part of kinetic term denoted as$ K\equiv N_\tau\sum (\Delta x)^3{(\Delta \Phi)^2} $ , and the second term includes all the time-independent terms denoted as$ V\equiv N_\tau^{-1}\sum (\Delta x)^3{[(\nabla\Phi)^2+V(\Phi)]} $ . The temperature dependence of these two terms are different and is separable with regard to the quantum fields. It is evident that once actions of a sample at two given temperatures can be estimated as$ \begin{aligned}[b]& S_1[\Phi]=\beta_1^{-1}K[\Phi]+\beta_1 V[\Phi]+C_1\\ &S_2[\Phi]=\beta_2^{-1}K[\Phi]+\beta_2 V[\Phi]+C_2, \end{aligned} $
(5) the K and V term can be solved out and the action at any third temperature is
$ S_3[\Phi]=\frac{\beta_1(\beta_3^2-\beta_2^2)}{\beta_3(\beta_1^2-\beta_2^2)}S_1+ \frac{\beta_2(\beta_1^2-\beta_3^2)}{\beta_3(\beta_1^2-\beta_2^2)}S_2+C_3, $
(6) where
$ C_3=\frac{\beta_1(\beta_2^2-\beta_3^2)}{\beta_3(\beta_1^2-\beta_2^2)}C_1+ \frac{\beta_2(\beta_3^2-\beta_1^2)}{\beta_3(\beta_1^2-\beta_2^2)}C_2. $
(7) Employing the ANNs as the classical case twice, we can extract the probability(up to a global constant) of any sample in a correctly distributed ensemble at two given temperatures. Equivalently, the temperature-independent terms K and V can be determined by feeding two ensembles—-generated via certain sampling algorithms-—into ANNs. The network is then trained to output the numerical action or probability density associated with each sample. Utilizing two such ensembles at distinct temperatures allows for the encoding of interaction details within two respective ANNs, denoted
$ T_1 $ and$ T_2 $ . This methodology obviates the need for an analytical expression of the Lagrangian density.A schematic flowchart is shown in Fig. 1. This quantum version algorithm indicates only two ensembles are required to reconstruct the interaction detail of a system, although the whole evolution history along the imaginary time will contribute to the quantum partition function in principle.
-
A simple 1D quantum mechanical system is enough for the experiment of the above algorithm. The Lagrangian is
$ L=\frac{1}{2}\left(\frac{dx}{d\tau}\right)^{2}+V_{k}(x). $
(8) We introduce a standard double-well potential as,
$ V_{k}(x)=\frac{\lambda_{k}}{4}\left(x^{2}-\frac{\mu_{k}^{2}}{2 \lambda_k}\right)^{2}. $
(9) Here
$ \lambda_k $ is the coupling constant with mass dimension$ [\lambda_k] = [m]^5 $ and a parameter$ \mu_k $ with mass dimension$ [\mu_k] = [m]^{3/2} $ . Suppose the particle mass parameter (in the kinetic energy term) is m, we will then work with dimensionless quantities rescaled by proper powers of m (e.g.,$ \mu^2_k/m^3\rightarrow \mu^2_k $ and$ \lambda_k/m^5\rightarrow \lambda_k $ ) throughout this paper. The properties of the system at finite temperature T can be described through the partition function$ \begin{aligned}[b] Z =\;& \int _{x(\beta)=x(0)}Dx\ e^{-S_E[x(\tau)]} =\int\prod\limits_{j=-N+1}^{N+1}\frac{dx_{j}}{\sqrt{2\pi a}}\\ & \times\exp\left\{-\sum\limits_{i=-N+1}^{N+1}\left[\frac{(x_{i+1}-x_{i})^{2}}{2a}+a{V_k(x_{i})}\right]\right\}, \end{aligned}$
(10) where
$ k_B = \hbar = 1 $ ,$ \beta = 1/T $ is the inverse of temperature T, and$ S_E $ the Euclidean action in the imaginary formalism given by$ \begin{aligned}[b] S_{E}[x(\tau)]=\;&\int_{0}^{\beta}d\tau\ \mathcal{L}_{E}[x(\tau)]\\ =\;&\int_{0}^{\beta}d\tau\ \left[\frac{1}{2}\left(\frac{dx}{d\tau}\right)^{2}+V_{k}(x)\right] . \end{aligned} $
(11) When the interaction coupling is not too large, the system is governed by the semi-classical solutions of the equation of motion. In order to derive a non-trivial semi-classical solution in the 0+1-dimensional field system explicitly, we can consider the 1D quantum mechanical system with Higgs-like interaction potential. This is a very popular potential to realize the spontaneous symmetry breaking [32, 34, 35] in the higher dimensional system, while for the 1D case no such a mechanism exist. Instead, for such a system a type of tunneling solution known as the kink is a close analog of the instanton in QCD.
The semi-classical solution is obtained by minimizing the action and solving the equation of motion. There are also two nontrivial solutions given by
$ x(\tau) = \pm\frac{\mu_k}{\sqrt{\lambda_k}} \tanh \left[\frac{\mu_{k}}{\sqrt{2}}(\tau -\tau_0)\right]. $
(12) These solutions approach
$ \pm \mu_k/\sqrt{\lambda_k} $ at$ \tau=\pm \infty $ . This means these two solutions interpolate between two minima over infinite long imaginary time. Such a behavior is consistent with the ground state of the Schrodinger equation, which is supposed to peak at the middle of the two minima of the potential. The above solutions with plus and minus sign are respectively called the kink and anti-kink solutions which represent the tunneling process if one transforms the solution back to the real-time formulation.In the numerical simulation, as suggested in Ref. [36], we choose
$ \lambda_k=4 $ and$ \mu_{k}/\sqrt{\lambda_{k}}=1.4 $ and adopt the traditional MCMC for sampling at different temperatures [36]. We will try to used a ANN to estimate the probability of each sample and to reconstruct the Lagrangian density using two ensembles. In order to apply the MCMC sampling the continuous$ x(\tau) $ should be firstly discretized, and smaller lattice size certainly requires larger number of steps before convergence is achieved. As a practical example in our work, the simulations for different temperatures ($ \beta = T^{-1} = 80,40,20 $ ) were done with the same number of lattice size(256) and sweeps ranging ($ N_{MC} = 5 \times 10^6 $ ). After discarding the first 50,000 steps we chose 10,000 configurations randomly in the following sequence as training set and another 10,000 samples as testing set for each β. -
CAN is a suitable neural network for extracting the probability density for each sample with continuous d.o.f [20, 25]. Here we adopt it to learn the probability density of each sample in quantum systems. The algorithm is constructed according to the Maximum Likelihood Estimate (MLE), which is employed to estimate the probability density in an unsupervised manner [37]. Two basic properties of probability should be satisfied. Firstly, the given probability is positive. Secondly, two similar configuration gives similar and continuous values of probability. To achieve the two properties, in the CAN, we propose factorizing the whole probability of a sample as the product of conditional probabilities at each site and using an appropriate mixture of Beta distributions as the prior probability to ensure positive and continuous requirements. The Beta distribution
$ {\mathcal B}(a, b) $ with two parameters a and b is defined as continuous within a finite interval. Therefore, the output layers of the neural network are designed to have two channels for each parameter, and the conditional probabilities at each site are expressed as [38]$ f_{\theta}(s_{i}|s_{1}, \cdots, s_{i- 1})=\frac{\Gamma(a_{i}+b_{i})}{\Gamma(a_{i})\Gamma(b_{i})}s_{i}^{a_{i}-1}(1-s_{i})^{b_{i}-1}, $
(13) where
$ \Gamma(a) $ is the gamma function,$ \{\theta\} $ is a set of trainable parameters of the networks, and$ s_i $ is the configuration of the system. The outputs of the hidden layers are$ a \equiv (a1, a2, ...) $ and$ b \equiv (b1, b2, ... ) $ . The conditional probability is realized by adding a mask which veils all the sites whose indices are larger than a given position before one sample is conveyed to convolutional layers. This setup agrees with the locality of microscopic interactions and is capable of preserving any high-order interactions from a restricted Boltzmann machine perspective [39]. With such a conditional probability Ansatz, the joint probability of a sample is as$ q_{\theta}(s)=\prod\limits_{i = 1}^{N}f_{\theta}(s_{i}|s_{1}, \cdots, s_{i-1}). $
(14) The loss function in training is designed by maximize the probability of the ensemble (training set) according to the MLE principle, i.e. the most physical ensemble is the most possible to emerge. Hence the loss is calculated as the logarithmic value of the mixture distribution obtained from the network,
$ L=-\sum\limits_{s\sim q_(data)}\log(q_{\theta}(s)) $
(15) and Adam optimizer is used to minimize this loss. With this framework the problem of density estimation is converted to find the
$ \{a_i\} $ and$ \{b_i\} $ for each site which are working for all the samples in the ensemble. In this way we have sidestepped finding different probability(unknown) for each sample and achieved the positivity and similarity of the sample probability. The only Ansatz is the way of factorizing the whole probability of a sample. It would work better if the explicit form of the conditional probability (Beta distribution in this work) is chosen to have better expression ability.In an unsupervised manner, the training process is equivalent to find the correct probability of each sample. The training data we need is only two physically distributed ensembles at different temperatures and no analytical Lagrangian density expression are required by the CAN. These training data are obtained via the traditional MCMC simulation [33].
-
In this section we show the training and validation stages of the CANs. It will be seen that the CANs can succeed to reproduce the correct action of most configurations at different temperature without knowing the analytical form of the Lagrangian density (unsupervisely). To train the network we have prepared three ensembles at
$ \beta=T^{-1}= $ 20, 40 and 80 with the MCMC simulation. We train the CANs to estimate the probability of each sample in these three ensembles for 10,000 epochs. From Figure 2 we can find the$ \beta=40 $ case converges fastest, while the$ \beta=80 $ case is slower and$ \beta=80 $ slowest. This means the network is good at learning the ensemble with different kinds of configurations, because both the kinetic and potential terms will contribute to the averaged loss equally. Two limit cases is harder to learn because the network will focus on the kinetic/potential part. As a result the loss from the other part will be more difficult to reduce. Nevertheless with a large enough number of epoch(larger than 2000) all the networks has been convergent.The CANs can estimate the probability well as shown in Figure 3. The first column shows the histograms of both the analytical (blue) and CANs-output (red) actions of the test data. Because the constant in Eq. 6 can not be determined, we have shifted the histogram of CAN by hand to make their peaks at the same position. And the same shifting constant is adopted in the further comparison. It is clear that they almost coincide with each other. To show the estimating ability, a sample-wise comparison of these two actions are shown in the second column. Up to the above-mentioned constant the two actions fall on a straight line whose slope equals 1. This means the CANs can not only reproduce the correct distribution (histogram) of action for a given ensemble but more important the correct action for each sample. And the third column show typical configurations in the ensembles at each temperature. At larger β(low temperature) the configuration appears to be a link of more multi kink and anti kinks. When temperature increases, the number of kinks and anti-kinks will reduce and vanishes eventually. This behavior agrees with our expectation and means the ensembles distribute correctly. From the results it is shown clearly that the CANs have the capacity to learn the possibility distribution at one single temperature and successfully learned most of the samples' action up to a network-dependent constant without introducing the analytical Lagrangian density into the CANs.
Figure 3. (color online) (a)-(c) Comparisons of action distribution by MCMC(analytical) and CANs; (d)-(f) Comparisons of action on testing data by MC and CANs; (g)-(i)Monte Carlo configuration
$ {x_i} $ for$ \beta= $ 20, 40, 80, respectivelyOnce two networks, corresponding two different temperatures, have been trained, the action of each sample in the third ensemble at a different temperature can be extracted with Eq.6 up to a sample-independent but network dependent constant. In next section we will use networks trained at two temperature of the above three cases to predict the sample actions at the third temperature.
-
With the trained networks it is ready for estimation actions from those at two other temperatures. By pretending having neither the network nor the analytical Lagrangian density at
$ \beta_3 $ , we start to find the action of an arbitrary sample at$ \beta_3 $ by making use of two trained neural networks at$ \beta_1 $ and$ \beta_2 $ according to the procedure presented in Section 3 using Eq.6. As we have already obtained three networks at three temperatures, we will choose any two of them as$ \beta_1 $ and$ \beta_2 $ to predict the remaining one$ \beta_3 $ whose network becomes idle. The predicted total action, kinetic part (proportional to$ \beta^{-1} $ ) and potential part (proportional to β) of the 2-to-1 prediction and analytical actions are show in the first, second and third column respectively in Figure 4. Clearly the first ($ \beta_{40, 80} $ to$ \beta_{20} $ ) and last ($ \beta_{20, 40} $ to$ \beta_{80} $ ) rows are extrapolation cases and the second ($ \beta_{20, 80} $ to$ \beta_{40} $ ) is the interpolation. In these figures we list two quantities, i.e.$ R^2 $ and$ \bar{D} $ to show the predicting ability of the CANS.$ R^2 $ is the mean value of the square of the predicting error$ R^2=\langle{(S_{True}-S_{CANs})^2}\rangle $ and the$ \bar{D} $ is the mean of the distance to red line with the slope 1. We can find that the interpolation case, i.e. the$ \beta_{20, 80} $ to$ \beta_{40} $ , works best.Figure 4. (color online) (a)-(c)Comparisons of action on testing data by MC and prediction Net; (d)-(f) The same comparison for the kinetic part. (g)-(i) The same comparison for the potential part
The reason can be concluded if we notice the kinetic and potential part distributions. In the
$ \beta_{40, 80} $ to$ \beta_{20} $ case (first row), the$ \beta_{40} $ and$ \beta_{80} $ ensembles are more dominated by multi-kink and anti-kink configurations whose kinetic parts are relatively larger because of the larger derivative from the jump, i.e.,$ \pm 2 $ to$ \mp 2 $ in our computation. However predicted ensembles$ \beta_{20} $ have less kinks/anti-kinks, this makes the kinetic part estimation is not good enough. On the other hand, in the$ \beta_{20, 40} $ to$ \beta_{80} $ case (third row), the$ \beta_{80} $ ensemble has more multi-kinks/anti-kinks, i.e. more sites value equals$ \pm 2 $ , while in$ \beta_{20, 40} $ ensembles more sites absolute value less than 2. This makes the potential part estimation not good enough. In the interpolation case, the training data includes ensembles both at low and high temperature, which makes the training data cover more different configurations, so the kinetic and potential parts both work well in the$ \beta_{20, 80} $ to$ \beta_{40} $ case. Physically, the low and high temperature ensembles typically corresponds distinct phases of the system. Once the network has assimilated the information, it is equipped to predict the system's behavior at any intermediate temperature stage. This capability is crucial for detailed exploration of the phase diagram.
Building imaginary-time thermal field theory with artificial neural networks
- Received Date: 2024-05-12
- Available Online: 2024-07-01
Abstract: In this study, we introduce a novel approach in quantum field theories to estimate the action using the artificial neural networks (ANNs). The estimation is achieved by learning on system configurations governed by the Boltzmann factor,