论文提出了一种新的深度神经网络模型家族，并没有规定一个离散的隐藏层序列，而是使用神经网络将隐藏状态的导数参数化。然后使用黑箱微分方程求解器计算该网络的输出。这些连续深度模型的内存成本是固定的，它们根据输入调整评估策略，并显式地用数值精度来换取运算速度。论文展示了连续深度残差网络和连续时间隐变量模型的特性。此外，还构建了连续的归一化流，一个可以使用最大似然方法训练的生成模型，无需对数据维度进行分割或排序。至于训练，论文展示了如何基于任意 ODE 求解器进行可扩展的反向传播，无需访问 ODE 求解器的内部操作。这使得在较大模型内也可以实现 ODE 的端到端训练。
 64 次浏览
 0 次分享
 1 次下载
注脚
1. Neural Ordinary Differential Equations Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud University of Toronto, Vector Institute Toronto, Canada {rtqichen, rubanova, jessebett, duvenaud}@cs.toronto.edu arXiv:1806.07366v2 [cs.LG] 3 Oct 2018 Abstract We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black box differential equation solver. These continuousdepth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuousdepth residual networks and continuoustime latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows endtoend training of ODEs within larger models. 1 Introduction Residual Network ODE Network 5 5 Models such as residual networks, recurrent neural network decoders, and normalizing flows build com 4 4 plicated transformations by composing a sequence of transformations to a hidden state: 3 3 Depth Depth ht+1 = ht + f (ht , θt ) (1) 2 2 where t ∈ {0 . . . T } and ht ∈ RD . These iterative updates can be seen as an Euler discretization of a 1 1 continuous transformation (Lu et al., 2017; Haber 0 0 and Ruthotto, 2017; Ruthotto and Haber, 2018). 5 0 5 5 0 5 Input/Hidden/Output Input/Hidden/Output What happens as we add more layers and take smaller Figure 1: Left: A Residual network defines a steps? In the limit, we parameterize the continuous discrete sequence of finite transformations. dynamics of hidden units using an ordinary differen Right: A ODE network defines a vector tial equation (ODE) specified by a neural network: field, which continuously transforms the state. dh(t) = f (h(t), t, θ) (2) Both: Circles represent evaluation locations. dt Starting from the input layer h(0), we can define the output layer h(T ) to be the solution to this ODE initial value problem at some time T . This value can be computed by a blackbox differential equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the solution with the desired accuracy. Figure 1 contrasts these two approaches. Defining and evaluating models using ODE solvers has several benefits: Memory efficiency In Section 2, we show how to compute gradients of a scalarvalued loss with respect to all inputs of any ODE solver, without backpropagating through the operations of the solver. Not storing any intermediate quantities of the forward pass allows us to train our models with nearly constant memory cost as a function of depth, a major bottleneck of training deep models. *Equal contribution.
2.Adaptive computation Euler’s method is perhaps the simplest method for solving ODEs. There have since been more than 120 years of development of efficient and accurate ODE solvers (Runge, 1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity. After training, accuracy can be reduced for realtime or lowpower applications. Parameter efficiency When the hidden unit dynamics are parameterized as a continuous function of time, the parameters of nearby “layers” are automatically tied together. In Section 3, we show that this reduces the number of parameters required on a supervised learning task. Scalable and invertible normalizing flows An unexpected sidebenefit of continuous transforma tions is that the change of variables formula becomes easier to compute. In Section 4, we derive this result and use it to construct a new class of invertible density models that avoids the singleunit bottleneck of normalizing flows, and can be trained directly by maximum likelihood. Continuous timeseries models Unlike recurrent neural networks, which require discretizing observation and emission intervals, continuouslydefined dynamics can naturally incorporate data which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model. 2 Reversemode automatic differentiation of ODE solutions The main technical difficulty in training continuousdepth networks is performing reversemode differentiation (also known as backpropagation) through the ODE solver. Differentiating through the operations of the forward pass is straightforward, but incurs a high memory cost and introduces additional numerical error. We treat the ODE solver as a black box, and compute gradients using the adjoint method (Pontryagin et al., 1962). We present a modern proof of this method in Appendix B. This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly with problem size, has low memory cost, and explicitly controls numerical error. Consider optimizing a scalarvalued loss function L(), whose input is the result of an ODE solver: t1 L(z(t1 )) = L f (z(t), t, θ)dt = L (ODESolve(z(t0 ), f, t0 , t1 , θ)) (3) t0 To optimize L, we require gradients with respect to its parameters: z(t0 ), t0 , t1 , and θ. The first step is to determining how the gradient of the loss depends on the hidden state z(t) at each instant. This quantity is called the adjoint a(t) = −∂L/∂z(t). Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule: da(t) ∂f (z(t), t, θ) = −a(t)T (4) dt ∂z We can compute ∂L/∂z(t0 ) by another call to an ODE solver. This solver must run backwards, starting from the “initial value” of ∂L/∂z(t1 ). State One complication is that solving this ODE re Adjoint State quires the knowing value of z(t) along its entire trajectory. However, we can simply recompute z(t) backwards in time together with the adjoint, starting from its final value z(t1 ). Computing the gradients with respect to the pa rameters θ requires evaluating a third integral, Figure 2: Reversemode differentiation through an which depends on both z(t) and a(t): ODE solver requires solving an augmented system t0 dL ∂f (z(t), t, θ) backwards in time. This adjoint state is updated by = a(t)T dt (5) dθ t1 ∂θ the gradient at each observation. 2
3.All three of these integrals can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the necessary dynamics, and call an ODE solver to compute all gradients at once. Algorithm 1 Reversemode derivative of an ODE initial value problem Input: dynamics parameters θ, start time t0 , stop time t1 , final state z(t1 ), loss gradient ∂L/∂z(t1 ) ∂L ∂L T ∂t1 = ∂z(t1 ) f (z(t1 ), t1 , θ) Compute gradient w.r.t. t1 ∂L ∂L s0 = [z(t1 ), ∂z(t 1) , 0, − ∂t1 ] Define initial augmented state def aug_dynamics([z(t), a(t), −, −], t, θ): Define dynamics on augmented state return [f (z(t), t, θ), −a(t)T ∂f ∂z , −a(t)T ∂f ∂θ , −a(t)T ∂f ∂t ] Concatenate timederivatives ∂L ∂L ∂L [z(t0 ), ∂z(t0 ) , ∂θ , ∂t0 ] = ODESolve(s0 , aug_dynamics, t1 , t0 , θ) Solve reversetime ODE ∂L ∂L ∂L ∂L return ∂z(t 0) , , ∂θ ∂t0 ∂t1, Return all gradients Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends on these intermediate states, the reversemode derivative must be broken into a sequence of separate solves, one between each consecutive pair of outputs (Figure 2). At each observation, the adjoint must be adjusted by the corresponding direct gradient ∂L/∂z(ti ). The results above extend those of Stapor et al. (2018, section 2.4.2). Detailed derivations are provided in Appendix B. Appendix C provides Python code which computes all derivatives for scipy.integrate.odeint by extending the autograd automatic differentiation package. This code also supports all higherorder derivatives. 3 Replacing residual networks with ODEs for supervised learning In this section, we experimentally investigate the training of neural ODEs for supervised learning. Software To solve ODE initial value problems numerically, we use the implicit Adams method implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being an implicit method, it has better guarantees than explicit methods such as RungeKutta but requires solving a nonlinear optimization problem at every step. This setup makes direct backpropagation through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the Fortran ODE solvers, which were called from Python autograd code. Model Architectures We experiment with a small residual network which downsamples the input twice then applies 6 residual blocks, which are replaced by an ODESolve module in the ODENet variant. We also test a network which directly backpropagates through a RungeKutta integrator, referred to as RKNet. Table 1 shows test error, number of parameters, and memory cost. L denotes the number of lay Table 1: Performance on MNIST. † From LeCun ers in the ResNet, and L ˜ is the number of func et al. (1998). tion evaluations that the ODE solver requests in a single forward pass, which can be interpreted Test Error # Params Memory Time as an implicit number of layers. 1Layer MLP † 1.60% 0.24 M   We find that ODENets and RKNets can ResNet 0.41% 0.60 M O(L) O(L) RKNet 0.47% 0.22 M ˜ O(L) ˜ O(L) achieve around the same performance as the ˜ ODENet 0.42% 0.22 M O(1) O(L) ResNet, while using fewer parameters. For ref erence, a neural net with a single hidden layer of 300 units has around the same number of parameters as the ODENet and RKNet architecture that we tested. Error Control in ODENets ODE solvers can approximately ensure that the output is within a given tolerance of the true solution. Changing this tolerance changes the behavior of the network. 3
4.We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a tradeoff between accuracy and computational cost. One could train with high accuracy, but switch to a lower accuracy at test time. Figure 3: Statistics of a trained ODENet. (NFE = number of function evaluations.) Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory efficient, but also more computationally efficient than directly backpropagating through the integrator, because the latter approach will need to backprop through each function evaluation in the forward pass. Network Depth It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. Figure 3d shows that he number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model. 4 Continuous Normalizing Flows The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to compute exact changes in probability if samples are transformed through a bijective function f : ∂f z1 = f (z0 ) =⇒ log p(z1 ) = log p(z0 ) − log det (6) ∂z0 An example is the planar normalizing flow (Rezende and Mohamed, 2015): ∂h z(t + 1) = z(t) + uh(wT z(t) + b), log p(z(t + 1)) = log p(z(t)) − log 1 + uT (7) ∂z Generally, the main bottleneck to using the change of variables formula is computing of the deter minant of the Jacobian ∂f/∂z, which has a cubic cost in either the dimension of z, or the number of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018). Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the computation of the change in normalizing constant: Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz dt = f (z(t), t) be a differential equation describing a continuousintime transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation, ∂ log p(z(t)) df = −tr (8) ∂t dz(t) Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation. Also unlike standard finite flows, the differential equation f does not need to be bijective, since if uniqueness is satisfied, then the entire transformation is automatically bijective. 4
5.As an example application of the instantaneous change of variables, we can examine the continuous analog of the planar flow, and its change in normalization constant: dz(t) ∂ log p(z(t)) ∂h = uh(wT z(t) + b), = −uT (9) dt ∂t ∂z(t) Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving this combined ODE. Using multiple hidden units with linear cost While det is not a linear function, the trace function is, which implies tr( n Jn ) = n tr(Jn ). Thus if our dynamics is given by a sum of functions then the differential equation for the log density is also a sum: M M dz(t) d log p(z(t)) ∂fn = fn (z(t)), = tr (10) dt n=1 dt n=1 ∂z This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in the number of hidden units M . Evaluating such ‘wide’ flow layers using standard normalizing flows costs O(M 3 ), meaning that standard NF architectures use many layers of only a single hidden unit. Timedependent dynamics We can specify the parameters of a flow as a function of t, making the differential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork (Ha et al., 2016). We also introduce a gating mechanism for each hidden unit, dz dt = n σn (t)fn (z) where σn (t) ∈ (0, 1) is a neural network that learns when the dynamic fn (z) should be applied. We call these models continuous normalizing flows (CNF). 4.1 Experiments with Continuous Normalizing Flows We first compare continuous and discrete planar flows at learning to sample from a known distribution. We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with K = M layers, and sometimes much more expressive. Density matching We configure the CNF as described above, and train for 10,000 iterations using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we minimize KL (q(x) p(x)) as the loss function where q is the flow model and the target density p(·) can be evaluated. Figure 4 shows that CNF generally achieves lower loss. Maximum Likelihood Training A useful property of continuoustime normalizing flows is that we can compute the reverse transformation for about the same cost as the forward pass, which cannot be said for normalizing flows. This lets us train the flow on a density estimation task by performing K=2 K=8 K=32 M=2 M=8 M=32 CNF NF 1 10 20 30 CNF NF 2 10 20 30 CNF NF 3 10 20 30 (a) Target (b) NF (c) CNF (d) Loss vs. K/M Figure 4: Comparison of normalizing flows versus continuous normalizing flows. The model capacity of normalizing flows is determined by their depth (K), while continuous normalizing flows can also increase capacity by increasing width (M), making them easier to train. 5
6. 5% 20% 40% 60% 80% 100% 5% 20% 40% 60% 80% 100% Samples Density Samples Density Target Target NF NF (a) Two Circles (b) Two Moons Figure 5: Visualizing the transformation from noise to data. Continuoustime normalizing flows are reversible, so we can train on a density estimation task and still be able to sample from the learned density efficiently. maximum likelihood estimation, which maximizes Ep(x) [log q(x)] where q(·) is computed using the appropriate change of variables theorem, then afterwards reverse the CNF to generate random samples from q(x). For this task, we use 64 hidden units for CNF, and 64 stacked onehiddenunit layers for NF. Figure 5 shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the transformed distribution after a small amount of time which shows the locations of the initial planar flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that the particles can be evenly spread into circles. While the CNF transformations are smooth and interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting the two moons dataset in Figure 5b. 5 A generative latent function timeseries model Applying neural networks to irregularlysampled data such as medical records, network traffic, or neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill defined latent variables. Missing data can be addressed using generative timeseries models (Álvarez and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data imputation (Che et al., 2018). Another approach concatenates timestamp information to the input of an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017). We present a continuoustime, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0 , and a global set of latent dynamics shared across all time series. Given observation times t0 , t1 , . . . , tN and an initial state zt0 , an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each observation.We define this generative model formally through a sampling procedure: zt0 ∼ p(zt0 ) (11) zt1 , zt2 , . . . , ztN = ODESolve(zt0 , f, θf , t0 , . . . , tN ) (12) each xti ∼ p(xzti , θx ) (13) Function f is a timeinvariant function that takes the value z at the current time step and outputs the gradient: ∂z(t)/∂t = f (z(t), θf ). We parametrize this function using a neural net. Because f is time invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time. Training and Prediction We can train this latentvariable model as a variational autoen coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequencevalued observations. Our recognition net is an RNN, which consumes the data sequentially backwards in time, and out puts qφ (z0 x1 , x2 , . . . , xN ). A detailed algorithm can be found in Appendix D. Using ODEs as a Animated Figures 4 and 5 available at: http://www.cs.toronto.edu/~jessebett/nodes/ 6
7. ODE Solve(zt0 , f, ✓f , t0 , ..., tM ) RNN encoder q(zt0 xt0 ...xtN ) zt 1 ztN +1 z ht 0 ht 1 ht N µ zt 0 zt N tM … ~ Latent space Data space Time x(t) x ˆ(t) t0 t1 tN tN +1 tM t0 t1 tN tN +1 tM Observed Unobserved Prediction Extrapolation Figure 6: Computation graph of the latent ODE model. generative model allows us to make predictions for arbitrary time points t1 ...tM on a continuous timeline. Poisson Process likelihoods The fact that an observation oc curred often tells us something about the latent state. For ex λ(t) ample, a patient may be more likely to take a medical test if they are sick. The rate of events can be parameterized by a function of the latent state: p(event at time t z(t)) = λ(z(t)). Given this rate function, the likelihood of a set of indepen dent observation times in the interval [tstart , tend ] is given by an t inhomogeneous Poisson process (Palm, 1943): Figure 7: Fitting a latent ODE dy N namics model with a Poisson pro tend cess likelihood. Dots show event log p(t1 . . . tN  tstart , tend ) = log λ(z(ti )) − λ(z(t))dt times. The line is the learned inten i=1 tstart sity λ(t) of the Poisson process. We can parameterize λ(·) using another neural network. Con veniently, we can evaluate both the latent trajectory and the Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate learned by such a model on a toy dataset. A Poisson process likelihood on observations can be combined with an event rate can easily be combined with a data likelihood to give a joint distribution over observations and their times. 5.1 Timeseries Latent ODE Experiments We investigate the ability of the latent ODE model to fit and extrapolate time series. The recognition network is an RNN with 25 hidden units. We use a 4dimensional latent space. We parameterize the dynamics function f with a onehiddenlayer network with 20 hidden units. The decoder computing p(xti zti ) is another neural network with one hidden layer with 20 hidden units. Our baseline was a recurrent neural net with 25 hidden units trained to minimize negative Gaussian loglikelihood. We trained a second version of this RNN whose inputs were concatenated with the time difference to the next observation to aid RNN with irregular observations. Bidirectional spiral dataset We generated a dataset of 1000 2dimensional spirals, each starting at a different point, sampled at 100 equallyspaced timesteps. The dataset contains two types of spirals: half are clockwise while the other half counterclockwise. To make the task more realistic, we add gaussian noise to the observations. Time series with irregular time points To Table 2: Predictive RMSE on test set generate irregular timestamps, we randomly sample points from each trajectory without re # Observations 30/100 50/100 100/100 placement (n = {30, 50, 100}). We report pre RNN 0.3937 0.3202 0.1813 dictive rootmeansquared error (RMSE) on 100 Latent ODE 0.1642 0.1502 0.1346 time points extending beyond those that were used for training. Table 2 shows that the latent ODE has substantially lower predictive RMSE. 7
8.Figure 9: Dataspace trajectories decoded from varying one dimension of zt0 . Color indicates progression through time, starting at purple and ending at red. Note that the trajectories on the left are counterclockwise, while the trajectories on the right are clockwise. Figure 8 shows examples of spiral reconstructions with 30 subsampled points. Reconstructions from the latent ODE were obtained by sampling from the posterior over latent trajectories and decoding it to dataspace. Examples with varying number of time points are shown in Appendix E. We observed that reconstructions and extrapolations are consistent with the ground truth regardless of number of observed points and despite the noise. Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimensions of the latent space. The trajecto ries form two separate clusters of trajectories, one decoding to clockwise spirals, the other to counterclockwise. Figure 9 shows that the la tent trajectories change smoothly as a function of the initial point z(t0 ), switching from a clock wise to a counterclockwise spiral. (a) Recurrent Neural Network 6 Scope and Limitations Minibatching Second, the use of mini batches is less straightforward than for stan dard neural networks. One can still batch to gether evaluations through the ODE solver by (b) Latent Neural Ordinary Differential Equation concatenating the states of each batch element Ground Truth together, creating a combined ODE with dimen Observation sion D × K. In some cases, controlling error on Prediction all batch elements together might require eval Extrapolation uating the combined system K times more of ten than if each system was solved individually, leading to a more than O(K) cost to evaluate K (c) Latent Trajectories batch elements. However, in practice the num ber of evaluations did not increase substantially Figure 8: (a): Reconstruction and extrapolation when using minibatches. of spirals with irregular time points by a recurrent neural network. (b): Reconstructions and extrapo lations by a latent neural ODE. Blue curve shows Uniqueness When do continuous dynamics model prediction. Red shows extrapolation. (c) A have a unique solution? Picard’s existence the projection of inferred 4dimensional latent ODE orem (Coddington and Levinson, 1955) states trajectories onto their first two dimensions. Color that the solution to an initial value problem ex indicates the direction of the corresponding trajec ists and is unique if the differential equation is tory. The model has learned latent dynamics which uniformly Lipschitz continuous in z and contin distinguishes the two directions. uous in t. This theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities, such as tanh or relu. Reversibility Even if the forward trajectory is invertible in principle, in practice there will be three compounding sources of error in the gradient: 1) Numerical error introduced in the forward ODE solver; 2) Information lost due to multiple initial values mapping to the same final state; and 3) Numerical error introduced in the reverse ODE solver. 8
9.Error from 1 and 3 can be made as small as desired, at the cost of more computation. Error from 2 could be expected to be a problem if the system dynamics encoded optimizationlike, convergent dynamics. Note that exactly convergent transformations are disallowed by Lipschitz dynamics. In our experiments we did not find this to be a practical problem, and we informally checked that reversing many layers of continuous normalizing flows recovered the initial Gaussian density. If necessary, exact reversibility could be recovered by storing the initial state and solving the reverse pass as a boundaryvalue problem instead of as an initialvalue problem. If truly convergent dynamics are required, replacing the ODE solver with a differentiable blackbox optimizer, as in OptNet (Amos and Kolter, 2017), would be more appropriate. 7 Related Work The interpretation of residual networks He et al. (2016) as approximate ODE solvers spurred research into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al., 2017). We demonstrate these same properties in more generality by directly using an ODE solver. Adaptive computation One can adapt computation time by training secondary neural networks to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al., 2016; Figurnov et al., 2017). However, this introduces overhead both at training and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer wellstudied, computationally cheap, and generalizable rules for adapting the amount of computation. Reversibility Recent work developed reversible versions of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives the same memory savings as our approach. However, this method requires the use of restricted architectures which partition the hidden units. Our approach does not have these restrictions. Learning differential equations Much recent work has proposed learning differential equations from data. One can train feedforward or recurrent neural networks to approximate a differential equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit differential equations (Raissi et al., 2018b) and can naturally model continuoustime effects and interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic variational inference to recover the solution of a given stochastic differential equation. Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint computation for general ODE and PDE solutions, but only by backpropagating through the individual operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity analysis is quadratictime in the number of variables, whereas the adjoint sensitivity analysis is linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint method to train bespoke latent dynamic models. In contrast, by providing a generic vectorJacobian product, we allow an ODE solver to be trained endtoend with any other differentiable model components. While use of vectorJacobian products for solving the adjoint method has been explored in optimal control (Andersson, 2013), we explore a fully general integration of blackbox ODE solvers into automatic differentiation (Baydin et al., 2018) and highlight its potential in deep learning and generative modeling. 8 Conclusion We investigated the use of blackbox ODE solvers as a model component, developing new models for timeseries modeling, supervised learning, and density estimation. These models are evaluated adaptively, and allow explicit control of the tradeoff between computation speed and accuracy. Finally, we derived an instantaneous version of the change of variables formula, and developed continuoustime normalizing flows, which can scale to large layer sizes. 9
10.9 Acknowledgements We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya, Hossein Soleimani, Eldad Haber, Ken Caluwaerts, and Daniel FlamShepherd for feedback on early drafts. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful discussions. References Mauricio A Álvarez and Neil D Lawrence. Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011. Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136–145, 2017. Joel Andersson. A generalpurpose software framework for dynamic optimization. PhD thesis, 2013. Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18 (153):1–153, 2018. Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018. Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan court. The Stan math library: Reversemode automatic differentiation in c++. arXiv preprint arXiv:1509.07164, 2015. Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017. Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018. URL https://doi.org/10.1038/s41598018242719. Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 301–318. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/ v56/Choi16.html. Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw Hill Education, 1955. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In International Conference on Knowledge Discovery and Data Mining, pages 1555–1564. ACM, 2016. Patrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of highlevel transient finite element programs. SIAM Journal on Scientific Computing, 2013. Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint, 2017. J. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process RNN Classifier. ArXiv eprints, 2017. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pages 2211–2221, 2017. 10
11.Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016. David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016. Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34 (1):014004, 2017. E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems. Springer, 1987. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of minibatch gradient descent, 2012. Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Diederik P. Kingma and Max Welling. Autoencoding variational Bayes. International Conference on Learning Representations, 2014. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016. W. Kutta. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für Mathematik und Physik, 46:435–453, 1901. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Yang Li. Timedependent representation for neural event sequence prediction. arXiv preprint arXiv:1708.00065, 2017. Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with RNNs: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253– 270. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/v56/Lipton16.html. Z. Long, Y. Lu, X. Ma, and B. Dong. PDENet: Learning PDEs from Data. ArXiv eprints, 2017. Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121, 2017. Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reversemode differentiation of native Python. In ICML workshop on Automatic Machine Learning, 2015. Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally selfmodulating multivariate point process. In Advances in Neural Information Processing Systems, pages 6757– 6767, 2017. Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for ODE based models using adjointstate method. Computational Statistics, 32(4):1621–1643, 2017. Conny Palm. Intensitätsschwankungen im fernsprechverker. Ericsson Technics, 1943. Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat ical theory of optimal processes. 1962. 11
12.M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, pages 125–141, 2018. Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a. Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for timedependent and nonlinear partial differential equations. SIAM Journal on Scientific Computing, 40(1):A172–A198, 2018b. Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286, 2014. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015. C. Runge. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46: 167–178, 1895. Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. arXiv preprint arXiv:1804.04272, 2018. T. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Blackbox Variational Inference for Stochastic Differential Equations. ArXiv eprints, 2018. Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with RungeKutta means. In Advances in Neural Information Processing Systems 25, 2014. Peter Schulam and Suchi Saria. Whatif reasoning with counterfactual Gaussian processes. arXiv preprint arXiv:1703.10651, 2017. Hossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty aware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a. Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatmentresponse models for coun terfactual reasoning with continuoustime, continuousvalued interventions. arXiv preprint arXiv:1704.02038, 2017b. Jos Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 121–128. ACM Press/AddisonWesley Publishing Co., 1999. Paul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE models using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018. Jakub M Tomczak and Max Welling. Improving variational autoencoders using Householder flow. arXiv preprint arXiv:1611.09630, 2016. Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latentspace physics: Towards learning the temporal evolution of fluid flow. arXiv preprint arXiv:1802.10123, 2018. Hong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration of ODEs. SIAM Journal on Scientific Computing, 36(5):C504–C523, 2014. 12
13.Appendix A Proof of the Instantaneous Change of Variables Theorem Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz dt = f (z(t), t) be a differential equation describing a continuousintime transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation: ∂ log p(z(t)) df = −tr (t) ∂t dz Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First we denote the transformation of z over an ε change in time as z(t + ε) = Tε (z(t)) (14) We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique solution by Picard’s existence theorem. We also assume z(t) is bounded. These conditions imply that f , Tε , and ∂ T are all bounded. In the following, we use these conditions to exchange limits and products. ∂z ε ∂ log p(z(t)) We can write the differential equation ∂t using the discrete change of variables formula, and the definition of the derivative: ∂ ∂ log p(z(t)) log p(z(t)) − log det ∂z Tε (z(t)) − log p(z(t)) = lim (15) ∂t ε→0 + ε ∂ log det ∂z Tε (z(t)) = − lim (16) ε→0+ ε ∂ ∂ log det T (z(t)) ∂z ε = − lim ∂ε ∂ (by L’Hôpital’s rule) (17) ε→0 + ∂ε ε ∂ ∂ ∂ε det T (z(t)) ∂z ε ∂ log(z) = − lim ∂ =1 (18) ε→0+ det ∂z Tε (z(t)) ∂z z=1 ∂ ∂ 1 =− lim det Tε (z(t)) lim ∂ (19) ε→0+ ∂ε ∂z ε→0+ det T (z(t)) ∂z ε bounded =1 ∂ ∂ = − lim det Tε (z(t)) (20) ε→0+ ∂ε ∂z The derivative of the determinant can be expressed using Jacobi’s formula, which gives ∂ log p(z(t)) ∂ ∂ ∂ = − lim tr adj Tε (z(t)) Tε (z(t)) (21) ∂t ε→0+ ∂z ∂ε ∂z ∂ ∂ ∂ = −tr lim adj Tε (z(t)) lim Tε (z(t)) (22) ε→0+ ∂z ε→0+ ∂ε ∂z =I ∂ ∂ = −tr lim Tε (z(t)) (23) ε→0+ ∂ε ∂z Substituting Tε with its Taylor series expansion and taking the limit, we complete the proof. ∂ log p(z(t)) ∂ ∂ = −tr lim z + εf (z(t), t) + O(ε2 ) + O(ε3 ) + . . . (24) ∂t ε→0+ ∂ε ∂z ∂ ∂ = −tr lim I+ εf (z(t), t) + O(ε2 ) + O(ε3 ) + . . . (25) ε→0+ ∂ε ∂z ∂ = −tr lim f (z(t), t) + O(ε) + O(ε2 ) + . . . (26) ε→0+ ∂z ∂ = −tr f (z(t), t) (27) ∂z 13
14.A.1 Special Cases ∂f T Planar CNF. Let f (z) = uh(wz + b), then ∂z = u ∂h ∂z . Since the trace of an outer product is the inner product, we have ∂ log p(z) ∂h T ∂h = −tr u = −uT (28) ∂t ∂z ∂z This is the parameterization we use in all of our experiments. Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits ∂ log p(z) the data into two equal partitions and is a volumepreserving transformation, implying that ∂t = 0. We can verify this. Let dz1:d dt f (zd+1:D ) dzd+1:D = (29) g(z1:d ) dt Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volumepreserving flow. A.2 Connection to FokkerPlanck and Liouville PDEs The FokkerPlanck equation is a wellknown partial differential equation (PDE) that describes the probability density function of a stochastic differential equation as it changes with time. We relate the instantaneous change of variables to the special case of FokkerPlanck with zero diffusion, the Liouville equation. As with the instantaneous change of variables, let z(t) ∈ RD evolve through time following dz(t) dt = f (z(t), t). Then Liouville equation describes the change in density of z–a fixed point in space–as a PDE, D ∂p(z, t) ∂ =− [fi (z, t)p(z, t)] (30) ∂t i=1 ∂z i However, (30) cannot be easily used as it requires the partial derivatives of p(z,t) ∂z , which is typically approximated using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999). Instead of evaluating p(·, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain ∂p(z(t), t) ∂p(z(t), t) ∂z(t) ∂p(z(t), t) = + ∂t ∂z(t) ∂t ∂t partial derivative from first argument, z(t) partial derivative from second argument, t D ✘ D D ✘✘ ∂p(z(t), t) ∂z ✘i✘(t) ∂fi (z(t), t) ∂p(z(t), ✘ t) (31) = ✘ ✘ ✘ − p(z(t), t) − ✘t)✘✘ fi (z(t), i=1 ✘ ✘✘∂zi (t) ∂t i=1 ∂zi ✘ i=1 ✘ ✘ ✘ ∂zi (t) D ∂fi (z(t), t) =− p(z(t), t) i=1 ∂zi We arrive at the instantaneous change of variables by taking the log, D ∂ log p(z(t), t) 1 ∂p(z(t), t) ∂fi (z(t), t) = =− (32) ∂t p(z(t), t) ∂t i=1 ∂zi While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1, d z(t) f (z(t), t) = D ∂fi (z(t),t) (33) dt log p(z(t), t) − i=1 ∂t Compared to the FokkerPlanck and Liouville equations, the instantaneous change of variables is of more practical impact as it can be numerically solved much more easily, requiring an extra state of D for following the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation would require a grid size that is exponential in D. Appendix B A Modern Proof of the Adjoint Method We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow. 14
15.B.1 Continuous Backpropagation dz(t) Let z(t) follow the differential equation dt = f (z(t), t, θ), where θ are the parameters. We will prove that if we define an adjoint state dL a(t) = (34) dz(t) then it follows the differential equation da(t) ∂f (z(t), t, θ) = −a(t) (35) dt ∂z(t) For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors. The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks, the gradient of a hidden layer ht depends on the gradient from the next layer ht+1 by chain rule dL dL dht+1 = . (36) dht dht+1 dht With a continuous hidden state, we can write the transformation after an ε change in time as t+ε z(t + ε) = f (z(t), t, θ)dt + z(t) = Tε (z(t), t) (37) t and chain rule can also be applied dL dL dz(t + ε) ∂Tε (z(t), t) = or a(t) = a(t + ε) (38) ∂z(t) dz(t + ε) dz(t) ∂z(t) The proof of (35) follows from the definition of derivative: da(t) a(t + ε) − a(t) = lim (39) dt ε→0+ ε ∂ a(t + ε) − a(t + ε) ∂z(t) Tε (z(t)) = lim (by Eq 38) (40) ε→0 + ε ∂ a(t + ε) − a(t + ε) ∂z(t) z(t) + εf (z(t), t, θ) + O(ε2 ) = lim (Taylor series around z(t)) ε→0+ ε (41) a(t + ε) − a(t + ε) I + ε ∂f (z(t),t,θ) ∂z(t) + O(ε )2 = lim (42) ε→0+ ε −εa(t + ε) ∂f (z(t),t,θ) ∂z(t) + O(ε2 ) = lim (43) ε→0+ ε ∂f (z(t), t, θ) = lim −a(t + ε) + O(ε) (44) ε→0+ ∂z(t) ∂f (z(t), t, θ) = −a(t) (45) ∂z(t) We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to the hidden state at any time, including the initial value. t0 dL ∂f (z(t), t, θ) a(tN ) = a(t0 ) = a(t) dt (46) dz(tN ) tN ∂z(t) initial condition of adjoint diffeq. gradient wrt. initial value Here we assumed that loss function L depends only on the last time point tN . If function L depends also on intermediate time points t1 , t2 , . . . , tN −1 , etc., we can repeat the adjoint step for each of the intervals [tN −1 , tN ], [tN −2 , tN −1 ] in the backward order and sum up the obtained gradients. B.2 Gradients wrt. θ and t We can generalize (35) to obtain gradients with respect to θ–a constant wrt. t–and and the initial and end times, t0 and tN . We view θ and t as states with constant differential equations and write ∂θ(t) dt(t) =0 =1 (47) ∂t dt 15
16.We can then combine these with z to form an augmented state with corresponding differential equation and adjoint state, d z f ([z, θ, t]) a , aaug := aθ , aθ (t) := dL , at (t) := dL θ (t) = faug ([z, θ, t]) := 0 dt t 1 a dθ(t) dt(t) t (48) Note this formulates the augmented ODE as an autonomous (timeinvariant) ODE, but the derivations in the previous section still hold as this is a special case of a timevariant ODE. The Jacobian of f has the form ∂f ∂f ∂f ∂faug ∂z ∂θ ∂t =0 0 0 (t) (49) ∂[z, θ, t] 0 0 0 where each 0 is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain daaug (t) ∂faug = − a(t) aθ (t) at (t) (t) = − a ∂f ∂z a ∂f ∂θ a ∂f ∂t (t) (50) dt ∂[z, θ, t] The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain the total gradient with respect to the parameters, by integrating over the full interval. t0 dL ∂f (z(t), t, θ) = a(t) dt (51) dθ tN ∂θ Note the negative sign cancels out since we integrate backwards from tN to t0 . Finally, we also get gradients with respect to t0 and tN , the start and end of the integration interval. t0 dL ∂f (z(tN ), tN , θ) dL ∂f (z(t), t, θ) = −a(tN ) = a(t) dt (52) dtN ∂tN dt0 tN ∂t Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver. Appendix C Autograd Implementation import scipy . i n t e g r a t e i m p o r t a u t o g r a d . numpy a s np from a u t o g r a d . e x t e n d i m p o r t p r i m i t i v e , d e f v j p _ a r g n u m s from a u t o g r a d i m p o r t make_vjp from a u t o g r a d . m i s c i m p o r t f l a t t e n from a u t o g r a d . b u i l t i n s i m p o r t t u p l e odeint = primitive ( scipy . integrate . odeint ) d e f g r a d _ o d e i n t _ a l l ( y t , f u n c , y0 , t , f u n c _ a r g s , ∗∗ k w a r g s ) : # E x t e n d e d from " S c a l a b l e I n f e r e n c e o f O r d i n a r y D i f f e r e n t i a l # E q u a t i o n Models o f B i o c h e m i c a l P r o c e s s e s " , Sec . 2 . 4 . 2 # F a b i a n F r o e h l i c h , C a r o l i n Loos , J a n H a s e n a u e r , 2017 # h t t p s : / / a r x i v . org / pdf /1711.08079. pdf T , D = np . s h a p e ( y t ) flat_args , unflatten = f l a t t e n ( func_args ) def f l a t _ f u n c (y , t , f l a t _ a r g s ) : r e t u r n func ( y , t , ∗ u n f l a t t e n ( f l a t _ a r g s ) ) def unpack ( x ) : # y, vjp_y , vjp_t , vjp_args r e t u r n x [ 0 : D] , x [D: 2 ∗ D] , x [ 2 ∗ D] , x [ 2 ∗ D + 1 : ] def augmented_dynamics ( augmented_state , t , f l a t _ a r g s ) : Note that we’ve overloaded t to be both a part of the state and the (dummy) independent variable. The distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the text. 16
17. # O r g i n a l s y s t e m a u g m e n t e d w i t h v j p _ y , v j p _ t and v j p _ a r g s . y , vjp_y , _ , _ = unpack ( a u g m e n t e d _ s t a t e ) v j p _ a l l , d y _ d t = make_vjp ( f l a t _ f u n c , argnum = ( 0 , 1 , 2 ) ) ( y , t , f l a t _ a r g s ) v j p _ y , v j p _ t , v j p _ a r g s = v j p _ a l l (− v j p _ y ) r e t u r n np . h s t a c k ( ( d y _ d t , v j p _ y , v j p _ t , v j p _ a r g s ) ) def v j p _ a l l ( g ,∗∗ kwargs ) : v j p _ y = g [ −1 , : ] vjp_t0 = 0 time_vjp_list = [] v j p _ a r g s = np . z e r o s ( np . s i z e ( f l a t _ a r g s ) ) f o r i i n r a n g e ( T − 1 , 0 , −1): # Compute e f f e c t o f moving c u r r e n t t i m e . v j p _ c u r _ t = np . d o t ( f u n c ( y t [ i , : ] , t [ i ] , ∗ f u n c _ a r g s ) , g [ i , : ] ) t i m e _ v j p _ l i s t . append ( v j p _ c u r _ t ) vjp_t0 = vjp_t0 − vjp_cur_t # Run a u g m e n t e d s y s t e m b a c k w a r d s t o t h e p r e v i o u s o b s e r v a t i o n . aug_y0 = np . h s t a c k ( ( y t [ i , : ] , v j p _ y , v j p _ t 0 , v j p _ a r g s ) ) a u g _ a n s = o d e i n t ( a u g m e n t e d _ d y n a m i c s , aug_y0 , np . a r r a y ( [ t [ i ] , t [ i − 1 ] ] ) , t u p l e ( ( f l a t _ a r g s , ) ) , ∗∗ k w a r g s ) _ , vjp_y , vjp_t0 , v j p _ a r g s = unpack ( aug_ans [ 1 ] ) # Add g r a d i e n t from c u r r e n t o u t p u t . vjp_y = vjp_y + g [ i − 1 , : ] t i m e _ v j p _ l i s t . append ( v j p _ t 0 ) v j p _ t i m e s = np . h s t a c k ( t i m e _ v j p _ l i s t ) [ : : − 1 ] r e t u r n None , v j p _ y , v j p _ t i m e s , u n f l a t t e n ( v j p _ a r g s ) return vjp_all def grad_argnums_wrapper ( a l l _ v j p _ b u i l d e r ) : # A g e n e r i c a u t o g r a d h e l p e r f u n c t i o n . Takes a f u n c t i o n t h a t # b u i l d s v j p s f o r a l l a r g u m e n t s , and w r a p s i t t o r e t u r n o n l y r e q u i r e d v j p s . d e f b u i l d _ s e l e c t e d _ v j p s ( argnums , ans , c o m b i n e d _ a r g s , k w a r g s ) : v j p _ f u n c = a l l _ v j p _ b u i l d e r ( ans , ∗ c o m b i n e d _ a r g s , ∗∗ k w a r g s ) def chosen_vjps ( g ) : # R e t u r n w h i c h e v e r v j p s were a s k e d f o r . a l l _ v j p s = vjp_func ( g ) r e t u r n [ a l l _ v j p s [ argnum ] f o r argnum i n argnums ] return chosen_vjps return build_selected_vjps defvjp_argnums ( odeint , grad_argnums_wrapper ( g r a d _ o d e i n t _ a l l ) ) Appendix D Algorithm for training the latent ODE model To obtain the latent representation zt0 , we traverse the sequence using RNN and obtain parameters of distribution q(zt0 {xti , ti }i , θenc ). The algorithm follows a standard VAE algorithm with an RNN variational posterior and an ODESolve model: 1. Run an RNN encoder through the time series and infer the parameters for a posterior over zt0 : q(zt0 {xti , ti }i , φ) = N (zt0 µzt0 , σz0 ), (53) where µz0 , σz0 comes from hidden state of RNN({xti , ti }i , φ) 2. Sample zt0 ∼ q(zt0 {xti , ti }i ) 3. Obtain zt1 , zt2 , . . . , ztM by solving ODE ODESolve(zt0 , f, θf , t0 , . . . , tM ), where f is the function defining the gradient dz/dt as a function of z 17
18. 4. Maximize ELBO = M i=1 log p(xti zti , θx ) + log p(zt0 ) − log q(zt0 {xti , ti }i , φ), where p(zt0 ) = N (0, 1) Appendix E Extra Figures Ground Truth Observation Prediction Extrapolation (a) 30 time points (b) 50 time points (c) 100 time points Figure 10: Spiral reconstructions using a latent ODE with a variable number of noisy observations. 18
相关Slides

论文提出了一种新的深度神经网络模型家族，并没有规定一个离散的隐藏层序列，而是使用神经网络将隐藏状态的导数参数化。然后使用黑箱微分方程求解器计算该网络的输出。这些连续深度模型的内存成本是固定的，它们根据输入调整评估策略，并显式地用数值精度来换取运算速度。论文展示了连续深度残差网络和连续时间隐变量模型的特性。此外，还构建了连续的归一化流，一个可以使用最大似然方法训练的生成模型，无需对数据维度进行分割或排序。至于训练，论文展示了如何基于任意 ODE 求解器进行可扩展的反向传播，无需访问 ODE 求解器的内部操作。这使得在较大模型内也可以实现 ODE 的端到端训练。

1. 中国人工智能产业发展迅速，但整体实力仍落后于美国 2. 中国企业价值链布局侧重技术层和应用层，对需要长周期的基础层关注 度较小 3. 科技巨头生态链博弈正在展开，创业企业则积极发力垂直行业解决方案， 深耕巨头的数据洼地，打造护城河 4. 政府端是目前人工智能切入智慧政务和公共安全应用场景的主要渠道， 早期进入的企业逐步建立行业壁垒，未来需要解决数据割裂问题以获得长 足发展 5. 人工智能在金融领域的应用最为深入，应用场景逐步由以交易安全为主 向变革金融经营全过程扩展 6. 医疗行业人工智能应用发展快速，但急需建立标准化的人工智能产品市 场准入机制并加强医疗数据库的建设 7. 以无人驾驶技术为主导的汽车行业将迎来产业链的革新 8. 人工智能在制造业领域的应用潜力被低估，优质数据资源未被充分利 用 9. 人工智能加速新零售全渠道的融合，传统零售企业与创业企业结成伙 伴关系，围绕人、货、场、链搭建应用场景 10. 政策与资本双重驱动推动人工智能产业区域间竞赛，京沪深领跑全 国，杭州发展逐步加速 11. 各地政府以建设产业园的方式发挥人工智能产业在推动新旧动能转换 中的作用 2. 杭州未来科技城抓住人工智能产业快速发展的机会并取得显著成绩， 未来可以从人才、技术、创新三要素入手进一步打造产业竞争力

张首晟教授讲述了量子的世界，人工智能未来的发展以及区块链技术的展望与发展趋势。张提出区块链与人工智能共生的见解，人工智能需要数据，但是数据往往被中心化平台垄断，因而阻碍创新。从这种意义上人工智能有所欠缺。 加密经济学创造了一个对于数据提供者有正确激励机制的数据市场。人工智能能够依赖这个数据市场起飞。