← BACK TO BLOG
Machine LearningHamiltonian Neural NetworksDifferential GeometrySymplectic GeometryPhysics-Informed ML

Hamiltonian Neural Networks from a Differential Geometry Perspective

Ata Hakçıl, Abscondita Research Team··32 min read

Because when given the simplest nail in the universe, sometimes you just need a nuclear powered sledgehammer.

We have committed a crime.

What do you do when you have a frontier concept that is just a few years old and is already complicated enough? Well, you set up a Tinder date between that concept and one of the most haunting[·] branches of mathematics, of course. So we are taking a look at Hamiltonian Neural Networks from a differential-geometry perspective.

Look, we didn't ask for this. But nowadays it seems every single piece of ML research boils down to the same thing:

Put a transformer in it and make it lame!

Do they get the job done? Usually — at significant compute cost. But they might as well be the single most uninspired architecture we have. So this time we're doing the opposite. We're going to chase something beautiful and let the structure of the problem do the heavy lifting, instead of throwing a billion parameters at it and praying to hyperparameter deities that our losses converge.

Achievement Unlocked
Offended the Entire ML Community
20G

We are not the first to think the aesthetics matter:

Paul Dirac

It is more important to have beauty in one's equations than to have them fit experiment.

Paul Dirac, Physicist, professional aesthete

We are going to work on what is possibly the simplest physical system in the universe — the kind engineering students see in their second physics course and never think about again. And to talk about it, we are going to need symplectic manifolds, differential forms, the Lie derivative, Cartan's magic formula, the Poisson bracket, and Noether's theorem.

Mathematical overkill level3/10
concept: Differential Geometry
fine

If that ratio of machinery-to-mass strikes you as deranged, we never claimed otherwise. But we promise: somewhere underneath this absurdity is a genuinely beautiful idea about machine learning — you can teach a neural network to obey a conservation law it is never shown, purely by changing the shape of the thing it is allowed to learn.

INFO

Every terrifying word above has a precise, checkable meaning, and we are going to define each one in terms of things that are easier to digest — partial derivatives, dot products, determinants, the chain rule — and then compute with it, instead of waving our hands and hoping you nod along. If you can differentiate x2x^2 and remember what a determinant is, you should be able to follow every line here. The whole post runs on six tools:

  • a partial derivative Hq\tfrac{\partial H}{\partial q} — how HH changes when you nudge qq and freeze everything else;
  • the gradient H=(Hq,Hp)\nabla H = \left(\tfrac{\partial H}{\partial q}, \tfrac{\partial H}{\partial p}\right) — every partial stacked into one vector;
  • the dot product, for turning two vectors into a number;
  • the multivariable chain rule, ddtH(q(t),p(t))=Hqdqdt+Hpdpdt\tfrac{d}{dt}H(q(t),p(t)) = \tfrac{\partial H}{\partial q}\tfrac{dq}{dt} + \tfrac{\partial H}{\partial p}\tfrac{dp}{dt};
  • the determinant of a 2×22\times 2 matrix, read as a signed area;
  • Green's theorem, which trades a loop integral for the area it encloses.

Everything else — forms, the symplectic structure, Poisson brackets — we build out of those, in front of you. A bit like a cooking show, I guess.

The crime scene

Here's the setup that started it.

Take that mass on a spring. Unit mass, unit spring constant, no friction. Its entire universe is two numbers: position qq and momentum pp. The total energy is

H(q,p)=12p2kinetic  +  12q2potentialH(q, p) = \underbrace{\tfrac{1}{2} p^{2}}_{\text{kinetic}} \;+\; \underbrace{\tfrac{1}{2} q^{2}}_{\text{potential}}

and the equations of motion are the cleanest things in all of physics. Here dqdt\tfrac{dq}{dt} is the velocity (the rate position changes) and dpdt\tfrac{dp}{dt} is the force (the rate momentum changes). The first equation below just says velocity equals momentum — true when the mass is 11. The second is Hooke's law, F=kqF = -kq: pull the mass to the right, and the spring hauls it back left.

dqdt=p,dpdt=q\frac{dq}{dt} = p, \qquad \frac{dp}{dt} = -q

Watch the two pictures move together — the mass bobbing in real space on the left, and the state (q,p)(q, p) tracing a closed loop on the right. That loop, the system drawn in its own coordinates, is the phase space we'll obsess over for the rest of the post:

a spring and its shadow in phase space
real space — the mass bobs
q=0
phase space — the state (q, p) orbits
qp
q 1.70p 0.00H 1.44

Now do the obvious machine-learning thing.

NO. Nooooo. Put down the transformer. Not that obvious thing.

Collect a pile of (q,p)(q, p) states and their measured time-derivatives (dqdt,dpdt)(\tfrac{dq}{dt}, \tfrac{dp}{dt}), and train a perfectly reasonable little MLP to map one to the other. It fits the training data to five decimal places. You feel great. You leave your room and tell your mom "Mama, I did it," as she gives you a confused look. Then you let it predict forward in time for a few hundred swings, and oh no…

It does not stay constant. The orbit, which should be a perfect closed circle in phase space, slowly spirals — either inward until the mass grinds to a halt, or outward until your frictionless spring is somehow flinging a mass around with more energy than you gave it. Your neural network has, depending on the sign of its errors, either invented heat death or built a perpetual-motion machine.

INFO

The network is not bad at physics. It is bad at geometry. It learned a vector field with no reason to conserve anything, because nothing in its architecture knew that "conservation" was even a category of thing. So it's not the model's fault. It's yours.

Fixing that is not a matter of more data or a cleverer loss; it's a matter of giving the network the right shape to learn into. That shape has a name, and the name is symplectic.

So here is the plan. We are going to explain why that spiral is inevitable for a naive model — and how to make it impossible — three times over: once as geometry, once as a learning problem, and once as a symmetry. They turn out to be the same sentence wearing three different hats. Helmets on.


Movement I — The arena is a manifold

Phase space is not a plane

Physics is often built on small white lies we tell ourselves to make a problem easier to hold. It's why, when you ask a physicist "what is spin," they say "it's like a ball spinning, except it's not a ball and it isn't spinning."

Achievement Unlocked
Offended the Entire Physics Community Too
20G

One such lie we tell undergrads is that the state of the spring "is a point in the (q,p)(q, p) plane." It's close enough to compute with — but calling phase space "a plane" is doing about as much heavy lifting as calling a 747 "a plane." Same word; wildly different object underneath.

Here is the honest version, and then immediately the plain-language version. Position lives in a configuration space QQ (for the spring, just the line of possible positions). Momentum is not another position — it's the thing you pair with a velocity to extract a number (a kinetic energy). An object whose whole job is "eat a vector, return a number" is a covector, and it lives in a separate space, the dual. Glue one dual space onto every point of QQ and you get the cotangent bundle TQT^{*}Q. That is phase space.

INFO

If "covector" made your eyes glaze, good news — you already use one constantly. In ML, when you write ΔLLΔw\Delta L \approx \nabla L \cdot \Delta w, the gradient L\nabla L isn't really "an arrow in weight space." It's the gadget you feed a step Δw\Delta w in order to get back a number, the change in loss. That "eat a vector → return a number" gadget is a covector (a 1-form). A plain vector is a direction you move; a covector is a ruler you measure against. Momentum is a ruler, not a move — that's the entire reason phase space isn't a naive plane.

Mathematical overkill level5/10
concept: The Cotangent Bundle T*Q
still fine

This sounds like pedantry until it bites. Ask the question the naive picture cannot answer: what does it even mean for a flow to "preserve volume" here, when the two axes carry different units? Position times momentum is an action, not an area — you can't lay a ruler across phase space. You need a purpose-built instrument. Building that instrument, carefully, is the rest of this movement.

The instrument: a symplectic form

That instrument is the symplectic form ω\omega. In canonical coordinates it is

ω=dqdp.\omega = dq \wedge dp .

We'll define the pieces from the ground up. A 1-form like dqdq is a little machine that eats a vector v=(vq,vp)v=(v_q,v_p) and reports one number — here, how far you moved in qq: dq(v)=vqdq(v)=v_q. A general 1-form is α=adq+bdp\alpha = a\,dq + b\,dp, acting by α(v)=avq+bvp\alpha(v) = a\,v_q + b\,v_p. With that, the differential of our energy is nothing but the total-derivative formula you already know, finally given a name:

dH=Hqdq+Hpdp.dH = \frac{\partial H}{\partial q}\,dq + \frac{\partial H}{\partial p}\,dp .

A 2-form like dqdpdq \wedge dp eats two vectors and returns a number — and this particular one returns the signed area of the parallelogram they span:

INFOω(u,v)=uqvpupvq=det(uqvqupvp)\omega(u, v) = u_q v_p - u_p v_q = \det\begin{pmatrix} u_q & v_q \\ u_p & v_p \end{pmatrix}

That's just the 2×22\times 2 determinant you've seen a hundred times — the area of the parallelogram with sides uu and vv. Two facts read straight off it, both of which we cash in later:

  • Antisymmetry: swap the inputs and the sign flips, ω(v,u)=ω(u,v)\omega(v,u) = -\omega(u,v).
  • Zero on repeats: feed it the same vector twice and you get ω(v,v)=0\omega(v,v)=0 — a "parallelogram" with two identical sides is flat, it has no area.

Tattoo that second fact somewhere visible. It is, almost single-handedly, why energy is conserved.

ω\omega carries two structural properties:

  • it is closed, written dω=0d\omega = 0 — there's no "source" of area anywhere (we cash this in exactly once, in a slick proof later), and
  • it is non-degenerate — the only vector that measures zero against every other vector is the zero vector itself.

Non-degeneracy is the magic clause, because it makes ω\omega a perfect, invertible dictionary between vectors and 1-forms: hand it a vector and you get a 1-form; because nothing is invisible to it, you can run the dictionary backwards and turn any 1-form into exactly one vector. A metric gg (think: the ordinary dot product) gives the same kind of dictionary — that's literally how the gradient H\nabla H is defined, by translating the 1-form dHdH into a vector. The only difference is that gg is symmetric while ω\omega is antisymmetric, and that one sign flip is the whole post.

The 90° turn, derived (not asserted)

Our network learns a scalar HH, whose differential dHdH is a 1-form — at every point it points "uphill" on the energy landscape. We now turn that 1-form into a flow, through ω\omega. The vector field XHX_H we want is defined by one deceptively small equation:

ιXHω=dH.\iota_{X_H}\,\omega = dH .

Let's not admire it — let's solve it. The symbol ιX\iota_X (the interior product) just means "plug XX into the first slot of ω\omega and leave the second open," which leaves a 1-form behind. Compute it directly:

(ιXω)(v)=ω(X,v)=XqvpXpvq=(Xp)vq+(Xq)vp    ιXω=Xpdq+Xqdp.(\iota_X \omega)(v) = \omega(X, v) = X_q v_p - X_p v_q = (-X_p)\,v_q + (X_q)\,v_p \;\Longrightarrow\; \iota_X\omega = -X_p\,dq + X_q\,dp .

Now set that equal to dH=Hqdq+HpdpdH = \tfrac{\partial H}{\partial q}\,dq + \tfrac{\partial H}{\partial p}\,dp and match the dqdq and dpdp pieces:

Xp=Hq,Xq=HpXH=(Hp,  Hq).-X_p = \frac{\partial H}{\partial q}, \qquad X_q = \frac{\partial H}{\partial p} \quad\Longrightarrow\quad X_H = \left( \frac{\partial H}{\partial p},\; -\frac{\partial H}{\partial q} \right).

There they are — Hamilton's equations, derived in three lines. Plug in the spring's H=12(q2+p2)H = \tfrac12(q^2+p^2) and you get XH=(p,q)X_H = (p, -q), exactly dqdt=p, dpdt=q\tfrac{dq}{dt} = p,\ \tfrac{dp}{dt} = -q. And now the "90° rotation" is not a slogan but a computation. Stack the two dictionaries side by side:

H=(Hq,Hp),XH=(Hp,Hq)=(0110)JH.\nabla H = \left( \frac{\partial H}{\partial q}, \frac{\partial H}{\partial p} \right), \qquad X_H = \left( \frac{\partial H}{\partial p}, -\frac{\partial H}{\partial q} \right) = \underbrace{\begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}}_{J}\,\nabla H .

That matrix JJ rotates any vector by 90-90^\circ. So XHX_H is literally the gradient, turned a quarter turn. The gradient climbs straight up the energy bowl; rotate it and it runs along the contour rings instead. Drag the white dot below and flip between the two dictionaries — same dHdH, same rings, opposite fate:

dH → flow — pick the machine that turns the 1-form into motion
qp∇HX_H
ENERGY H(q,p)
1.625
start: 1.625 · conserved

Feeding dH through ω rotates it 90° onto the rings. The particle rides a level set — energy never moves.

drag the white dot to move the start point

Why the quarter-turn conserves energy

Here's the payoff, short enough to do twice. How fast does energy change as the system flows along XHX_H? Multivariable chain rule, then substitute dqdt=H/p\tfrac{dq}{dt} = \partial H/\partial p and dpdt=H/q\tfrac{dp}{dt} = -\partial H/\partial q:

dHdt=Hqdqdt+Hpdpdt=HqHp+Hp(Hq)=0.\frac{dH}{dt} = \frac{\partial H}{\partial q}\,\frac{dq}{dt} + \frac{\partial H}{\partial p}\,\frac{dp}{dt} = \frac{\partial H}{\partial q}\frac{\partial H}{\partial p} + \frac{\partial H}{\partial p}\left(-\frac{\partial H}{\partial q}\right) = 0 .

The two terms are identical with opposite signs. They cancel for any HH whatsoever — that's the entire content of energy conservation, and you just checked it with first-year calculus. The slick, coordinate-free way to say precisely the same thing is

dHdt=dH(XH)=ω(XH,XH)=0,\frac{dH}{dt} = dH(X_H) = \omega(X_H, X_H) = 0,

where the final step is free because ω\omega is antisymmetric, and anything antisymmetric, fed two copies of one vector, returns zero — the "zero on repeats" fact from the primer. Energy is conserved because a quantity, paired with itself, backwards, is zero. We brought a cotangent bundle to a knife fight and the knife was ω(X,X)=0\omega(X,X)=0.

The flow preserves area, two ways

Energy conservation pins each trajectory to one ring. There's a second, deeper invariant: the flow preserves area in phase space. The beginner-friendly proof is one line of partial derivatives. A flow inflates or shrinks volume according to the divergence of its vector field, so just compute it:

XH=q ⁣(Hp)+p ⁣(Hq)=2Hqp2Hpq=0.\nabla\cdot X_H = \frac{\partial}{\partial q}\!\left(\frac{\partial H}{\partial p}\right) + \frac{\partial}{\partial p}\!\left(-\frac{\partial H}{\partial q}\right) = \frac{\partial^2 H}{\partial q\,\partial p} - \frac{\partial^2 H}{\partial p\,\partial q} = 0 .

It vanishes because mixed partials commute[·] — the most boring fact in calculus. A divergence-free flow is incompressible: it can stir phase space but never compress or inflate a patch of it. That is Liouville's theorem, and for a Hamiltonian system you get it for free.

Mathematical overkill level8/10
concept: Cartan's Magic Formula
why?

Drop a small square patch of initial conditions into phase space and let it flow. A real (Hamiltonian) flow just turns the patch rigidly — same shape, same area forever; the MLP's divergent field balloons it. Toggle between them and watch the area readout:

liouville's theorem — does phase-space area survive the flow?
qp
PHASE-SPACE AREA
1.00
started at 1.00 · +0%

The square balloons as it turns. A field with divergence cannot keep area fixed — so it cannot be the flow of any energy.

For the spring, the Hamiltonian flow is a rigid rotation, so Liouville's theorem here is the thrilling claim that spinning a shape doesn't change its area. We know. But the one-line proof that makes it obvious for the spring is the identical proof that makes it true for a galaxy of stars or a Hamiltonian Monte Carlo sampler. That is the whole point of the sledgehammer: the spring is where you can check, by hand, that the machinery is honest.


Movement II — The network never sees an energy

Here is the part about Hamiltonian Neural Networks that, in our opinion, just doesn't get enough love.

The training target is a lie of omission

A Hamiltonian Neural Network (Greydanus, Dzamba & Yosinski, 2019) is almost insultingly simple to state: instead of a network that outputs the vector field (dqdt,dpdt)(\tfrac{dq}{dt}, \tfrac{dp}{dt}), you build a network Hθ(q,p)H_\theta(q, p) that outputs a single number, and you define the dynamics as its symplectic gradient — the exact ιXHω=dH\iota_{X_H}\omega = dH construction we derived above, computed with autodiff.

But look hard at what you train it on. You have trajectory data: states (q,p)(q, p) and their measured slopes (dqdt,dpdt)(\tfrac{dq}{dt}, \tfrac{dp}{dt}). You do not have energy labels. Nobody ever measured HH. The loss compares the network's derivatives to the observed slopes:

L(θ)=Hθpdqdt2+Hθqdpdt2\mathcal{L}(\theta) = \left\lVert\, \frac{\partial H_\theta}{\partial p} - \frac{dq}{dt} \,\right\rVert^{2} + \left\lVert\, -\frac{\partial H_\theta}{\partial q} - \frac{dp}{dt} \,\right\rVert^{2}

Every term is a derivative of the network. The network's actual output — the scalar HθH_\theta — appears nowhere in the loss. It is never supervised, never compared to a target, never even named in the data. The Hamiltonian is a latent variable: a hidden scalar the network is forced to invent, whose only job is to have the right slopes.

WARN

You are doing supervised regression on the gradient of a function while leaving the function itself completely unobserved. It's like being handed thousands of slope readings from across a mountain range and asked to reconstruct the terrain. You can do it — but only up to one thing slopes can never tell you: the absolute altitude. Sea level is a free parameter. So is the additive constant in HH (add 55 to HH and dHdH doesn't budge). The dynamics don't care, because they only ever see dHdH, never HH.

"Is there an energy at all?" is a geometry question

Now the deep bit, and we are going to earn it rather than assert it. Recovering a scalar from its slopes is only possible if the slopes are genuinely the slopes of something. Not every field of arrows is a gradient. So: given a learned field X=(Xq,Xp)X = (X_q, X_p), when does a potential HH exist with X=XHX = X_H?

Run it through the same machine. The 1-form attached to XX is ιXω=Xpdq+Xqdp\iota_X\omega = -X_p\,dq + X_q\,dp. For this to equal dHdH, it has to be closed — its exterior derivative must vanish. For any 1-form α=adq+bdp\alpha = a\,dq + b\,dp that derivative is

dα=(bqap)dqdp.d\alpha = \left(\frac{\partial b}{\partial q} - \frac{\partial a}{\partial p}\right) dq \wedge dp .

Substitute a=Xpa = -X_p, b=Xqb = X_q:

d(ιXω)=(Xqq+Xpp)dqdp=(X)  dqdp.d(\iota_X \omega) = \left(\frac{\partial X_q}{\partial q} + \frac{\partial X_p}{\partial p}\right) dq \wedge dp = (\nabla\cdot X)\; dq \wedge dp .

So the abstract condition "ιXω\iota_X\omega is closed" is, in plain coordinates, exactly "XX is divergence-free." And on a simply-connected patch like our plane, closed implies exact (the Poincaré lemma), so a potential HH exists if and only if the field has zero divergence. The symplectic field (p,q)(p,-q) has divergence p/q+(q)/p=0\partial p/\partial q + \partial(-q)/\partial p = 0 and passes. A generic MLP that regresses (dqdt,dpdt)(\tfrac{dq}{dt}, \tfrac{dp}{dt}) directly has no reason on earth to be divergence-free, so its 1-form is not closed, so no HH exists, so there is nothing for a conservation law to even be about.

INFO

You can watch the obstruction. By Green's theorem, integrating the recovered slope-form around a closed loop equals the divergence piled up inside it:

RιXω=R(X)  dqdp.\oint_{\partial R} \iota_X\omega = \iint_R (\nabla\cdot X)\; dq\,dp .

If XX is divergence-free the right side is zero, so the loop integral comes home to where it started and a single-valued HH exists. If not, you get a non-zero gap — the recovered "energy" is path-dependent, which is mathematician for it isn't a function at all. The slider below dials the divergence; the gap you see is literally that area integral.

Drag the slider below from real physics toward what an MLP learns, and watch the recovered altitude refuse to return to where it started:

can this field of arrows come from an energy at all?
✗ loop never closes

An MLP only ever sees the field of arrows on the left — never an energy value. To test whether an energy could even exist, we try to rebuild its “altitude” by walking once around a closed loop (we use a circle, but any loop gives the same verdict). The rule: a real energy must bring you back to the altitude you started at.

the field, and the loop we walk
qp

At real physics the arrows circulate (no leak); slide toward the MLP and they spiral outward — a source the energy can’t explain.

recovered altitude as we walk one full lap
start altitudeΔ = 2.54

Flat & home → an energy exists. A line that ends higher than it began → it doesn’t.

Every lap climbs by Δ = 2.54 and never comes back down — an Escher staircase. No energy can hand the same point two different altitudes, so no energy exists at all. That is exactly why a free-form MLP can’t conserve one: there is nothing there to conserve.
DANGER

A direct-field MLP doesn't fail to conserve energy because it was under-trained. It fails because the field it learned is not closed, so the loop integral is path-dependent, so there is no single-valued energy function in existence for it to conserve. The HNN conserves energy for the opposite reason: by only ever emitting dHdH of an actual scalar, every field it can possibly produce is exact by construction — divergence-free, area-preserving, energy-conserving, with no extra loss term and no penalty. Conservation isn't learned. It's the only thing the architecture can express.

The only thing this test can never pin down is the overall "sea level" — the additive constant in HH from earlier. Which is fine, because the dynamics never depended on it.


Movement III — Conservation is a symmetry in disguise

This is the section that makes us question: "If Emmy Noether was alive to see her work utilized by cavemen like us, would she be excited or horrified?"

We have energy conservation. Time for the theorem that says energy was never special to begin with.

Mathematical overkill level10/10
concept: Noether's Theorem + the Poisson Bracket
ridiculous

The Poisson bracket, derived

Noether's theorem says every continuous symmetry of a system has a matching conserved quantity, and vice versa. In our symplectic language it becomes startlingly mechanical. First, one new gadget — the Poisson bracket[·] of two functions, defined as ω\omega evaluated on their two Hamiltonian fields, {F,G}=ω(XF,XG)\{F, G\} = \omega(X_F, X_G). Expand it with the coordinate formulas we already derived (XF=(Fp,Fq)X_F = (F_p, -F_q), same for GG):

{F,G}=ω(XF,XG)=(Fp)(Gq)(Fq)(Gp)=FqGpFpGq.\{F, G\} = \omega(X_F, X_G) = (F_p)(-G_q) - (-F_q)(G_p) = \frac{\partial F}{\partial q}\frac{\partial G}{\partial p} - \frac{\partial F}{\partial p}\frac{\partial G}{\partial q} .

Now the one identity that runs the whole show. How does any quantity GG change along the dynamics? Chain rule once more, then substitute Hamilton's equations:

dGdt=Gqdqdt+Gpdpdt=GqHpGpHq={G,H}.\frac{dG}{dt} = \frac{\partial G}{\partial q}\frac{dq}{dt} + \frac{\partial G}{\partial p}\frac{dp}{dt} = \frac{\partial G}{\partial q}\frac{\partial H}{\partial p} - \frac{\partial G}{\partial p}\frac{\partial H}{\partial q} = \{G, H\} .

Read it both ways and Noether falls out for free:

  • {G,H}=0    G\{G, H\} = 0 \iff G is conserved as the system evolves under HH.
  • By antisymmetry {G,H}={H,G}\{G, H\} = -\{H, G\}, that's the same equation as "HH is unchanged by the flow generated by GG" — i.e. GG generates a symmetry.

Symmetry and conservation are, quite literally, one equation {G,H}=0\{G, H\} = 0 read left-to-right or right-to-left. Energy is conserved because {H,H}=0\{H, H\} = 0 — antisymmetry, again, the gift that keeps on giving.

Pick a quantity below and watch its value as the spring swings. The widget offers three: the energy 12(q2+p2)\tfrac12(q^2+p^2) with {G,H}=0\{G,H\}=0, the position qq with {G,H}=p\{G,H\}=p, and a stretch qpqp with {G,H}=p2q2\{G,H\}=p^2-q^2. Only the energy has a vanishing bracket — and only the energy stays flat:

which quantities stay constant as the spring swings?
✓ CONSERVED

Some quantities hold dead still while the spring oscillates; others slosh around. Noether’s theorem says the constant ones are exactly the symmetries. Pick a quantity and watch its value as time runs.

qp

the orbit ·  level sets of Energy. Conserved ⟺ the orbit runs along them, not across.

value of “Energy ½(q²+p²)” over time
starting value

A flat line: the energy never changes while the spring swings. It is conserved — and it is precisely the charge of the rotation symmetry of phase space.

the test: {G, H} = 0bracket = 0 → conserved

Uhhh.. So?

Here is the thing we find genuinely lovely, and the reason the latent scalar from Movement II matters so much:

INFO

Let's be precise, because the honest version is better than the hype. The HNN conserves energy no matter what, because it always has some scalar HθH_\theta and {Hθ,Hθ}=0\{H_\theta, H_\theta\} = 0 — antisymmetry, one final time. It does not automatically conserve momentum or angular momentum; a plain Hθ(q,p)H_\theta(q,p) has no reason to.

But here is the lever a direct-field MLP will never have: impose a symmetry on the learned energy, and Noether hands you back the matching conserved quantity — exactly, for free. Build an HθH_\theta that is blind to absolute position, and momentum is conserved. Blind to absolute orientation, and angular momentum is. The MLP has no scalar, so no bracket, so no Noether — it is not that it conserves the wrong things, it has nowhere for a conserved quantity to live.

On the spring, that lever has exactly one notch. A single degree of freedom bolted to a wall has one continuous symmetry — the phase-space rotation, generated (in a coincidence special to the oscillator) by HH itself — so Noether politely hands us back the energy we already proved in Movement I. The wall is precisely why momentum leaks: it pins the potential 12q2\tfrac12 q^2 to an absolute position, breaking translation symmetry. Cut the wall, let two masses interact freely, and translation symmetry — and conserved momentum — come right back. The real power shows up the moment a system is big enough to own more than one symmetry.

Why an ML person should care: this is weight-sharing for physics

YOU BETTER NOT SKIP THIS ONE

Okay — this one box is the entire reason we wrote this article. We know most of you are just thumb-scrolling by now, so we made it as loud and obnoxious as humanly possible, purely so you would stop and go what in god's name is happening here. Now that we have your attention:

Tell an ML person you built a model that conserves energy and you get a polite nod. But every conservation law is, secretly, a generalization guarantee — that is the entire content of Noether's theorem. "Conserves energy" already means "the learned dynamics don't depend on absolute time: fit them on one window and they hold for all time." Make the energy blind to position too, and the same theorem hands you conserved momentum, which reads "fit the system in one place and it is correct everywhere." That second sentence is a CNN's weight-sharing, restated for things that move. Conservation was never the selling point: it was generalization wearing a physicist's hat.

Here somewhere, buried deep in these gorgeous mechanics, there has to be the holy grail of ML and generalization that we are unfortunately collectively just too dumb to figure out and utilize to its potential.

If "a symmetry gives you a conservation law" still sounds like a physicist's party trick, notice that you already bet your career on the same idea. A CNN bakes in translation symmetry through weight-sharing: a feature learned in one corner of an image works in every corner. You don't hope the network discovers that a cat is a cat wherever it lands — you build it in, shrinking the hypothesis space to functions that already respect the symmetry. The payoff is the whole reason CNNs won: far fewer parameters, and a pattern seen once generalizes across the entire plane.

A symmetry-constrained HNN is the identical move, one level up. Make HθH_\theta depend only on relative coordinates and it goes blind to rigid shifts of the whole system — and because invariance of the scalar becomes equivariance of the flow XH=JHθX_H = J\nabla H_\theta, the two things you actually want fall straight out:

  • Generalization, for free. Train near the origin, and the shift-invariant energy makes the learned dynamics automatically correct for the same system placed anywhere. You never had to show it examples at every location — the symmetry transports them for you. That is a CNN's "learn the pattern once, apply everywhere," but for motion.
  • An exact invariant, as a bonus. Noether upgrades that same shift-invariance into conserved momentum — held to machine precision along every trajectory. A CNN's equivariance only ever buys generalization; in a dynamical setting the identical constraint buys generalization and a conservation law. You get to cash the symmetry twice.

This is just the geometric deep learning program — CNNs, GNNs, equivariant nets — restated for things that move: name the symmetries your problem has, build them into the energy, and let the geometry hand back both the invariances and the sample-efficiency. The HNN is simply the member of that family whose scalar happens to be an energy, and Noether is the receipt.


Let's actually build the thing

Enough cathedral-building. The astonishing part is how little code turns all of that geometry into a working network. The symplectic gradient ιXHω=dH\iota_{X_H}\omega = dH — the conceptual heart of three movements — is two lines of autograd. We deployed a cotangent bundle, Cartan's magic formula, and Noether's theorem to justify two lines of code. We regret nothing.

import torch
import torch.nn as nn
 
 
class HNN(nn.Module):
    """Outputs a single number — the latent energy H(q, p) — and nothing else."""
 
    def __init__(self, hidden=128):
        super().__init__()
        self.H = nn.Sequential(
            nn.Linear(2, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, 1),
        )
 
    def field(self, x):
        """The symplectic gradient J @ grad(H). This *is* the equation iota_{X_H} w = dH, in code."""
        x = x.requires_grad_(True)
        H = self.H(x).sum()                          # one scalar to differentiate
        dH, = torch.autograd.grad(H, x, create_graph=True)
        dHdq, dHdp = dH[:, 0], dH[:, 1]
        return torch.stack([dHdp, -dHdq], dim=1)     # the 90-degree turn. yes, we had an entire section about a minus sign.

That torch.stack([dHdp, -dHdq]) is the 9090^\circ rotation. The minus sign is the antisymmetry of ω\omega. Everything we proved about conservation is now structurally true of this network, and there's no way to write it down without conservation — the architecture literally cannot represent a divergence-bearing field.

Training matches slopes, never energies — the unsupervised target from Movement II:

model = HNN()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
 
for step in range(2000):
    x, dx_true = sample_spring_batch()       # states and their *observed* slopes
    dx_pred = model.field(x)
    loss = ((dx_pred - dx_true) ** 2).mean() # we never tell it what the energy is
    opt.zero_grad()
    loss.backward()
    opt.step()

And the baseline that invents free energy — same capacity, same data, but it emits a field directly, with no scalar and therefore no geometry:

class MLP(nn.Module):
    """Learns the vector field directly. There is no H anywhere in here."""
 
    def __init__(self, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, 2),                # (dq/dt, dp/dt), and not a scalar in sight
        )
 
    def field(self, x):
        return self.net(x)

Train both, then integrate each forward for fifty full periods with RK4 — a plain, non-symplectic integrator, so nothing is quietly conserving energy for the networks. (Reach for a symplectic integrator here and it would conserve energy even for the MLP, hiding the entire effect.) Here is an actual, reproducible run — every number below came out of the script, none were typed by hand:

terminal
$ python train_spring.py [data] H = 1/2 (q^2 + p^2), true field (dq/dt, dp/dt) = (p, -q) [HNN ] step 0 | slope MSE 3.89e+00 [HNN ] step 2000 | slope MSE 4.01e-05 [MLP ] step 0 | slope MSE 4.15e+00 [MLP ] step 2000 | slope MSE 3.41e-05 rollout: RK4 (non-symplectic), 50 periods, from (q,p)=(2,0), H0=2.000 final drift max drift area true field -0.000% 0.000% x1.000 <- integrator floor HNN +0.003% 0.205% x1.001 MLP -26.777% 27.023% x0.659

Both networks fit the training slopes essentially perfectly — to ~4e-5, so the MLP is not a worse function approximator. The true field row is the control: RK4 itself leaks essentially nothing over fifty periods, so any drift below it is the model's doing, not the integrator's. And the two models could not be more different. The HNN holds the true energy to 0.2% at its very worst and keeps phase-space area pinned at ×1.001[·]. The MLP bleeds away 27% of its energy — its orbit spiralling inward as its phase-space area collapses to ×0.66, a slow heat-death, exactly the non-zero divergence Movement I warned about. The difference was never accuracy. It was that one of them was allowed to learn a field with non-zero divergence, and the other, by construction, was not.


Where the geometry runs out

We would be bad guests if we wheeled out this much machinery and pretended it solved everything. The symplectic story is exact, beautiful, and assumes a frictionless, energy-conserving, canonically-coordinatized universe. Reality is rarely so polite.

The instant you add friction, energy is genuinely not conserved, ω\omega is genuinely not preserved, and a pure HNN is now confidently wrong. The fixes restore a weaker structure: port-Hamiltonian and Dissipative HNNs learn an energy term and a separate dissipation term, and contact geometry generalizes the symplectic form to systems that bleed energy on purpose. The spring with a damper is no longer a knife fight; bring the cotangent bundle back.


Closing thoughts

Hamilton wrote down these equations roughly ago, to reformulate a mechanics that already worked. Joseph-Louis Lagrange set up the foundations with Lagrangian Mechanics ago when he mailed his work to Euler. They were not thinking about gradient descent, automatic differentiation, or the regrettable tendency of neural networks to manufacture energy from nothing. And yet the cleanest way we have today to stop a network from violating thermodynamics is to hand it their 200-year-old geometry and let the antisymmetry of a 2-form do the rest.

That's the lesson we keep relearning, in security and in ML alike: the win usually isn't a bigger hammer. It's noticing what shape the problem already has, and refusing to learn anything that doesn't fit. Give the network a scalar instead of a field and conservation stops being a thing you hope for and becomes a thing that cannot not happen. We didn't teach it to conserve energy. We made non-conservation inexpressible. Good luck creating free energy now, bozo.

Are you all quite finished?

A
A mass on a frictionless springUnbothered

We are. The spring, of course, is still going.


Normally this is where we pitch you a consultation. But if you read a differential-geometry-of-neural-networks post to the end, you are not a sales lead — you are one of us.

Didn't like our post? Send your hatemails here.

Are you named Sam Greydanus? We absolutely love your work. 👉👈