Hamiltonian Neural Networks from a Differential Geometry Perspective
Because when given the simplest nail in the universe, sometimes you just need a nuclear powered sledgehammer.
We have committed a crime.
What do you do when you have a frontier concept that is just a few years old and is already complicated enough? Well, you set up a Tinder date between that concept and one of the most haunting[·] branches of mathematics, of course. So we are taking a look at Hamiltonian Neural Networks from a differential-geometry perspective.
Look, we didn't ask for this. But nowadays it seems every single piece of ML research boils down to the same thing:

Do they get the job done? Usually — at significant compute cost. But they might as well be the single most uninspired architecture we have. So this time we're doing the opposite. We're going to chase something beautiful and let the structure of the problem do the heavy lifting, instead of throwing a billion parameters at it and praying to hyperparameter deities that our losses converge.
We are not the first to think the aesthetics matter:

“It is more important to have beauty in one's equations than to have them fit experiment.
”
We are going to work on what is possibly the simplest physical system in the universe — the kind engineering students see in their second physics course and never think about again. And to talk about it, we are going to need symplectic manifolds, differential forms, the Lie derivative, Cartan's magic formula, the Poisson bracket, and Noether's theorem.
If that ratio of machinery-to-mass strikes you as deranged, we never claimed otherwise. But we promise: somewhere underneath this absurdity is a genuinely beautiful idea about machine learning — you can teach a neural network to obey a conservation law it is never shown, purely by changing the shape of the thing it is allowed to learn.
Every terrifying word above has a precise, checkable meaning, and we are going to define each one in terms of things that are easier to digest — partial derivatives, dot products, determinants, the chain rule — and then compute with it, instead of waving our hands and hoping you nod along. If you can differentiate and remember what a determinant is, you should be able to follow every line here. The whole post runs on six tools:
- a partial derivative — how changes when you nudge and freeze everything else;
- the gradient — every partial stacked into one vector;
- the dot product, for turning two vectors into a number;
- the multivariable chain rule, ;
- the determinant of a matrix, read as a signed area;
- Green's theorem, which trades a loop integral for the area it encloses.
Everything else — forms, the symplectic structure, Poisson brackets — we build out of those, in front of you. A bit like a cooking show, I guess.
The crime scene
Here's the setup that started it.
Take that mass on a spring. Unit mass, unit spring constant, no friction. Its entire universe is two numbers: position and momentum . The total energy is
and the equations of motion are the cleanest things in all of physics. Here is the velocity (the rate position changes) and is the force (the rate momentum changes). The first equation below just says velocity equals momentum — true when the mass is . The second is Hooke's law, : pull the mass to the right, and the spring hauls it back left.
Watch the two pictures move together — the mass bobbing in real space on the left, and the state tracing a closed loop on the right. That loop, the system drawn in its own coordinates, is the phase space we'll obsess over for the rest of the post:
Now do the obvious machine-learning thing.
NO. Nooooo. Put down the transformer. Not that obvious thing.
Collect a pile of states and their measured time-derivatives , and train a perfectly reasonable little MLP to map one to the other. It fits the training data to five decimal places. You feel great. You leave your room and tell your mom "Mama, I did it," as she gives you a confused look. Then you let it predict forward in time for a few hundred swings, and oh no…
It does not stay constant. The orbit, which should be a perfect closed circle in phase space, slowly spirals — either inward until the mass grinds to a halt, or outward until your frictionless spring is somehow flinging a mass around with more energy than you gave it. Your neural network has, depending on the sign of its errors, either invented heat death or built a perpetual-motion machine.
The network is not bad at physics. It is bad at geometry. It learned a vector field with no reason to conserve anything, because nothing in its architecture knew that "conservation" was even a category of thing. So it's not the model's fault. It's yours.
Fixing that is not a matter of more data or a cleverer loss; it's a matter of giving the network the right shape to learn into. That shape has a name, and the name is symplectic.
So here is the plan. We are going to explain why that spiral is inevitable for a naive model — and how to make it impossible — three times over: once as geometry, once as a learning problem, and once as a symmetry. They turn out to be the same sentence wearing three different hats. Helmets on.
Movement I — The arena is a manifold
Phase space is not a plane
Physics is often built on small white lies we tell ourselves to make a problem easier to hold. It's why, when you ask a physicist "what is spin," they say "it's like a ball spinning, except it's not a ball and it isn't spinning."
One such lie we tell undergrads is that the state of the spring "is a point in the plane." It's close enough to compute with — but calling phase space "a plane" is doing about as much heavy lifting as calling a 747 "a plane." Same word; wildly different object underneath.
Here is the honest version, and then immediately the plain-language version. Position lives in a configuration space (for the spring, just the line of possible positions). Momentum is not another position — it's the thing you pair with a velocity to extract a number (a kinetic energy). An object whose whole job is "eat a vector, return a number" is a covector, and it lives in a separate space, the dual. Glue one dual space onto every point of and you get the cotangent bundle . That is phase space.
If "covector" made your eyes glaze, good news — you already use one constantly. In ML, when you write , the gradient isn't really "an arrow in weight space." It's the gadget you feed a step in order to get back a number, the change in loss. That "eat a vector → return a number" gadget is a covector (a 1-form). A plain vector is a direction you move; a covector is a ruler you measure against. Momentum is a ruler, not a move — that's the entire reason phase space isn't a naive plane.
This sounds like pedantry until it bites. Ask the question the naive picture cannot answer: what does it even mean for a flow to "preserve volume" here, when the two axes carry different units? Position times momentum is an action, not an area — you can't lay a ruler across phase space. You need a purpose-built instrument. Building that instrument, carefully, is the rest of this movement.
The instrument: a symplectic form
That instrument is the symplectic form . In canonical coordinates it is
We'll define the pieces from the ground up. A 1-form like is a little machine that eats a vector and reports one number — here, how far you moved in : . A general 1-form is , acting by . With that, the differential of our energy is nothing but the total-derivative formula you already know, finally given a name:
A 2-form like eats two vectors and returns a number — and this particular one returns the signed area of the parallelogram they span:
That's just the determinant you've seen a hundred times — the area of the parallelogram with sides and . Two facts read straight off it, both of which we cash in later:
- Antisymmetry: swap the inputs and the sign flips, .
- Zero on repeats: feed it the same vector twice and you get — a "parallelogram" with two identical sides is flat, it has no area.
Tattoo that second fact somewhere visible. It is, almost single-handedly, why energy is conserved.
carries two structural properties:
- it is closed, written — there's no "source" of area anywhere (we cash this in exactly once, in a slick proof later), and
- it is non-degenerate — the only vector that measures zero against every other vector is the zero vector itself.
Non-degeneracy is the magic clause, because it makes a perfect, invertible dictionary between vectors and 1-forms: hand it a vector and you get a 1-form; because nothing is invisible to it, you can run the dictionary backwards and turn any 1-form into exactly one vector. A metric (think: the ordinary dot product) gives the same kind of dictionary — that's literally how the gradient is defined, by translating the 1-form into a vector. The only difference is that is symmetric while is antisymmetric, and that one sign flip is the whole post.
The 90° turn, derived (not asserted)
Our network learns a scalar , whose differential is a 1-form — at every point it points "uphill" on the energy landscape. We now turn that 1-form into a flow, through . The vector field we want is defined by one deceptively small equation:
Let's not admire it — let's solve it. The symbol (the interior product) just means "plug into the first slot of and leave the second open," which leaves a 1-form behind. Compute it directly:
Now set that equal to and match the and pieces:
There they are — Hamilton's equations, derived in three lines. Plug in the spring's and you get , exactly . And now the "90° rotation" is not a slogan but a computation. Stack the two dictionaries side by side:
That matrix rotates any vector by . So is literally the gradient, turned a quarter turn. The gradient climbs straight up the energy bowl; rotate it and it runs along the contour rings instead. Drag the white dot below and flip between the two dictionaries — same , same rings, opposite fate:
Feeding dH through ω rotates it 90° onto the rings. The particle rides a level set — energy never moves.
drag the white dot to move the start point
Why the quarter-turn conserves energy
Here's the payoff, short enough to do twice. How fast does energy change as the system flows along ? Multivariable chain rule, then substitute and :
The two terms are identical with opposite signs. They cancel for any whatsoever — that's the entire content of energy conservation, and you just checked it with first-year calculus. The slick, coordinate-free way to say precisely the same thing is
where the final step is free because is antisymmetric, and anything antisymmetric, fed two copies of one vector, returns zero — the "zero on repeats" fact from the primer. Energy is conserved because a quantity, paired with itself, backwards, is zero. We brought a cotangent bundle to a knife fight and the knife was .
The flow preserves area, two ways
Energy conservation pins each trajectory to one ring. There's a second, deeper invariant: the flow preserves area in phase space. The beginner-friendly proof is one line of partial derivatives. A flow inflates or shrinks volume according to the divergence of its vector field, so just compute it:
It vanishes because mixed partials commute[·] — the most boring fact in calculus. A divergence-free flow is incompressible: it can stir phase space but never compress or inflate a patch of it. That is Liouville's theorem, and for a Hamiltonian system you get it for free.
Drop a small square patch of initial conditions into phase space and let it flow. A real (Hamiltonian) flow just turns the patch rigidly — same shape, same area forever; the MLP's divergent field balloons it. Toggle between them and watch the area readout:
The square balloons as it turns. A field with divergence cannot keep area fixed — so it cannot be the flow of any energy.
For the spring, the Hamiltonian flow is a rigid rotation, so Liouville's theorem here is the thrilling claim that spinning a shape doesn't change its area. We know. But the one-line proof that makes it obvious for the spring is the identical proof that makes it true for a galaxy of stars or a Hamiltonian Monte Carlo sampler. That is the whole point of the sledgehammer: the spring is where you can check, by hand, that the machinery is honest.
Movement II — The network never sees an energy
Here is the part about Hamiltonian Neural Networks that, in our opinion, just doesn't get enough love.
The training target is a lie of omission
A Hamiltonian Neural Network (Greydanus, Dzamba & Yosinski, 2019) is almost insultingly simple to state: instead of a network that outputs the vector field , you build a network that outputs a single number, and you define the dynamics as its symplectic gradient — the exact construction we derived above, computed with autodiff.
But look hard at what you train it on. You have trajectory data: states and their measured slopes . You do not have energy labels. Nobody ever measured . The loss compares the network's derivatives to the observed slopes:
Every term is a derivative of the network. The network's actual output — the scalar — appears nowhere in the loss. It is never supervised, never compared to a target, never even named in the data. The Hamiltonian is a latent variable: a hidden scalar the network is forced to invent, whose only job is to have the right slopes.
You are doing supervised regression on the gradient of a function while leaving the function itself completely unobserved. It's like being handed thousands of slope readings from across a mountain range and asked to reconstruct the terrain. You can do it — but only up to one thing slopes can never tell you: the absolute altitude. Sea level is a free parameter. So is the additive constant in (add to and doesn't budge). The dynamics don't care, because they only ever see , never .
"Is there an energy at all?" is a geometry question
Now the deep bit, and we are going to earn it rather than assert it. Recovering a scalar from its slopes is only possible if the slopes are genuinely the slopes of something. Not every field of arrows is a gradient. So: given a learned field , when does a potential exist with ?
Run it through the same machine. The 1-form attached to is . For this to equal , it has to be closed — its exterior derivative must vanish. For any 1-form that derivative is
Substitute , :
So the abstract condition " is closed" is, in plain coordinates, exactly " is divergence-free." And on a simply-connected patch like our plane, closed implies exact (the Poincaré lemma), so a potential exists if and only if the field has zero divergence. The symplectic field has divergence and passes. A generic MLP that regresses directly has no reason on earth to be divergence-free, so its 1-form is not closed, so no exists, so there is nothing for a conservation law to even be about.
You can watch the obstruction. By Green's theorem, integrating the recovered slope-form around a closed loop equals the divergence piled up inside it:
If is divergence-free the right side is zero, so the loop integral comes home to where it started and a single-valued exists. If not, you get a non-zero gap — the recovered "energy" is path-dependent, which is mathematician for it isn't a function at all. The slider below dials the divergence; the gap you see is literally that area integral.
Drag the slider below from real physics toward what an MLP learns, and watch the recovered altitude refuse to return to where it started:
An MLP only ever sees the field of arrows on the left — never an energy value. To test whether an energy could even exist, we try to rebuild its “altitude” by walking once around a closed loop (we use a circle, but any loop gives the same verdict). The rule: a real energy must bring you back to the altitude you started at.
At real physics the arrows circulate (no leak); slide toward the MLP and they spiral outward — a source the energy can’t explain.
Flat & home → an energy exists. A line that ends higher than it began → it doesn’t.
A direct-field MLP doesn't fail to conserve energy because it was under-trained. It fails because the field it learned is not closed, so the loop integral is path-dependent, so there is no single-valued energy function in existence for it to conserve. The HNN conserves energy for the opposite reason: by only ever emitting of an actual scalar, every field it can possibly produce is exact by construction — divergence-free, area-preserving, energy-conserving, with no extra loss term and no penalty. Conservation isn't learned. It's the only thing the architecture can express.
The only thing this test can never pin down is the overall "sea level" — the additive constant in from earlier. Which is fine, because the dynamics never depended on it.
Movement III — Conservation is a symmetry in disguise
This is the section that makes us question: "If Emmy Noether was alive to see her work utilized by cavemen like us, would she be excited or horrified?"
We have energy conservation. Time for the theorem that says energy was never special to begin with.
The Poisson bracket, derived
Noether's theorem says every continuous symmetry of a system has a matching conserved quantity, and vice versa. In our symplectic language it becomes startlingly mechanical. First, one new gadget — the Poisson bracket[·] of two functions, defined as evaluated on their two Hamiltonian fields, . Expand it with the coordinate formulas we already derived (, same for ):
Now the one identity that runs the whole show. How does any quantity change along the dynamics? Chain rule once more, then substitute Hamilton's equations:
Read it both ways and Noether falls out for free:
- is conserved as the system evolves under .
- By antisymmetry , that's the same equation as " is unchanged by the flow generated by " — i.e. generates a symmetry.
Symmetry and conservation are, quite literally, one equation read left-to-right or right-to-left. Energy is conserved because — antisymmetry, again, the gift that keeps on giving.
Pick a quantity below and watch its value as the spring swings. The widget offers three: the energy with , the position with , and a stretch with . Only the energy has a vanishing bracket — and only the energy stays flat:
Some quantities hold dead still while the spring oscillates; others slosh around. Noether’s theorem says the constant ones are exactly the symmetries. Pick a quantity and watch its value as time runs.
— the orbit · — level sets of Energy. Conserved ⟺ the orbit runs along them, not across.
A flat line: the energy never changes while the spring swings. It is conserved — and it is precisely the charge of the rotation symmetry of phase space.
Uhhh.. So?
Here is the thing we find genuinely lovely, and the reason the latent scalar from Movement II matters so much:
Let's be precise, because the honest version is better than the hype. The HNN conserves energy no matter what, because it always has some scalar and — antisymmetry, one final time. It does not automatically conserve momentum or angular momentum; a plain has no reason to.
But here is the lever a direct-field MLP will never have: impose a symmetry on the learned energy, and Noether hands you back the matching conserved quantity — exactly, for free. Build an that is blind to absolute position, and momentum is conserved. Blind to absolute orientation, and angular momentum is. The MLP has no scalar, so no bracket, so no Noether — it is not that it conserves the wrong things, it has nowhere for a conserved quantity to live.
On the spring, that lever has exactly one notch. A single degree of freedom bolted to a wall has one continuous symmetry — the phase-space rotation, generated (in a coincidence special to the oscillator) by itself — so Noether politely hands us back the energy we already proved in Movement I. The wall is precisely why momentum leaks: it pins the potential to an absolute position, breaking translation symmetry. Cut the wall, let two masses interact freely, and translation symmetry — and conserved momentum — come right back. The real power shows up the moment a system is big enough to own more than one symmetry.
Why an ML person should care: this is weight-sharing for physics
Okay — this one box is the entire reason we wrote this article. We know most of you are just thumb-scrolling by now, so we made it as loud and obnoxious as humanly possible, purely so you would stop and go what in god's name is happening here. Now that we have your attention:
Tell an ML person you built a model that conserves energy and you get a polite nod. But every conservation law is, secretly, a generalization guarantee — that is the entire content of Noether's theorem. "Conserves energy" already means "the learned dynamics don't depend on absolute time: fit them on one window and they hold for all time." Make the energy blind to position too, and the same theorem hands you conserved momentum, which reads "fit the system in one place and it is correct everywhere." That second sentence is a CNN's weight-sharing, restated for things that move. Conservation was never the selling point: it was generalization wearing a physicist's hat.
Here somewhere, buried deep in these gorgeous mechanics, there has to be the holy grail of ML and generalization that we are unfortunately collectively just too dumb to figure out and utilize to its potential.
If "a symmetry gives you a conservation law" still sounds like a physicist's party trick, notice that you already bet your career on the same idea. A CNN bakes in translation symmetry through weight-sharing: a feature learned in one corner of an image works in every corner. You don't hope the network discovers that a cat is a cat wherever it lands — you build it in, shrinking the hypothesis space to functions that already respect the symmetry. The payoff is the whole reason CNNs won: far fewer parameters, and a pattern seen once generalizes across the entire plane.
A symmetry-constrained HNN is the identical move, one level up. Make depend only on relative coordinates and it goes blind to rigid shifts of the whole system — and because invariance of the scalar becomes equivariance of the flow , the two things you actually want fall straight out:
- Generalization, for free. Train near the origin, and the shift-invariant energy makes the learned dynamics automatically correct for the same system placed anywhere. You never had to show it examples at every location — the symmetry transports them for you. That is a CNN's "learn the pattern once, apply everywhere," but for motion.
- An exact invariant, as a bonus. Noether upgrades that same shift-invariance into conserved momentum — held to machine precision along every trajectory. A CNN's equivariance only ever buys generalization; in a dynamical setting the identical constraint buys generalization and a conservation law. You get to cash the symmetry twice.
This is just the geometric deep learning program — CNNs, GNNs, equivariant nets — restated for things that move: name the symmetries your problem has, build them into the energy, and let the geometry hand back both the invariances and the sample-efficiency. The HNN is simply the member of that family whose scalar happens to be an energy, and Noether is the receipt.
Let's actually build the thing
Enough cathedral-building. The astonishing part is how little code turns all of that geometry into a working network. The symplectic gradient — the conceptual heart of three movements — is two lines of autograd. We deployed a cotangent bundle, Cartan's magic formula, and Noether's theorem to justify two lines of code. We regret nothing.
import torch
import torch.nn as nn
class HNN(nn.Module):
"""Outputs a single number — the latent energy H(q, p) — and nothing else."""
def __init__(self, hidden=128):
super().__init__()
self.H = nn.Sequential(
nn.Linear(2, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
nn.Linear(hidden, 1),
)
def field(self, x):
"""The symplectic gradient J @ grad(H). This *is* the equation iota_{X_H} w = dH, in code."""
x = x.requires_grad_(True)
H = self.H(x).sum() # one scalar to differentiate
dH, = torch.autograd.grad(H, x, create_graph=True)
dHdq, dHdp = dH[:, 0], dH[:, 1]
return torch.stack([dHdp, -dHdq], dim=1) # the 90-degree turn. yes, we had an entire section about a minus sign.That torch.stack([dHdp, -dHdq]) is the rotation. The minus sign is the antisymmetry of . Everything we proved about conservation is now structurally true of this network, and there's no way to write it down without conservation — the architecture literally cannot represent a divergence-bearing field.
Training matches slopes, never energies — the unsupervised target from Movement II:
model = HNN()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for step in range(2000):
x, dx_true = sample_spring_batch() # states and their *observed* slopes
dx_pred = model.field(x)
loss = ((dx_pred - dx_true) ** 2).mean() # we never tell it what the energy is
opt.zero_grad()
loss.backward()
opt.step()And the baseline that invents free energy — same capacity, same data, but it emits a field directly, with no scalar and therefore no geometry:
class MLP(nn.Module):
"""Learns the vector field directly. There is no H anywhere in here."""
def __init__(self, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
nn.Linear(hidden, 2), # (dq/dt, dp/dt), and not a scalar in sight
)
def field(self, x):
return self.net(x)Train both, then integrate each forward for fifty full periods with RK4 — a plain, non-symplectic integrator, so nothing is quietly conserving energy for the networks. (Reach for a symplectic integrator here and it would conserve energy even for the MLP, hiding the entire effect.) Here is an actual, reproducible run — every number below came out of the script, none were typed by hand:
Both networks fit the training slopes essentially perfectly — to ~4e-5, so the MLP is not a worse function approximator. The true field row is the control: RK4 itself leaks essentially nothing over fifty periods, so any drift below it is the model's doing, not the integrator's. And the two models could not be more different. The HNN holds the true energy to 0.2% at its very worst and keeps phase-space area pinned at ×1.001[·]. The MLP bleeds away 27% of its energy — its orbit spiralling inward as its phase-space area collapses to ×0.66, a slow heat-death, exactly the non-zero divergence Movement I warned about. The difference was never accuracy. It was that one of them was allowed to learn a field with non-zero divergence, and the other, by construction, was not.
Where the geometry runs out
We would be bad guests if we wheeled out this much machinery and pretended it solved everything. The symplectic story is exact, beautiful, and assumes a frictionless, energy-conserving, canonically-coordinatized universe. Reality is rarely so polite.
The instant you add friction, energy is genuinely not conserved, is genuinely not preserved, and a pure HNN is now confidently wrong. The fixes restore a weaker structure: port-Hamiltonian and Dissipative HNNs learn an energy term and a separate dissipation term, and contact geometry generalizes the symplectic form to systems that bleed energy on purpose. The spring with a damper is no longer a knife fight; bring the cotangent bundle back.
Closing thoughts
Hamilton wrote down these equations roughly … ago, to reformulate a mechanics that already worked. Joseph-Louis Lagrange set up the foundations with Lagrangian Mechanics … ago when he mailed his work to Euler. They were not thinking about gradient descent, automatic differentiation, or the regrettable tendency of neural networks to manufacture energy from nothing. And yet the cleanest way we have today to stop a network from violating thermodynamics is to hand it their 200-year-old geometry and let the antisymmetry of a 2-form do the rest.
That's the lesson we keep relearning, in security and in ML alike: the win usually isn't a bigger hammer. It's noticing what shape the problem already has, and refusing to learn anything that doesn't fit. Give the network a scalar instead of a field and conservation stops being a thing you hope for and becomes a thing that cannot not happen. We didn't teach it to conserve energy. We made non-conservation inexpressible. Good luck creating free energy now, bozo.
Are you all quite finished?
We are. The spring, of course, is still going.
Normally this is where we pitch you a consultation. But if you read a differential-geometry-of-neural-networks post to the end, you are not a sales lead — you are one of us.
Didn't like our post? Send your hatemails here.
Are you named Sam Greydanus? We absolutely love your work. 👉👈