Hacker News — AI on Front Page · · 3 min read

Building a custom octocopter from scratch with no prior hardware experience

Mirrored from Hacker News — AI on Front Page for archival readability. Support the source by reading on the original site.

240 pts · 52 comments on Hacker News

Day 30: Learning curve
Jun 28, 2026

The drone flies through all single, dual, and SOME TRIPLE motor failures!!  But it's sim-only, and the path to get here had significant learning curves.

Surviving dual motor failures (2x speed)
Surviving worst-case 2-motor failure (2x speed)

Before training the huge policy with domain randomization and all of the bells and whistles, I decided to train a sim-only policy, which would be ready in about an hour and half on CPU. I'm really glad I did this, because it definitely didn't work out the first time. Here's a (maybe too transparent) timeline of the process:

# What I tried Result
1 Baseline PPO, high exploration, always 2 faults. Ran overnight. Failed. Entropy climbed 11→22 the whole time, and it crashed at 20M when trying to save (it was fine, I had checkpoints, but still)
2 Lowered exploration, kept always-2-faults and no curriculum (straight-to-dual). This one seemed to be working. Looked broken at 2M (crashing at step 7, very negative entropy — Gaussian differential entropy can go negative when variance collapses), but trained through it. I killed it deliberately because I wanted to add single-motor failures first.
3 Added a hover->single->dual curriculum Broke everything. Turned out the curriculum's 4M steps of pure hover was exposing two latent bugs (failures from step 0 gave enough training signal to bulldoze past them).
4 (not a config -- an operator error) A zombie training process from the night before was running alongside the new one, both writing the same checkpoint filenames.
5 Residual actions (u = hover + action). Commands got driven to ±3, saturated and lopsided -- tip-over in 7 steps.
6 Stripped/re-added the curriculum, low exploration. 0% at every checkpoint -- and it crashed at step 7 every single time, even during the pure-hover phase. That's what finally told me that this bug had to be systemic.
7 The two real fixes (below). Worked. Hover learned by 0.5M steps, 100% survival on hover/single/dual by ~9.5M.

Everything from #3 onward -- the entropy panic, the residual detour, the curriculum back-and-forth -- was me poking at symptoms of two underlying bugs.

1. The actions were getting stuck. The Gaussian policy outputs unbounded means, but the env hard-clips commands to [0,1] and PPO computes gradients on the unclipped value -- so once a motor drifts past the clip edge there's no corrective gradient pulling it back inside, and it stays there. With 8 motors this produces a lopsided tip-over, and no hyperparameter fixes it because it's not a hyperparameter problem. Fix: squash through tanh as a residual around hover throttle, so commands can't saturate and an untrained net already hovers. This alone bumped untrained survival from 7 steps to 205.

2. Staying alive paid nothing. An open-loop hover test printed r = 0.00 every step: at the ~1.9m the drone actually settles, the +0.1 survival bonus was exactly cancelled by the -0.1 altitude penalty. Since the drone is marginally stable every episode eventually crashes (-10), so "hover 200 steps then crash" and "crash immediately" had the same return. Fix: bump r_survive 0.1 → 1.0, so hovering pays +0.9/step and PPO finally has a reason to stay up.

Results

Final policy is a 43.4k-parameter MLP.

Learning curve across 20M training steps
Learning curve across 20M steps — survival and reward by fault class

It even generalizes to 3-motor failures it was never trained on, as long as recovery is physically possible. Even when I killed 3 adjacent motors (physically unrecoverable) then, it fought for 7.2 seconds and sank instead of tumbling.

Surviving triple motor failures (out-of-distribution)

One nice surprise: the "uncompensatable-yaw" cases (two same-spin motors 90° apart, where in theory the drone should just spin freely) aren't actually uncompensatable. The policy holds heading to within ~13°/s in all of them -- a slow drift, not a free spin. My heuristic for flagging those was too pessimistic.

Next step: the real, sim-to-real-able policy!

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hacker News — AI on Front Page