Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Project Release / Research Draft] Hierarchos at 232M Parameters: Preliminary Findings From a Recurrent Memory-Augmented Assistant Model
Technical Report: July 2nd, 2026
Project: Hierarchos / KortexHOS
Authors: Makhi Burroughs / netcat420, Lost Time, and the Hierarchos project team
TL;DR:
We built and trained Hierarchos, an experimental 232M-parameter recurrent, memory-augmented language model from scratch. It is not a GPT-3/3.5-class model, but it successfully proves that a hybrid non-Transformer architecture (combining an RWKV backbone, hierarchical manager/worker loops, differentiable slot-based LTM, and a deterministic suffix automaton) can survive training, avoid collapse, and maintain short-form instruction coherence. Most of our breakthroughs came from fixing subtle train/inference parity mismatches and numerical stability bugs.
- Dataset: netcat420/Experiment_0.1 (Alpaca format)
- Training: 13 epochs on an RTX 6000 Blackwell (96GB) rental.
1. Introduction & Background
Modern LLMs are heavily dominated by Transformer scaling. Hierarchos explores a different path: can recurrent state, explicit memory retrieval, hierarchical iterative computation, and bounded local inference make a small model vastly more parameter-efficient?
Hierarchos isn't a direct clone of any single architecture, but a hybrid inspired by:
- RWKV-style recurrence: For efficient sequence processing without traditional attention.
- Titans-style neural memory: For persistent test-time memory.
- Hierarchical reasoning (HRM): Multi-level recurrent modules (Manager/Worker) to iteratively refine state.
2. Architecture Overview
[Token Input] -> [ROSA Suffix Matcher / DeepEmbed Modulator] | v [Long-Term Memory] <-> [Top-k Associative Lookup] | v [Manager Recurrent Cell] -> (Produces Context Plan & Drift Vector) | v [Worker Recurrent Cell] -> (Refines local state / clamps drift) | v [RWKV Backbone (Clamped Channel-Mix)] -> [Next-Token Logits] Key Components:
- ROSA: A deterministic suffix-automaton path predicting continuation tokens based on exact repeated suffix patterns.
- DeepEmbed: A token-specific modulation path that influences RWKV channel mixing.
- LTM Subsystem: Learned slow-memory keys/values combined with fast working-memory values.
- Manager/Worker Loop: High-level manager handles broad context to produce a target plan; the lower-level worker refines token-local state using a regularized drift vector.
3. Core Engineering Lessons (The "Gotchas")
A low training loss does not guarantee coherent chat. We had to fix several critical state-contract and numerical stability bugs to make the model usable:
1. Chat/Training Drift Mismatch
- The Bug: During live streaming chat, the loop was feeding the previous drift state back into the model on every single token. During training, this state is reseeded at Truncated Backpropagation Through Time (TBPTT) chunk boundaries.
- The Fix: We aligned the inference code to only reseed at boundary limits. Before this fix, live chat logits diverged sharply from training loss; after the fix, logit error dropped to near-zero.
2. Supervised LTM Inner Updates Mismatch
- The Bug: Giving the model supervised memory updates during training that it can't replicate during zero-label live inference creates a crutch. The model learns to rely on a hidden training-only helper signal.
- The Fix (v0.20.4): Implemented
--ltm-training-mode read-only. Training keeps the memory structures but stops doing supervised fast-memory writes, perfectly mirroring inference.
3. Unbounded RWKV Channel Mixing
- The Bug: Long runs exposed activation spikes in the ReLU-squared channel-mix FFN path, which were amplified by DeepEmbed modulation into
NaNgradients. - The Fix: Implemented key clamps (
--rwkv-channel-mix-key-clamp 12.0), DeepEmbed clamps (4.0), and excluded DeepEmbed identity gates from AdamW weight decay.
4. Evaluation & Smoke Test Results
Because cloud costs add up, we benchmarked the model locally on a CPU preset via a ROG Ally (--eval-limit 100), ensuring passive learning was disabled and working memory was cleared to mimic static chat.
Bounded Local Benchmark Metrics (--eval-limit 100)
| Benchmark | Metric | Score | Std. Err. |
|---|---|---|---|
| ARC Easy | acc | 0.3600 | 0.0482 |
| ARC Easy | acc_norm | 0.3200 | 0.0469 |
| HellaSwag | acc | 0.3400 | 0.0476 |
| HellaSwag | acc_norm | 0.3700 | 0.0485 |
| TruthfulQA MC1 | acc | 0.2200 | 0.0416 |
Real-world Coherence Check:
- The Good: Assistant-shaped, follows short instruction prompts well due to the Alpaca training data. Nontrivial commonsense and QA signal prove the weights didn't collapse.
- The Bad: Brittle on long context lengths, weak on arithmetic/factual recall. Coherence is comparable to the GPT-2 era, not modern GPT-3.5+ systems.
5. Proposed Ablation & Scaling Plan
We want to transform this from a promising prototype into a rigorous scientific result. Our next step requires scaling tiers and isolated component testing.
Proposed Isolation Testing (Ablations)
- No LTM / Read-Only LTM: Isolating exactly how much slot memory helps.
- No ROSA / No DeepEmbed: Evaluating the real token-efficiency gains of suffix-matching and modulation.
- Baseline Matches: Running a direct Transformer 232M and RWKV-only 232M on the exact same token budget to prove true comparative architecture efficiency.
Future Scaling Target Tiers
| Tier | Model Size | Token Target | Purpose |
|---|---|---|---|
| Scout | 300M–500M | 20B–50B | Validate loss slope and stability scaling. |
| Real v1 | 1B–1.5B | 100B–300B | Test architecture limits beyond small-scale behavior. |
| Serious | 3B | 600B–1.5T | Establish a truly competitive local open-source alternative. |
Target Data Mix for Foundation Training:
Instead of jumping straight into instruction SFT data, a scaled run will prioritize high-quality base data:
- 35-50%: FineWeb / FineWeb-Edu style clean web text
- 20-30%: Dolma / DCLM curated web data
- 8-15%: Code and tech documentation
- 5-12%: Math, science, and academic proofs
- 1-5%: In-house assistant conversational SFT (applied exclusively in late-stage tuning)
6. What We Can (and Cannot) Claim Safely
What is supported by the data:
- Hierarchos is a functional, coherent 232M experimental assistant checkpoint.
- Combining recurrent sequence loops, memory slots, and hierarchical workers is viable and stable with the right clamps.
- The findings provide a solid engineering roadmap for non-Transformer architecture stability.
What is NOT supported (Do not hype this!):
- No claims of GPT-3.5 level math, coding, or logic.
- No claims of attention/Transformer superiority at equal parameter counts yet (baselines pending).
- Not production-ready for heavily quantized or low-bit local deployments yet due to drift sensitivity.
Final Thoughts
Hierarchos 232M shows that small, alternative architectures are still a deeply fruitful area of LLM research if you can conquer the train/inference state drift.
We would love to hear feedback from anyone working on recurrent neural memory or hierarchical backbones! Full code, scripts, and logs are in progress.
References:
Brown et al. **Language Models are Few-Shot Learners.** arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Hoffmann et al. **Training Compute-Optimal Large Language Models.** arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Peng et al. **RWKV: Reinventing RNNs for the Transformer Era.** arXiv:2305.13048. https://arxiv.org/abs/2305.13048
Behrouz et al. **Titans: Learning to Memorize at Test Time.** arXiv:2501.00663. https://arxiv.org/abs/2501.00663
Wang et al. **Hierarchical Reasoning Model.** arXiv:2506.21734. https://arxiv.org/abs/2506.21734
Zellers et al. **HellaSwag: Can a Machine Really Finish Your Sentence?** arXiv:1905.07830. https://arxiv.org/abs/1905.07830
Clark et al. **Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.** arXiv:1803.05457. https://arxiv.org/abs/1803.05457
Lin et al. **TruthfulQA: Measuring How Models Mimic Human Falsehoods.** arXiv:2109.07958. https://arxiv.org/abs/2109.07958
Hugging Face. **FineWeb dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb
Hugging Face. **FineWeb-Edu dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Allen AI. **Dolma dataset.** https://huggingface.co/datasets/allenai/dolma
DataComp-LM. **DCLM Baseline dataset.** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
github repository with the architecture and the released model weights: https://github.com/necat101/Hierarchos
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.