Fine-tuning on harmless data can partially undo behaviors acquired earlier in train-<br>ing. Safety can erode under benign post-alignment updates, unlearned capabilities<br>can re-emerge, latent traits can transfer through apparently unrelated supervision,<br>and related post-alignment fragility appears in other generative settings. We argue<br>these phenomena are usefully viewed through a common training-history lens.<br>Our hypothesis is geometric: large early training phases create dominant behav-<br>ioral manifolds, while later alignment or specialization phases are shallower dis-<br>placements from them. Subsequent fine-tuning can therefore inherit a persistent<br>reversion component pointing back toward a witness of the dominant manifold.<br>We call this the gravitational interpretation of fine-tuning reversion. Across our<br>main settings, representational drift rapidly acquires a component along a history-<br>defined reversion direction (vrev). In our main track, alignment with vrev rises<br>from cos = 0.429 ±0.052 after the first update to 0.647 ±0.021 by step 20.<br>Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic<br>activation-space null. We demonstrate that selectively blocking motion along vrev<br>changes the final alignment at T = 100 from 0.648 ±0.009 to−0.211 ±0.021<br>and reduces harmfulness from 19.0% ±4.0% to 8.5% ±1.5% with little task cost.<br>These results support vrev as a causally relevant mediator of early post-alignment<br>reversion in our setup. Importantly, we do not claim that vrev is the unique safety<br>direction, nor that the dominant manifold is directly observed; rather, we iden-<br>tify a robust, history-defined direction that explains and partially controls early<br>reversion dynamics.</p>\n","updatedAt":"2026-06-30T15:25:19.696Z","author":{"_id":"664f8be73fc8c9f05dabaad6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664f8be73fc8c9f05dabaad6/XEN0yV4qssdKgVdUso-_A.jpeg","fullname":"Samuele Poppi","name":"seppia978","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9007740020751953},"editors":["seppia978"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/664f8be73fc8c9f05dabaad6/XEN0yV4qssdKgVdUso-_A.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28525","authors":[{"_id":"6a43de9b41f04ae4d7ad9543","name":"Samuele Poppi","hidden":false},{"_id":"6a43de9b41f04ae4d7ad9544","name":"Nils Lukas","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/664f8be73fc8c9f05dabaad6/2u44tzPTawmwuEXGyUNa3.jpeg"],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"A Gravitational Interpretation of Fine-Tuning Reversion","submittedOnDailyBy":{"_id":"664f8be73fc8c9f05dabaad6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664f8be73fc8c9f05dabaad6/XEN0yV4qssdKgVdUso-_A.jpeg","isPro":false,"fullname":"Samuele Poppi","user":"seppia978","type":"user","name":"seppia978"},"summary":"Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.","upvotes":1,"discussionId":"6a43de9c41f04ae4d7ad9545","ai_summary":"Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics.","ai_keywords":["fine-tuning","alignment","representational drift","history-defined reversion direction","activation-space","dominant manifold","post-alignment fragility","geometric interpretation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"664f8be73fc8c9f05dabaad6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664f8be73fc8c9f05dabaad6/XEN0yV4qssdKgVdUso-_A.jpeg","isPro":false,"fullname":"Samuele Poppi","user":"seppia978","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28525.md","query":{}}">
A Gravitational Interpretation of Fine-Tuning Reversion
Abstract
Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics.
Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.
Community
Fine-tuning on harmless data can partially undo behaviors acquired earlier in train-
ing. Safety can erode under benign post-alignment updates, unlearned capabilities
can re-emerge, latent traits can transfer through apparently unrelated supervision,
and related post-alignment fragility appears in other generative settings. We argue
these phenomena are usefully viewed through a common training-history lens.
Our hypothesis is geometric: large early training phases create dominant behav-
ioral manifolds, while later alignment or specialization phases are shallower dis-
placements from them. Subsequent fine-tuning can therefore inherit a persistent
reversion component pointing back toward a witness of the dominant manifold.
We call this the gravitational interpretation of fine-tuning reversion. Across our
main settings, representational drift rapidly acquires a component along a history-
defined reversion direction (vrev). In our main track, alignment with vrev rises
from cos = 0.429 ±0.052 after the first update to 0.647 ±0.021 by step 20.
Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic
activation-space null. We demonstrate that selectively blocking motion along vrev
changes the final alignment at T = 100 from 0.648 ±0.009 to−0.211 ±0.021
and reduces harmfulness from 19.0% ±4.0% to 8.5% ±1.5% with little task cost.
These results support vrev as a causally relevant mediator of early post-alignment
reversion in our setup. Importantly, we do not claim that vrev is the unique safety
direction, nor that the dominant manifold is directly observed; rather, we iden-
tify a robust, history-defined direction that explains and partially controls early
reversion dynamics.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.28525 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.28525 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.28525 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.