r/LocalLLaMA · July 3, 2026 · 11 min read

Particle Scattering Sampler for llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

https://github.com/IceFog72/llama.cpp

I added an experimental sampler to llama.cpp called scatter.

The short version: it slightly smooths the model’s next-token probability distribution inside the already-selected top candidates. It is meant to make generation less rigid without doing the usual “raise temperature and wake up the garbage tail” thing.

It uses a light-scattering metaphor, but the implementation is not real physics. The actual operation is much simpler: a cheap local diffusion / moving-average step over token rank.

Think of the model’s next-token distribution as a beam. The strongest candidate is rank 1, the next strongest is rank 2, and so on. scatter lets nearby ranks exchange a little probability mass. Rank 1 can lose some mass to rank 2, 3, 4, etc. Rank 5 can exchange with nearby ranks. But it does not pull random deep-tail tokens into play.

That is the main point:

flatten the head of the distribution without leaking probability into the deep tail.

Status

Implemented and tested as an experimental sampler for llama.cpp.

What is currently implemented:

Native sampler API:
- llama_sampler_init_scatter(...)
- llama_sampler_init_scatter_ext(...)
Sampler-chain name: scatter
Sampler-chain character: r
Included in the default sampler chain between xtc and temperature
Disabled by default: with default parameters it returns a noop sampler
Fixed scattering strength
Optional adaptive strength using entropy feedback
Optional repeated-token absorption
Optional collision / mean-free-path gating
Invariant tests in tests/test-sampling.cpp

What problem is it trying to solve?

Temperature is blunt.

If you raise temperature, you flatten the whole distribution. That can make the model more creative, but it also gives more probability to weak tail tokens. Sometimes that is fine. Sometimes it causes weird word choices, broken formatting, bad identifiers, or incoherent jumps.

scatter is more local.

It only operates inside a chosen top-K candidate set after earlier samplers have already filtered the distribution. So if top-k, top-p, min-p, XTC, etc. have already removed bad candidates, scatter only reshapes what survived.

So instead of this:

"Make everything more random, including the tail."

it does this:

"Take the strongest surviving candidates and locally soften the differences between nearby ranks."

How it works

The sampler runs after earlier samplers have already defined the candidate “medium.”

Recommended order:

penalties -> dry -> top_n_sigma -> top_k -> typ_p -> top_p -> min_p -> xtc -> scatter -> temperature -> dist

So by the time scatter runs, the candidate list has already been filtered by the normal samplers.

Per generated token, it does the following.

1. Collision gate

First, scatter may decide to do nothing for this token.

The flag is:

--scatter-collision N

Default:

--scatter-collision 1.0

With collision = 1.0, scattering runs every token.

With collision = 0.25, scattering only fires about 25% of the time. That gives a mean free path of roughly 4 tokens.

If scattering does not fire, the sampler leaves the candidates completely untouched:

no sorting
no truncation
no renormalization
no write-back

This is useful if you want occasional stronger “deflections” instead of constant weak smoothing.

2. Rank-space diffusion

The sampler takes the top k surviving candidates, sorts them by logit, and softmaxes them into probabilities.

Then it applies a local Gaussian smoothing kernel over rank distance:

K_ij = exp(-((i - j)^2) / (2 * radius^2)) q_i = sum_j K_ij * p_j / sum_j K_ij

In plain English:

tokens close in rank share more probability
tokens far away in rank share little or none
rank distance matters, not token meaning
the kernel is row-normalized, so probability mass stays controlled

Then the direct distribution and scattered distribution are blended:

p_i = normalize((1 - strength) * p_i + strength * q_i)

So:

strength = 0.0 means no scattering
strength = 0.1 means mostly original distribution, slightly smoothed
strength = 0.3 means stronger local flattening

The kernel depends only on rank distance, so it is computed once and cached.

The sampler can also repeat this diffusion step several times:

--scatter-steps N

Roughly speaking, multiple steps behave like a wider blur:

n steps ≈ one step with radius * sqrt(n)

So normally you should keep steps low.

3. Adaptive strength

There is also an optional adaptive mode:

--scatter-adaptive

This uses the normalized entropy of the current top-K distribution:

H_norm = -sum_i p_i * log(p_i) / log(K)

Interpretation:

H_norm = 0.0 -> very sharp distribution H_norm = 1.0 -> very diffuse distribution

Then the sampler compares that entropy to a target:

--scatter-entropy-target 0.55

If the distribution is sharper than the target, the sampler increases the effective scattering strength.

If the distribution is already diffuse, it lowers the effective scattering strength.

Conceptually:

H_norm < target: sharp beam -> denser medium -> stronger scattering H_norm > target: diffuse -> thinner medium -> weaker scattering

The adaptive strength is clamped between:

--scatter-strength-min N --scatter-strength-max N

Example:

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55

Important caveat: adaptive mode pushes strength up when the model is confident. That is deliberate, similar in spirit to XTC, but confident distributions are often confident for a good reason. This may be good for creative prose, but risky for code, math, JSON, exact names, and strict instruction following.

4. Optional repeated-token absorption

There is an optional repetition-related feature:

--scatter-absorption N --scatter-absorption-last-n N

If a token appeared recently, it can lose some scattered mass:

p_i *= exp(-absorption * repeat_count_i)

This is not meant to replace normal repetition penalty or DRY. It is a soft extra effect restricted to the top-K scattering medium.

You can think of it as a small frequency penalty applied only to candidates currently inside the medium.

Use this carefully. Strong absorption can damage:

names
style markers
dialogue punctuation
intentional repetition
repeated structural tokens
exact formatting

A light value is safer:

--scatter-absorption 0.06 --scatter-absorption-last-n 64

5. Write-back

After scattering, the final probabilities are converted back to logits:

log(max(p_i, eps))

Then the candidate list is truncated to the top-K medium.

So when scattering fires, only the top --scatter-k candidates remain.

Important behavior: order stability

scatter is designed to be order-stable.

The row-normalized Gaussian kernel preserves the rank ordering of a sorted distribution, up to floating-point noise. So if the top token was rank 1 before scattering, it should still be rank 1 afterward.

That means:

argmax is preserved
greedy decoding is unaffected
the sampler does not aggressively reorder candidates
it mostly softens the probability gaps between nearby ranks

The deliberate exception is absorption. If repeated-token absorption is enabled, repeated tokens can drop in rank.

This is one of the main differences from temperature.

Temperature can lift deep-tail tokens.

scatter only moves probability locally inside the selected top-K medium.

Tests

tests/test-sampling.cpp checks:

order stability
argmax preservation
normalization
truncation
collision-0 noop behavior
absorption behavior on synthetic distributions

CLI reference

Flag	Default	Meaning
`--scatter-k N`	`64`	Top-K scattering medium. These are the candidates that participate in scattering. Also truncates to this size when scattering fires.
`--scatter-strength N`	`0.0`	Blend between original and scattered distribution. `0.0` means disabled.
`--scatter-radius N`	`2.5`	Gaussian rank radius. Higher means probability moves across a wider rank neighborhood.
`--scatter-steps N`	`1`	Number of repeated diffusion passes.
`--scatter-adaptive`	off	Enables entropy-feedback strength.
`--scatter-strength-min N`	`0.02`	Lower bound for adaptive strength.
`--scatter-strength-max N`	`0.30`	Upper bound for adaptive strength.
`--scatter-entropy-target N`	`0.55`	Target normalized entropy for adaptive mode.
`--scatter-absorption N`	`0.0`	Repeated-token damping. `0.0` means disabled.
`--scatter-absorption-last-n N`	`64`	History window for absorption.
`--scatter-collision N`	`1.0`	Per-token probability that scattering fires. Mean free path is roughly `1 / collision` tokens.

The sampler is active only if at least one of these is true:

--scatter-strength > 0 --scatter-adaptive is enabled --scatter-absorption > 0

And these must also be valid:

k > 1 radius > 0 steps > 0 collision > 0

Otherwise it becomes a noop.

It is already in the default sampler chain. To remove it completely, remove scatter from --samplers or remove r from --sampler-seq.

Leaving all scatter flags at default also disables it for free.

Recommended sampler order

penalties -> dry -> top_n_sigma -> top_k -> typ_p -> top_p -> min_p -> xtc -> scatter -> temperature -> dist

Why this order?

Penalties and DRY apply pressure against repeated paths.
Top-k, top-p, min-p, and XTC define the candidate medium.
scatter redistributes probability only inside that medium.
Temperature and final sampling happen afterward.

This is the default chain order.

Presets

Subtle fixed scattering

Good if you want a light effect.

--scatter-strength 0.12 \ --scatter-radius 2.0 \ --scatter-k 64

Creative-writing starting point

A stronger but still reasonable starting point for prose / RP.

--scatter-strength 0.18 \ --scatter-radius 2.5 \ --scatter-k 64

Adaptive medium

Lets entropy decide how strong the scattering should be.

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-radius 2.5 \ --scatter-k 64

Collision-gated scattering

Mean free path is about 4 tokens:

--scatter-collision 0.25 \ --scatter-strength 0.30 \ --scatter-radius 2.5 \ --scatter-k 64

This means scattering fires less often, but when it does fire, it hits harder.

In testing, occasional strong deflections tend to preserve local coherence better than constant weak blur at the same average strength.

Note: this makes generation stochastic even before the final dist sampler, because the collision gate uses the common RNG seed.

Adaptive + light absorption

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-absorption 0.06 \ --scatter-absorption-last-n 64 \ --scatter-radius 2.5 \ --scatter-k 64

Keep absorption low. Too much absorption can damage repeated names, punctuation, formatting, and intentional style patterns.

Full example command

./build/bin/llama-cli \ -m model.gguf \ --samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;scatter;temperature" \ --temp 0.8 \ --top-k 40 \ --top-p 0.95 \ --min-p 0.05 \ --scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-radius 2.5 \ --scatter-k 64

Compact sampler sequence:

--sampler-seq edskypmxrt

Where:

e = penalties d = dry s = top_n_sigma k = top_k y = typ_p p = top_p m = min_p x = xtc r = scatter t = temperature

What this is probably good for

Likely useful for:

creative writing
RP prose
less rigid word choice
soft exploration
avoiding high-temperature chaos
making the model less “locked” to the top token

Risky for:

code
math
strict instruction following
JSON
grammar-constrained output
rare exact names
identifiers
tasks where the top token is usually correct for a reason

Adaptive mode is especially risky for strict tasks because it increases strength when the model is confident.

Native API

Fixed diffusion mode:

/// Fixed diffusion mode. strength <= 0, k <= 1, radius <= 0, or steps <= 0 -> noop. LLAMA_API struct llama_sampler * llama_sampler_init_scatter( int32_t k, float strength, float radius, int32_t steps);

Extended mode:

/// Extended: adaptive medium, repeated-token absorption, collision gating. LLAMA_API struct llama_sampler * llama_sampler_init_scatter_ext( int32_t k, float strength, float radius, int32_t steps, bool adaptive, float strength_min, float strength_max, float entropy_target, float absorption, int32_t absorption_last_n, float collision, uint32_t seed);

The common CLI path uses:

llama_sampler_init_scatter_ext(...)

with the chain seed.

The sampler is a normal llama_sampler_i implementation:

does not need llama_context
does not call llama_decode
does not touch the KV cache
accept records sampled tokens only when absorption is enabled
reset clears absorption history and reseeds the collision RNG
clone copies both state and settings

Files changed

include/llama.h src/llama-sampler.cpp common/common.h common/sampling.cpp common/arg.cpp tests/test-sampling.cpp docs/particle-scattering-sampler.md

Known limitations

The biggest limitation is that rank distance is not semantic distance.

Rank 5 and rank 6 are close in probability rank, but they may not be similar tokens. So the phrase “nearby candidates” only means nearby in sorted probability order, not nearby in meaning.

Because of that, strength, radius, and steps should stay modest.

This is a practical, cheap sampler, not a semantic diffusion model.

Possible future work

1. Embedding-metric diffusion

Instead of using rank distance:

K_ij = exp(-((i - j)^2) / (2 * radius^2))

use token embedding distance:

K_ij = exp(-||e_i - e_j||^2 / (2σ^2))

That would scatter through actual token embedding space instead of rank space.

This would fix the main limitation, but it requires sampler access to the token embedding matrix, so it needs a small API addition.

2. Token-class gating

A cheaper approximation would be to only let similar token classes scatter with each other.

For example:

words with words
punctuation with punctuation
leading-space tokens with leading-space tokens
numeric tokens with numeric tokens

This would prevent some obviously bad rank-neighbor mixing without needing full embedding access.

3. Branched-flow lookahead

A more expensive idea: do short rollouts for several candidate branches, then rerank by branch score.

That would require extra forward passes and temporary KV sequences, so it should probably be a separate sampler rather than part of scatter.

Practical summary

scatter is a local head-flattening sampler.

It does not make the model “globally more random” like temperature. It does not pull in the deep tail. It only redistributes probability among the strongest surviving candidates after the normal sampler filters have already done their job.

The intended use case is creative generation where you want slightly more varied wording and less top-token rigidity, but you do not want the chaos that comes from simply raising temperature.

submitted by /u/Pristine_Income9554
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.