r/LocalLLaMA · · 11 min read

Particle Scattering Sampler for llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

https://github.com/IceFog72/llama.cpp

I added an experimental sampler to llama.cpp called scatter.

The short version: it slightly smooths the model’s next-token probability distribution inside the already-selected top candidates. It is meant to make generation less rigid without doing the usual “raise temperature and wake up the garbage tail” thing.

It uses a light-scattering metaphor, but the implementation is not real physics. The actual operation is much simpler: a cheap local diffusion / moving-average step over token rank.

Think of the model’s next-token distribution as a beam. The strongest candidate is rank 1, the next strongest is rank 2, and so on. scatter lets nearby ranks exchange a little probability mass. Rank 1 can lose some mass to rank 2, 3, 4, etc. Rank 5 can exchange with nearby ranks. But it does not pull random deep-tail tokens into play.

That is the main point:

flatten the head of the distribution without leaking probability into the deep tail.

Status

Implemented and tested as an experimental sampler for llama.cpp.

What is currently implemented:

  • Native sampler API:
    • llama_sampler_init_scatter(...)
    • llama_sampler_init_scatter_ext(...)
  • Sampler-chain name: scatter
  • Sampler-chain character: r
  • Included in the default sampler chain between xtc and temperature
  • Disabled by default: with default parameters it returns a noop sampler
  • Fixed scattering strength
  • Optional adaptive strength using entropy feedback
  • Optional repeated-token absorption
  • Optional collision / mean-free-path gating
  • Invariant tests in tests/test-sampling.cpp

What problem is it trying to solve?

Temperature is blunt.

If you raise temperature, you flatten the whole distribution. That can make the model more creative, but it also gives more probability to weak tail tokens. Sometimes that is fine. Sometimes it causes weird word choices, broken formatting, bad identifiers, or incoherent jumps.

scatter is more local.

It only operates inside a chosen top-K candidate set after earlier samplers have already filtered the distribution. So if top-k, top-p, min-p, XTC, etc. have already removed bad candidates, scatter only reshapes what survived.

So instead of this:

"Make everything more random, including the tail." 

it does this:

"Take the strongest surviving candidates and locally soften the differences between nearby ranks." 

How it works

The sampler runs after earlier samplers have already defined the candidate “medium.”

Recommended order:

penalties -> dry -> top_n_sigma -> top_k -> typ_p -> top_p -> min_p -> xtc -> scatter -> temperature -> dist 

So by the time scatter runs, the candidate list has already been filtered by the normal samplers.

Per generated token, it does the following.

1. Collision gate

First, scatter may decide to do nothing for this token.

The flag is:

--scatter-collision N 

Default:

--scatter-collision 1.0 

With collision = 1.0, scattering runs every token.

With collision = 0.25, scattering only fires about 25% of the time. That gives a mean free path of roughly 4 tokens.

If scattering does not fire, the sampler leaves the candidates completely untouched:

  • no sorting
  • no truncation
  • no renormalization
  • no write-back

This is useful if you want occasional stronger “deflections” instead of constant weak smoothing.

2. Rank-space diffusion

The sampler takes the top k surviving candidates, sorts them by logit, and softmaxes them into probabilities.

Then it applies a local Gaussian smoothing kernel over rank distance:

K_ij = exp(-((i - j)^2) / (2 * radius^2)) q_i = sum_j K_ij * p_j / sum_j K_ij 

In plain English:

  • tokens close in rank share more probability
  • tokens far away in rank share little or none
  • rank distance matters, not token meaning
  • the kernel is row-normalized, so probability mass stays controlled

Then the direct distribution and scattered distribution are blended:

p_i = normalize((1 - strength) * p_i + strength * q_i) 

So:

  • strength = 0.0 means no scattering
  • strength = 0.1 means mostly original distribution, slightly smoothed
  • strength = 0.3 means stronger local flattening

The kernel depends only on rank distance, so it is computed once and cached.

The sampler can also repeat this diffusion step several times:

--scatter-steps N 

Roughly speaking, multiple steps behave like a wider blur:

n steps ≈ one step with radius * sqrt(n) 

So normally you should keep steps low.

3. Adaptive strength

There is also an optional adaptive mode:

--scatter-adaptive 

This uses the normalized entropy of the current top-K distribution:

H_norm = -sum_i p_i * log(p_i) / log(K) 

Interpretation:

H_norm = 0.0 -> very sharp distribution H_norm = 1.0 -> very diffuse distribution 

Then the sampler compares that entropy to a target:

--scatter-entropy-target 0.55 

If the distribution is sharper than the target, the sampler increases the effective scattering strength.

If the distribution is already diffuse, it lowers the effective scattering strength.

Conceptually:

H_norm < target: sharp beam -> denser medium -> stronger scattering H_norm > target: diffuse -> thinner medium -> weaker scattering 

The adaptive strength is clamped between:

--scatter-strength-min N --scatter-strength-max N 

Example:

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 

Important caveat: adaptive mode pushes strength up when the model is confident. That is deliberate, similar in spirit to XTC, but confident distributions are often confident for a good reason. This may be good for creative prose, but risky for code, math, JSON, exact names, and strict instruction following.

4. Optional repeated-token absorption

There is an optional repetition-related feature:

--scatter-absorption N --scatter-absorption-last-n N 

If a token appeared recently, it can lose some scattered mass:

p_i *= exp(-absorption * repeat_count_i) 

This is not meant to replace normal repetition penalty or DRY. It is a soft extra effect restricted to the top-K scattering medium.

You can think of it as a small frequency penalty applied only to candidates currently inside the medium.

Use this carefully. Strong absorption can damage:

  • names
  • style markers
  • dialogue punctuation
  • intentional repetition
  • repeated structural tokens
  • exact formatting

A light value is safer:

--scatter-absorption 0.06 --scatter-absorption-last-n 64 

5. Write-back

After scattering, the final probabilities are converted back to logits:

log(max(p_i, eps)) 

Then the candidate list is truncated to the top-K medium.

So when scattering fires, only the top --scatter-k candidates remain.

Important behavior: order stability

scatter is designed to be order-stable.

The row-normalized Gaussian kernel preserves the rank ordering of a sorted distribution, up to floating-point noise. So if the top token was rank 1 before scattering, it should still be rank 1 afterward.

That means:

  • argmax is preserved
  • greedy decoding is unaffected
  • the sampler does not aggressively reorder candidates
  • it mostly softens the probability gaps between nearby ranks

The deliberate exception is absorption. If repeated-token absorption is enabled, repeated tokens can drop in rank.

This is one of the main differences from temperature.

Temperature can lift deep-tail tokens.

scatter only moves probability locally inside the selected top-K medium.

Tests

tests/test-sampling.cpp checks:

  • order stability
  • argmax preservation
  • normalization
  • truncation
  • collision-0 noop behavior
  • absorption behavior on synthetic distributions

CLI reference

Flag Default Meaning
--scatter-k N 64 Top-K scattering medium. These are the candidates that participate in scattering. Also truncates to this size when scattering fires.
--scatter-strength N 0.0 Blend between original and scattered distribution. 0.0 means disabled.
--scatter-radius N 2.5 Gaussian rank radius. Higher means probability moves across a wider rank neighborhood.
--scatter-steps N 1 Number of repeated diffusion passes.
--scatter-adaptive off Enables entropy-feedback strength.
--scatter-strength-min N 0.02 Lower bound for adaptive strength.
--scatter-strength-max N 0.30 Upper bound for adaptive strength.
--scatter-entropy-target N 0.55 Target normalized entropy for adaptive mode.
--scatter-absorption N 0.0 Repeated-token damping. 0.0 means disabled.
--scatter-absorption-last-n N 64 History window for absorption.
--scatter-collision N 1.0 Per-token probability that scattering fires. Mean free path is roughly 1 / collision tokens.

The sampler is active only if at least one of these is true:

--scatter-strength > 0 --scatter-adaptive is enabled --scatter-absorption > 0 

And these must also be valid:

k > 1 radius > 0 steps > 0 collision > 0 

Otherwise it becomes a noop.

It is already in the default sampler chain. To remove it completely, remove scatter from --samplers or remove r from --sampler-seq.

Leaving all scatter flags at default also disables it for free.

Recommended sampler order

penalties -> dry -> top_n_sigma -> top_k -> typ_p -> top_p -> min_p -> xtc -> scatter -> temperature -> dist 

Why this order?

  1. Penalties and DRY apply pressure against repeated paths.
  2. Top-k, top-p, min-p, and XTC define the candidate medium.
  3. scatter redistributes probability only inside that medium.
  4. Temperature and final sampling happen afterward.

This is the default chain order.

Presets

Subtle fixed scattering

Good if you want a light effect.

--scatter-strength 0.12 \ --scatter-radius 2.0 \ --scatter-k 64 

Creative-writing starting point

A stronger but still reasonable starting point for prose / RP.

--scatter-strength 0.18 \ --scatter-radius 2.5 \ --scatter-k 64 

Adaptive medium

Lets entropy decide how strong the scattering should be.

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-radius 2.5 \ --scatter-k 64 

Collision-gated scattering

Mean free path is about 4 tokens:

--scatter-collision 0.25 \ --scatter-strength 0.30 \ --scatter-radius 2.5 \ --scatter-k 64 

This means scattering fires less often, but when it does fire, it hits harder.

In testing, occasional strong deflections tend to preserve local coherence better than constant weak blur at the same average strength.

Note: this makes generation stochastic even before the final dist sampler, because the collision gate uses the common RNG seed.

Adaptive + light absorption

--scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-absorption 0.06 \ --scatter-absorption-last-n 64 \ --scatter-radius 2.5 \ --scatter-k 64 

Keep absorption low. Too much absorption can damage repeated names, punctuation, formatting, and intentional style patterns.

Full example command

./build/bin/llama-cli \ -m model.gguf \ --samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;scatter;temperature" \ --temp 0.8 \ --top-k 40 \ --top-p 0.95 \ --min-p 0.05 \ --scatter-adaptive \ --scatter-strength 0.14 \ --scatter-strength-min 0.02 \ --scatter-strength-max 0.30 \ --scatter-entropy-target 0.55 \ --scatter-radius 2.5 \ --scatter-k 64 

Compact sampler sequence:

--sampler-seq edskypmxrt 

Where:

e = penalties d = dry s = top_n_sigma k = top_k y = typ_p p = top_p m = min_p x = xtc r = scatter t = temperature 

What this is probably good for

Likely useful for:

  • creative writing
  • RP prose
  • less rigid word choice
  • soft exploration
  • avoiding high-temperature chaos
  • making the model less “locked” to the top token

Risky for:

  • code
  • math
  • strict instruction following
  • JSON
  • grammar-constrained output
  • rare exact names
  • identifiers
  • tasks where the top token is usually correct for a reason

Adaptive mode is especially risky for strict tasks because it increases strength when the model is confident.

Native API

Fixed diffusion mode:

/// Fixed diffusion mode. strength <= 0, k <= 1, radius <= 0, or steps <= 0 -> noop. LLAMA_API struct llama_sampler * llama_sampler_init_scatter( int32_t k, float strength, float radius, int32_t steps); 

Extended mode:

/// Extended: adaptive medium, repeated-token absorption, collision gating. LLAMA_API struct llama_sampler * llama_sampler_init_scatter_ext( int32_t k, float strength, float radius, int32_t steps, bool adaptive, float strength_min, float strength_max, float entropy_target, float absorption, int32_t absorption_last_n, float collision, uint32_t seed); 

The common CLI path uses:

llama_sampler_init_scatter_ext(...) 

with the chain seed.

The sampler is a normal llama_sampler_i implementation:

  • does not need llama_context
  • does not call llama_decode
  • does not touch the KV cache
  • accept records sampled tokens only when absorption is enabled
  • reset clears absorption history and reseeds the collision RNG
  • clone copies both state and settings

Files changed

include/llama.h src/llama-sampler.cpp common/common.h common/sampling.cpp common/arg.cpp tests/test-sampling.cpp docs/particle-scattering-sampler.md 

Known limitations

The biggest limitation is that rank distance is not semantic distance.

Rank 5 and rank 6 are close in probability rank, but they may not be similar tokens. So the phrase “nearby candidates” only means nearby in sorted probability order, not nearby in meaning.

Because of that, strength, radius, and steps should stay modest.

This is a practical, cheap sampler, not a semantic diffusion model.

Possible future work

1. Embedding-metric diffusion

Instead of using rank distance:

K_ij = exp(-((i - j)^2) / (2 * radius^2)) 

use token embedding distance:

K_ij = exp(-||e_i - e_j||^2 / (2σ^2)) 

That would scatter through actual token embedding space instead of rank space.

This would fix the main limitation, but it requires sampler access to the token embedding matrix, so it needs a small API addition.

2. Token-class gating

A cheaper approximation would be to only let similar token classes scatter with each other.

For example:

  • words with words
  • punctuation with punctuation
  • leading-space tokens with leading-space tokens
  • numeric tokens with numeric tokens

This would prevent some obviously bad rank-neighbor mixing without needing full embedding access.

3. Branched-flow lookahead

A more expensive idea: do short rollouts for several candidate branches, then rerank by branch score.

That would require extra forward passes and temporary KV sequences, so it should probably be a separate sampler rather than part of scatter.

Practical summary

scatter is a local head-flattening sampler.

It does not make the model “globally more random” like temperature. It does not pull in the deep tail. It only redistributes probability among the strongest surviving candidates after the normal sampler filters have already done their job.

The intended use case is creative generation where you want slightly more varied wording and less top-token rigidity, but you do not want the chaos that comes from simply raising temperature.

submitted by /u/Pristine_Income9554
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA