trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Trained a prompt injection classifier using https://huggingface.co/spaces/av-codes/prompt-injection-detector --- I've been interested in prompt injections and agentic security for a while, and wanted to see how a purpose-built ML agent compares to general-purpose coding agents for this kind of task. Here's roughly how it went:
For v1, I went with DistilBERT targeting CPU inference. After a few parameter sweeps, the agent launched a full run and landed at F1 95.87%. I also tried training an HRM-Text model, but the agent didn't figure it out and set up a TRM run instead (different architecture, no positional encoding). When I steered it back to HRM with the correct paper, the training script wasn't optimized for my hardware. I spent $20 on HF remote training with a T4, but it fumbled after epoch 1 because agent didn't follow training routine from the paper and used wrong optimiser/params leading to params blowing up. For v2, I found a larger synthetic dataset from Bordair and re-trained the DistilBERT. That's the model in the Space above. What surprised me:
The obvious gap: the synthetic dataset means the train/test splits might be too similar. Not a proper scientific approach, but it's the most pleasant ML experience I've had with an agentic tool so far. The HRM run is still pending. I'm curious to learn about other people's experiences with these tools. Thank you! [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.