Improving machine-translated novels via style transfer — looking for advice on the faithfulness/fluency tradeoff [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hey all.
I recently started working on a project to improve machine-translated webnovels via style transfer. The basic idea is to take the clunky translated prose and rewrite it to something that reads like it was written by a professional author, while remaining as faithful as possible to the original text.
The source material is mostly amateur/MTL output full of direct sentence structure translations carried over from Chinese, awkward honorifics, over-translated idioms, that kind of thing. The goal isn't retranslation from the source but a cleanup of the English output.
The tricky part is I have no clean data pair for supervised approaches.
I've been looking at a few directions:
- STRAP (Krishna et al., EMNLP 2020) — reframe as paraphrase generation, create pseudo-parallel pairs automatically, fine-tune a style-specific inverse model. Seems like the cleanest unsupervised framing. Unfortunately, it focuses on the sentence level, and I need a way to maintain context over thousands of pages
- Translating away Translationese (Jalota et al., EMNLP 2023) — directly targets the "sounds like a translation" problem with a self-supervised + LM fluency + semantic similarity loss setup.
- Fine-tuning on target-style prose — collect high-quality English novels, fine-tune a small LLM to rewrite in that register.
- Just use a local LLM — run a local LLM and provide it with guidelines on what to rewrite and leave the same. No fine-tuning or anything needed, just hoping the transformer can handle it.
A few things I'm stuck on:
- Is the faithfulness/fluency tradeoff actually manageable at the sentence level, or do I need paragraph-level context or more to preserve narrative coherence?
- How do people handle domain-specific terms like termonlify and catchphrase-type things that need to survive the rewrite unchanged? Hard constraints during decoding, or just hope the model learns to leave them alone?
Happy to hear about similar projects, relevant papers I might have missed, or just general lessons from working in this space. Thanks.
[link] [comments]
More from r/MachineLearning
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
-
Books/Resources to improve mathematical foundations for ML research [D]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.