News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow Ollama releases dev-tools 28d ago v0.30.5-rc0: llama.cpp version update (#16511) Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows. Fixes #16479 Fixes #16489 Fixes #16491 Fixes #16492 Fixes #16495 11 Hugging Face Daily Papers research 28d ago Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning Abstract Stable-Layers uses reinforcement learning with vision-language model feedback to improve layer decomposition without paired data, employing Flow-GRPO and LoRA adaptation for optimized policy training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present… 38 Hugging Face Daily Papers research 28d ago SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes Abstract Vision-language models demonstrate strong performance on isolated spatial reasoning tasks but fail to maintain coherent spatial understanding and reliable actions during multi-turn interactive feedback in 3D environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 15 Hugging Face Daily Papers research 28d ago Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by… 9 r/MachineLearning community 28d ago Repo for implementations of various Transformer Attn mechanisms [P] Initially, I developed this so I can easily switch between different Attention mechanisms for my Small Language Model (SLM) experiments and benchmarking. However, I also realized that these implementations can be applicable in Computer Vision, modernize Vision Encoders, RL, and… 14 Hugging Face Daily Papers research 28d ago OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs Abstract OVO-S-Bench presents a comprehensive benchmark for evaluating streaming spatial intelligence in multimodal language models through human-annotated questions spanning multiple abstraction levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal agents in robotics,… 23 arXiv — Machine Learning research 28d ago Self-Distilled Policy Gradient arXiv:2606.04036v1 Announce Type: new Abstract: On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be… 33 arXiv — Machine Learning research 28d ago KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not… 8 arXiv — Machine Learning research 28d ago A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support arXiv:2606.04209v1 Announce Type: new Abstract: Counterfactual explanations seek small, semantically meaningful changes to an input that alter a model's prediction, and are widely used to interpret and audit machine learning systems. In modern vision, language, and multimodal… 13 arXiv — NLP / Computation & Language research 28d ago MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A arXiv:2606.04231v1 Announce Type: new Abstract: Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often… 24 arXiv — NLP / Computation & Language research 28d ago VCIFBench: Evaluating Complex Instruction Following for Video Understanding arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We… 27 arXiv — NLP / Computation & Language research 28d ago A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs arXiv:2606.04596v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the… 34 arXiv — NLP / Computation & Language research 28d ago Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM arXiv:2606.04719v1 Announce Type: new Abstract: The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this… 7 arXiv — NLP / Computation & Language research 28d ago Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation arXiv:2606.04046v1 Announce Type: cross Abstract: In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at… 31 arXiv — NLP / Computation & Language research 28d ago Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1) arXiv:2606.04240v1 Announce Type: cross Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The… 7 arXiv — NLP / Computation & Language research 28d ago VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when… 7 arXiv — NLP / Computation & Language research 28d ago Video2LoRA: Parametric Video Internalization for Vision-Language Models arXiv:2606.04351v1 Announce Type: cross Abstract: Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video… 13 arXiv — NLP / Computation & Language research 28d ago Stateful Visual Encoders for Vision-Language Models arXiv:2606.04433v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language… 32 Hugging Face Daily Papers research 28d ago MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation Abstract MapAgent is an industrial-grade agentic architecture that combines vision-language processing with constraint-aware reasoning to produce specification-compliant lane maps, achieving high automation rates in large-scale urban mapping. Generated by… 21 Hugging Face Daily Papers research 28d ago Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching Abstract Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence… 28 Hugging Face Daily Papers research 28d ago WALL-WM: Carving World Action Modeling at the Event Joints Abstract WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference. Generated by Qwen/Qwen2.5-Coder-32B-Instruct WALL-WM is a World Action… 32 r/LocalLLaMA community 29d ago How to use audio and vision modalities in llama.cpp? How to use audio and vision modalities in llama.cpp with Gemma4 12B it? I’m on release b9494, but when I run llama-cli it shows “modalities: text” only, and crashes if I try to add an image.   submitted by   /u/No-Leave-4512 [link]   [comments] 20 llama.cpp releases dev-tools 29d ago b9494 mtmd: enable non-causal vision for gemma 4 unified ( #24082 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu… 22 r/LocalLLaMA community 29d ago Introducing Gemma 4 12B: a unified, encoder-free multimodal model   submitted by   /u/johnnyApplePRNG [link]   [comments] 4 Hugging Face Daily Papers research 29d ago Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models Abstract YOLO26 addresses real-time vision challenges through a unified model family with NMS-free inference, improved training strategies, and multi-task capabilities spanning detection, segmentation, and pose estimation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-time… 28 Hacker News — AI on Front Page community 29d ago Gemma 4 12B: A unified, encoder-free multimodal model Article URL: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/ Comments URL: https://news.ycombinator.com/item?id=48385906 Points: 263 # Comments: 95 28 Maarten Grootendorst research 29d ago A Visual Guide to Gemma 4 12B An in-depth explainer to Gemma 4 12B; a unified, encoder-free multimodal model! 38 r/LocalLLaMA community 29d ago google/gemma-4-12B · Hugging Face Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned… 29 Hugging Face Daily Papers research 29d ago Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling Abstract Researchers identify a perceptual judgment bias in multimodal large language models where visual evidence is overlooked for textual plausibility, and propose a training framework using a perturbed dataset and reward modeling to improve perceptual fidelity and evaluation… 18 Stratechery (Ben Thompson) community 29d ago The Nvidia AI PC, Project Solara, Microsoft AI The Nvidia AI PC feels like a relic of another AI era; Microsoft's vision for devices at Build was much more compelling. 5 r/LocalLLaMA community 29d ago Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes) from Hcompany (which seems to be a French company): Holo3.1: Fast & Local Computer Use Agents Model Description Holo3.1 is our latest family of Vision-Language Models (VLMs) for computer use agents. Building on Holo3, it expands support beyond browser and desktop automation to… 25 Hugging Face Daily Papers research 29d ago World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning Abstract Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World models and multimodal… 19 arXiv — Machine Learning research 29d ago Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification arXiv:2606.02605v1 Announce Type: new Abstract: Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive,… 24 arXiv — Machine Learning research 29d ago CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning arXiv:2606.02659v1 Announce Type: new Abstract: Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complementary patterns, thereby enhancing data processing and decision-making. While existing methods… 5 arXiv — Machine Learning research 29d ago Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals arXiv:2606.02679v1 Announce Type: new Abstract: Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local… 24 arXiv — NLP / Computation & Language research 29d ago Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which… 33 arXiv — Machine Learning research 29d ago QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models arXiv:2606.02785v1 Announce Type: new Abstract: Large machine learning models benefit substantially from multimodal inputs that provide a complementary view of the same example. We introduce QUIVER (QUantum-Informed Views for Enhanced Representations, a paradigm that enriches… 21 arXiv — Machine Learning research 29d ago Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning arXiv:2606.02842v1 Announce Type: new Abstract: Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address… 6 arXiv — Machine Learning research 29d ago BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks arXiv:2606.02947v1 Announce Type: new Abstract: Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing… 17 arXiv — Machine Learning research 29d ago Constitutional On-Policy Safe Distillation arXiv:2606.03089v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in… 15 arXiv — Machine Learning research 29d ago Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning arXiv:2606.03118v1 Announce Type: new Abstract: Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion… 38 arXiv — NLP / Computation & Language research 29d ago EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction arXiv:2606.02971v1 Announce Type: new Abstract: Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal… 10 arXiv — NLP / Computation & Language research 29d ago Coherence Maximization Improves Pluralistic Alignment arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these… 16 arXiv — NLP / Computation & Language research 29d ago See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence arXiv:2606.03371v1 Announce Type: new Abstract: Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework,… 17 arXiv — NLP / Computation & Language research 29d ago Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding arXiv:2606.03604v1 Announce Type: new Abstract: When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal… 37 arXiv — NLP / Computation & Language research 29d ago Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study arXiv:2606.03693v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an… 10 arXiv — NLP / Computation & Language research 29d ago Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric… 19 arXiv — NLP / Computation & Language research 29d ago VESTA: Visual Exploration with Statistical Tool Agents arXiv:2606.00384v1 Announce Type: cross Abstract: Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and… 10 arXiv — NLP / Computation & Language research 29d ago Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection arXiv:2606.02812v1 Announce Type: cross Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but… 38 MIT News — AI research 29d ago MIT researchers teach AI models to interpret charts The new ChartNet training dataset could improve the accuracy of vision-language models that help analyze business trends or interpret scientific figures. 29 Page 9 of 10 · 500 articles ← Newer Older →