News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — Machine Learning research 2h ago SAOT: Self-Supervised Continual Graph Learning with Structure-Aware Optimal Transport arXiv:2607.00377v1 Announce Type: new Abstract: Self-supervised Continual Graph Learning (CGL) aims to successively learn from a graph sequence with different tasks without label supervision - a paradigm that has attracted widespread attention. Most existing self-supervised CGL… 31 arXiv — Machine Learning research 2h ago AdaBoosting Text Prompts for Vision-Language Models arXiv:2607.00684v1 Announce Type: new Abstract: The classification accuracy of pretrained Vision-Language Models (VLMs) relies on the quality of the text prompts. Handcrafted templates and Large Language Model (LLM)-generated descriptions not only make predictions more… 25 arXiv — Machine Learning research 2h ago Language-Critique Imitation Learning from Suboptimal Demonstrations arXiv:2607.01225v1 Announce Type: new Abstract: Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently… 32 arXiv — Machine Learning research 2h ago Steal the Patch Size: Adversarially Manipulate Vision-Language Models arXiv:2607.00174v1 Announce Type: cross Abstract: We present a black-box model-stealing attack that recovers private vision-tokenizer configurations of deployed vision-language models (VLMs), including the visual patch size and input preprocessing pipeline. The key idea is a… 34 arXiv — Machine Learning research 2h ago Leveraging Multimodality for Real-Time Classification of Transients and Variables found by the Zwicky Transient Facility arXiv:2607.00228v1 Announce Type: cross Abstract: Modern time-domain surveys such as the Zwicky Transient Facility (ZTF) generate hundreds of thousands of alerts each night, making real-time decisions for follow-up observations a central challenge in time-domain astronomy.… 28 arXiv — NLP / Computation & Language research 2h ago Selective Test-Time Debiasing for CLIP via Reward Gating arXiv:2607.00423v1 Announce Type: new Abstract: Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias… 22 arXiv — NLP / Computation & Language research 2h ago Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach arXiv:2607.01115v1 Announce Type: new Abstract: University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to… 15 arXiv — NLP / Computation & Language research 2h ago Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions arXiv:2507.15692v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect… 22 arXiv — NLP / Computation & Language research 2h ago Rosetta: Composable Native Multimodal Pretraining arXiv:2607.00293v1 Announce Type: cross Abstract: Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete… 15 arXiv — NLP / Computation & Language research 2h ago StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same… 8 arXiv — NLP / Computation & Language research 2h ago MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.… 21 r/LocalLLaMA community 8h ago My reasons to run local models I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run… 10 r/LocalLLaMA community 11h ago Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models..... Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.… 16 Hugging Face Daily Papers research 13h ago Hierarchical Experimentalist Agents Abstract HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models (LLMs) are increasingly used to take… 24 Hugging Face Daily Papers research 14h ago Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Abstract Act2Answer protocol evaluates embodied vision-language-action models by having agents answer questions through physical actions, revealing knowledge retention and generalization patterns across different semantic categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 35 Hugging Face Daily Papers research 17h ago QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents Abstract A testbed called QVal is introduced for evaluating dense supervision signals in long-horizon LLM agent tasks by measuring how well method scores align with Q-values, enabling fair comparison of different supervision approaches without training. Generated by… 22 Hugging Face Daily Papers research 20h ago MuSViT: A Foundation Vision Model for Sheet Music Representation Abstract MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches. Generated by… 10 Hugging Face Daily Papers research 21h ago Xiaomi-GUI-0 Technical Report Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on… 7 arXiv — Machine Learning research 1d ago PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely… 25 arXiv — Machine Learning research 1d ago Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images arXiv:2606.31394v1 Announce Type: new Abstract: Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower… 14 arXiv — Machine Learning research 1d ago CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation arXiv:2606.32012v1 Announce Type: new Abstract: Uncertainty estimation has been a long-standing challenge in AI models; it amounts to "knowing what you don't know," and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger effect). Although it is still… 8 arXiv — Machine Learning research 1d ago FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning arXiv:2606.32016v1 Announce Type: new Abstract: Multimodal graph foundation models aim to learn reusable knowledge from graphs enriched with text, images, attributes, and relational topology, thereby supporting diverse graph-centric and modality-centric tasks. In practice,… 14 arXiv — NLP / Computation & Language research 1d ago QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents arXiv:2606.32034v1 Announce Type: cross Abstract: LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the… 36 arXiv — Machine Learning research 1d ago Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection arXiv:2606.30675v1 Announce Type: cross Abstract: Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for… 28 arXiv — NLP / Computation & Language research 1d ago ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models arXiv:2606.30696v1 Announce Type: cross Abstract: Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing… 4 arXiv — NLP / Computation & Language research 1d ago Building a Multimodal Dataset of Academic Paper for Keyword Extraction arXiv:2606.31069v1 Announce Type: new Abstract: Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential… 14 arXiv — NLP / Computation & Language research 1d ago LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment arXiv:2606.31310v1 Announce Type: new Abstract: Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the… 9 arXiv — NLP / Computation & Language research 1d ago Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue arXiv:2606.31719v1 Announce Type: new Abstract: In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be… 22 arXiv — NLP / Computation & Language research 1d ago CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield arXiv:2606.31796v1 Announce Type: new Abstract: We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output… 14 arXiv — NLP / Computation & Language research 1d ago DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching arXiv:2606.31980v1 Announce Type: new Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions… 36 arXiv — NLP / Computation & Language research 1d ago Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision arXiv:2606.32038v1 Announce Type: new Abstract: When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their… 30 arXiv — NLP / Computation & Language research 1d ago ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection arXiv:2606.30646v1 Announce Type: cross Abstract: Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia… 18 arXiv — NLP / Computation & Language research 1d ago ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive… 37 arXiv — NLP / Computation & Language research 1d ago Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these… 20 arXiv — NLP / Computation & Language research 1d ago Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity? arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows… 15 arXiv — NLP / Computation & Language research 1d ago RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization arXiv:2606.31694v1 Announce Type: cross Abstract: For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from… 18 arXiv — NLP / Computation & Language research 1d ago MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a… 4 arXiv — NLP / Computation & Language research 1d ago From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing… 17 Hugging Face Daily Papers research 1d ago BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language Abstract BrainJanus represents the first unified brain model integrating brain, vision, and language through a shared Omni space, enabling bidirectional mapping between neural activity and sensory stimuli via a tokenized representation and autoregressive architecture. Generated… 38 r/MachineLearning community 1d ago Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D] I recently came across the announcement for the MARS2 Workshop (Multimodal Reasoning Competition) at ECCV 2026. From what I understand, it focuses on multimodal reasoning and test-time reasoning (“slow thinking”), especially applied to video and real-world scenarios like… 30 Hugging Face Daily Papers research 1d ago DOPD: Dual On-policy Distillation Abstract DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models. Generated by… 6 r/MachineLearning community 1d ago 80TB+ of astronomy for the HDD-poor: crossmatch the Universe from your laptop [R] Today is the day you (🫵!) get access to 80TB plus of data from over 30 astronomical surveys in one place. 4GB of RAM is enough even at Gaia Scale. Check out our writeup here: https://huggingface.co/blog/hugging-science/multimodal-universe-hats And a tutorial here… 6 Hugging Face Daily Papers research 1d ago Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models? Abstract Research reveals that language backbones in Vision-Language-Action models are highly redundant for robotic manipulation tasks, while vision and action pathways are more critical, suggesting need for deliberate capacity allocation in future architectures. Generated by… 11 Hugging Face Daily Papers research 1d ago Mind the Heads: Topological Representation Alignment for Multimodal LLMs Abstract HeRA aligns individual attention heads in MLLMs to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation alignment has… 27 Hugging Face Daily Papers research 1d ago RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation Abstract RaysUp is a lightweight, task-agnostic feature upsampling framework that reconstructs high-resolution features using geometry-aware ray domain techniques with improved efficiency and accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pre-trained Vision Foundation… 37 Hugging Face Daily Papers research 1d ago Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation Abstract ILLUME-X is a unified multimodal paradigm that enhances text-image generation through improved data efficiency, stable training processes, and comprehensive evaluation metrics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The advancement of generative AI models capable… 17 Hugging Face Daily Papers research 1d ago ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval Abstract A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.… 32 Hugging Face Daily Papers research 2d ago Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal… 25 Hugging Face Daily Papers research 2d ago Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature Abstract A novel pipeline called MatMMExtract is introduced that processes compound scientific figures into individual panels and generates structured annotations using large language models, creating a comprehensive dataset for vision-language learning in materials science.… 16 arXiv — Machine Learning research 2d ago Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing… 36 Page 1 of 10 · 500 articles Older →