Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — Machine Learning research 2h ago

SAOT: Self-Supervised Continual Graph Learning with Structure-Aware Optimal Transport

arXiv:2607.00377v1 Announce Type: new Abstract: Self-supervised Continual Graph Learning (CGL) aims to successively learn from a graph sequence with different tasks without label supervision - a paradigm that has attracted widespread attention. Most existing self-supervised CGL…

31
arXiv — Machine Learning research 2h ago

AdaBoosting Text Prompts for Vision-Language Models

arXiv:2607.00684v1 Announce Type: new Abstract: The classification accuracy of pretrained Vision-Language Models (VLMs) relies on the quality of the text prompts. Handcrafted templates and Large Language Model (LLM)-generated descriptions not only make predictions more…

25
arXiv — Machine Learning research 2h ago

Language-Critique Imitation Learning from Suboptimal Demonstrations

arXiv:2607.01225v1 Announce Type: new Abstract: Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently…

32
arXiv — Machine Learning research 2h ago

Steal the Patch Size: Adversarially Manipulate Vision-Language Models

arXiv:2607.00174v1 Announce Type: cross Abstract: We present a black-box model-stealing attack that recovers private vision-tokenizer configurations of deployed vision-language models (VLMs), including the visual patch size and input preprocessing pipeline. The key idea is a…

34
arXiv — Machine Learning research 2h ago

Leveraging Multimodality for Real-Time Classification of Transients and Variables found by the Zwicky Transient Facility

arXiv:2607.00228v1 Announce Type: cross Abstract: Modern time-domain surveys such as the Zwicky Transient Facility (ZTF) generate hundreds of thousands of alerts each night, making real-time decisions for follow-up observations a central challenge in time-domain astronomy.…

28
arXiv — NLP / Computation & Language research 2h ago

Selective Test-Time Debiasing for CLIP via Reward Gating

arXiv:2607.00423v1 Announce Type: new Abstract: Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias…

22
arXiv — NLP / Computation & Language research 2h ago

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

arXiv:2607.01115v1 Announce Type: new Abstract: University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to…

15
arXiv — NLP / Computation & Language research 2h ago

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

arXiv:2507.15692v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect…

22
arXiv — NLP / Computation & Language research 2h ago

Rosetta: Composable Native Multimodal Pretraining

arXiv:2607.00293v1 Announce Type: cross Abstract: Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete…

15
arXiv — NLP / Computation & Language research 2h ago

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

arXiv:2607.00465v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same…

8
arXiv — NLP / Computation & Language research 2h ago

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv:2607.00491v1 Announce Type: cross Abstract: Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed.…

21
r/LocalLLaMA community 8h ago

My reasons to run local models

I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run…

10
r/LocalLLaMA community 11h ago

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.…

16
Hugging Face Daily Papers research 13h ago

Hierarchical Experimentalist Agents

Abstract HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models (LLMs) are increasingly used to take…

24
Hugging Face Daily Papers research 14h ago

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Abstract Act2Answer protocol evaluates embodied vision-language-action models by having agents answer questions through physical actions, revealing knowledge retention and generalization patterns across different semantic categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 17h ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Abstract A testbed called QVal is introduced for evaluating dense supervision signals in long-horizon LLM agent tasks by measuring how well method scores align with Q-values, enabling fair comparison of different supervision approaches without training. Generated by…

22
Hugging Face Daily Papers research 20h ago

MuSViT: A Foundation Vision Model for Sheet Music Representation

Abstract MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches. Generated by…

10
Hugging Face Daily Papers research 21h ago

Xiaomi-GUI-0 Technical Report

Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on…

7
arXiv — Machine Learning research 1d ago

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely…

25
arXiv — Machine Learning research 1d ago

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

arXiv:2606.31394v1 Announce Type: new Abstract: Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower…

14
arXiv — Machine Learning research 1d ago

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

arXiv:2606.32012v1 Announce Type: new Abstract: Uncertainty estimation has been a long-standing challenge in AI models; it amounts to "knowing what you don't know," and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger effect). Although it is still…

8
arXiv — Machine Learning research 1d ago

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

arXiv:2606.32016v1 Announce Type: new Abstract: Multimodal graph foundation models aim to learn reusable knowledge from graphs enriched with text, images, attributes, and relational topology, thereby supporting diverse graph-centric and modality-centric tasks. In practice,…

14
arXiv — NLP / Computation & Language research 1d ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

arXiv:2606.32034v1 Announce Type: cross Abstract: LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the…

36
arXiv — Machine Learning research 1d ago

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

arXiv:2606.30675v1 Announce Type: cross Abstract: Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for…

28
arXiv — NLP / Computation & Language research 1d ago

ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

arXiv:2606.30696v1 Announce Type: cross Abstract: Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing…

4
arXiv — NLP / Computation & Language research 1d ago

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

arXiv:2606.31069v1 Announce Type: new Abstract: Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential…

14
arXiv — NLP / Computation & Language research 1d ago

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

arXiv:2606.31310v1 Announce Type: new Abstract: Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the…

9
arXiv — NLP / Computation & Language research 1d ago

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

arXiv:2606.31719v1 Announce Type: new Abstract: In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be…

22
arXiv — NLP / Computation & Language research 1d ago

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

arXiv:2606.31796v1 Announce Type: new Abstract: We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output…

14
arXiv — NLP / Computation & Language research 1d ago

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

arXiv:2606.31980v1 Announce Type: new Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions…

36
arXiv — NLP / Computation & Language research 1d ago

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

arXiv:2606.32038v1 Announce Type: new Abstract: When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their…

30
arXiv — NLP / Computation & Language research 1d ago

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

arXiv:2606.30646v1 Announce Type: cross Abstract: Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia…

18
arXiv — NLP / Computation & Language research 1d ago

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive…

37
arXiv — NLP / Computation & Language research 1d ago

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

arXiv:2606.31270v1 Announce Type: cross Abstract: Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these…

20
arXiv — NLP / Computation & Language research 1d ago

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

arXiv:2606.31407v1 Announce Type: cross Abstract: Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows…

15
arXiv — NLP / Computation & Language research 1d ago

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

arXiv:2606.31694v1 Announce Type: cross Abstract: For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from…

18
arXiv — NLP / Computation & Language research 1d ago

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

arXiv:2606.31966v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a…

4
arXiv — NLP / Computation & Language research 1d ago

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

arXiv:2506.17294v3 Announce Type: replace Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding research area, offering advantages such as scalable availability and personalized narration. However, existing…

17
Hugging Face Daily Papers research 1d ago

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

Abstract BrainJanus represents the first unified brain model integrating brain, vision, and language through a shared Omni space, enabling bidirectional mapping between neural activity and sensory stimuli via a tokenized representation and autoregressive architecture. Generated…

38
r/MachineLearning community 1d ago

Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D]

I recently came across the announcement for the MARS2 Workshop (Multimodal Reasoning Competition) at ECCV 2026. From what I understand, it focuses on multimodal reasoning and test-time reasoning (“slow thinking”), especially applied to video and real-world scenarios like…

30
Hugging Face Daily Papers research 1d ago

DOPD: Dual On-policy Distillation

Abstract DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models. Generated by…

6
r/MachineLearning community 1d ago

80TB+ of astronomy for the HDD-poor: crossmatch the Universe from your laptop [R]

Today is the day you (🫵!) get access to 80TB plus of data from over 30 astronomical surveys in one place. 4GB of RAM is enough even at Gaia Scale. Check out our writeup here: https://huggingface.co/blog/hugging-science/multimodal-universe-hats And a tutorial here…

6
Hugging Face Daily Papers research 1d ago

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

Abstract Research reveals that language backbones in Vision-Language-Action models are highly redundant for robotic manipulation tasks, while vision and action pathways are more critical, suggesting need for deliberate capacity allocation in future architectures. Generated by…

11
Hugging Face Daily Papers research 1d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Abstract HeRA aligns individual attention heads in MLLMs to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation alignment has…

27
Hugging Face Daily Papers research 1d ago

RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation

Abstract RaysUp is a lightweight, task-agnostic feature upsampling framework that reconstructs high-resolution features using geometry-aware ray domain techniques with improved efficiency and accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pre-trained Vision Foundation…

37
Hugging Face Daily Papers research 1d ago

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

Abstract ILLUME-X is a unified multimodal paradigm that enhances text-image generation through improved data efficiency, stable training processes, and comprehensive evaluation metrics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The advancement of generative AI models capable…

17
Hugging Face Daily Papers research 1d ago

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Abstract A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.…

32
Hugging Face Daily Papers research 2d ago

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Abstract A new benchmark evaluates multimodal large language models' ability to reason over dynamic visual evidence through controlled temporal-logical operations rather than simple object recognition. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent interest in multimodal…

25
Hugging Face Daily Papers research 2d ago

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

Abstract A novel pipeline called MatMMExtract is introduced that processes compound scientific figures into individual panels and generates structured annotations using large language models, creating a comprehensive dataset for vision-language learning in materials science.…

16
arXiv — Machine Learning research 2d ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36

SAOT: Self-Supervised Continual Graph Learning with Structure-Aware Optimal Transport

AdaBoosting Text Prompts for Vision-Language Models

Language-Critique Imitation Learning from Suboptimal Demonstrations

Steal the Patch Size: Adversarially Manipulate Vision-Language Models

Leveraging Multimodality for Real-Time Classification of Transients and Variables found by the Zwicky Transient Facility

Selective Test-Time Debiasing for CLIP via Reward Gating

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

Rosetta: Composable Native Multimodal Pretraining

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

My reasons to run local models

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Hierarchical Experimentalist Agents

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

MuSViT: A Foundation Vision Model for Sheet Music Representation

Xiaomi-GUI-0 Technical Report

PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

Anyone looking into the new MARS2 Workshop/Competition @ ECCV 2026? I saw Tec-do posting it. [D]

DOPD: Dual On-policy Distillation

80TB+ of astronomy for the HDD-poor: crossmatch the Universe from your laptop [R]

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models