Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — Machine Learning research 2d ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36
arXiv — Machine Learning research 2d ago

Counterfactual Residual Data Augmentation for Regression

arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel…

21
arXiv — Machine Learning research 2d ago

NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence

arXiv:2606.28546v1 Announce Type: new Abstract: Recent advances in AI-driven weather and climate modeling have improved forecast skill while reducing computational cost. However, existing data-driven approaches are limited in their ability to model coupled Earth system dynamics,…

9
arXiv — Machine Learning research 2d ago

ML-Powered LDAP Reconnaissance Detection using Weak Supervision

arXiv:2606.28917v1 Announce Type: new Abstract: Lightweight Directory Access Protocol (LDAP) is a protocol that allows users to query and modify Active Directory (AD) data. By default, all users have read access to all AD data through LDAP, making it a common initial tool for…

14
arXiv — Machine Learning research 2d ago

DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training

arXiv:2606.28932v1 Announce Type: new Abstract: Large language models have driven recent progress in language and multimodal AI, yet pre-training them at scale is prohibitively expensive. Low-rank pre-training, which factorizes each weight matrix into a rank-r product to reduce…

35
arXiv — Machine Learning research 2d ago

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

arXiv:2606.29335v1 Announce Type: new Abstract: Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker…

14
arXiv — Machine Learning research 2d ago

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a…

29
arXiv — NLP / Computation & Language research 2d ago

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

arXiv:2606.28562v1 Announce Type: new Abstract: On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones.…

10
arXiv — NLP / Computation & Language research 2d ago

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

arXiv:2606.28938v1 Announce Type: new Abstract: Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we…

26
arXiv — NLP / Computation & Language research 2d ago

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts…

30
arXiv — NLP / Computation & Language research 2d ago

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to…

33
arXiv — NLP / Computation & Language research 2d ago

Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages

arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or…

31
arXiv — NLP / Computation & Language research 2d ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 2d ago

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

arXiv:2606.30189v1 Announce Type: new Abstract: Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world…

14
arXiv — NLP / Computation & Language research 2d ago

Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector

arXiv:2606.30196v1 Announce Type: new Abstract: This paper offers an in-depth analysis of non-sequential multimodal sentence-level embeddings, with a particular focus on the SONAR model. We demonstrate that certain embedding dimensions are sensitive to perturbations and can…

25
Hugging Face Daily Papers research 2d ago

Orca: The World is in Your Mind

Abstract Orca establishes a unified world latent space through next-state-prediction modeling using multimodal data and demonstrates superior performance in downstream tasks compared to specialized baselines. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce Orca, an…

38
Hugging Face Daily Papers research 2d ago

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Abstract Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated…

38
Hugging Face Daily Papers research 2d ago

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by…

28
Hacker News — AI on Front Page community 2d ago

.self: A new top-level domain designed to support self-hosting

Article URL: https://hccf.onmy.cloud/2026/06/21/reclaiming-our-digital-selves-hccfs-vision-for-a-human-centered-top-level-domain/ Comments URL: https://news.ycombinator.com/item?id=48724230 Points: 246 # Comments: 154

24
r/MachineLearning community 2d ago

I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]

(this was deleted before but i dont know if it was the filters of reddit or the moderators, if is the moderators i will not post it again after you delete it sorry.) (The name will probably change soon because I didn't realize "AgroVision" is already a registered trademark lol.)…

15
r/MachineLearning community 2d ago

I do historical swordfighting and noticed AI struggles to track it. I’m building an open dataset to help fix this. Does my schema make sense? [P]

Hi everyone, I’m a historical swordfighter (HEMA practitioner), and while I’m not a computer vision engineer or a roboticist, I’ve been reading a lot about the current bottlenecks in embodied AI, specifically around the Sim2Real gap and thin-object tracking. It occurred to me…

18
r/MachineLearning community 3d ago

ECCV 2026 Final Decisions after Provisional Acceptance [D]

Has anyone actually received final acceptance following their provisional acceptance email from ECCV 2026? I am very confused. Thank you so much.   submitted by   /u/Land_Heavy [link]   [comments]

15
arXiv — Machine Learning research 3d ago

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance…

7
arXiv — Machine Learning research 3d ago

Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings

arXiv:2606.27672v1 Announce Type: new Abstract: Inspired by advances in natural language processing and computer vision, "time-series foundation models" (TSFMs) have recently been introduced with the promise of strong generalization across diverse time-series tasks, including…

5
arXiv — Machine Learning research 3d ago

Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data

arXiv:2606.27984v1 Announce Type: new Abstract: Multimodal feature fusion can effectively capture complex patterns in real-world data by integrating complementary information from different modalities. However, in many applications, such as boiler combustion monitoring,…

18
arXiv — Machine Learning research 3d ago

Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection

arXiv:2606.28134v1 Announce Type: new Abstract: Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges:…

12
arXiv — Machine Learning research 3d ago

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

arXiv:2606.27527v1 Announce Type: cross Abstract: Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose…

10
arXiv — Machine Learning research 3d ago

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for…

23
arXiv — NLP / Computation & Language research 3d ago

Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

arXiv:2606.28002v1 Announce Type: new Abstract: Insurance fraud imposes substantial financial losses and operational inefficiencies, raising premiums and impacting trust among legitimate policyholders. Early detection at FNOL remains a persistent challenge. Existing approaches…

25
arXiv — NLP / Computation & Language research 3d ago

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

arXiv:2606.28273v1 Announce Type: new Abstract: Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally…

31
arXiv — NLP / Computation & Language research 3d ago

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write…

11
arXiv — NLP / Computation & Language research 3d ago

Aloe-Vision: Robust Vision-Language Models for Healthcare

arXiv:2606.27500v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity…

28
arXiv — NLP / Computation & Language research 3d ago

Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline

arXiv:2606.27700v1 Announce Type: cross Abstract: Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline:…

7
arXiv — NLP / Computation & Language research 3d ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from…

34
arXiv — NLP / Computation & Language research 3d ago

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using…

4
arXiv — NLP / Computation & Language research 3d ago

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,…

31
TechCrunch — AI news-outlet 4d ago

SoftBank’s CEO isn’t the only one with questions about Elon Musk’s orbital data center hype

Not everyone is buying Elon Musk’s vision for orbital data centers.

19
TechCrunch — AI news-outlet 4d ago

Apple Vision Pro exec is reportedly leaving for OpenAI

Paul Meade, the Apple vice president in charge of the Vision Pro headset, is reportedly leaving the company to join OpenAI’s hardware team.

22
r/LocalLLaMA community 4d ago

Agentic Cyberdeck Dev

I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I…

12
r/LocalLLaMA community 4d ago

New deepseek vision model incoming?

Hello guys, it seems like DeepSeek added a new vision mode to their application. Does this mean, that they will release a new vision model? Edit: Guys.it is not an OCR model. I have just asked it to describe multiple images, which had no text in them.   submitted by  …

19
Hugging Face Daily Papers research 5d ago

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

Abstract ABACUS is a unified vision-language model that performs object counting and related tasks through innovative spatial grounding, boundary-aware counting policies, and self-critical learning strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct ABACUS is a unified…

16
r/LocalLLaMA community 5d ago

Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction?

I tried replacing Google Vision in my receipt pipeline with a local Qwen model. I had an old LINE message bot where I could send a receipt photo, it would go to Google Vision, get parsed into JSON, and saved in SQLite. Recently I tried again, but locally. Setup: RTX 3060 12GB…

8
r/LocalLLaMA community 5d ago

Gemma 4 12b needs glasses

Having a lot of fun using Gemma 4 as an assistant, but is growing frustrated with the poor default image resolution setting for image vision. Tasks like identifying smaller text in an image that Qwen 3.6 flies through, Gemma 4 are never able to decipher. Even larger overall…

31
arXiv — Machine Learning research 6d ago

\chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation

arXiv:2606.26164v1 Announce Type: new Abstract: Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches -- basin-hopping, CMA-ES, multistart gradient descent -- operate…

26
arXiv — Machine Learning research 6d ago

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

arXiv:2606.26473v1 Announce Type: new Abstract: Many multimodal systems estimate the reliability of each modality and weight their contributions to the final prediction. However, it remains unclear whether these scores influence model decisions or merely correlate with…

20
arXiv — NLP / Computation & Language research 6d ago

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed…

15
arXiv — Machine Learning research 6d ago

Automating Potential-based Reward Shaping with Vision Language Model Guidance

arXiv:2606.27180v1 Announce Type: new Abstract: Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive…

36
arXiv — Machine Learning research 6d ago

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

arXiv:2606.27321v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top-$k$…

22
arXiv — Machine Learning research 6d ago

Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring

arXiv:2606.26121v1 Announce Type: cross Abstract: Global insect population declines necessitate scalable, continuous monitoring systems, yet existing vision-based solutions remain constrained by high hardware costs, energy demands, and reliance on centralized processing or cloud…

11
arXiv — NLP / Computation & Language research 6d ago

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a…

37

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

Counterfactual Residual Data Augmentation for Regression

NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence

ML-Powered LDAP Reconnaissance Detection using Weak Supervision

DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector

Orca: The World is in Your Mind

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

.self: A new top-level domain designed to support self-hosting

I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]

I do historical swordfighting and noticed AI struggles to track it. I’m building an open dataset to help fix this. Does my schema make sense? [P]

ECCV 2026 Final Decisions after Provisional Acceptance [D]

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings

Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data

Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

Aloe-Vision: Robust Vision-Language Models for Healthcare

Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

SoftBank’s CEO isn’t the only one with questions about Elon Musk&#8217;s orbital data center hype

Apple Vision Pro exec is reportedly leaving for OpenAI

Agentic Cyberdeck Dev

New deepseek vision model incoming?

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction?

Gemma 4 12b needs glasses

\chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

Automating Potential-based Reward Shaping with Vision Language Model Guidance

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

SoftBank’s CEO isn’t the only one with questions about Elon Musk’s orbital data center hype