News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — Machine Learning research 2d ago Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing… 36 arXiv — Machine Learning research 2d ago Counterfactual Residual Data Augmentation for Regression arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel… 21 arXiv — Machine Learning research 2d ago NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence arXiv:2606.28546v1 Announce Type: new Abstract: Recent advances in AI-driven weather and climate modeling have improved forecast skill while reducing computational cost. However, existing data-driven approaches are limited in their ability to model coupled Earth system dynamics,… 9 arXiv — Machine Learning research 2d ago ML-Powered LDAP Reconnaissance Detection using Weak Supervision arXiv:2606.28917v1 Announce Type: new Abstract: Lightweight Directory Access Protocol (LDAP) is a protocol that allows users to query and modify Active Directory (AD) data. By default, all users have read access to all AD data through LDAP, making it a common initial tool for… 14 arXiv — Machine Learning research 2d ago DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training arXiv:2606.28932v1 Announce Type: new Abstract: Large language models have driven recent progress in language and multimodal AI, yet pre-training them at scale is prohibitively expensive. Low-rank pre-training, which factorizes each weight matrix into a rank-r product to reduce… 35 arXiv — Machine Learning research 2d ago AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification arXiv:2606.29335v1 Announce Type: new Abstract: Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker… 14 arXiv — Machine Learning research 2d ago Do Models Read What They Write? Causal Registers in Scratchpad Reasoning arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a… 29 arXiv — NLP / Computation & Language research 2d ago SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision arXiv:2606.28562v1 Announce Type: new Abstract: On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones.… 10 arXiv — NLP / Computation & Language research 2d ago EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control arXiv:2606.28938v1 Announce Type: new Abstract: Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we… 26 arXiv — NLP / Computation & Language research 2d ago Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts… 30 arXiv — NLP / Computation & Language research 2d ago Hybrid Retriever Evolution for Multimodal Document Reasoning Agents arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to… 33 arXiv — NLP / Computation & Language research 2d ago Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or… 31 arXiv — NLP / Computation & Language research 2d ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 2d ago DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning arXiv:2606.30189v1 Announce Type: new Abstract: Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world… 14 arXiv — NLP / Computation & Language research 2d ago Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector arXiv:2606.30196v1 Announce Type: new Abstract: This paper offers an in-depth analysis of non-sequential multimodal sentence-level embeddings, with a particular focus on the SONAR model. We demonstrate that certain embedding dimensions are sensitive to perturbations and can… 25 Hugging Face Daily Papers research 2d ago Orca: The World is in Your Mind Abstract Orca establishes a unified world latent space through next-state-prediction modeling using multimodal data and demonstrates superior performance in downstream tasks compared to specialized baselines. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce Orca, an… 38 Hugging Face Daily Papers research 2d ago TACO: Tool-Augmented Credit Optimization for Agentic Tool Use Abstract Tool-Augmented Credit Optimization (TACO) improves multimodal agent performance by distinguishing useful, redundant, or misleading code operations through dual advantage channels: Differential Answer-Probe Reward for individual tool contribution and Outcome-Gated… 38 Hugging Face Daily Papers research 2d ago Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction Abstract A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks. Generated by… 28 Hacker News — AI on Front Page community 2d ago .self: A new top-level domain designed to support self-hosting Article URL: https://hccf.onmy.cloud/2026/06/21/reclaiming-our-digital-selves-hccfs-vision-for-a-human-centered-top-level-domain/ Comments URL: https://news.ycombinator.com/item?id=48724230 Points: 246 # Comments: 154 24 r/MachineLearning community 2d ago I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p] (this was deleted before but i dont know if it was the filters of reddit or the moderators, if is the moderators i will not post it again after you delete it sorry.) (The name will probably change soon because I didn't realize "AgroVision" is already a registered trademark lol.)… 15 r/MachineLearning community 2d ago I do historical swordfighting and noticed AI struggles to track it. I’m building an open dataset to help fix this. Does my schema make sense? [P] Hi everyone, I’m a historical swordfighter (HEMA practitioner), and while I’m not a computer vision engineer or a roboticist, I’ve been reading a lot about the current bottlenecks in embodied AI, specifically around the Sim2Real gap and thin-object tracking. It occurred to me… 18 r/MachineLearning community 3d ago ECCV 2026 Final Decisions after Provisional Acceptance [D] Has anyone actually received final acceptance following their provisional acceptance email from ECCV 2026? I am very confused. Thank you so much.   submitted by   /u/Land_Heavy [link]   [comments] 15 arXiv — Machine Learning research 3d ago HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance… 7 arXiv — Machine Learning research 3d ago Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings arXiv:2606.27672v1 Announce Type: new Abstract: Inspired by advances in natural language processing and computer vision, "time-series foundation models" (TSFMs) have recently been introduced with the promise of strong generalization across diverse time-series tasks, including… 5 arXiv — Machine Learning research 3d ago Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data arXiv:2606.27984v1 Announce Type: new Abstract: Multimodal feature fusion can effectively capture complex patterns in real-world data by integrating complementary information from different modalities. However, in many applications, such as boiler combustion monitoring,… 18 arXiv — Machine Learning research 3d ago Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection arXiv:2606.28134v1 Announce Type: new Abstract: Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges:… 12 arXiv — Machine Learning research 3d ago Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge arXiv:2606.27527v1 Announce Type: cross Abstract: Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose… 10 arXiv — Machine Learning research 3d ago Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for… 23 arXiv — NLP / Computation & Language research 3d ago Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection arXiv:2606.28002v1 Announce Type: new Abstract: Insurance fraud imposes substantial financial losses and operational inefficiencies, raising premiums and impacting trust among legitimate policyholders. Early detection at FNOL remains a persistent challenge. Existing approaches… 25 arXiv — NLP / Computation & Language research 3d ago Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models arXiv:2606.28273v1 Announce Type: new Abstract: Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally… 31 arXiv — NLP / Computation & Language research 3d ago DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write… 11 arXiv — NLP / Computation & Language research 3d ago Aloe-Vision: Robust Vision-Language Models for Healthcare arXiv:2606.27500v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity… 28 arXiv — NLP / Computation & Language research 3d ago Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline arXiv:2606.27700v1 Announce Type: cross Abstract: Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline:… 7 arXiv — NLP / Computation & Language research 3d ago EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from… 34 arXiv — NLP / Computation & Language research 3d ago Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using… 4 arXiv — NLP / Computation & Language research 3d ago SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,… 31 TechCrunch — AI news-outlet 4d ago SoftBank’s CEO isn’t the only one with questions about Elon Musk’s orbital data center hype Not everyone is buying Elon Musk’s vision for orbital data centers. 19 TechCrunch — AI news-outlet 4d ago Apple Vision Pro exec is reportedly leaving for OpenAI Paul Meade, the Apple vice president in charge of the Vision Pro headset, is reportedly leaving the company to join OpenAI’s hardware team. 22 r/LocalLLaMA community 4d ago Agentic Cyberdeck Dev I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I… 12 r/LocalLLaMA community 4d ago New deepseek vision model incoming? Hello guys, it seems like DeepSeek added a new vision mode to their application. Does this mean, that they will release a new vision model? Edit: Guys.it is not an OCR model. I have just asked it to describe multiple images, which had no text in them.   submitted by  … 19 Hugging Face Daily Papers research 5d ago ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation Abstract ABACUS is a unified vision-language model that performs object counting and related tasks through innovative spatial grounding, boundary-aware counting policies, and self-critical learning strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct ABACUS is a unified… 16 r/LocalLLaMA community 5d ago Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction? I tried replacing Google Vision in my receipt pipeline with a local Qwen model. I had an old LINE message bot where I could send a receipt photo, it would go to Google Vision, get parsed into JSON, and saved in SQLite. Recently I tried again, but locally. Setup: RTX 3060 12GB… 8 r/LocalLLaMA community 5d ago Gemma 4 12b needs glasses Having a lot of fun using Gemma 4 as an assistant, but is growing frustrated with the poor default image resolution setting for image vision. Tasks like identifying smaller text in an image that Qwen 3.6 flies through, Gemma 4 are never able to decipher. Even larger overall… 31 arXiv — Machine Learning research 6d ago \chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation arXiv:2606.26164v1 Announce Type: new Abstract: Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches -- basin-hopping, CMA-ES, multistart gradient descent -- operate… 26 arXiv — Machine Learning research 6d ago When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence arXiv:2606.26473v1 Announce Type: new Abstract: Many multimodal systems estimate the reliability of each modality and weight their contributions to the final prediction. However, it remains unclear whether these scores influence model decisions or merely correlate with… 20 arXiv — NLP / Computation & Language research 6d ago Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed… 15 arXiv — Machine Learning research 6d ago Automating Potential-based Reward Shaping with Vision Language Model Guidance arXiv:2606.27180v1 Announce Type: new Abstract: Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive… 36 arXiv — Machine Learning research 6d ago Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders arXiv:2606.27321v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top-$k$… 22 arXiv — Machine Learning research 6d ago Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring arXiv:2606.26121v1 Announce Type: cross Abstract: Global insect population declines necessitate scalable, continuous monitoring systems, yet existing vision-based solutions remain constrained by high hardware costs, energy demands, and reliance on centralized processing or cloud… 11 arXiv — NLP / Computation & Language research 6d ago Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a… 37 Page 2 of 10 · 500 articles ← Newer Older →