planning - 2026-01-27

Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge

Authors:Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Date:2026-01-26 17:56:19

Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.

Detection of high-frequency gravitational waves using SRF cavities

Authors:M. Wenskat, B. Giaccone, J. Branlard, V. Chouhan, C. Dokuyucu, L. Fischer, I. Gonin, A. Grassellino, W. Hillert, T. Khabiboulline, T. Krokotsch, F. Ludwig, G. Marconato, A. Melnychuk, G. Moortgat-Pick, A. Muhs, A. Netepenko, Y. Orlov, M. Paulsen, K. Peters, L. Pfeiffer, S. Posen, O. Pronitchev, H. Schlarb
Date:2026-01-26 17:42:03

Today, apart from some isolated R&D efforts, there are no gravitational wave (GW) experiments, yet which explore a large part of the vast frequency range above the LIGO/Virgo band. It is planned to establish an experiment at Deutsches Elektronen-Synchrotron (DESY) and at the Superconducting Quantum Materials and Systems (SQMS) Center at Fermi National Accelerator Laboratory (Fermilab) to search for high-frequency GWs in the frequency range of 10 kHz to 100 MHz. The basic idea is to use superconducting radiofrequency (SRF) cavities to detect tiny harmonic deformations induced by GWs which change the boundary conditions of the oscillating electromagnetic field. This paper summarizes the challenging environmental boundary requirements, and the R&D to operate a cavity using a low level RF (LLRF) system which pushes beyond state-of-the-art accuracy and resolutions and a seismic noise mitigated cryostat at 1.8 K. The focus of this paper is the warm and cold commissioning of a prototype cavity, built 20 years ago during the MAGO collaboration, and its first measurement in our collaborative research project.

Scalable Transit Delay Prediction at City Scale: A Systematic Approach with Multi-Resolution Feature Engineering and Deep Learning

Authors:Emna Boudabbous, Mohamed Karaa, Lokman Sboui, Julio Montecinos, Omar Alam
Date:2026-01-26 14:30:50

Urban bus transit agencies need reliable, network-wide delay predictions to provide accurate arrival information to passengers and support real-time operational control. Accurate predictions help passengers plan their trips, reduce waiting time, and allow operations staff to adjust headways, dispatch extra vehicles, and manage disruptions. Although real-time feeds such as GTFS-Realtime (GTFS-RT) are now widely available, most existing delay prediction systems handle only a few routes, depend on hand-crafted features, and offer little guidance on how to design a scalable, reusable architecture. We present a city-scale prediction pipeline that combines multi-resolution feature engineering, dimensionality reduction, and deep learning. The framework generates 1,683 spatiotemporal features by exploring 23 aggregation combinations over H3 cells, routes, segments, and temporal patterns, and compresses them into 83 components using Adaptive PCA while preserving 95% of the variance. To avoid the "giant cluster" problem that occurs when dense urban areas fall into a single H3 region, we introduce a hybrid H3+topology clustering method that yields 12 balanced route clusters (coefficient of variation 0.608) and enables efficient distributed training. We compare five model architectures on six months of bus operations from the Société de transport de Montréal (STM) network in Montréal. A global LSTM with cluster-aware features achieves the best trade-off between accuracy and efficiency, outperforming transformer models by 18 to 52% while using 275 times fewer parameters. We also report multi-level evaluation at the elementary segment, segment, and trip level with walk-forward validation and latency analysis, showing that the proposed pipeline is suitable for real-time, city-scale deployment and can be reused for other networks with limited adaptation.

TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion

Authors:Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, Jian Tang
Date:2026-01-26 10:06:56

The vision-language-action (VLA) paradigm has enabled powerful robotic control by leveraging vision-language models, but its reliance on large-scale, high-quality robot data limits its generalization. Generative world models offer a promising alternative for general-purpose embodied AI, yet a critical gap remains between their pixel-level plans and physically executable actions. To this end, we propose the Tool-Centric Inverse Dynamics Model (TC-IDM). By focusing on the tool's imagined trajectory as synthesized by the world model, TC-IDM establishes a robust intermediate representation that bridges the gap between visual planning and physical control. TC-IDM extracts the tool's point cloud trajectories via segmentation and 3D motion estimation from generated videos. Considering diverse tool attributes, our architecture employs decoupled action heads to project these planned trajectories into 6-DoF end-effector motions and corresponding control signals. This plan-and-translate paradigm not only supports a wide range of end-effectors but also significantly improves viewpoint invariance. Furthermore, it exhibits strong generalization capabilities across long-horizon and out-of-distribution tasks, including interacting with deformable objects. In real-world evaluations, the world model with TC-IDM achieves an average success rate of 61.11 percent, with 77.7 percent on simple tasks and 38.46 percent on zero-shot deformable object tasks. It substantially outperforms end-to-end VLA-style baselines and other inverse dynamics models.

Revisiting Aerial Scene Classification on the AID Benchmark

Authors:Subhajeet Das, Susmita Ghosh, Abhiroop Chatterjee
Date:2026-01-26 08:39:02

Aerial images play a vital role in urban planning and environmental preservation, as they consist of various structures, representing different types of buildings, forests, mountains, and unoccupied lands. Due to its heterogeneous nature, developing robust models for scene classification remains a challenge. In this study, we conduct a literature review of various machine learning methods for aerial image classification. Our survey covers a range of approaches from handcrafted features (e.g., SIFT, LBP) to traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks. In this connection, we have also designed Aerial-Y-Net, a spatial attention-enhanced CNN with multi-scale feature fusion mechanism, which acts as an attention-based model and helps us to better understand the complexities of aerial images. Evaluated on the AID dataset, our model achieves 91.72% accuracy, outperforming several baseline architectures.

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Authors:James Burgess, Jan N. Hansen, Duo Peng, Yuhui Zhang, Alejandro Lozano, Min Woo Sun, Emma Lundberg, Serena Yeung-Levy
Date:2026-01-26 06:46:16

Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on https://huggingface.co/collections/jmhb/papersearchqa. Finally, our data creation methods are scalable and easily extendable to other scientific domains.

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Authors:Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, Feng Zheng
Date:2026-01-26 06:16:17

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.

Agentic Very Long Video Understanding

Authors:Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim
Date:2026-01-26 05:20:47

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.

Contact Plan Design For Optical Interplanetary Communications

Authors:Jason Gerard, Juan A. Fraire, Sandra Cespedes
Date:2026-01-26 05:10:42

Space exploration missions generate rapidly increasing volumes of scientific telemetry that far exceed the capacity of today's manually scheduled, RF-based deep-space infrastructure. Free-space optical (FSO) communications promise orders of magnitude higher throughput, but their narrow beams require precise pointing, acquisition, and tracking (PAT) for link establishment and tightly synchronized contact schedules. Critically, no existing contact plan design (CPD) framework accounts for optical head retargeting delay, the time spent during coarse pointing and link acquisition before data transmission begins, which directly reduces usable contact time. Retargeting delay is the dominant impairment unique to optical networks, which induces a seconds-to-minutes-long mechanical pointing process for an optical terminal's laser from its current partner to the next receiver. This paper introduces the first PAT-aware CPD framework for optical interplanetary backhaul networks. The model captures directional temporal flows across both direct-to-Earth optical links and two-hop relay paths using delay/disruption-tolerant networking (DTN) satellites. We also introduce an optical network duty-cycle metric that quantifies the proportion of time spent transmitting to the contact window duration, exposing capacity lost to retargeting delay. Our results show that our MILP scheduler delivers over 30 percent higher network capacity than a greedy algorithm. More importantly, the results uncover a fundamental behavioral shift: when retargeting delays are modeled accurately, optimal schedules favor fewer but longer optical links that maximize throughput while minimizing retargeting overhead. These findings demonstrate that zero-delay assumptions substantially overestimate achievable performance and yield unrealistic contact plans.

Neurocomputational Mechanisms of Syntactic Transfer in Bilingual Sentence Production

Authors:Ahmet Yavuz Uluslu, Elliot Murphy
Date:2026-01-26 01:00:58

We discuss the benefits of incorporating into the study of bilingual production errors and their traditionally documented timing signatures (e.g., event-related potentials) certain types of oscillatory signatures, which can offer new implementational-level constraints for theories of bilingualism. We argue that a recent neural model of language, ROSE, can offer a neurocomputational account of syntactic transfer in bilingual production, capturing some of its formal properties and the scope of morphosyntactic sequencing failure modes. We take as a case study cross-linguistic influence (CLI) and attendant theories of functional inhibition/competition, and present these as being driven by specific oscillatory failure modes during L2 sentence planning. We argue that modeling CLI in this way not only offers the kind of linking hypothesis ROSE was built to encourage, but also licenses the exploration of more spatiotemporally complex biomarkers of language dysfunction than more commonly discussed neural signatures.

An Experimental Comparison of Cognitive Forcing Functions for Execution Plans in AI-Assisted Writing: Effects On Trust, Overreliance, and Perceived Critical Thinking

Authors:Ahana Ghosh, Advait Sarkar, Siân Lindley, Christian Poelitz
Date:2026-01-25 23:03:29

Generative AI (GenAI) tools improve productivity in knowledge workflows such as writing, but also risk overreliance and reduced critical thinking. Cognitive forcing functions (CFFs) mitigate these risks by requiring active engagement with AI output. As GenAI workflows grow more complex, systems increasingly present execution plans for user review. However, these plans are themselves AI-generated and prone to overreliance, and the effectiveness of applying CFFs to AI plans remains underexplored. We conduct a controlled experiment in which participants completed AI-assisted writing tasks while reviewing AI-generated plans under four CFF conditions: Assumption (argument analysis), WhatIf (hypothesis testing), Both, and a no-CFF control. A follow-up think-aloud and interview study qualitatively compared these conditions. Results show that the Assumption CFF most effectively reduced overreliance without increasing cognitive load, while participants perceived the WhatIf CFF as most helpful. These findings highlight the value of plan-focused CFFs for supporting critical reflection in GenAI-assisted knowledge work.

Agentic AI for Self-Driving Laboratories in Soft Matter: Taxonomy, Benchmarks,and Open Challenges

Authors:Xuanzhou Chen, Audrey Wang, Stanley Yin, Hanyang Jiang, Dong Zhang
Date:2026-01-25 17:44:19

Self-driving laboratories (SDLs) close the loop between experiment design, automated execution, and data-driven decision making, and they provide a demanding testbed for agentic AI under expensive actions, noisy and delayed feedback, strict feasibility and safety constraints, and non-stationarity. This survey uses soft matter as a representative setting but focuses on the AI questions that arise in real laboratories. We frame SDL autonomy as an agent environment interaction problem with explicit observations, actions, costs, and constraints, and we use this formulation to connect common SDL pipelines to established AI principles. We review the main method families that enable closed loop experimentation, including Bayesian optimization and active learning for sample efficient experiment selection, planning and reinforcement learning for long horizon protocol optimization, and tool using agents that orchestrate heterogeneous instruments and software. We emphasize verifiable and provenance aware policies that support debugging, reproducibility, and safe operation. We then propose a capability driven taxonomy that organizes systems by decision horizon, uncertainty modeling, action parameterization, constraint handling, failure recovery, and human involvement. To enable meaningful comparison, we synthesize benchmark task templates and evaluation metrics that prioritize cost aware performance, robustness to drift, constraint violation behavior, and reproducibility. Finally, we distill lessons from deployed SDLs and outline open challenges in multi-modal representation, calibrated uncertainty, safe exploration, and shared benchmark infrastructure.

Less Is More: Scalable Visual Navigation from Limited Data

Authors:Yves Inglin, Jonas Frey, Changan Chen, Marco Hutter
Date:2026-01-25 12:44:23

Imitation learning provides a powerful framework for goal-conditioned visual navigation in mobile robots, enabling obstacle avoidance while respecting human preferences and social norms. However, its effectiveness depends critically on the quality and diversity of training data. In this work, we show how classical geometric planners can be leveraged to generate synthetic trajectories that complement costly human demonstrations. We train Less is More (LiMo), a transformer-based visual navigation policy that predicts goal-conditioned SE(2) trajectories from a single RGB observation, and find that augmenting limited expert demonstrations with planner-generated supervision yields substantial performance gains. Through ablations and complementary qualitative and quantitative analyses, we characterize how dataset scale and diversity affect planning performance. We demonstrate real-robot deployment and argue that robust visual navigation is enabled not by simply collecting more demonstrations, but by strategically curating diverse, high-quality datasets. Our results suggest that scalable, embodiment-specific geometric supervision is a practical path toward data-efficient visual navigation.

Robust Computational Extraction of Non-Enhancing Hypercellular Tumor Regions from Clinical Imaging Data

Authors:A. Brawanski, Th. Schaffer, F. Raab, K. -M. Schebesch, M. Schrey, Chr. Doenitz, A. M. Tomé, E. W. Lang
Date:2026-01-25 11:39:41

Accurate identification of non-enhancing hypercellular (NEH) tumor regions is an unmet need in neuro-oncological imaging, with significant implications for patient management and treatment planning. We present a robust computational framework that generates probability maps of NEH regions from routine MRI data, leveraging multiple network architectures to address the inherent variability and lack of clear imaging boundaries. Our approach was validated against independent clinical markers -- relative cerebral blood volume (rCBV) and enhancing tumor recurrence location (ETRL) -- demonstrating both methodological robustness and biological relevance. This framework enables reliable, non-invasive mapping of NEH tumor compartments, supporting their integration as imaging biomarkers in clinical workflows and advancing precision oncology for brain tumor patients.

Multi-Criteria Inverse Robustness in Radiotherapy Planning Using Semidefinite Programming

Authors:Jan Schröeder, Yair Censor, Philipp Süss, Karl-Heinz Küfer
Date:2026-01-25 08:47:20

Radiotherapy planning naturally leads to a multi-criteria optimization problem which is subject to different sources of uncertainty. In order to find the desired treatment plan, a decision maker must balance these objectives as well as the level of robustness towards uncertainty against each other. This paper showcases a quantitative approach to do so, which combines the theoretical model with the ability to deal with practical challenges. To this end, the uncertainty, which can be expressed via the so-called dose-influence matrix, is modelled using interval matrices. We use inverse robustness to introduce an additional objective, which aims to maximize the volume of the uncertainty set. A multi-criteria approach allows to handle the uncertainty while keeping appropriate values of the other objective functions. We solve the resulting quadratically constrained quadratic optimization problem (QCQP) by first relaxing it to a convex semidefinite problem (SDP) and then reconstructing optimal solutions of the QCQP from solutions of the SDP.

EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

Authors:Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li
Date:2026-01-25 06:58:15

Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike previous datasets that depend on fragile execution traces or extensive manual annotation, EntWorld adopts a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas, enabling the synthesis of realistic, long-horizon workflows. Moreover, we propose a SQL-based deterministic verification mechanism in building datasets that replaces ambiguous visual matching with rigorous state-transition validation. Experimental results demonstrate that state-of-the-art models (e.g., GPT-4.1) achieve 47.61% success rate on EntWorld, substantially lower than the human performance, highlighting a pronounced enterprise gap in current agentic capabilities and the necessity of developing domain-specific agents. We release EntWorld as a rigorous testbed to facilitate the development and evaluation of the next generation of enterprise-ready digital agents.

Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing

Authors:Weiyu Zhang, Yuan Hu, Yong Li, Yu Liu
Date:2026-01-25 03:22:26

Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.

A Unified Approach to Concurrent, Parallel Map-Reduce in R using Futures

Authors:Henrik Bengtsson
Date:2026-01-24 20:18:21

The R ecosystem offers a rich variety of map-reduce application programming interfaces (APIs) for iterative computations, yet parallelizing code across these diverse frameworks requires learning multiple, often incompatible, parallel APIs. The futurize package addresses this challenge by providing a single function, futurize(), which transpiles sequential map-reduce expressions into their parallel equivalents in the future ecosystem, which performs all the heavy lifting. By leveraging R's native pipe operator, users can parallelize existing code with minimal refactoring -- often by simply appending `|> futurize()' to an expression. The package supports classical map-reduce functions from base R, purrr, crossmap, foreach, plyr, BiocParallel, e.g., lapply(xs, fcn) |> futurize() and map(xs, fcn) |> futurize(), as well as a growing set of domain-specific packages, e.g., boot, caret, glmnet, lme4, mgcv, and tm. By abstracting away the underlying parallel machinery, and unifying handling of future options, the package enables developers to declare what to parallelize via futurize(), and end-users to choose how via plan(). This article describes the philosophy, design, and implementation of futurize, demonstrates its usage across various map-reduce paradigms, and discusses its role in simplifying parallel computing in R.

MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

Authors:Yutong Shen, Hangxu Liu, Kailin Pei, Ruizhe Xia, Tongtong Feng
Date:2026-01-24 16:11:45

Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at https://anonymous.4open.science/r/metaworld-2BF4/

BMDS-Net: A Bayesian Multi-Modal Deep Supervision Network for Robust Brain Tumor Segmentation

Authors:Yan Zhou, Zhen Huang, Yingqiu Li, Yue Ouyang, Suncheng Xiang, Zehua Wang
Date:2026-01-24 16:06:43

Accurate brain tumor segmentation from multi-modal magnetic resonance imaging (MRI) is a prerequisite for precise radiotherapy planning and surgical navigation. While recent Transformer-based models such as Swin UNETR have achieved impressive benchmark performance, their clinical utility is often compromised by two critical issues: sensitivity to missing modalities (common in clinical practice) and a lack of confidence calibration. Merely chasing higher Dice scores on idealized data fails to meet the safety requirements of real-world medical deployment. In this work, we propose BMDS-Net, a unified framework that prioritizes clinical robustness and trustworthiness over simple metric maximization. Our contribution is three-fold. First, we construct a robust deterministic backbone by integrating a Zero-Init Multimodal Contextual Fusion (MMCF) module and a Residual-Gated Deep Decoder Supervision (DDS) mechanism, enabling stable feature learning and precise boundary delineation with significantly reduced Hausdorff Distance, even under modality corruption. Second, and most importantly, we introduce a memory-efficient Bayesian fine-tuning strategy that transforms the network into a probabilistic predictor, providing voxel-wise uncertainty maps to highlight potential errors for clinicians. Third, comprehensive experiments on the BraTS 2021 dataset demonstrate that BMDS-Net not only maintains competitive accuracy but, more importantly, exhibits superior stability in missing-modality scenarios where baseline models fail. The source code is publicly available at https://github.com/RyanZhou168/BMDS-Net.

DiffusionCinema: Text-to-Aerial Cinematography

Authors:Valerii Serpiva, Artem Lykov, Jeffrin Sam, Aleksey Fedoseev, Dzmitry Tsetserukou
Date:2026-01-24 11:12:30

We propose a novel Unmanned Aerial Vehicles (UAV) assisted creative capture system that leverages diffusion models to interpret high-level natural language prompts and automatically generate optimal flight trajectories for cinematic video recording. Instead of manually piloting the drone, the user simply describes the desired shot (e.g., "orbit around me slowly from the right and reveal the background waterfall"). Our system encodes the prompt along with an initial visual snapshot from the onboard camera, and a diffusion model samples plausible spatio-temporal motion plans that satisfy both the scene geometry and shot semantics. The generated flight trajectory is then executed autonomously by the UAV to record smooth, repeatable video clips that match the prompt. User evaluation using NASA-TLX showed a significantly lower overall workload with our interface (M = 21.6) compared to a traditional remote controller (M = 58.1), demonstrating a substantial reduction in perceived effort. Mental demand (M = 11.5 vs. 60.5) and frustration (M = 14.0 vs. 54.5) were also markedly lower for our system, confirming clear usability advantages in autonomous text-driven flight control. This project demonstrates a new interaction paradigm: text-to-cinema flight, where diffusion models act as the "creative operator" converting story intentions directly into aerial motion.

Physical Prompt Injection Attacks on Large Vision-Language Models

Authors:Chen Ling, Kai Hu, Hangcheng Liu, Xingshuo Han, Tianwei Zhang, Changhai Ou
Date:2026-01-24 09:13:28

Large Vision-Language Models (LVLMs) are increasingly deployed in real-world intelligent systems for perception and reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user queries, assumptions that rarely hold in practical deployments. We propose the first Physical Prompt Injection Attack (PPIA), a black-box, query-agnostic attack that embeds malicious typographic instructions into physical objects perceivable by the LVLM. PPIA requires no access to the model, its inputs, or internal pipeline, and operates solely through visual observation. It combines offline selection of highly recognizable and semantically effective visual prompts with strategic environment-aware placement guided by spatiotemporal attention, ensuring that the injected prompts are both perceivable and influential on model behavior. We evaluate PPIA across 10 state-of-the-art LVLMs in both simulated and real-world settings on tasks including visual question answering, planning, and navigation, PPIA achieves attack success rates up to 98%, with strong robustness under varying physical conditions such as distance, viewpoint, and illumination. Our code is publicly available at https://github.com/2023cghacker/Physical-Prompt-Injection-Attack.

High-Fidelity Longitudinal Patient Simulation Using Real-World Data

Authors:Yu Akagi, Tomohisa Seki, Hiromasa Ito, Toru Takiguchi, Kazuhiko Ohe, Yoshimasa Kawazoe
Date:2026-01-24 05:27:49

Simulation is a powerful tool for exploring uncertainty. Its potential in clinical medicine is transformative and includes personalized treatment planning and virtual clinical trials. However, simulating patient trajectories is challenging because of complex biological and sociocultural influences. Here, we show that real-world clinical records can be leveraged to empirically model patient timelines. We developed a generative simulator model that takes a patient's history as input and synthesizes fine-grained, realistic future trajectories. The model was pretrained on more than 200 million clinical records. It produced high-fidelity future timelines, closely matching event occurrence rates, laboratory test results, and temporal dynamics in real patient future data. It also accurately estimated future event probabilities, with observed-to-expected ratios consistently near 1.0 across diverse outcomes and time horizons. Our results reveal the untapped value of real-world data in electronic health records and introduce a scalable framework for in silico modeling of clinical care.

Real-Time, Energy-Efficient, Sampling-Based Optimal Control via FPGA Acceleration

Authors:Tanmay Desai, Brian Plancher, R. Iris Bahar
Date:2026-01-23 23:47:09

Autonomous mobile robots (AMRs), used for search-and-rescue and remote exploration, require fast and robust planning and control schemes. Sampling-based approaches for Model Predictive Control, especially approaches based on the Model Predictive Path Integral Control (MPPI) algorithm, have recently proven both to be highly effective for such applications and to map naturally to GPUs for hardware acceleration. However, both GPU and CPU implementations of such algorithms can struggle to meet tight energy and latency budgets on battery-constrained AMR platforms that leverage embedded compute. To address this issue, we present an FPGA-optimized MPPI design that exposes fine-grained parallelism and eliminates synchronization bottlenecks via deep pipelining and parallelism across algorithmic stages. This results in an average 3.1x to 7.5x speedup over optimized implementations on an embedded GPU and CPU, respectively, while simultaneously achieving a 2.5x to 5.4x reduction in energy usage. These results demonstrate that FPGA architectures are a promising direction for energy-efficient and high-performance edge robotics.

Hierarchical Informative Path Planning via Graph Guidance and Trajectory Optimization

Authors:Avraiem Iskandar, Shamak Dutta, Kevin Murrant, Yash Vardhan Pant, Stephen L. Smith
Date:2026-01-23 23:27:20

We study informative path planning (IPP) with travel budgets in cluttered environments, where an agent collects measurements of a latent field modeled as a Gaussian process (GP) to reduce uncertainty at target locations. Graph-based solvers provide global guarantees but assume pre-selected measurement locations, while continuous trajectory optimization supports path-based sensing but is computationally intensive and sensitive to initialization in obstacle-dense settings. We propose a hierarchical framework with three stages: (i) graph-based global planning, (ii) segment-wise budget allocation using geometric and kernel bounds, and (iii) spline-based refinement of each segment with hard constraints and obstacle pruning. By combining global guidance with local refinement, our method achieves lower posterior uncertainty than graph-only and continuous baselines, while running faster than continuous-space solvers (up to 9x faster than gradient-based methods and 20x faster than black-box optimizers) across synthetic cluttered environments and Arctic datasets.

Evaluation of the beam-induced depolarization of the HJET target at the EIC

Authors:A. A. Poblaguev
Date:2026-01-23 23:08:45

The Polarized Atomic Hydrogen Gas Jet Target (HJET) has played a central role in the absolute calibration of proton beam polarization at RHIC and is foreseen as a key element of the hadron polarimetry program at the future Electron--Ion Collider (EIC). The substantially higher beam current, reduced bunch spacing, and shorter bunch length planned for EIC operation motivate a careful reassessment of possible beam-induced depolarization of the jet target. In this paper, the depolarization of ground-state hydrogen atoms caused by the time-dependent magnetic field of the circulating polarized proton beam is quantitatively evaluated. The hydrogen atom is treated as a four-level hyperfine system in a holding magnetic field, and transitions driven by harmonic components of the bunch-induced magnetic field are analyzed using time-dependent quantum-mechanical evolution along atomic trajectories. Numerical tracking of hydrogen atoms through the beam region is performed using nominal EIC beam parameters. It is shown that, for a holding field of $120\,\mathrm{mT}$ (as used at RHIC), the resulting depolarization of the jet target at the EIC is negligibly small, $\lesssim 0.01\%$, and well below the level relevant for EIC polarization accuracy requirements. The stability of this result with respect to plausible variations of the EIC proton beam parameters is also evaluated.

Advancing Improvisation in Human-Robot Construction Collaboration: Taxonomy and Research Roadmap

Authors:David Wireko Atibila, Vineet R. Kamat, Carol C. Menassa
Date:2026-01-23 23:08:34

The construction industry faces productivity stagnation, skilled labor shortages, and safety concerns. While robotic automation offers solutions, construction robots struggle to adapt to unstructured, dynamic sites. Central to this is improvisation, adapting to unexpected situations through creative problem-solving, which remains predominantly human. In construction's unpredictable environments, collaborative human-robot improvisation is essential for workflow continuity. This research develops a six-level taxonomy classifying human-robot collaboration (HRC) based on improvisation capabilities. Through systematic review of 214 articles (2010-2025), we categorize construction robotics across: Manual Work (Level 0), Human-Controlled Execution (Level 1), Adaptive Manipulation (Level 2), Imitation Learning (Level 3), Human-in-Loop BIM Workflow (Level 4), Cloud-Based Knowledge Integration (Level 5), and True Collaborative Improvisation (Level 6). Analysis reveals current research concentrates at lower levels, with critical gaps in experiential learning and limited progression toward collaborative improvisation. A five-dimensional radar framework illustrates progressive evolution of Planning, Cognitive Role, Physical Execution, Learning Capability, and Improvisation, demonstrating how complementary human-robot capabilities create team performance exceeding individual contributions. The research identifies three fundamental barriers: technical limitations in grounding and dialogic reasoning, conceptual gaps between human improvisation and robotics research, and methodological challenges. We recommend future research emphasizing improved human-robot communication via Augmented/Virtual Reality interfaces, large language model integration, and cloud-based knowledge systems to advance toward true collaborative improvisation.

A Unified Kantorovich Duality for Multimarginal Optimal Transport

Authors:Yehya Cheryala, Mokhtar Z. Alaya, Salim Bouzebda
Date:2026-01-23 21:06:46

Multimarginal optimal transport (MOT) has gained increasing attention in recent years, notably due to its relevance in machine learning and statistics, where one seeks to jointly compare and align multiple probability distributions. This paper presents a unified and complete Kantorovich duality theory for MOT problem on general Polish product spaces with bounded continuous cost function. For marginal compact spaces, the duality identity is derived through a convex-analytic reformulation, that identifies the dual problem as a Fenchel-Rockafellar conjugate. We obtain dual attainment and show that optimal potentials may always be chosen in the class of $c$-conjugate families, thereby extending classical two-marginal conjugacy principle into a genuinely multimarginal setting. In non-compact setting, where direct compactness arguments are unavailable, we recover duality via a truncation-tightness procedure based on weak compactness of multimarginal transference plans and boundedness of the cost. We prove that the dual value is preserved under restriction to compact subsets and that admissible dual families can be regularized into uniformly bounded $c$-conjugate potentials. The argument relies on a refined use of $c$-splitting sets and their equivalence with multimarginal $c$-cyclical monotonicity. We then obtain dual attainment and exact primal-dual equality for MOT on arbitrary Polish spaces, together with a canonical representation of optimal dual potentials by $c$-conjugacy. These results provide a structural foundation for further developments in probabilistic and statistical analysis of MOT, including stability, differentiability, and asymptotic theory under marginal perturbations.

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Authors:Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
Date:2026-01-23 18:43:34

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

On the transportation cost norm on finite metric graphs

Authors:Georges Skandalis, Alain Valette
Date:2026-01-23 16:07:02

For a finite metric graph $X=(V,E,\ell)$, where $V$ is endowed with the shortest path metric, we consider the transportation cost problem associated with the distance $d$ on $V$. Namely, for $f$ a function with total sum 0 on $V$, write $f=\sum_{a,b\in V}P(a,b)(δ_a-δ_b)$ where the transportation plan $P$ satisfies $P(a,b)\geq 0$ for $(a,b)\in V\times V$. The cost of $P$ is $W(P):=\sum_{a,b\in V}P(a,b)d(a,b)$ and the transportation norm of $f$ is $\|f\|_{TC}=\min_P W(P)$ where $P$ runs over all transportation plans for $f$. In this semi-survey paper, we give short proofs for the following statements: 1)There always exists an optimal transportation plan supported in $V_+\times V_-$ where $V_+=\{x\in V: f(x)>0\}$ and $V_-=\{x\in V: f(x)<0\}$. If $X$ is a metric tree, we may moreover assume that this plan involves at most $|Supp(f)|-1$ transports. 2) There always exists an optimal transportation plan supported in the set of edges of $X$. 3) Better, there always exists an optimal transportation plan supported in some spanning tree of $X$. We use this to reprove known formulae for the transportation norm when $X$ is either a tree or a cycle.