planning - 2026-01-28

Information-Theoretic Detection of Bimanual Interactions for Dual-Arm Robot Plan Generation

Authors:Elena Merlo, Marta Lagomarsino, Arash Ajoudani
Date:2026-01-27 17:38:58

Programming by demonstration is a strategy to simplify the robot programming process for non-experts via human demonstrations. However, its adoption for bimanual tasks is an underexplored problem due to the complexity of hand coordination, which also hinders data recording. This paper presents a novel one-shot method for processing a single RGB video of a bimanual task demonstration to generate an execution plan for a dual-arm robotic system. To detect hand coordination policies, we apply Shannon's information theory to analyze the information flow between scene elements and leverage scene graph properties. The generated plan is a modular behavior tree that assumes different structures based on the desired arms coordination. We validated the effectiveness of this framework through multiple subject video demonstrations, which we collected and made open-source, and exploiting data from an external, publicly available dataset. Comparisons with existing methods revealed significant improvements in generating a centralized execution plan for coordinating two-arm systems.

An Interpretable Recommendation Model for Psychometric Data, With an Application to Gerontological Primary Care

Authors:Andre Paulino de Lima, Paula Castro, Suzana Carvalho Vaz de Andrade, Rosa Maria Marcucci, Ruth Caldeira de Melo, Marcelo Garcia Manzato
Date:2026-01-27 17:29:21

There are challenges that must be overcome to make recommender systems useful in healthcare settings. The reasons are varied: the lack of publicly available clinical data, the difficulty that users may have in understanding the reasons why a recommendation was made, the risks that may be involved in following that recommendation, and the uncertainty about its effectiveness. In this work, we address these challenges with a recommendation model that leverages the structure of psychometric data to provide visual explanations that are faithful to the model and interpretable by care professionals. We focus on a narrow healthcare niche, gerontological primary care, to show that the proposed recommendation model can assist the attending professional in the creation of personalised care plans. We report results of a comparative offline performance evaluation of the proposed model on healthcare datasets that were collected by research partners in Brazil, as well as the results of a user study that evaluates the interpretability of the visual explanations the model generates. The results suggest that the proposed model can advance the application of recommender systems in this healthcare niche, which is expected to grow in demand , opportunities, and information technology needs as demographic changes become more pronounced.

SCOPE: Smooth Convex Optimization for Planned Evolution of Deformable Linear Objects

Authors:Ali Jnadi, Hadi Salloum, Yaroslav Kholodov, Alexander Gasnikov, Karam Almaghout
Date:2026-01-27 16:01:56

We present SCOPE, a fast and efficient framework for modeling and manipulating deformable linear objects (DLOs). Unlike conventional energy-based approaches, SCOPE leverages convex approximations to significantly reduce computational cost while maintaining smooth and physically plausible deformations. This trade-off between speed and accuracy makes the method particularly suitable for applications requiring real-time or near-real-time response. The effectiveness of the proposed framework is demonstrated through comprehensive simulation experiments, highlighting its ability to generate smooth shape trajectories under geometric and length constraints.

Localized Latent Editing for Dose-Response Modeling in Botulinum Toxin Injection Planning

Authors:Estèphe Arnaud, Mohamed Daoudi, Pierre Guerreschi
Date:2026-01-27 13:26:29

Botulinum toxin (Botox) injections are the gold standard for managing facial asymmetry and aesthetic rejuvenation, yet determining the optimal dosage remains largely intuitive, often leading to suboptimal outcomes. We propose a localized latent editing framework that simulates Botulinum Toxin injection effects for injection planning through dose-response modeling. Our key contribution is a Region-Specific Latent Axis Discovery method that learns localized muscle relaxation trajectories in StyleGAN2's latent space, enabling precise control over specific facial regions without global side effects. By correlating these localized latent trajectories with injected toxin units, we learn a predictive dose-response model. We rigorously compare two approaches: direct metric regression versus image-based generative simulation on a clinical dataset of N=360 images from 46 patients. On a hold-out test set, our framework demonstrates moderate-to-strong structural correlations for geometric asymmetry metrics, confirming that the generative model correctly captures the direction of morphological changes. While biological variability limits absolute precision, we introduce a hybrid "Human-in-the-Loop" workflow where clinicians interactively refine simulations, bridging the gap between pathological reconstruction and cosmetic planning.

ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

Authors:Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Daxin Tian, Bingzhao Gao, Jianqiang Wang, Hong Chen
Date:2026-01-27 13:17:50

In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.

From Scattered to Structured: A Vision for Automating Architectural Knowledge Management

Authors:Jan Keim, Angelika Kaplan
Date:2026-01-27 12:42:16

Software architecture is inherently knowledge-centric. The architectural knowledge is distributed across heterogeneous software artifacts such as requirements documents, design diagrams, code, and documentation, making it difficult for developers to access and utilize this knowledge effectively. Moreover, as systems evolve, inconsistencies frequently emerge between these artifacts, leading to architectural erosion and impeding maintenance activities. We envision an automated pipeline that systematically extracts architectural knowledge from diverse artifacts, links them, identifies and resolves inconsistencies, and consolidates this knowledge into a structured knowledge base. This knowledge base enables critical activities such as architecture conformance checking and change impact analysis, while supporting natural language question-answering to improve access to architectural knowledge. To realize this vision, we plan to develop specialized extractors for different artifact types, design a unified knowledge representation schema, implement consistency checking mechanisms, and integrate retrieval-augmented generation techniques for conversational knowledge access.

Self-Reconfiguration Planning for Deformable Quadrilateral Modular Robots

Authors:Jie Gu, Hongrun Gao, Zhihao Xia, Yirun Sun, Chunxu Tian, Dan Zhang
Date:2026-01-27 11:32:04

For lattice modular self-reconfigurable robots (MSRRs), maintaining stable connections during reconfiguration is crucial for physical feasibility and deployability. This letter presents a novel self-reconfiguration planning algorithm for deformable quadrilateral MSRRs that guarantees stable connection. The method first constructs feasible connect/disconnect actions using a virtual graph representation, and then organizes these actions into a valid execution sequence through a Dependence-based Reverse Tree (DRTree) that resolves interdependencies. We also prove that reconfiguration sequences satisfying motion characteristics exist for any pair of configurations with seven or more modules (excluding linear topologies). Finally, comparisons with a modified BiRRT algorithm highlight the superior efficiency and stability of our approach, while deployment on a physical robotic platform confirms its practical feasibility.

RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming

Authors:Jisheng Chu, Wenrui Li, Rui Zhao, Wangmeng Zuo, Shifeng Chen, Xiaopeng Fan
Date:2026-01-27 10:10:55

Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.

Judgelight: Trajectory-Level Post-Optimization for Multi-Agent Path Finding via Closed-Subwalk Collapsing

Authors:Yimin Tang, Sven Koenig, Erdem Bıyık
Date:2026-01-27 09:20:14

Multi-Agent Path Finding (MAPF) is an NP-hard problem with applications in warehouse automation and multi-robot coordination. Learning-based MAPF solvers offer fast and scalable planning but often produce feasible trajectories that contain unnecessary or oscillatory movements. We propose Judgelight, a post-optimization method that improves trajectory quality after a MAPF solver generates a feasible schedule. Judgelight collapses closed subwalks in agents' trajectories to remove redundant movements while preserving all feasibility constraints. We formalize this process as MAPF-Collapse, prove that it is NP-hard, and present an exact optimization approach by formulating it as integer linear programming (ILP) problem. Experimental results show Judgelight consistently reduces solution cost by around 20%, particularly for learning-based solvers, producing trajectories that are better suited for real-world deployment.

Self-Supervised Path Planning in Unstructured Environments via Global-Guided Differentiable Hard Constraint Projection

Authors:Ziqian Wang, Chenxi Fang, Zhen Zhang
Date:2026-01-27 08:37:21

Deploying deep learning agents for autonomous navigation in unstructured environments faces critical challenges regarding safety, data scarcity, and limited computational resources. Traditional solvers often suffer from high latency, while emerging learning-based approaches struggle to ensure deterministic feasibility. To bridge the gap from embodied to embedded intelligence, we propose a self-supervised framework incorporating a differentiable hard constraint projection layer for runtime assurance. To mitigate data scarcity, we construct a Global-Guided Artificial Potential Field (G-APF), which provides dense supervision signals without manual labeling. To enforce actuator limitations and geometric constraints efficiently, we employ an adaptive neural projection layer, which iteratively rectifies the coarse network output onto the feasible manifold. Extensive benchmarks on a test set of 20,000 scenarios demonstrate an 88.75\% success rate, substantiating the enhanced operational safety. Closed-loop experiments in CARLA further validate the physical realizability of the planned paths under dynamic constraints. Furthermore, deployment verification on an NVIDIA Jetson Orin NX confirms an inference latency of 94 ms, showing real-time feasibility on resource-constrained embedded hardware. This framework offers a generalized paradigm for embedding physical laws into neural architectures, providing a viable direction for solving constrained optimization in mechatronics. Source code is available at: https://github.com/wzq-13/SSHC.git.

GeoSSA: Geometric Sparrow Search Algorithm for UAV Path Planning and Engineering Design Optimization

Authors:Junhao Wei, Wenxuan Zhu, Qingyang Xu, Yanxiao Li, Yifu Zhao, Zikun Li, Ran Zhang, Yanzhao Gu, Jinhong Song, Yapeng Wang, Zhiwen Wang, Ngai Cheong, Sio-Kei Im, Xu Yang
Date:2026-01-27 08:26:24

The Sparrow Search Algorithm (SSA), characterized by its simple structure and ease of implementation, nevertheless suffers from an insufficient balance between exploration and exploitation, making it prone to premature convergence and slow optimization progress. To address these shortcomings, this paper proposes a Geometric Sparrow Search Algorithm (GeoSSA). By integrating Good Nodes Set initialization, a Sine-Cosine Enhanced Producer position update strategy, and a Triangular-Walk Enhanced Edge Sparrow update strategy, GeoSSA significantly improves the global exploration ability, local exploitation efficiency, and convergence stability of the original SSA. To thoroughly validate the effectiveness of GeoSSA, we conducted ablation studies, qualitative analysis, and comparative experiments on 23 benchmark functions against state-of-the-art algorithms. Experimental results show that GeoSSA achieves the best or near-best performance in terms of average fitness, standard deviation, Wilcoxon tests, and Friedman rankings, with an Overall Effectiveness ($OE$) of 95.65\%. Its overall performance is significantly superior to all compared algorithms. In three-dimensional UAV path planning tasks, GeoSSA demonstrates excellent stability and superior path quality. In four categories of engineering design optimization problems, GeoSSA consistently attains the highest solution accuracy and strongest stability. GeoSSA not only exhibits outstanding global optimization performance on standard benchmark functions but also shows strong robustness and generalization ability in practical applications such as UAV path planning and engineering design. Therefore, GeoSSA provides an efficient and reliable solution framework for complex optimization problems.

The Psychological Science of Artificial Intelligence: A Rapidly Emerging Field of Psychology

Authors:Zheng Yan, Ru-Yuan Zhang
Date:2026-01-27 08:21:51

The psychological science of artificial intelligence (AI) can be broadly defined as an emerging field of psychology that examines all AI-related mental and behavioral processes from the perspective of psychology. This field has been growing exponentially in the recent decade. This review synthesizes the existing literature on the psychological science of AI with a goal to provide a comprehensive conceptual framework for planning, conducting, and assessing scientific research in the field. It consists of six parts, starting with an overview of the entire field of the psychological science of artificial intelligence, then synthesizing the literature in each of the four specific areas (i.e., Psychology of designing AI, psychology of using AI, AI for examining psychological processes, and AI for advancing psychological methods), and concluding with an outlook on the field in the future.

Perception-to-Pursuit: Track-Centric Temporal Reasoning for Open-World Drone Detection and Autonomous Chasing

Authors:Venkatakrishna Reddy Oruganti
Date:2026-01-27 07:57:29

Autonomous drone pursuit requires not only detecting drones but also predicting their trajectories in a manner that enables kinematically feasible interception. Existing tracking methods optimize for prediction accuracy but ignore pursuit feasibility, resulting in trajectories that are physically impossible to intercept 99.9% of the time. We propose Perception-to-Pursuit (P2P), a track-centric temporal reasoning framework that bridges detection and actionable pursuit planning. Our method represents drone motion as compact 8-dimensional tokens capturing velocity, acceleration, scale, and smoothness, enabling a 12-frame causal transformer to reason about future behavior. We introduce the Intercept Success Rate (ISR) metric to measure pursuit feasibility under realistic interceptor constraints. Evaluated on the Anti-UAV-RGBT dataset with 226 real drone sequences, P2P achieves 28.12 pixel average displacement error and 0.597 ISR, representing a 77% improvement in trajectory prediction and 597x improvement in pursuit feasibility over tracking-only baselines, while maintaining perfect drone classification accuracy (100%). Our work demonstrates that temporal reasoning over motion patterns enables both accurate prediction and actionable pursuit planning.

Instance-Guided Radar Depth Estimation for 3D Object Detection

Authors:Chen-Chou Lo, Patrick Vandewalle
Date:2026-01-27 07:53:24

Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.

LightSBB-M: Bridging Schrödinger and Bass for Generative Diffusion Modeling

Authors:Alexandre Alouadi, Pierre Henry-Labordère, Grégoire Loeper, Othmane Mazhar, Huyên Pham, Nizar Touzi
Date:2026-01-27 07:50:59

The Schrodinger Bridge and Bass (SBB) formulation, which jointly controls drift and volatility, is an established extension of the classical Schrodinger Bridge (SB). Building on this framework, we introduce LightSBB-M, an algorithm that computes the optimal SBB transport plan in only a few iterations. The method exploits a dual representation of the SBB objective to obtain analytic expressions for the optimal drift and volatility, and it incorporates a tunable parameter beta greater than zero that interpolates between pure drift (the Schrodinger Bridge) and pure volatility (Bass martingale transport). We show that LightSBB-M achieves the lowest 2-Wasserstein distance on synthetic datasets against state-of-the-art SB and diffusion baselines with up to 32 percent improvement. We also illustrate the generative capability of the framework on an unpaired image-to-image translation task (adult to child faces in FFHQ). These findings demonstrate that LightSBB-M provides a scalable, high-fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real-world generative tasks. The code is available at https://github.com/alexouadi/LightSBB-M.

Curiosity Driven Knowledge Retrieval for Mobile Agents

Authors:Sijia Li, Xiaoyu Tan, Shahir Ali, Niels Schmidt, Gengchen Ma, Xihe Qiu
Date:2026-01-27 07:46:05

Mobile agents have made progress toward reliable smartphone automation, yet performance in complex applications remains limited by incomplete knowledge and weak generalization to unseen environments. We introduce a curiosity driven knowledge retrieval framework that formalizes uncertainty during execution as a curiosity score. When this score exceeds a threshold, the system retrieves external information from documentation, code repositories, and historical trajectories. Retrieved content is organized into structured AppCards, which encode functional semantics, parameter conventions, interface mappings, and interaction patterns. During execution, an enhanced agent selectively integrates relevant AppCards into its reasoning process, thereby compensating for knowledge blind spots and improving planning reliability. Evaluation on the AndroidWorld benchmark shows consistent improvements across backbones, with an average gain of six percentage points and a new state of the art success rate of 88.8\% when combined with GPT-5. Analysis indicates that AppCards are particularly effective for multi step and cross application tasks, while improvements depend on the backbone model. Case studies further confirm that AppCards reduce ambiguity, shorten exploration, and support stable execution trajectories. Task trajectories are publicly available at https://lisalsj.github.io/Droidrun-appcard/.

A Collaborative Extended Reality Prototype for 3D Surgical Planning and Visualization

Authors:Shi Qiu, Ruiyang Li, Qixuan Liu, Yuqi Tong, Yue Qiu, Yinqiao Wang, Yan Li, Chi-Wing Fu, Pheng-Ann Heng
Date:2026-01-27 07:42:51

We present a collaborative extended reality (XR) prototype for 3D surgical planning and visualization. Our system consists of three key modules: XR-based immersive surgical planning, cloud-based data management, and coordinated stereoscopic 3D displays for interactive visualization. We describe the overall workflow, core functionalities, implementations and setups. By conducting user studies on a liver resection surgical planning case, we demonstrate the effectiveness of our prototype and provide practical insights to inspire future advances in medical XR collaboration.

Bridging Visual and Wireless Sensing: A Unified Radiation Field for 3D Radio Map Construction

Authors:Chaozheng Wen, Jingwen Tong, Zehong Lin, Chenghong Bian, Jun Zhang
Date:2026-01-27 05:35:50

The emerging applications of next-generation wireless networks (e.g., immersive 3D communication, low-altitude networks, and integrated sensing and communication) necessitate high-fidelity environmental intelligence. 3D radio maps have emerged as a critical tool for this purpose, enabling spectrum-aware planning and environment-aware sensing by bridging the gap between physical environments and electromagnetic signal propagation. However, constructing accurate 3D radio maps requires fine-grained 3D geometric information and a profound understanding of electromagnetic wave propagation. Existing approaches typically treat optical and wireless knowledge as distinct modalities, failing to exploit the fundamental physical principles governing both light and electromagnetic propagation. To bridge this gap, we propose URF-GS, a unified radio-optical radiation field representation framework for accurate and generalizable 3D radio map construction based on 3D Gaussian splatting (3D-GS) and inverse rendering. By fusing visual and wireless sensing observations, URF-GS recovers scene geometry and material properties while accurately predicting radio signal behavior at arbitrary transmitter-receiver (Tx-Rx) configurations. Experimental results demonstrate that URF-GS achieves up to a 24.7% improvement in spatial spectrum prediction accuracy and a 10x increase in sample efficiency for 3D radio map construction compared with neural radiance field (NeRF)-based methods. This work establishes a foundation for next-generation wireless networks by integrating perception, interaction, and communication through holistic radiation field reconstruction.

Before Smelling the Video: A Two-Stage Pipeline for Interpretable Video-to-Scent Plans

Authors:Kaicheng Wang, Kevin Zhongyang Shao, Ruiqi Chen, Sep Makhsous, Denise Wilson
Date:2026-01-27 05:05:29

Olfactory cues can enhance immersion in interactive media, yet smell remains rare because it is difficult to author and synchronize with dynamic video. Prior olfactory interfaces rely on designer triggers and fixed event-to-odor mappings that do not scale to unconstrained content. This work examines whether semantic planning for smell is intelligible to people before physical scent delivery. We present a video-to-scent planning pipeline that separates visual semantic extraction using a vision-language model from semantic-to-olfactory inference using a large language model. Two survey studies compare system-generated scent plans with over-inclusive and naive baselines. Results show consistent preference for plans that prioritize perceptually salient cues and align scent changes with visible actions, supporting semantic planning as a foundation for future olfactory media systems.

Optimal Motion Planning for Two Square Robots in a Rectilinear Environment

Authors:Pankaj K. Agarwal, Mark de Berg, Benjamin Holmgren, Alex Steiger, Martijn Struijs
Date:2026-01-27 03:24:22

Let $\mathcal{W} \subset \mathbb{R}^2$ be a rectilinear polygonal environment (that is, a rectilinear polygon potentially with holes) with a total of $n$ vertices, and let $A,B$ be two robots, each modeled as an axis-aligned unit square, that can move rectilinearly inside $\mathcal{W}$. The goal is to compute a collision-free motion plan $\boldsymbolπ$, that is, a motion plan that continuously moves $A$ from $s_A$ to $t_A$ and $B$ from $s_B$ to $t_B$ so that $A$ and $B$ remain inside $\mathcal{W}$ and do not collide with each other during the motion. We study two variants of this problem which are focused additionally on the optimality of $\boldsymbolπ$, and obtain the following results. 1. Min-Sum: Here the goal is to compute a motion plan that minimizes the sum of the lengths of the paths of the robots. We present an $O(n^4\log{n})$-time algorithm for computing an optimal solution to the min-sum problem. This is the first polynomial-time algorithm to compute an optimal, collision-free motion of two robots amid obstacles in a planar polygonal environment. 2. Min-Makespan: Here the robots can move with at most unit speed, and the goal is to compute a motion plan that minimizes the maximum time taken by a robot to reach its target location. We prove that the min-makespan variant is NP-hard.

One Global Model, Many Behaviors: Stockout-Aware Feature Engineering and Dynamic Scaling for Multi-Horizon Retail Demand Forecasting with a Cost-Aware Ordering Policy (VN2 Winner Report)

Authors:Bartosz Szabłowski
Date:2026-01-26 19:36:52

Inventory planning for retail chains requires translating demand forecasts into ordering decisions, including asymmetric shortages and holding costs. The VN2 Inventory Planning Challenge formalizes this setting as a weekly decision-making cycle with a two-week product delivery lead time, where the total cost is defined as the shortage cost plus the holding cost. This report presents the winning VN2 solution: a two-stage predict-then-optimize pipeline that combines a single global multi-horizon forecasting model with a cost-aware ordering policy. The forecasting model is trained in a global paradigm, jointly using all available time series. A gradient-boosted decision tree (GBDT) model implemented in CatBoost is used as the base learner. The model incorporates stockout-aware feature engineering to address censored demand during out-of-stock periods, per-series scaling to focus learning on time-series patterns rather than absolute levels, and time-based observation weights to reflect shifts in demand patterns. In the decision stage, inventory is projected to the start of the delivery week, and a target stock level is calculated that explicitly trades off shortage and holding costs. Evaluated by the official competition simulation in six rounds, the solution achieved first place by combining a strong global forecasting model with a lightweight cost-aware policy. Although developed for the VN2 setting, the proposed approach can be extended to real-world applications and additional operational constraints.

Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge

Authors:Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Date:2026-01-26 17:56:19

Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.

Detection of high-frequency gravitational waves using SRF cavities

Authors:M. Wenskat, B. Giaccone, J. Branlard, V. Chouhan, C. Dokuyucu, L. Fischer, I. Gonin, A. Grassellino, W. Hillert, T. Khabiboulline, T. Krokotsch, F. Ludwig, G. Marconato, A. Melnychuk, G. Moortgat-Pick, A. Muhs, A. Netepenko, Y. Orlov, M. Paulsen, K. Peters, L. Pfeiffer, S. Posen, O. Pronitchev, H. Schlarb
Date:2026-01-26 17:42:03

Today, apart from some isolated R&D efforts, there are no gravitational wave (GW) experiments, yet which explore a large part of the vast frequency range above the LIGO/Virgo band. It is planned to establish an experiment at Deutsches Elektronen-Synchrotron (DESY) and at the Superconducting Quantum Materials and Systems (SQMS) Center at Fermi National Accelerator Laboratory (Fermilab) to search for high-frequency GWs in the frequency range of 10 kHz to 100 MHz. The basic idea is to use superconducting radiofrequency (SRF) cavities to detect tiny harmonic deformations induced by GWs which change the boundary conditions of the oscillating electromagnetic field. This paper summarizes the challenging environmental boundary requirements, and the R&D to operate a cavity using a low level RF (LLRF) system which pushes beyond state-of-the-art accuracy and resolutions and a seismic noise mitigated cryostat at 1.8 K. The focus of this paper is the warm and cold commissioning of a prototype cavity, built 20 years ago during the MAGO collaboration, and its first measurement in our collaborative research project.

Scalable Transit Delay Prediction at City Scale: A Systematic Approach with Multi-Resolution Feature Engineering and Deep Learning

Authors:Emna Boudabbous, Mohamed Karaa, Lokman Sboui, Julio Montecinos, Omar Alam
Date:2026-01-26 14:30:50

Urban bus transit agencies need reliable, network-wide delay predictions to provide accurate arrival information to passengers and support real-time operational control. Accurate predictions help passengers plan their trips, reduce waiting time, and allow operations staff to adjust headways, dispatch extra vehicles, and manage disruptions. Although real-time feeds such as GTFS-Realtime (GTFS-RT) are now widely available, most existing delay prediction systems handle only a few routes, depend on hand-crafted features, and offer little guidance on how to design a scalable, reusable architecture. We present a city-scale prediction pipeline that combines multi-resolution feature engineering, dimensionality reduction, and deep learning. The framework generates 1,683 spatiotemporal features by exploring 23 aggregation combinations over H3 cells, routes, segments, and temporal patterns, and compresses them into 83 components using Adaptive PCA while preserving 95% of the variance. To avoid the "giant cluster" problem that occurs when dense urban areas fall into a single H3 region, we introduce a hybrid H3+topology clustering method that yields 12 balanced route clusters (coefficient of variation 0.608) and enables efficient distributed training. We compare five model architectures on six months of bus operations from the Société de transport de Montréal (STM) network in Montréal. A global LSTM with cluster-aware features achieves the best trade-off between accuracy and efficiency, outperforming transformer models by 18 to 52% while using 275 times fewer parameters. We also report multi-level evaluation at the elementary segment, segment, and trip level with walk-forward validation and latency analysis, showing that the proposed pipeline is suitable for real-time, city-scale deployment and can be reused for other networks with limited adaptation.

GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

Authors:Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He
Date:2026-01-26 11:33:40

GUI agents enable end-to-end automation through direct perception of and interaction with on-screen interfaces. However, these agents frequently access interfaces containing sensitive personal information, and screenshots are often transmitted to remote models, creating substantial privacy risks. These risks are particularly severe in GUI workflows: GUIs expose richer, more accessible private information, and privacy risks depend on interaction trajectories across sequential scenes. We propose GUIGuard, a three-stage framework for privacy-preserving GUI agents: (1) privacy recognition, (2) privacy protection, and (3) task execution under protection. We further construct GUIGuard-Bench, a cross-platform benchmark with 630 trajectories and 13,830 screenshots, annotated with region-level privacy grounding and fine-grained labels of risk level, privacy category, and task necessity. Evaluations reveal that existing agents exhibit limited privacy recognition, with state-of-the-art models achieving only 13.3% accuracy on Android and 1.4% on PC. Under privacy protection, task-planning semantics can still be maintained, with closed-source models showing stronger semantic consistency than open-source ones. Case studies on MobileWorld show that carefully designed protection strategies achieve higher task accuracy while preserving privacy. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

Towards a Comprehensive Understanding of Planetary Systems through Population-Level, Large-Scale Surveys

Authors:Francisco J. Pozuelos, Pedro J. Amado, Jesús Aceituno, Marina Centenera-Merino, Stefan Cikota, Javier Flores, Julius Göhring, Sergio León-Saval, Kalaga Madhav, Giuseppe Morello, Abani Nayak, Jose L. Ortiz, David Pérez-Medialdea, María Isabel Ruiz-López, Miguel A. Sánchez-Carrasco, Alejandro Sánchez-López
Date:2026-01-26 10:59:13

Over the past three decades, exoplanet research has delivered an extensive census of planets spanning a wide range of masses, sizes, and orbital configurations. Despite this progress, the physical interpretation of these populations remains severely limited, as precise constraints on planetary masses, interior structures, and atmospheres are available only for a small, highly selected subset of targets. As a result, most known exoplanets remain physically ambiguous, preventing the construction of robust population-level trends and limiting our understanding of planet formation, evolution, and habitability. In the coming decades, missions such as PLATO, Earth 2.0, and the Nancy Grace Roman Space Telescope will dramatically expand the number of exoplanets detected. However, without a corresponding capability to characterise planetary masses and atmospheres at scale, these discoveries will remain largely detection-driven. Current and planned facilities, including JWST and ELT-class instruments, excel at detailed studies of individual systems but are intrinsically unsuited for large, homogeneous surveys. This white paper identifies population-level physical characterisation as a fundamental science challenge for the 2040s and motivates the need for a new observational paradigm. We outline how photonics-enabled, modular telescope architectures can deliver the survey speed, stability, and scalability required to jointly probe planetary interiors and atmospheres across statistically meaningful samples, thereby enabling a comprehensive and physically grounded understanding of planetary systems.

TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion

Authors:Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, Jian Tang
Date:2026-01-26 10:06:56

The vision-language-action (VLA) paradigm has enabled powerful robotic control by leveraging vision-language models, but its reliance on large-scale, high-quality robot data limits its generalization. Generative world models offer a promising alternative for general-purpose embodied AI, yet a critical gap remains between their pixel-level plans and physically executable actions. To this end, we propose the Tool-Centric Inverse Dynamics Model (TC-IDM). By focusing on the tool's imagined trajectory as synthesized by the world model, TC-IDM establishes a robust intermediate representation that bridges the gap between visual planning and physical control. TC-IDM extracts the tool's point cloud trajectories via segmentation and 3D motion estimation from generated videos. Considering diverse tool attributes, our architecture employs decoupled action heads to project these planned trajectories into 6-DoF end-effector motions and corresponding control signals. This plan-and-translate paradigm not only supports a wide range of end-effectors but also significantly improves viewpoint invariance. Furthermore, it exhibits strong generalization capabilities across long-horizon and out-of-distribution tasks, including interacting with deformable objects. In real-world evaluations, the world model with TC-IDM achieves an average success rate of 61.11 percent, with 77.7 percent on simple tasks and 38.46 percent on zero-shot deformable object tasks. It substantially outperforms end-to-end VLA-style baselines and other inverse dynamics models.

Revisiting Aerial Scene Classification on the AID Benchmark

Authors:Subhajeet Das, Susmita Ghosh, Abhiroop Chatterjee
Date:2026-01-26 08:39:02

Aerial images play a vital role in urban planning and environmental preservation, as they consist of various structures, representing different types of buildings, forests, mountains, and unoccupied lands. Due to its heterogeneous nature, developing robust models for scene classification remains a challenge. In this study, we conduct a literature review of various machine learning methods for aerial image classification. Our survey covers a range of approaches from handcrafted features (e.g., SIFT, LBP) to traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks. In this connection, we have also designed Aerial-Y-Net, a spatial attention-enhanced CNN with multi-scale feature fusion mechanism, which acts as an attention-based model and helps us to better understand the complexities of aerial images. Evaluated on the AID dataset, our model achieves 91.72% accuracy, outperforming several baseline architectures.

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Authors:James Burgess, Jan N. Hansen, Duo Peng, Yuhui Zhang, Alejandro Lozano, Min Woo Sun, Emma Lundberg, Serena Yeung-Levy
Date:2026-01-26 06:46:16

Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on https://huggingface.co/collections/jmhb/papersearchqa. Finally, our data creation methods are scalable and easily extendable to other scientific domains.

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Authors:Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, Feng Zheng
Date:2026-01-26 06:16:17

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.