planning - 2026-02-25

Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

Authors:Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott

Date:2026-02-24 18:18:36

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search-based methods such as A* suffer from scalability issues in large-scale high-resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any-angle planning methods, which find optimal paths by connecting corners of obstacles with straight-line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any-angle planners while overcoming computational tractability issues common to search-based methods by exploiting multi-resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach's solution quality and speed, outperforming even sampling-based methods. The framework is open-sourced to allow the robotics and planning community to build on our research.

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Authors:Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

Date:2026-02-24 16:10:27

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Authors:Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

Date:2026-02-24 15:33:02

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Authors:Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

Date:2026-02-24 15:30:55

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Telemetry-Based Server Selection in the Quantum Internet via Cross-Layer Runtime Estimation

Authors:Masaki Nagai, Hideaki Kawaguchi, Shin Nishio, Takahiko Satoh

Date:2026-02-24 15:25:55

The Quantum Internet will allow clients to delegate quantum workloads to remote servers over heterogeneous networks, but choosing the server that minimizes end-to-end execution time is difficult because server processing, feedforward classical communication, and entanglement distribution can overlap in protocol-dependent ways and shift the runtime bottleneck. We propose $T_{\max}$, a lightweight runtime score that sums coarse telemetry from multiple layers to obtain a conservative ranking for online server selection without calibrating weights for each deployment. Using NetSquid discrete-event simulations of a modified parameter-blind VQE (PB-VQE) workload, we evaluate $T_{\max}$ on pools of 10,000 heterogeneous candidates (selecting among up to 100 per decision) across crossover and bottleneck-dominated regimes, including temporal jitter scenarios and jobs with multiple shots. $T_{\max}$ achieves single-digit mean regret normalized by the oracle (below 10%) in both regimes and remains in the single-digit range under classical communication latency jitter for multi-shot jobs, while performance degrades for single-shot jobs under severe jitter. To connect performance to deployment planning, we derive an operating map based on requirements relating distance and entanglement rate requirements to protocol level counts, quantify how simple multiuser contention shifts the crossover, and use Sobol global sensitivity analysis to identify regime-dependent bottlenecks. These findings suggest that simple cross-layer telemetry can enable practical server selection while providing actionable provisioning guidance for emerging Quantum Internet services.

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Authors:Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

Date:2026-02-24 14:37:36

Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3\% in three tested scenarios.

KCFRC: Kinematic Collision-Aware Foothold Reachability Criteria for Legged Locomotion

Authors:Lei Ye, Haibo Gao, Huaiguang Yang, Peng Xu, Haoyu Wang, Tie Liu, Junqi Shan, Zongquan Deng, Liang Ding

Date:2026-02-24 12:46:34

Legged robots face significant challenges in navigating complex environments, as they require precise real-time decisions for foothold selection and contact planning. While existing research has explored methods to select footholds based on terrain geometry or kinematics, a critical gap remains: few existing methods efficiently validate the existence of a non-collision swing trajectory. This paper addresses this gap by introducing KCFRC, a novel approach for efficient foothold reachability analysis. We first formally define the foothold reachability problem and establish a sufficient condition for foothold reachability. Based on this condition, we develop the KCFRC algorithm, which enables robots to validate foothold reachability in real time. Our experimental results demonstrate that KCFRC achieves remarkable time efficiency, completing foothold reachability checks for a single leg across 900 potential footholds in an average of 2 ms. Furthermore, we show that KCFRC can accelerate trajectory optimization and is particularly beneficial for contact planning in confined spaces, enhancing the adaptability and robustness of legged robots in challenging environments.

Fast-Response Balancing Capacity of Alkaline Electrolyzers

Authors:Marvin Dorn, Julian Hoffmann, André Weber, Veit Hagenmeyer

Date:2026-02-24 12:31:08

The energy transition requires flexible technologies to maintain grid stability, and electrolyzers are playing an increasingly important role in meeting this need. While previous studies often question the dynamic capabilities of large-scale alkaline electrolyzer systems, we assess their potential to provide balancing services using real manufacturer data. Unlike common approaches, we propose the decoupling between the total electrolyzer power and a smaller fractions of power actually offered on balancing markets. Adapting an existing methodology, we analyze alkaline electrolyzer systems and extend the assessment to Germany and Europe. Our results show that large-scale electrolyzers are technically capable of delivering fast-response balancing services, with significantly lower dynamic requirements than previously assumed. The planned electrolyzers in Germany could cover the entire balancing capacity market, potentially saving around 13 % of their electricity costs, excluding energy balancing revenues. The decoupling also resolves part of the trade-off for electrolyzer manufacturers, enabling the design of less dynamic but more stable systems.

Confidence Distributions and Related Themes

Authors:Nils Lid Hjort, Tore Schweder

Date:2026-02-24 12:18:48

This is the guest editors' general introduction to a Special Issue of the Journal of Statistical Planning and Inference, dedicated to confidence distributions and related themes. Confidence distributions (CDs) are distributions for parameters of interest, constructed via a statistical model after analysing the data. As such they serve the same purpose for the frequentist statisticians as the posterior distributions for the Bayesians. There have been several attempts in the literature to put up a clear theory for such confidence distributions, from Fisher's fiducial inference and onwards. There are certain obstacles and difficulties involved in these attempts, both conceptually and operationally, which have contributed to the CDs being slow in entering statistical mainstream. Recently there is a renewed surge of interest in CDs and various related themes, however, reflected in both series of new methodological research, advanced applications to substantive sciences, and dissemination and communication via workshops and conferences. The present special issue of the JSPI is a collection of papers emanating from the {\it Inference With Confidence} workshop in Oslo, May 2015. Several of the papers appearing here were first presented at that workshop. The present collection includes however also new research papers from other scholars in the field.

POMDPPlanners: Open-Source Package for POMDP Planning

Authors:Yaacov Pariente, Vadim Indelman

Date:2026-02-24 11:50:04

We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation -- reducing the overhead of extensive simulation studies. POMDPPlanners is designed to enable scalable, reproducible research on decision-making under uncertainty, with particular emphasis on risk-sensitive settings where standard toolkits fall short.

Infrared spectropolarimetry of a C-class solar flare footpoint plasma -- I. Spectral features and forward modelling

Authors:Z. Vashalomidze, C. Quintero Noda, T. V. Zaqarashvili, M. Benko, D. Kuridze, P. Gömöry, J. Rybák, S. Lomineishvili, M. Collados, C. Denker, M. Verma, C. Kuckein, A. Asensio Ramos

Date:2026-02-24 11:36:04

We performed high-spatial resolution spectropolarimetric observations of active region NOAA 13363 during a C-class flare with the Gregor Infrared Spectrograph (GRIS) on 16 July 2023. We examine the coupling between the photosphere and the chromosphere, studying the polarimetric signals during a period that encompasses the decaying phase of a C-class flare and the appearance of a new C-class flare at the same location. We focus on the analysis of various spectral lines. In particular, we study the Si I 10827 Å, Ca I 10833.4 Å, Na I 10834.9 Å, and Ca I 10838.9 Å photospheric lines, as well as the He I 10830 Å triplet. GRIS data revealed the presence of flare-related red- and blueshifted spectral line components, reaching Doppler velocities up to 90 km/s, and complex Si I profiles where the He i spectral line contribution is blueshifted. In contrast, the photospheric Ca i and Na i transitions remained unchanged, indicating that the flare did not modify the physical conditions of the lower photosphere. We combined that information with simultaneous imaging in the Ca ii H line and TiO band with the improved High-resolution Fast Imager (HiFI+), finding that the flare emission did not affect the inverse granulation or nearby plage, in agreement with the results from GRIS. We also complement the previous studies with a forward modelling computation, concluding that the He I spectral line emission reflects a complex response of the flaring chromosphere. Radiative excitation from coronal EUV irradiation, energy deposition by flare-accelerated electrons, and dynamic field-aligned plasma flows likely act together to produce the observed supersonic downflows and upflows. We plan to expand these findings through inversions of the He I 10830 Å triplet signals in the future.

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Authors:Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

Date:2026-02-24 11:33:44

The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Authors:Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

Date:2026-02-24 09:35:43

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

WeirNet: A Large-Scale 3D CFD Benchmark for Geometric Surrogate Modeling of Piano Key Weirs

Authors:Lisa Lüddecke, Michael Hohmann, Sebastian Eilermann, Jan Tillmann-Mumm, Pezhman Pourabdollah, Mario Oertel, Oliver Niggemann

Date:2026-02-24 09:19:28

Reliable prediction of hydraulic performance is challenging for Piano Key Weir (PKW) design because discharge capacity depends on three-dimensional geometry and operating conditions. Surrogate models can accelerate hydraulic-structure design, but progress is limited by scarce large, well-documented datasets that jointly capture geometric variation, operating conditions, and functional performance. This study presents WeirNet, a large 3D CFD benchmark dataset for geometric surrogate modeling of PKWs. WeirNet contains 3,794 parametric, feasibility-constrained rectangular and trapezoidal PKW geometries, each scheduled at 19 discharge conditions using a consistent free-surface OpenFOAM workflow, resulting in 71,387 completed simulations that form the benchmark and with complete discharge coefficient labels. The dataset is released as multiple modalities compact parametric descriptors, watertight surface meshes and high-resolution point clouds together with standardized tasks and in-distribution and out-of-distribution splits. Representative surrogate families are benchmarked for discharge coefficient prediction. Tree-based regressors on parametric descriptors achieve the best overall accuracy, while point- and mesh-based models remain competitive and offer parameterization-agnostic inference. All surrogates evaluate in milliseconds per sample, providing orders-of-magnitude speedups over CFD runtimes. Out-of-distribution results identify geometry shift as the dominant failure mode compared to unseen discharge values, and data-efficiency experiments show diminishing returns beyond roughly 60% of the training data. By publicly releasing the dataset together with simulation setups and evaluation pipelines, WeirNet establishes a reproducible framework for data-driven hydraulic modeling and enables faster exploration of PKW designs during the early stages of hydraulic planning.

Autonomous Laboratory Agent via Customized Domain-Specific Language Model and Modular AI Interface

Authors:Zhuo Diao, Kouma Matsumoto, Linfeng Hou, Hayato Yamashita, Masayuki Abe

Date:2026-02-24 08:20:00

We introduce a system architecture that addresses a fundamental challenge in deploying language-model agents for autonomous control of scientific instrumentation: ensuring reliability in safety-critical environments. The framework integrates probabilistic reasoning by domain-specialized language models with deterministic execution layers that enforce constraints through structured validation and modular orchestration. By separating intent interpretation, experimental planning, and command verification, the architecture translates high-level scientific goals into verifiable experimental actions. We demonstrate this approach in real-time atomic-resolution scanning probe microscopy experiments operated at room temperature, where the system autonomously generates control strategies, invokes corrective modules, and maintains stable operation under experimentally challenging conditions. Quantitative evaluations show that domain-adapted small language models achieve high routing robustness and command accuracy while operating on consumer-grade hardware. Beyond a specific instrument, the framework establishes a general computational principle for deploying language-model agents in safety-critical experimental workflows, providing a pathway toward scalable autonomous laboratories.

Robot Local Planner: A Periodic Sampling-Based Motion Planner with Minimal Waypoints for Home Environments

Authors:Keisuke Takeshita, Takahiro Yamazaki, Tomohiro Ono, Takashi Yamamoto

Date:2026-02-24 07:46:31

The objective of this study is to enable fast and safe manipulation tasks in home environments. Specifically, we aim to develop a system that can recognize its surroundings and identify target objects while in motion, enabling it to plan and execute actions accordingly. We propose a periodic sampling-based whole-body trajectory planning method, called the "Robot Local Planner (RLP)." This method leverages unique features of home environments to enhance computational efficiency, motion optimality, and robustness against recognition and control errors, all while ensuring safety. The RLP minimizes computation time by planning with minimal waypoints and generating safe trajectories. Furthermore, overall motion optimality is improved by periodically executing trajectory planning to select more optimal motions. This approach incorporates inverse kinematics that are robust to base position errors, further enhancing robustness. Evaluation experiments demonstrated that the RLP outperformed existing methods in terms of motion planning time, motion duration, and robustness, confirming its effectiveness in home environments. Moreover, application experiments using a tidy-up task achieved high success rates and short operation times, thereby underscoring its practical feasibility.

TrajGPT-R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer

Authors:Jiawei Wang, Chuang Yang, Jiawei Yong, Xiaohang Xu, Hongjun Wang, Noboru Koshizuka, Shintaro Fukushima, Ryosuke Shibasaki, Renhe Jiang

Date:2026-02-24 07:44:19

Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large-scale urban mobility trajectories, employing a novel application of a transformer-based model pre-trained and fine-tuned through a two-phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory-wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre-trained model is fine-tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL-based autoregressive methods, such as long-term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at https://github.com/Wangjw6/TrajGPT_R.

SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

Authors:Rulin Zhou, Guankun Wang, An Wang, Yujie Ma, Lixin Ouyang, Bolin Cui, Junyan Li, Chaowei Zhu, Mingyang Li, Ming Chen, Xiaopin Zhong, Peng Lu, Jiankun Wang, Xianming Liu, Hongliang Ren

Date:2026-02-24 07:30:51

Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.

Conflict-Based Search for Multi-Agent Path Finding with Elevators

Authors:Haitong He, Xuemian Wu, Shizhe Zhao, Zhongqiang Ren

Date:2026-02-24 03:29:27

This paper investigates a problem called Multi-Agent Path Finding with Elevators (MAPF-E), which seeks conflict-free paths for multiple agents from their start to goal locations that may locate on different floors, and the agents can use elevators to travel between floors. The existence of elevators complicates the interaction among the agents and introduces new challenges to the planning. On the one hand, elevators can cause many conflicts among the agents due to its relatively long traversal time across floors, especially when many agents need to reach a different floor. On the other hand, the planner has to reason in a larger state space including the states of the elevators, besides the locations of the agents.

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Authors:Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral

Date:2026-02-24 02:09:12

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

Diffusion Modulation via Environment Mechanism Modeling for Planning

Authors:Hanping Zhang, Yuhong Guo

Date:2026-02-23 23:41:22

Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

Directly from Alpha to Omega: Controllable End-to-End Vector Floor Plan Generation

Authors:Shidong Wang, Renato Pajarola

Date:2026-02-23 21:31:08

Automated floor plan generation aims to create residential layouts by arranging rooms within a given boundary, balancing topological, geometric, and aesthetic considerations. The existing methods typically use a multi-step pipeline with intermediate representations to decompose the prediction process into several sub-tasks, limiting model flexibility and imposing predefined solution paths. This often results in unreasonable outputs when applied to data unsuitable for these predefined paths, making it challenging for these methods to match human designers, who do not restrict themselves to a specific set of design workflows. To address these limitations, we introduce CE2EPlan, a controllable end-to-end topology- and geometry-enhanced diffusion model that removes restrictions on the generative process of AI design tools. Instead, it enables the model to learn how to design floor plans directly from data, capturing a wide range of solution paths from input boundaries to complete layouts. Extensive experiments demonstrate that our method surpasses all existing approaches using the multi-step pipeline, delivering higher-quality results with enhanced user control and greater diversity in output, bringing AI design tools closer to the versatility of human designers.

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

Authors:Haoyang Li, Yang You, Hao Su, Leonidas Guibas

Date:2026-02-23 20:18:35

Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.

Quantifying the Expectation-Realisation Gap for Agentic AI Systems

Authors:Sebastian Lobentanzer

Date:2026-02-23 19:16:30

Agentic AI systems are deployed with expectations of substantial productivity gains, yet rigorous empirical evidence reveals systematic discrepancies between pre-deployment expectations and post-deployment outcomes. We review controlled trials and independent validations across software engineering, clinical documentation, and clinical decision support to quantify this expectation-realisation gap. In software development, experienced developers expected a 24% speedup from AI tools but were slowed by 19% -- a 43 percentage-point calibration error. In clinical documentation, vendor claims of multi-minute time savings contrast with measured reductions of less than one minute per note, and one widely deployed tool showed no statistically significant effect. In clinical decision support, externally validated performance falls substantially below developer-reported metrics. These shortfalls are driven by workflow integration friction, verification burden, measurement construct mismatches, and systematic heterogeneity in treatment effects. The evidence motivates structured planning frameworks that require explicit, quantified benefit expectations with human oversight costs factored in.

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Authors:Wei-Cheng Huang, Jiaheng Han, Xiaohan Ye, Zherong Pan, Kris Hauser

Date:2026-02-23 18:58:24

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

On a discrete max-plus transportation problem

Authors:Sergio Mayorga, Eugene Stepanov, Pedro Barrios

Date:2026-02-23 18:47:18

We provide an explicit algorithm to solve the idempotent analogue of the discrete Monge-Kantorovich optimal mass transportation problem with the usual real number field replaced by the tropical (max-plus) semiring, in which addition is defined as the maximum and product is defined as usual addition, with minus infinity and zero playing the roles of additive and multiplicative identities. Such a problem may be naturally called tropical or "max-plus" optimal transportation problem. We show that the solutions to the latter, called the optimal tropical plans, may not correspond to perfect matchings even if the data (max-plus probability measures) have all weights equal to zero, in contrast with the classical discrete optimal transportation analogue, where perfect matching optimal plans in similar situations always exist. Nevertheless, in some randomized situation the existence of perfect matching optimal tropical plans may occur rather frequently. At last, we prove that the uniqueness of solutions of the optimal tropical transportation problem is quite rare.

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Authors:Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris

Date:2026-02-23 18:35:18

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/

HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

Authors:Kundan Thota, Xuanhao Mu, Thorsten Schlachter, Veit Hagenmeyer

Date:2026-02-23 17:22:54

Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an $R^2$ uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

Authors:Junli Wang, Xueyi Liu, Yinan Zheng, Zebing Xing, Pengfei Li, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Zhongpu Xia, Long Chen, Qichao Zhang

Date:2026-02-23 17:17:26

Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at https://github.com/wjl2244/MeanFuser.

A Quantum Internet Protocol Suite Beyond Layering

Authors:Angela Sara Cacciapuoti, Marcello Caleffi

Date:2026-02-23 16:03:30

Layering, the protocol organization principle underpinning the classical Internet, is ill-suited to the Quantum Internet, built around entanglement, which is non-local and stateful. This paper proposes a quantum-native organizational principle based on dynamic composition, which replaces static layering with a distributed orchestration fabric driven by the node local state and in-band control. Each node runs a Dynamic Kernel that i) constructs a local PoA of candidate steps to advance a service intent, and ii) executes the PoA by composing atomic micro-protocols into context-aware procedures (the meta-protocols). Quantum packets carry an in-band control-field (the meta-header) containing the service intent and an append-only list of action-commit records, termed as stamps. Successive nodes exploit this minimal, authoritative history to construct their local PoAs. As quantum packets progress, these local commits collectively induce a network-wide, direct acyclic graph that certifies end-to-end service fulfillment, without requiring global synchronization. In contrast to classical encapsulation, the proposed suite enforces order by certification: dependency-aware local scheduling decides what may run at a certain node, stamps certify what did run and constrain subsequent planning. By embedding procedural control within the quantum packet, the design ensures coherence and consistency between entanglement-state evolution and control-flow, preventing divergence between resource state ad protocol logic, while remaining MP-agnostic and implementation-decoupled. The resulting suite is modular, adaptable to entanglement dynamics, and scalable. It operates correctly with or without optional control-plane hints. Indeed, when present, hints can steer QoS policies, without changing semantics. We argue that dynamic composition is the organizing principle required for a truly quantum-native Internet.