planning - 2026-02-26

Petri Net Relaxation for Infeasibility Explanation and Sequential Task Planning

Authors:Nguyen Cong Nhat Le, John G. Rogers, Claire N. Bonial, Neil T. Dantam

Date:2026-02-25 16:39:50

Plans often change due to changes in the situation or our understanding of the situation. Sometimes, a feasible plan may not even exist, and identifying such infeasibilities is useful to determine when requirements need adjustment. Common planning approaches focus on efficient one-shot planning in feasible cases rather than updating domains or detecting infeasibility. We propose a Petri net reachability relaxation to enable robust invariant synthesis, efficient goal-unreachability detection, and helpful infeasibility explanations. We further leverage incremental constraint solvers to support goal and constraint updates. Empirically, compared to baselines, our system produces a comparable number of invariants, detects up to 2 times more infeasibilities, performs competitively in one-shot planning, and outperforms in sequential plan updates in the tested domains.

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Authors:Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan

Date:2026-02-25 16:38:53

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

Visual Milestone Planning in a Hybrid Development Context

Authors:Eduardo Miranda

Date:2026-02-25 16:25:18

This paper explains the Visual Milestone Planning (VMP) method using an agile vocabulary to facilitate its adoption by agile practitioners as a front end for a hybrid development process. VMP is a visual and collaborative planning approach which promotes a shared understanding of the work approach and commitment through the direct manipulation by team members of the reified planning constructs involved in the development of the plan. Once the product backlog has been established and relevant milestones identified, a novel construct called the milestone planning matrix is used to document the allocation of product backlog items to milestones. The milestones due dates are later determined by grouping sticky notes representing the work to be performed into time-boxes called work packages and accommodating them on a resource and time scaled scheduling canvas very much as it would be done in a Tetris game.

PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for medical images

Authors:Jitindra Fartiyal, Pedro Freire, Sergei K. Turitsyn, Sergei G. Solovski

Date:2026-02-25 15:08:43

Medical images are essential for diagnosis, treatment planning, and research, but their quality is often degraded by noise from low-dose acquisition, patient motion, or scanner limitations, affecting both clinical interpretation and downstream analysis. Traditional filtering approaches often over-smooth and lose fine anatomical details, while deep learning methods, including CNNs, GANs, and transformers, may struggle to preserve such details or require large, computationally expensive models, limiting clinical practicality. We propose PatchDenoiser, a lightweight, energy-efficient multi-scale patch-based denoising framework. It decomposes denoising into local texture extraction and global context aggregation, fused via a spatially aware patch fusion strategy. This design enables effective noise suppression while preserving fine structural and anatomical details. PatchDenoiser is ultra-lightweight, with far fewer parameters and lower computational complexity than CNN-, GAN-, and transformer-based denoisers. On the 2016 Mayo Low-Dose CT dataset, PatchDenoiser consistently outperforms state-of-the-art CNN- and GAN-based methods in PSNR and SSIM. It is robust to variations in slice thickness, reconstruction kernels, and HU windows, generalizes across scanners without fine-tuning, and reduces parameters by ~9x and energy consumption per inference by ~27x compared with conventional CNN denoisers. PatchDenoiser thus provides a practical, scalable, and computationally efficient solution for medical image denoising, balancing performance, robustness, and clinical deployability.

Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments

Authors:Xiangqi Meng, Pengxu Hou, Zhenjun Zhao, Javier Civera, Daniel Cremers, Hesheng Wang, Haoang Li

Date:2026-02-25 14:48:49

In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Authors:Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu

Date:2026-02-25 14:34:50

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.

Traffic-aware Hierarchical Integrated Thermal and Energy Management for Connected HEVs

Authors:Jie Han, Arash Khalatbarisoltani, Hai L. Vu, Xiaosong Hu, Jun Yang

Date:2026-02-25 13:42:04

The energy and thermal management systems of hybrid electric vehicles (HEVs) are inherently interdependent. With the ongoing deployment of intelligent transportation systems (ITSs) and increasing vehicle connectivity, the integration of traffic information has become crucial for improving both energy efficiency and thermal comfort in modern vehicles. To enhance fuel economy, this paper proposes a novel traffic-aware hierarchical integrated thermal and energy management (TA-ITEM) strategy for connected HEVs. In the upper layer, global reference trajectories for battery state of charge (SOC) and cabin temperature are planned using traffic flow speed information obtained from ITSs. In the lower layer, a real-time model predictive control (MPC)-based ITEM controller is developed, which incorporates a novel Transformer-based speed predictor with driving condition recognition (TF-DCR) to enable anticipatory tracking of the reference trajectories. Numerical simulations are conducted under various driving cycles and ambient temperature conditions. The results demonstrate that the proposed TA-ITEM approach outperforms conventional rule-based and MPC-SP approaches, with average fuel consumption reductions of 56.36\% and 5.84\%, respectively, while maintaining superior thermal regulation and cabin comfort. These findings confirm the effectiveness and strong generalization capability of TA-ITEM and underscore the advantages of incorporating traffic information.

Enhancing Cellular-enabled Collaborative Robots Planning through GNSS data for SAR Scenarios

Authors:Arnau Romero, Carmen Delgado, Jana Baguer, Raúl Suárez, Xavier Costa-Pérez

Date:2026-02-25 13:29:38

Cellular-enabled collaborative robots are becoming paramount in Search-and-Rescue (SAR) and emergency response. Crucially dependent on resilient mobile network connectivity, they serve as invaluable assets for tasks like rapid victim localization and the exploration of hazardous, otherwise unreachable areas. However, their reliance on battery power and the need for persistent, low-latency communication limit operational time and mobility. To address this, and considering the evolving capabilities of 5G/6G networks, we propose a novel SAR framework that includes Mission Planning and Mission Execution phases and that optimizes robot deployment. By considering parameters such as the exploration area size, terrain elevation, robot fleet size, communication-influenced energy profiles, desired exploration rate, and target response time, our framework determines the minimum number of robots required and their optimal paths to ensure effective coverage and timely data backhaul over mobile networks. Our results demonstrate the trade-offs between number of robots, explored area, and response time for wheeled and quadruped robots. Further, we quantify the impact of terrain elevation data on mission time and energy consumption, showing the benefits of incorporating real-world environmental factors that might also affect mobile signal propagation and connectivity into SAR planning. This framework provides critical insights for leveraging next-generation mobile networks to enhance autonomous SAR operations.

SunnyParking: Multi-Shot Trajectory Generation and Motion State Awareness for Human-like Parking

Authors:Jishu Miao, Han Chen, Jiankun Zhai, Qi Liu, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Date:2026-02-25 08:35:58

Autonomous parking fundamentally differs from on-road driving due to its frequent direction changes and complex maneuvering requirements. However, existing End-to-End (E2E) planning methods often simplify the parking task into a geometric path regression problem, neglecting explicit modeling of the vehicle's kinematic state. This "dimensionality deficiency" easily leads to physically infeasible trajectories and deviates from real human driving behavior, particularly at critical gear-shift points in multi-shot parking scenarios. In this paper, we propose SunnyParking, a novel dual-branch E2E architecture that achieves motion state awareness by jointly predicting spatial trajectories and discrete motion state sequences (e.g., forward/reverse). Additionally, we introduce a Fourier feature-based representation of target parking slots to overcome the resolution limitations of traditional bird's-eye view (BEV) approaches, enabling high-precision target interactions. Experimental results demonstrate that our framework generates more robust and human-like trajectories in complex multi-shot parking scenarios, while significantly improving gear-shift point localization accuracy compared to state-of-the-art methods. We open-source a new parking dataset of the CARLA simulator, specifically designed to evaluate full prediction capabilities under complex maneuvers.

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

Authors:Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen

Date:2026-02-25 06:58:06

Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.

ABM-UDE: Developing Surrogates for Epidemic Agent-Based Models via Scientific Machine Learning

Authors:Sharv Murgai, Utkarsh Utkarsh, Kyle C. Nguyen, Alan Edelman, Erin C. S. Acquesta, Christopher Vincent Rackauckas

Date:2026-02-25 05:19:43

Agent-based epidemic models (ABMs) encode behavioral and policy heterogeneity but are too slow for nightly hospital planning. We develop county-ready surrogates that learn directly from exascale ABM trajectories using Universal Differential Equations (UDEs): mechanistic SEIR-family ODEs with a neural-parameterized contact rate $κ_φ(u,t)$ (no additive residual). Our contributions are threefold: we adapt multiple shooting and an observer-based prediction-error method (PEM) to stabilize identification of neural-augmented epidemiological dynamics across intervention-driven regime shifts; we enforce positivity and mass conservation and show the learned contact-rate parameterization yields a well-posed vector field; and we quantify accuracy, calibration, and compute against ABM ensembles and UDE baselines. On a representative ExaEpi scenario, PEM-UDE reduces mean MSE by 77% relative to single-shooting UDE (3.00 vs. 13.14) and by 20% relative to MS-UDE (3.75). Reliability improves in parallel: empirical coverage of ABM $10$-$90$% and $25$-$75$% bands rises from 0.68/0.43 (UDE) and 0.79/0.55 (MS-UDE) to 0.86/0.61 with PEM-UDE and 0.94/0.69 with MS+PEM-UDE, indicating calibrated uncertainty rather than overconfident fits. Inference runs in seconds on commodity CPUs (20-35 s per $\sim$90-day forecast), enabling nightly ''what-if'' sweeps on a laptop. Relative to a $\sim$100 CPU-hour ABM reference run, this yields $\sim10^{4}\times$ lower wall-clock per scenario. This closes the realism-cadence gap, supports threshold-aware decision-making (e.g., maintaining ICU occupancy $<75$%), preserves mechanistic interpretability, and enables calibrated, risk-aware scenario planning on standard institutional hardware. Beyond epidemics, the ABM$\to$UDE recipe provides a portable path to distill agent-based simulators into fast, trustworthy surrogates for other scientific domains.

Diagnosis-Driven Co-planning of Network Reinforcement and BESS for Distribution Grid with High Penetration of Electric Vehicles

Authors:Linhan Fang, Elias Raffoul, Xingpeng Li

Date:2026-02-25 04:45:24

While the rapid proliferation of electric vehicles (EVs) accelerates net-zero goals, uncoordinated charging activities impose severe operational challenges on distribution grids, including exacerbated peak loads, thermal overloading, and voltage violations. To overcome the computational intractability of jointly optimizing grid infrastructure reinforcements and battery energy storage system (BESS) installations, this paper proposes a novel three-stage diagnosis-driven co-planning (DDCP) framework. The methodology integrates a violation detection and quantification (VDQ) model to systematically identify system breaches, and a violation-mitigated BESS planning (VMBP) model for optimal BESS sitting and sizing. Specifically, Stage I of the DDCP framework diagnoses critical bottleneck lines that render standalone BESS solutions infeasible. Stage II targets cable upgrades exclusively at the Top-N prioritized bottleneck lines and Stage III then executes the optimal BESS deployment using a network-enhanced VMBP model. Furthermore, this study quantifies the EV hosting capacity thresholds before and after BESS integration across varying EV adoption rates and base voltages. Finally, a comprehensive comparative analysis evaluates four mitigation approaches: the VDQ-driven cable upgrade (VCU) model, the VMBP model, system-wide voltage uprating, and the proposed DDCP framework. The results demonstrate that the DDCP framework not only resolves the complex joint-optimization hurdle but also achieves the high techno-economic superiority in addressing high-EV-penetration challenges.

VasGuideNet: Vascular Topology-Guided Couinaud Liver Segmentation with Structural Contrastive Loss

Authors:Chaojie Shen, Jingjun Gu, Zihao Zhao, Ruocheng Li, Cunyuan Yang, Jiajun Bu, Lei Wu

Date:2026-02-25 03:50:48

Accurate Couinaud liver segmentation is critical for preoperative surgical planning and tumor localization.However, existing methods primarily rely on image intensity and spatial location cues, without explicitly modeling vascular topology. As a result, they often produce indistinct boundaries near vessels and show limited generalization under anatomical variability.We propose VasGuideNet, the first Couinaud segmentation framework explicitly guided by vascular topology. Specifically, skeletonized vessels, Euclidean distance transform (EDT)--derived geometry, and k-nearest neighbor (kNN) connectivity are encoded into topology features using Graph Convolutional Networks (GCNs). These features are then injected into a 3D encoder--decoder backbone via a cross-attention fusion module. To further improve inter-class separability and anatomical consistency, we introduce a Structural Contrastive Loss (SCL) with a global memory bank.On Task08_HepaticVessel and our private LASSD dataset, VasGuideNet achieves Dice scores of 83.68% and 76.65% with RVDs of 1.68 and 7.08, respectively. It consistently outperforms representative baselines including UNETR, Swin UNETR, and G-UNETR++, delivering higher Dice/mIoU and lower RVD across datasets, demonstrating its effectiveness for anatomically consistent segmentation. Code is available at https://github.com/Qacket/VasGuideNet.git.

Geometric Priors for Generalizable World Models via Vector Symbolic Architecture

Authors:William Youngwoo Chung, Calvin Yeung, Hansen Jin Lillemark, Zhuowen Zou, Xiangjian Liu, Mohsen Imani

Date:2026-02-25 00:41:42

A key challenge in artificial intelligence and neuroscience is understanding how neural systems learn representations that capture the underlying dynamics of the world. Most world models represent the transition function with unstructured neural networks, limiting interpretability, sample efficiency, and generalization to unseen states or action compositions. We address these issues with a generalizable world model grounded in Vector Symbolic Architecture (VSA) principles as geometric priors. Our approach utilizes learnable Fourier Holographic Reduced Representation (FHRR) encoders to map states and actions into a high dimensional complex vector space with learned group structure and models transitions with element-wise complex multiplication. We formalize the framework's group theoretic foundation and show how training such structured representations to be approximately invariant enables strong multi-step composition directly in latent space and generalization performances over various experiments. On a discrete grid world environment, our model achieves 87.5% zero shot accuracy to unseen state-action pairs, obtains 53.6% higher accuracy on 20-timestep horizon rollouts, and demonstrates 4x higher robustness to noise relative to an MLP baseline. These results highlight how training to have latent group structure yields generalizable, data-efficient, and interpretable world models, providing a principled pathway toward structured models for real-world planning and reasoning.

A CFD-Based Investigation of Local Luminal Curvature as a Primary Determinant of Hemodynamic Environments in Cerebral Aneurysms

Authors:Marcella P. A. Dallavanzi, José L. Gasche, Iago L. Oliveira

Date:2026-02-24 22:29:37

The relationship between vascular morphology and hemodynamics is fundamental to understanding the natural history of cerebral aneurysms (CAs). While global geometric indices have been widely studied, the local interaction between luminal curvature and wall shear stress (WSS) remains poorly characterized. This study analyzed a large cohort of CAs to investigate how local surface morphology relates to hemodynamics. This was performed via CFD flow simulations of a set of 76 patient-specific CA geometries using the OpenFOAM library. Geometry and pulsatile inflow conditions were modeled based on patient arterial diameter and age. Blood was assumed to be a Newtonian incompressible fluid flowing in a laminar regime. We utilized a geometric framework to classify the aneurysm lumen into spherical-like and saddle-like patches based on Gaussian curvature. Our results demonstrate a robust, statistically significant correlation between these curvature types and hemodynamic metrics, regardless of rupture status or aneurysm type. Specifically, saddle-like patches, predominantly found at the aneurysm neck, are associated with high time-averaged WSS, a low oscillatory shear index, and intense near-wall vortical activity as identified by the lambda2-criterion. In contrast, spherical-like patches, dominant at the dome, correspond to regions of flow impingement characterized by lower time-averaged WSS and an elevated oscillatory shear index. These findings suggest that wall curvature is a primary determinant of local hemodynamics. By bridging the gap between local wall morphology and pathological wall markers, this work suggests that curvature-based mapping can serve as a powerful tool for identifying vulnerable regions susceptible to thinning and rupture. This objective geometric assessment offers valuable insights for risk stratification and the precision planning of endovascular interventions.

Unified Complementarity-Based Contact Modeling and Planning for Soft Robots

Authors:Milad Azizkhani, Yue Chen

Date:2026-02-24 19:37:36

Soft robots were introduced in large part to enable safe, adaptive interaction with the environment, and this interaction relies fundamentally on contact. However, modeling and planning contact-rich interactions for soft robots remain challenging: dense contact candidates along the body create redundant constraints and rank-deficient LCPs, while the disparity between high stiffness and low friction introduces severe ill-conditioning. Existing approaches rely on problem-specific approximations or penalty-based treatments. This letter presents a unified complementarity-based framework for soft-robot contact modeling and planning that brings contact modeling, manipulation, and planning into a unified, physically consistent formulation. We develop a robust Linear Complementarity Problem (LCP) model tailored to discretized soft robots and address these challenges with a three-stage conditioning pipeline: inertial rank selection to remove redundant contacts, Ruiz equilibration to correct scale disparity and ill-conditioning, and lightweight Tikhonov regularization on normal blocks. Building on the same formulation, we introduce a kinematically guided warm-start strategy that enables dynamic trajectory optimization through contact using Mathematical Programs with Complementarity Constraints (MPCC) and demonstrate its effectiveness on contact-rich ball manipulation tasks. In conclusion, CUSP provides a new foundation for unifying contact modeling, simulation, and planning in soft robotics.

Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

Authors:Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott

Date:2026-02-24 18:18:36

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search-based methods such as A* suffer from scalability issues in large-scale high-resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any-angle planning methods, which find optimal paths by connecting corners of obstacles with straight-line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any-angle planners while overcoming computational tractability issues common to search-based methods by exploiting multi-resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach's solution quality and speed, outperforming even sampling-based methods. The framework is open-sourced to allow the robotics and planning community to build on our research.

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Authors:Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

Date:2026-02-24 16:10:27

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Authors:Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

Date:2026-02-24 15:33:02

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Authors:Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

Date:2026-02-24 15:30:55

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Telemetry-Based Server Selection in the Quantum Internet via Cross-Layer Runtime Estimation

Authors:Masaki Nagai, Hideaki Kawaguchi, Shin Nishio, Takahiko Satoh

Date:2026-02-24 15:25:55

The Quantum Internet will allow clients to delegate quantum workloads to remote servers over heterogeneous networks, but choosing the server that minimizes end-to-end execution time is difficult because server processing, feedforward classical communication, and entanglement distribution can overlap in protocol-dependent ways and shift the runtime bottleneck. We propose $T_{\max}$, a lightweight runtime score that sums coarse telemetry from multiple layers to obtain a conservative ranking for online server selection without calibrating weights for each deployment. Using NetSquid discrete-event simulations of a modified parameter-blind VQE (PB-VQE) workload, we evaluate $T_{\max}$ on pools of 10,000 heterogeneous candidates (selecting among up to 100 per decision) across crossover and bottleneck-dominated regimes, including temporal jitter scenarios and jobs with multiple shots. $T_{\max}$ achieves single-digit mean regret normalized by the oracle (below 10%) in both regimes and remains in the single-digit range under classical communication latency jitter for multi-shot jobs, while performance degrades for single-shot jobs under severe jitter. To connect performance to deployment planning, we derive an operating map based on requirements relating distance and entanglement rate requirements to protocol level counts, quantify how simple multiuser contention shifts the crossover, and use Sobol global sensitivity analysis to identify regime-dependent bottlenecks. These findings suggest that simple cross-layer telemetry can enable practical server selection while providing actionable provisioning guidance for emerging Quantum Internet services.

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Authors:Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

Date:2026-02-24 14:37:36

Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3\% in three tested scenarios.

KCFRC: Kinematic Collision-Aware Foothold Reachability Criteria for Legged Locomotion

Authors:Lei Ye, Haibo Gao, Huaiguang Yang, Peng Xu, Haoyu Wang, Tie Liu, Junqi Shan, Zongquan Deng, Liang Ding

Date:2026-02-24 12:46:34

Legged robots face significant challenges in navigating complex environments, as they require precise real-time decisions for foothold selection and contact planning. While existing research has explored methods to select footholds based on terrain geometry or kinematics, a critical gap remains: few existing methods efficiently validate the existence of a non-collision swing trajectory. This paper addresses this gap by introducing KCFRC, a novel approach for efficient foothold reachability analysis. We first formally define the foothold reachability problem and establish a sufficient condition for foothold reachability. Based on this condition, we develop the KCFRC algorithm, which enables robots to validate foothold reachability in real time. Our experimental results demonstrate that KCFRC achieves remarkable time efficiency, completing foothold reachability checks for a single leg across 900 potential footholds in an average of 2 ms. Furthermore, we show that KCFRC can accelerate trajectory optimization and is particularly beneficial for contact planning in confined spaces, enhancing the adaptability and robustness of legged robots in challenging environments.

Fast-Response Balancing Capacity of Alkaline Electrolyzers

Authors:Marvin Dorn, Julian Hoffmann, André Weber, Veit Hagenmeyer

Date:2026-02-24 12:31:08

The energy transition requires flexible technologies to maintain grid stability, and electrolyzers are playing an increasingly important role in meeting this need. While previous studies often question the dynamic capabilities of large-scale alkaline electrolyzer systems, we assess their potential to provide balancing services using real manufacturer data. Unlike common approaches, we propose the decoupling between the total electrolyzer power and a smaller fractions of power actually offered on balancing markets. Adapting an existing methodology, we analyze alkaline electrolyzer systems and extend the assessment to Germany and Europe. Our results show that large-scale electrolyzers are technically capable of delivering fast-response balancing services, with significantly lower dynamic requirements than previously assumed. The planned electrolyzers in Germany could cover the entire balancing capacity market, potentially saving around 13 % of their electricity costs, excluding energy balancing revenues. The decoupling also resolves part of the trade-off for electrolyzer manufacturers, enabling the design of less dynamic but more stable systems.

Confidence Distributions and Related Themes

Authors:Nils Lid Hjort, Tore Schweder

Date:2026-02-24 12:18:48

This is the guest editors' general introduction to a Special Issue of the Journal of Statistical Planning and Inference, dedicated to confidence distributions and related themes. Confidence distributions (CDs) are distributions for parameters of interest, constructed via a statistical model after analysing the data. As such they serve the same purpose for the frequentist statisticians as the posterior distributions for the Bayesians. There have been several attempts in the literature to put up a clear theory for such confidence distributions, from Fisher's fiducial inference and onwards. There are certain obstacles and difficulties involved in these attempts, both conceptually and operationally, which have contributed to the CDs being slow in entering statistical mainstream. Recently there is a renewed surge of interest in CDs and various related themes, however, reflected in both series of new methodological research, advanced applications to substantive sciences, and dissemination and communication via workshops and conferences. The present special issue of the JSPI is a collection of papers emanating from the {\it Inference With Confidence} workshop in Oslo, May 2015. Several of the papers appearing here were first presented at that workshop. The present collection includes however also new research papers from other scholars in the field.

POMDPPlanners: Open-Source Package for POMDP Planning

Authors:Yaacov Pariente, Vadim Indelman

Date:2026-02-24 11:50:04

We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation -- reducing the overhead of extensive simulation studies. POMDPPlanners is designed to enable scalable, reproducible research on decision-making under uncertainty, with particular emphasis on risk-sensitive settings where standard toolkits fall short.

Infrared spectropolarimetry of a C-class solar flare footpoint plasma -- I. Spectral features and forward modelling

Authors:Z. Vashalomidze, C. Quintero Noda, T. V. Zaqarashvili, M. Benko, D. Kuridze, P. Gömöry, J. Rybák, S. Lomineishvili, M. Collados, C. Denker, M. Verma, C. Kuckein, A. Asensio Ramos

Date:2026-02-24 11:36:04

We performed high-spatial resolution spectropolarimetric observations of active region NOAA 13363 during a C-class flare with the Gregor Infrared Spectrograph (GRIS) on 16 July 2023. We examine the coupling between the photosphere and the chromosphere, studying the polarimetric signals during a period that encompasses the decaying phase of a C-class flare and the appearance of a new C-class flare at the same location. We focus on the analysis of various spectral lines. In particular, we study the Si I 10827 Å, Ca I 10833.4 Å, Na I 10834.9 Å, and Ca I 10838.9 Å photospheric lines, as well as the He I 10830 Å triplet. GRIS data revealed the presence of flare-related red- and blueshifted spectral line components, reaching Doppler velocities up to 90 km/s, and complex Si I profiles where the He i spectral line contribution is blueshifted. In contrast, the photospheric Ca i and Na i transitions remained unchanged, indicating that the flare did not modify the physical conditions of the lower photosphere. We combined that information with simultaneous imaging in the Ca ii H line and TiO band with the improved High-resolution Fast Imager (HiFI+), finding that the flare emission did not affect the inverse granulation or nearby plage, in agreement with the results from GRIS. We also complement the previous studies with a forward modelling computation, concluding that the He I spectral line emission reflects a complex response of the flaring chromosphere. Radiative excitation from coronal EUV irradiation, energy deposition by flare-accelerated electrons, and dynamic field-aligned plasma flows likely act together to produce the observed supersonic downflows and upflows. We plan to expand these findings through inversions of the He I 10830 Å triplet signals in the future.

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Authors:Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

Date:2026-02-24 11:33:44

The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Authors:Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun

Date:2026-02-24 09:35:43

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Authors:Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

Date:2026-02-24 09:23:12

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.