planning - 2025-05-31

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Authors:Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao
Date:2025-05-29 17:59:46

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Computerized Modeling of Electrophysiology and Pathoelectrophysiology of the Atria -- How Much Detail is Needed?

Authors:Olaf Dössel, Axel Loewe
Date:2025-05-29 17:51:40

This review focuses on the computerized modeling of the electrophysiology of the human atria, emphasizing the simulation of common arrhythmias such as atrial flutter (AFlut) and atrial fibrillation (AFib). Which components of the model are necessary to accurately model arrhythmogenic tissue modifications, including remodeling, cardiomyopathy, and fibrosis, to ensure reliable simulations? The central question explored is the level of detail required for trustworthy simulations for a specific context of use. The review discusses the balance between model complexity and computational efficiency, highlighting the risks of oversimplification and excessive detail. It covers various aspects of atrial modeling, from cellular to whole atria levels, including the influence of atrial geometry, fiber direction, anisotropy, and wall thickness on simulation outcomes. The article also examines the impact of different modeling approaches, such as volumetric 3D models, bilayer models, and single surface models, on the realism of simulations. In addition, it reviews the latest advances in the modeling of fibrotic tissue and the verification and validation of atrial models. The intended use of these models in planning and optimization of atrial ablation strategies is discussed, with a focus on personalized modeling for individual patients and cohort-based approaches for broader applications. The review concludes by emphasizing the importance of integrating experimental data and clinical validation to enhance the utility of computerized atrial models to improve patient outcomes.

Dual-Task Graph Neural Network for Joint Seizure Onset Zone Localization and Outcome Prediction using Stereo EEG

Authors:Syeda Abeera Amir, Artur Agaronyan, William Gaillard, Chima Oluigbo, Syed Muhammad Anwar
Date:2025-05-29 17:14:28

Accurately localizing the brain regions that triggers seizures and predicting whether a patient will be seizure-free after surgery are vital for surgical planning and patient management in drug-resistant epilepsy. Stereo-electroencephalography (sEEG) delivers high-fidelity intracranial recordings that enable clinicians to precisely locate epileptogenic networks. However, the clinical identification is subjective and dependent on the expertise of the clinical team. Data driven approaches in this domain are sparse, despite the fact that sEEG offers high temporal-fidelity related to seizure dynamics that can be leveraged using graph structures ideal for imitating brain networks. In this study, we introduce a dual-task graph-neural network (GNN) framework that operates on windowed sEEG recordings to jointly predict seizure-freedom outcomes and identify seizure-onset-zone (SOZ) channels. We assemble non-overlapping 10 second windows from 51 clinical seizures spread across 20 pediatric patients, with sEEG data annotated by clinical experts. For each temporal window we construct a functional connectivity graph via thresholded Pearson correlations and extract rich node features (spectral, statistical, wavelet, Hjorth and local graph features), alongside six global graph descriptors. We optimize a combined cross-entropy loss with a tunable task-weight, and select model hyper-parameters via Optuna. Under window-level 10-fold cross-validation, the model achieves a mean graph-level accuracy of $89.31 \pm 0.0976 \%$ for seizure-freedom prediction and a node-level SOZ localization accuracy of $94.72. \pm 0.0041 \%$. For the best performing model, we ran additive and leave-one-out ablation studies to explore feature importance for graph and node-level accuracy.

Inference-time Scaling of Diffusion Models through Classical Search

Authors:Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du
Date:2025-05-29 16:22:40

Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models -- adapting generated outputs to meet diverse test-time objectives -- using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at diffusion-inference-scaling.github.io.

MAPLE: A Mobile Assistant with Persistent Finite State Machines for Recovery Reasoning

Authors:Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang
Date:2025-05-29 16:08:51

Mobile GUI agents aim to autonomously complete user-instructed tasks across mobile apps. Recent advances in Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens, identify actionable elements, and perform interactions such as tapping or typing. However, existing agents remain reactive: they reason only over the current screen and lack a structured model of app navigation flow, limiting their ability to understand context, detect unexpected outcomes, and recover from errors. We present MAPLE, a state-aware multi-agent framework that abstracts app interactions as a Finite State Machine (FSM). We computationally model each UI screen as a discrete state and user actions as transitions, allowing the FSM to provide a structured representation of the app execution. MAPLE consists of specialized agents responsible for four phases of task execution: planning, execution, verification, error recovery, and knowledge retention. These agents collaborate to dynamically construct FSMs in real time based on perception data extracted from the UI screen, allowing the GUI agents to track navigation progress and flow, validate action outcomes through pre- and post-conditions of the states, and recover from errors by rolling back to previously stable states. Our evaluation results on two challenging cross-app benchmarks, Mobile-Eval-E and SPA-Bench, show that MAPLE outperforms the state-of-the-art baseline, improving task success rate by up to 12%, recovery success by 13.8%, and action accuracy by 6.5%. Our results highlight the importance of structured state modeling in guiding mobile GUI agents during task execution. Moreover, our FSM representation can be integrated into future GUI agent architectures as a lightweight, model-agnostic memory layer to support structured planning, execution verification, and error recovery.

Individual differences in the cognitive mechanisms of planning strategy discovery

Authors:Ruiqi He, Falk Lieder
Date:2025-05-29 14:57:34

People employ efficient planning strategies. But how are these strategies acquired? Previous research suggests that people can discover new planning strategies through learning from reinforcements, a process known as metacognitive reinforcement learning (MCRL). While prior work has shown that MCRL models can learn new planning strategies and explain more participants' experience-driven discovery better than alternative mechanisms, it also revealed significant individual differences in metacognitive learning. Furthermore, when fitted to human data, these models exhibit a slower rate of strategy discovery than humans. In this study, we investigate whether incorporating cognitive mechanisms that might facilitate human strategy discovery can bring models of MCRL closer to human performance. Specifically, we consider intrinsically generated metacognitive pseudo-rewards, subjective effort valuation, and termination deliberation. Analysis of planning task data shows that a larger proportion of participants used at least one of these mechanisms, with significant individual differences in their usage and varying impacts on strategy discovery. Metacognitive pseudo-rewards, subjective effort valuation, and learning the value of acting without further planning were found to facilitate strategy discovery. While these enhancements provided valuable insights into individual differences and the effect of these mechanisms on strategy discovery, they did not fully close the gap between model and human performance, prompting further exploration of additional factors that people might use to discover new planning strategies.

Humanoid Loco-manipulation Planning based on Graph Search and Reachability Maps

Authors:Masaki Murooka, Iori Kumagai, Mitsuharu Morisawa, Fumio Kanehiro, Abderrahmane Kheddar
Date:2025-05-29 14:48:25

In this letter, we propose an efficient and highly versatile loco-manipulation planning for humanoid robots. Loco-manipulation planning is a key technological brick enabling humanoid robots to autonomously perform object transportation by manipulating them. We formulate planning of the alternation and sequencing of footsteps and grasps as a graph search problem with a new transition model that allows for a flexible representation of loco-manipulation. Our transition model is quickly evaluated by relocating and switching the reachability maps depending on the motion of both the robot and object. We evaluate our approach by applying it to loco-manipulation use-cases, such as a bobbin rolling operation with regrasping, where the motion is automatically planned by our framework.

Long Duration Inspection of GNSS-Denied Environments with a Tethered UAV-UGV Marsupial System

Authors:Simón Martínez-Rozas, David Alejo, José Javier Carpio, Fernando Caballero, Luis Merino
Date:2025-05-29 14:05:25

Unmanned Aerial Vehicles (UAVs) have become essential tools in inspection and emergency response operations due to their high maneuverability and ability to access hard-to-reach areas. However, their limited battery life significantly restricts their use in long-duration missions. This paper presents a novel tethered marsupial robotic system composed of a UAV and an Unmanned Ground Vehicle (UGV), specifically designed for autonomous, long-duration inspection tasks in Global Navigation Satellite System (GNSS)-denied environments. The system extends the UAV's operational time by supplying power through a tether connected to high-capacity battery packs carried by the UGV. We detail the hardware architecture based on off-the-shelf components to ensure replicability and describe our full-stack software framework, which is composed of open-source components and built upon the Robot Operating System (ROS). The proposed software architecture enables precise localization using a Direct LiDAR Localization (DLL) method and ensures safe path planning and coordinated trajectory tracking for the integrated UGV-tether-UAV system. We validate the system through three field experiments: (1) a manual flight endurance test to estimate the operational duration, (2) an autonomous navigation test, and (3) an inspection mission to demonstrate autonomous inspection capabilities. Experimental results confirm the robustness and autonomy of the system, its capacity to operate in GNSS-denied environments, and its potential for long-endurance, autonomous inspection and monitoring tasks.

Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents

Authors:Zhejian Yang, Yongchao Chen, Xueyang Zhou, Jiangyue Yan, Dingjie Song, Yinuo Liu, Yuting Li, Yu Zhang, Pan Zhou, Hechang Chen, Lichao Sun
Date:2025-05-29 13:56:49

Long-horizon robotic manipulation poses significant challenges for autonomous systems, requiring extended reasoning, precise execution, and robust error recovery across complex sequential tasks. Current approaches, whether based on static planning or end-to-end visuomotor policies, suffer from error accumulation and lack effective verification mechanisms during execution, limiting their reliability in real-world scenarios. We present Agentic Robot, a brain-inspired framework that addresses these limitations through Standardized Action Procedures (SAP)--a novel coordination protocol governing component interactions throughout manipulation tasks. Drawing inspiration from Standardized Operating Procedures (SOPs) in human organizations, SAP establishes structured workflows for planning, execution, and verification phases. Our architecture comprises three specialized components: (1) a large reasoning model that decomposes high-level instructions into semantically coherent subgoals, (2) a vision-language-action executor that generates continuous control commands from real-time visual inputs, and (3) a temporal verifier that enables autonomous progression and error recovery through introspective assessment. This SAP-driven closed-loop design supports dynamic self-verification without external supervision. On the LIBERO benchmark, Agentic Robot achieves state-of-the-art performance with an average success rate of 79.6\%, outperforming SpatialVLA by 6.1\% and OpenVLA by 7.4\% on long-horizon tasks. These results demonstrate that SAP-driven coordination between specialized components enhances both performance and interpretability in sequential manipulation, suggesting significant potential for reliable autonomous systems. Project Github: https://agentic-robot.github.io.

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Authors:Tobias Lindenbauer, Georg Groh, Hinrich Schütze
Date:2025-05-29 13:19:29

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

Designing the Future of Entrepreneurship Education: Exploring an AI-Empowered Scaffold System for Business Plan Development

Authors:Junhua Zhu, Lan Luo
Date:2025-05-29 10:35:55

Entrepreneurship education equips students to transform innovative ideas into actionable entrepreneurship plans, yet traditional approaches often struggle to provide the personalized guidance and practical alignment needed for success. Focusing on the business plan as a key learning tool and evaluation method, this study investigates the design needs for an AI-empowered scaffold system to address these challenges. Based on qualitative insights from educators and students, the findings highlight three critical dimensions for system design: mastery of business plan development, alignment with entrepreneurial learning goals, and integration of adaptive system features. These findings underscore the transformative potential of AI in bridging gaps in entrepreneurship education while emphasizing the enduring value of human mentorship and experiential learning.

RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

Authors:Zhihong Tan, Jiayi Wang, Huiying Shi, Binyuan Huang, Hongchen Wei, Zhenzhong Chen
Date:2025-05-29 09:30:46

Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at https://huggingface.co/datasets/TZHSW/RSFAKE/.

VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation

Authors:Jianlin Ye, Savvas Papaioannou, Panayiotis Kolios
Date:2025-05-29 09:15:44

Path planning is a fundamental capability of autonomous Unmanned Aerial Vehicles (UAVs), enabling them to efficiently navigate toward a target region or explore complex environments while avoiding obstacles. Traditional pathplanning methods, such as Rapidly-exploring Random Trees (RRT), have proven effective but often encounter significant challenges. These include high search space complexity, suboptimal path quality, and slow convergence, issues that are particularly problematic in high-stakes applications like disaster response, where rapid and efficient planning is critical. To address these limitations and enhance path-planning efficiency, we propose Vision Language Model RRT (VLM-RRT), a hybrid approach that integrates the pattern recognition capabilities of Vision Language Models (VLMs) with the path-planning strengths of RRT. By leveraging VLMs to provide initial directional guidance based on environmental snapshots, our method biases sampling toward regions more likely to contain feasible paths, significantly improving sampling efficiency and path quality. Extensive quantitative and qualitative experiments with various state-of-the-art VLMs demonstrate the effectiveness of this proposed approach.

UPP: Unified Path Planner with Adaptive Safety and Optimality

Authors:Jatin Kumar Arora, Shubhendu Bhasin
Date:2025-05-29 07:34:56

We are surrounded by robots helping us perform complex tasks. Robots have a wide range of applications, from industrial automation to personalized assistance. However, with great technological innovation come significant challenges. One of the major challenges in robotics is path planning. Despite advancements such as graph search, sampling, and potential field methods, most path planning algorithms focus either on optimality or on safety. Very little research addresses both simultaneously. We propose a Unified Path Planner (UPP) that uses modified heuristics and a dynamic safety cost function to balance safety and optimality. The level of safety can be adjusted via tunable parameters, trading off against computational complexity. We demonstrate the planner's performance in simulations, showing how parameter variation affects results. UPP is compared with various traditional and safe-optimal planning algorithms across different scenarios. We also validate it on a TurtleBot, where the robot successfully finds safe and sub-optimal paths.

Stochastic Production Planning in Manufacturing Systems

Authors:Dragos-Patru Covei
Date:2025-05-29 06:41:37

We extend the stochastic production planning framework to manufacturing systems, where the set of admissible production configurations is described by a general smooth convex domain $\omega $. In our setting, production operations continue as long as the production inventory $y(t)$ remains inside the capacity limits of $\omega $ and are halted once the state exits this region, i.e.,% \begin{equation*} \tau =\inf \{t>0:\Vert y(t)-x_{0}\Vert >\text{dist}(x_{0},\partial \omega )\}. \end{equation*}% The running cost is partitioned into a quadratic production cost $% a(p)=\left\Vert p\right\Vert ^{2}$ and an inventory holding cost modeled by a positive continuous function $b(y)$. We derive the associated Hamilton--Jacobi--Bellman (HJB) equation, verify the supermartingale property of the value function, and characterize the optimal feedback control. Techniques inspired by Lasry, Lions and Alvarez enable us to prove existence and uniqueness within this generalized production planning framework. Numerical experiments and a real-world examples illustrate the practical relevance of our results.

PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

Authors:Haoyu Chen, Keda Tao, Yizao Wang, Xinlei Wang, Lei Zhu, Jinjin Gu
Date:2025-05-29 06:00:51

Photo retouching is integral to photographic art, extending far beyond simple technical fixes to heighten emotional expression and narrative depth. While artists leverage expertise to create unique visual effects through deliberate adjustments, non-professional users often rely on automated tools that produce visually pleasing results but lack interpretative depth and interactive transparency. In this paper, we introduce PhotoArtAgent, an intelligent system that combines Vision-Language Models (VLMs) with advanced natural language reasoning to emulate the creative process of a professional artist. The agent performs explicit artistic analysis, plans retouching strategies, and outputs precise parameters to Lightroom through an API. It then evaluates the resulting images and iteratively refines them until the desired artistic vision is achieved. Throughout this process, PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control. Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.

Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Authors:Yunshen Wang, Yicheng Liu, Tianyuan Yuan, Yucheng Mao, Yingshi Liang, Xiuyu Yang, Honggang Zhang, Hang Zhao
Date:2025-05-29 05:34:22

Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

Redundancy Parameterization of the ABB YuMi Robot Arm

Authors:Alexander J. Elias, John T. Wen
Date:2025-05-29 05:26:15

The ABB YuMi is a 7-DOF collaborative robot arm with a complex, redundant kinematic structure. Path planning for the YuMi is challenging, especially with joint limits considered. The redundant degree of freedom is parameterized by the Shoulder-Elbow-Wrist (SEW) angle, called the arm angle by ABB, but the exact definition must be known for path planning outside the RobotStudio simulator. We provide the first complete and validated definition of the SEW angle used for the YuMi. It follows the conventional SEW angle formulation with the shoulder-elbow direction chosen to be the direction of the fourth joint axis. Our definition also specifies the shoulder location, making it compatible with any choice of reference vector. A previous attempt to define the SEW angle exists in the literature, but it is incomplete and deviates from the behavior observed in RobotStudio. Because our formulation fits within the general SEW angle framework, we also obtain the expression for the SEW angle Jacobian and complete numerical conditions for all algorithmic singularities. Finally, we demonstrate using IK-Geo, our inverse kinematics (IK) solver based on subproblem decomposition, to find all IK solutions using 2D search. Code examples are available in a publicly accessible repository.

SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Authors:Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Date:2025-05-29 02:38:34

High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.

Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Authors:Kewei Lian, Shaofei Cai, Yilun Du, Yitao Liang
Date:2025-05-29 01:28:57

The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.

Semantic Exploration and Dense Mapping of Complex Environments using Ground Robots Equipped with LiDAR and Panoramic Camera

Authors:Xiaoyang Zhan, Shixin Zhou, Qianqian Yang, Yixuan Zhao, Hao Liu, Srinivas Chowdary Ramineni, Kenji Shimada
Date:2025-05-28 21:27:32

This paper presents a system for autonomous semantic exploration and dense semantic target mapping of a complex unknown environment using a ground robot equipped with a LiDAR-panoramic camera suite. Existing approaches often struggle to balance collecting high-quality observations from multiple view angles and avoiding unnecessary repetitive traversal. To fill this gap, we propose a complete system combining mapping and planning. We first redefine the task as completing both geometric coverage and semantic viewpoint observation. We then manage semantic and geometric viewpoints separately and propose a novel Priority-driven Decoupled Local Sampler to generate local viewpoint sets. This enables explicit multi-view semantic inspection and voxel coverage without unnecessary repetition. Building on this, we develop a hierarchical planner to ensure efficient global coverage. In addition, we propose a Safe Aggressive Exploration State Machine, which allows aggressive exploration behavior while ensuring the robot's safety. Our system includes a plug-and-play semantic target mapping module that integrates seamlessly with state-of-the-art SLAM algorithms for pointcloud-level dense semantic target mapping. We validate our approach through extensive experiments in both realistic simulations and complex real-world environments. Simulation results show that our planner achieves faster exploration and shorter travel distances while guaranteeing a specified number of multi-view inspections. Real-world experiments further confirm the system's effectiveness in achieving accurate dense semantic object mapping of unstructured environments.

Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Authors:Carlota Parés-Morlans, Michelle Yi, Claire Chen, Sarah A. Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg
Date:2025-05-28 20:51:12

Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. These tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose Causal-PIK, a method that leverages Bayesian optimization to reason about causal interactions via a Physics-Informed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal-PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. We find that Causal-PIK remains competitive on tasks that are very challenging, even for human problem-solvers.

RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

Authors:Nikita Khramov, Andrei Kozyrev, Gleb Solovev, Anton Podkopaev
Date:2025-05-28 20:26:11

Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and show the use of multi-agent debate on the planning stage of proof synthesis.

Assembly in Directed Hypergraphs

Authors:Christoph Flamm, Daniel Merkle, Peter F. Stadler
Date:2025-05-28 20:10:22

Assembly theory has received considerable attention in the recent past. Here we analyze the formal framework of this model and show that assembly pathways coincide with certain minimal hyperpaths in B-hypergraphs. This makes it possible to generalize the notion of assembly to general chemical reaction systems and to make explicit the connection to rule based models of chemistry, in particular DPO graph rewriting. We observe, furthermore, that assembly theory is closely related to retrosynthetic analysis in chemistry. The assembly index fits seamlessly into a large family of cost measures for directed hyperpath problems that also encompasses cost functions used in computational synthesis planning. This allows to devise a generic approach to compute complexity measures derived from minimal hyperpaths in rule-derived directed hypergraphs using integer linear programming.

Enhancing Lifelong Multi-Agent Path-finding by Using Artificial Potential Fields

Authors:Arseniy Pertzovsky, Roni Stern, Ariel Felner, Roie Zivan
Date:2025-05-28 18:13:10

We explore the use of Artificial Potential Fields (APFs) to solve Multi-Agent Path Finding (MAPF) and Lifelong MAPF (LMAPF) problems. In MAPF, a team of agents must move to their goal locations without collisions, whereas in LMAPF, new goals are generated upon arrival. We propose methods for incorporating APFs in a range of MAPF algorithms, including Prioritized Planning, MAPF-LNS2, and Priority Inheritance with Backtracking (PIBT). Experimental results show that using APF is not beneficial for MAPF but yields up to a 7-fold increase in overall system throughput for LMAPF.

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Authors:Rui Li, Zixuan Hu, Wenxi Qu, Jinouwen Zhang, Zhenfei Yin, Sha Zhang, Xuantuo Huang, Hanqing Wang, Tai Wang, Jiangmiao Pang, Wanli Ouyang, Lei Bai, Wangmeng Zuo, Ling-Yu Duan, Dongzhan Zhou, Shixiang Tang
Date:2025-05-28 17:50:53

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments. We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research.

HDDLGym: A Tool for Studying Multi-Agent Hierarchical Problems Defined in HDDL with OpenAI Gym

Authors:Ngoc La, Ruaridh Mon-Williams, Julie A. Shah
Date:2025-05-28 17:10:43

In recent years, reinforcement learning (RL) methods have been widely tested using tools like OpenAI Gym, though many tasks in these environments could also benefit from hierarchical planning. However, there is a lack of a tool that enables seamless integration of hierarchical planning with RL. Hierarchical Domain Definition Language (HDDL), used in classical planning, introduces a structured approach well-suited for model-based RL to address this gap. To bridge this integration, we introduce HDDLGym, a Python-based tool that automatically generates OpenAI Gym environments from HDDL domains and problems. HDDLGym serves as a link between RL and hierarchical planning, supporting multi-agent scenarios and enabling collaborative planning among agents. This paper provides an overview of HDDLGym's design and implementation, highlighting the challenges and design choices involved in integrating HDDL with the Gym interface, and applying RL policies to support hierarchical planning. We also provide detailed instructions and demonstrations for using the HDDLGym framework, including how to work with existing HDDL domains and problems from International Planning Competitions, exemplified by the Transport domain. Additionally, we offer guidance on creating new HDDL domains for multi-agent scenarios and demonstrate the practical use of HDDLGym in the Overcooked domain. By leveraging the advantages of HDDL and Gym, HDDLGym aims to be a valuable tool for studying RL in hierarchical planning, particularly in multi-agent contexts.

Articulatory modeling of the S-shaped F2 trajectories observed in Öhman's spectrographic analysis of VCV syllables

Authors:Frédéric Berthommier
Date:2025-05-28 15:12:53

The synthesis of Ohman's VCV sequences with intervocalic plosive consonants was first achieved 30 years ago using the DRM model. However, this approach remains primarily acoustic and lacks articulatory constraints. In this study, the same 75 VCVs are analyzed, but generated with the Maeda model, using trajectory planning that differentiates vowel-to-vowel transitions from consonantal influences. Synthetic data exhibit similar characteristics to Ohman's sequences, including the presence of S-shaped F2 trajectories. Furthermore, locus equations (LEs) for F2 and F3 are computed from synthetic CV data to investigate their underlying determinism, leading to a reassessment of conventional interpretations. The findings indicate that, although articulatory planning is structured separately for vowel and consonant groups, S-shaped F2 trajectories emerge from a composite mechanism governed by the coordinated synergy of all articulators.

GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

Authors:Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, Shanghang Zhang
Date:2025-05-28 14:46:51

Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.

Complete Catalog of Laser Locking Configurations for LISA

Authors:Gerhard Heinzel, Javier Álvarez-Vizoso, Miguel Dovale-Álvarez
Date:2025-05-28 14:37:13

The Laser Interferometer Space Antenna (LISA) will enable direct observations of low-frequency gravitational waves, offering unprecedented insight into astrophysical and cosmological phenomena. LISA's heterodyne interferometric measurement system requires phase-locking five of its six onboard lasers with tunable frequency offsets to ensure that all beatnotes remain within the metrology system's operational range, despite Doppler-induced frequency shifts. The selection of these offset frequencies -- collectively forming a frequency plan -- is a complex optimization problem constrained by the spacecraft's orbital dynamics and instrument limitations. While previous work established an algorithmic solution for deriving time-dependent frequency plans, this study takes a complementary approach by systematically analyzing and cataloging all possible laser locking configurations. We present an automated method to explore, validate, and classify viable locking schemes, identifying 36 unique non-frequency-swapping configurations and 72 additional frequency-swapping configurations for an arbitrary choice of primary laser. This exhaustive classification provides a foundation for frequency planning across the full range of operational scenarios.