planning - 2025-05-28

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Authors:Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

Date:2025-05-27 17:29:31

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

Automation of a Matching On-Shell Calculator

Authors:Javier López Miras, Fuensanta Vilches

Date:2025-05-27 15:46:42

We introduce $\texttt{mosca}$, a $\texttt{Mathematica}$ package designed to facilitate on-shell calculations in effective field theories (EFTs). This initial release focuses on the reduction of Green's bases to physical bases, as well as transformations between arbitrary operator bases. The core of the package is based on a diagrammatic on-shell matching procedure, grounded in the equivalence of physical observables derived from both redundant and non-redundant Lagrangians. $\texttt{mosca}$ offers a complete set of tools for performing basis transformations, diagram isomorphism detection, numerical substitution of kinematic configurations, and symbolic manipulation of algebraic expressions. Planned future developments include extension to one-loop computations, thus providing support for EFT renormalization directly in a physical basis and automated computation of one-loop finite matching, including contributions from evanescent operators. The package, along with example notebooks and documentation, is available at: https://gitlab.com/matchingonshell/mosca.

robostrategy: Field and Target Assignment Optimization in the Sloan Digital Sky Survey V

Authors:Michael R. Blanton, Joleen K. Carlberg, Tom Dwelly, Ilija Medan, S. Drew Chojnowski, Kevin Covey, Megan C. Davis, John Donor, Pramod Gupta, Alexander Ji, Jennifer A. Johnson, Juna A. Kollmeier, Jose Sanchez-Gallego, Conor Sayres, Eleonora Zari

Date:2025-05-27 15:23:49

We present an algorithmic method for efficiently planning a long-term, large-scale multi-object spectroscopy program. The Sloan Digital Sky Survey V (SDSS-V) Focal Plane System performs multi-object spectroscopy using 500 robotic positioners to place fibers feeding optical and infrared spectrographs across a wide field. SDSS-V uses this system to observe targets throughout the year at two observatories in support of the science goals of its Milky Way Mapper and Black Hole Mapper programs. These science goals require observations of objects over time with preferred temporal spacinges (referred to as "cadences"), which can differ from object to object even in the same area of sky. robostrategy is the software we use to construct our planned observations so that they can best achieve the desired goals given the time available as a function of sky brightness and local sidereal time, and to assign fibers to targets during specific observations. We use linear programming techniques to seek optimal allocations of time under the constraints given. We present the methods and example results obtained with this software.

Collision Probability Estimation for Optimization-based Vehicular Motion Planning

Authors:Leon Tolksdorf, Arturo Tejada, Christian Birkner, Nathan van de Wouw

Date:2025-05-27 13:16:03

Many motion planning algorithms for automated driving require estimating the probability of collision (POC) to account for uncertainties in the measurement and estimation of the motion of road users. Common POC estimation techniques often utilize sampling-based methods that suffer from computational inefficiency and a non-deterministic estimation, i.e., each estimation result for the same inputs is slightly different. In contrast, optimization-based motion planning algorithms require computationally efficient POC estimation, ideally using deterministic estimation, such that typical optimization algorithms for motion planning retain feasibility. Estimating the POC analytically, however, is challenging because it depends on understanding the collision conditions (e.g., vehicle's shape) and characterizing the uncertainty in motion prediction. In this paper, we propose an approach in which we estimate the POC between two vehicles by over-approximating their shapes by a multi-circular shape approximation. The position and heading of the predicted vehicle are modelled as random variables, contrasting with the literature, where the heading angle is often neglected. We guarantee that the provided POC is an over-approximation, which is essential in providing safety guarantees, and present a computationally efficient algorithm for computing the POC estimate for Gaussian uncertainty in the position and heading. This algorithm is then used in a path-following stochastic model predictive controller (SMPC) for motion planning. With the proposed algorithm, the SMPC generates reproducible trajectories while the controller retains its feasibility in the presented test cases and demonstrates the ability to handle varying levels of uncertainty.

A Reduction-Driven Local Search for the Generalized Independent Set Problem

Authors:Yiping Liu, Yi Zhou, Zhenxiang Xu, Mingyu Xiao, Jin-Kao Hao

Date:2025-05-27 11:39:05

The Generalized Independent Set (GIS) problem extends the classical maximum independent set problem by incorporating profits for vertices and penalties for edges. This generalized problem has been identified in diverse applications in fields such as forest harvest planning, competitive facility location, social network analysis, and even machine learning. However, solving the GIS problem in large-scale, real-world networks remains computationally challenging. In this paper, we explore data reduction techniques to address this challenge. We first propose 14 reduction rules that can reduce the input graph with rigorous optimality guarantees. We then present a reduction-driven local search (RLS) algorithm that integrates these reduction rules into the pre-processing, the initial solution generation, and the local search components in a computationally efficient way. The RLS is empirically evaluated on 278 graphs arising from different application scenarios. The results indicates that the RLS is highly competitive -- For most graphs, it achieves significantly superior solutions compared to other known solvers, and it effectively provides solutions for graphs exceeding 260 million edges, a task at which every other known method fails. Analysis also reveals that the data reduction plays a key role in achieving such a competitive performance.

Cardiac Digital Twins at Scale from MRI: Open Tools and Representative Models from ~55000 UK Biobank Participants

Authors:Devran Ugurlu, Shuang Qian, Elliot Fairweather, Charlene Mauger, Bram Ruijsink, Laura Dal Toso, Yu Deng, Marina Strocchi, Reza Razavi, Alistair Young, Pablo Lamata, Steven Niederer, Martin Bishop

Date:2025-05-27 10:52:52

A cardiac digital twin is a virtual replica of a patient's heart for screening, diagnosis, prognosis, risk assessment, and treatment planning of cardiovascular diseases. This requires an anatomically accurate patient-specific 3D structural representation of the heart, suitable for electro-mechanical simulations or study of disease mechanisms. However, generation of cardiac digital twins at scale is demanding and there are no public repositories of models across demographic groups. We describe an automatic open-source pipeline for creating patient-specific left and right ventricular meshes from cardiovascular magnetic resonance images, its application to a large cohort of ~55000 participants from UK Biobank, and the construction of the most comprehensive cohort of adult heart models to date, comprising 1423 representative meshes across sex (male, female), body mass index (range: 16 - 42 kg/m$^2$) and age (range: 49 - 80 years). Our code is available at https://github.com/cdttk/biv-volumetric-meshing/tree/plos2025 , and pre-trained networks, representative volumetric meshes with fibers and UVCs will be made available soon.

RefAV: Towards Planning-Centric Scenario Mining

Authors:Cainan Davidson, Deva Ramanan, Neehar Peri

Date:2025-05-27 10:14:35

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html

Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

Authors:Keheliya Gallaba, Ali Arabat, Dayi Lin, Mohammed Sayagh, Ahmed E. Hassan

Date:2025-05-27 10:05:26

Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.

Construction, Commissioning, and Installation of the Cylindrical GEM Inner Tracker of the BESIII Experiment

Authors:Stefano Gramigna

Date:2025-05-27 09:49:30

BESIII (BEijing Spectrometer III) is a particle physics experiment with a vast physics program centered around the study of charmonium and the $\tau$ lepton. The performance of the spectrometer's inner tracker, the innermost part of a large drift chamber, has been degrading due to aging phenomena related to the large particle rate. Planned upgrades to the BEPCII (Beijing Electron Positron Collider II) collider, servicing the experiment, may further aggravate the problem, with the risk of disrupting the data taking. The Italian component of the BESIII collaboration proposed a detector based on cylindrical GEM (Gas Electron Multiplier) technology to replace the aging inner tracker. The new detector aims to improve the current tracker's spatial resolution in the beam direction at least a factor 2 and to ensure the continuation of BESIII's data taking until its end in 2030. After more than 10 years of design, development, and construction the three layers of the CGEM-IT (Cylindrical GEM Inner Tracker) are finally being installed in the spectrometer. This thesis describes the three final years of the detector's development, which led from diagnosing and resolving mechanical issues preventing the largest layer from powering on to securing the approval of the experiment's internal review committee for installation. Particular focus is given to the technological solutions adopted to overcome the challenges encountered during the development process, which often required a complete rethinking of previous methods and procedures. The thesis concludes with a snapshot of the ongoing installation of the detector, commenting on the results of the preparatory work undertaken to ensure its success.

Two-step dimensionality reduction of human mobility data: From potential landscapes to spatiotemporal insights

Authors:Yunhan Du, Takaaki Aoki, Naoya Fujiwara

Date:2025-05-27 09:19:58

Understanding the spatiotemporal patterns of human mobility is crucial for addressing societal challenges, such as epidemic control and urban transportation optimization. Despite advancements in data collection, the complexity and scale of mobility data continue to pose significant analytical challenges. Existing methods often result in losing location-specific details and fail to fully capture the intricacies of human movement. This study proposes a two-step dimensionality reduction framework to overcome existing limitations. First, we construct a potential landscape of human flow from origin-destination (OD) matrices using combinatorial Hodge theory, preserving essential spatial and structural information while enabling an intuitive visualization of flow patterns. Second, we apply principal component analysis (PCA) to the potential landscape, systematically identifying major spatiotemporal patterns. By implementing this two-step reduction method, we reveal significant shifts during a pandemic, characterized by an overall declines in mobility and stark contrasts between weekdays and holidays. These findings underscore the effectiveness of our framework in uncovering complex mobility patterns and provide valuable insights into urban planning and public health interventions.

COM Adjustment Mechanism Control for Multi-Configuration Motion Stability of Unmanned Deformable Vehicle

Authors:Jun Liu, Hongxun Liu, Cheng Zhang, Jiandang Xing, Shang Jiang, Ping Jiang

Date:2025-05-27 09:16:56

An unmanned deformable vehicle is a wheel-legged robot transforming between two configurations: vehicular and humanoid states, with different motion modes and stability characteristics. To address motion stability in multiple configurations, a center-of-mass adjustment mechanism was designed. Further, a motion stability hierarchical control algorithm was proposed, and an electromechanical model based on a two-degree-of-freedom center-of-mass adjustment mechanism was established. An unmanned-deformable-vehicle vehicular-state steady-state steering dynamics model and a gait planning kinematic model of humanoid state walking were established. A stability hierarchical control strategy was designed to realize the stability control. The results showed that the steady-state steering stability in vehicular state and the walking stability in humanoid state could be significantly improved by controlling the slider motion.

Generalized Coordination of Partially Cooperative Urban Traffic

Authors:Max Bastian Mertens, Michael Buchholz

Date:2025-05-27 08:25:57

Vehicle-to-anything connectivity, especially for autonomous vehicles, promises to increase passenger comfort and safety of road traffic, for example, by sharing perception and driving intention. Cooperative maneuver planning uses connectivity to enhance traffic efficiency, which has, so far, been mainly considered for automated intersection management. In this article, we present a novel cooperative maneuver planning approach that is generalized to various situations found in urban traffic. Our framework handles challenging mixed traffic, that is, traffic comprising both cooperative connected vehicles and other vehicles at any distribution. Our solution is based on an optimization approach accompanied by an efficient heuristic method for high-load scenarios. We extensively evaluate the proposed planer in a distinctly realistic simulation framework and show significant efficiency gains already at a cooperation rate of 40%. Traffic throughput increases, while the average waiting time and the number of stopped vehicles are reduced, without impacting traffic safety.

Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Authors:Guangcong Zheng, Jianlong Yuan, Bo Wang, Haoyang Huang, Guoqing Ma, Nan Duan

Date:2025-05-27 07:39:43

Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: https://zgctroy.github.io/frame-level-captions .

Spatial RoboGrasp: Generalized Robotic Grasping Control Policy

Authors:Yiqi Huang, Travis Davies, Jiahuan Yan, Jiankai Sun, Xiang Chen, Luhui Hu

Date:2025-05-27 07:22:33

Achieving generalizable and precise robotic manipulation across diverse environments remains a critical challenge, largely due to limitations in spatial perception. While prior imitation-learning approaches have made progress, their reliance on raw RGB inputs and handcrafted features often leads to overfitting and poor 3D reasoning under varied lighting, occlusion, and object conditions. In this paper, we propose a unified framework that couples robust multimodal perception with reliable grasp prediction. Our architecture fuses domain-randomized augmentation, monocular depth estimation, and a depth-aware 6-DoF Grasp Prompt into a single spatial representation for downstream action planning. Conditioned on this encoding and a high-level task prompt, our diffusion-based policy yields precise action sequences, achieving up to 40% improvement in grasp success and 45% higher task success rates under environmental variation. These results demonstrate that spatially grounded perception, paired with diffusion-based imitation learning, offers a scalable and robust solution for general-purpose robotic grasping.

An Empirical Study of Conjugate Gradient Preconditioners for Solving Symmetric Positive Definite Systems of Linear Equations

Authors:Marc A. Tunnell, David F. Gleich

Date:2025-05-27 04:05:16

Despite hundreds of papers on preconditioned linear systems of equations, there remains a significant lack of comprehensive performance benchmarks comparing various preconditioners for solving symmetric positive definite (SPD) systems. In this paper, we present a comparative study of 79 matrices using a broad range of preconditioners. Specifically, we evaluate 10 widely used preconditoners across 108 configurations to assess their relative performance against using no preconditioner. Our focus is on preconditioners that are commonly used in practice, are available in major software packages, and can be utilized as black-box tools without requiring significant \textit{a priori} knowledge. In addition, we compare these against a selection of classical methods. We primarily compare them without regards to effort needed to compute the preconditioner. Our results show that symmetric positive definite systems are mostly likely to benefit from incomplete symmetric factorizations, such as incomplete Cholesky (IC). Multigrid methods occasionally do exceptionally well. Simple classical techniques, symmetric Gauss Seidel and symmetric SOR, are not productive. We find that including preconditioner construction costs significantly diminishes the advantages of iterative methods compared to direct solvers; although, tuned IC methods often still outperform direct methods. Additionally, ordering strategies such as approximate minimum degree significantly enhance IC effectiveness. We plan to expand the benchmark with larger matrices, additional solvers, and detailed metrics to provide actionable information on SPD preconditioning.

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Authors:Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma

Date:2025-05-27 03:21:04

Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.

Scaling law of urban individual tour behavior

Authors:Xu-Jie Lin, Yitao Yang, Wei-Peng Nie, Xiao-Yong Yan

Date:2025-05-26 23:51:00

Analysing and modeling urban individual tour behavior are of great significance for a wide range of applications such as transportation management and urban planning. However, the urban tour length distribution has long been neglected in individual mobility models that can be found in recent literature. To fill in this gap, we analyse Foursquare user check-in data and find that urban human tour length distribution follows a power-law with exponential truncation. This law also appears in China's urban heavy truck trajectory data which can reflect commercial vehicle mobility. To reproduce the universal scaling law on tours of urban human and truck mobility, we introduce a tour terminate-continue model. Our model can not only reproduce the urban tour length distribution, but also the Heaps' law and Zipf's law in human and truck mobility, providing a new perspective for characterizing individual human mobility.

CoRI: Synthesizing Communication of Robot Intent for Physical Human-Robot Interaction

Authors:Junxiang Wang, Emek Barış Küçüktabak, Rana Soltani Zarrin, Zackory Erickson

Date:2025-05-26 21:48:34

Clear communication of robot intent fosters transparency and interpretability in physical human-robot interaction (pHRI), particularly during assistive tasks involving direct human-robot contact. We introduce CoRI, a pipeline that automatically generates natural language communication of a robot's upcoming actions directly from its motion plan and visual perception. Our pipeline first processes the robot's image view to identify human poses and key environmental features. It then encodes the planned 3D spatial trajectory (including velocity and force) onto this view, visually grounding the path and its dynamics. CoRI queries a vision-language model with this visual representation to interpret the planned action within the visual context before generating concise, user-directed statements, without relying on task-specific information. Results from a user study involving robot-assisted feeding, bathing, and shaving tasks across two different robots indicate that CoRI leads to statistically significant difference in communication clarity compared to a baseline communication strategy. Specifically, CoRI effectively conveys not only the robot's high-level intentions but also crucial details about its motion and any collaborative user action needed.

BlastOFormer: Attention and Neural Operator Deep Learning Methods for Explosive Blast Prediction

Authors:Reid Graves, Anthony Zhou, Amir Barati Farimani

Date:2025-05-26 18:47:50

Accurate prediction of blast pressure fields is essential for applications in structural safety, defense planning, and hazard mitigation. Traditional methods such as empirical models and computational fluid dynamics (CFD) simulations offer limited trade offs between speed and accuracy; empirical models fail to capture complex interactions in cluttered environments, while CFD simulations are computationally expensive and time consuming. In this work, we introduce BlastOFormer, a novel Transformer based surrogate model for full field maximum pressure prediction from arbitrary obstacle and charge configurations. BlastOFormer leverages a signed distance function (SDF) encoding and a grid to grid attention based architecture inspired by OFormer and Vision Transformer (ViT) frameworks. Trained on a dataset generated using the open source blastFoam CFD solver, our model outperforms convolutional neural networks (CNNs) and Fourier Neural Operators (FNOs) across both log transformed and unscaled domains. Quantitatively, BlastOFormer achieves the highest R2 score (0.9516) and lowest error metrics, while requiring only 6.4 milliseconds for inference, more than 600,000 times faster than CFD simulations. Qualitative visualizations and error analyses further confirm BlastOFormer's superior spatial coherence and generalization capabilities. These results highlight its potential as a real time alternative to conventional CFD approaches for blast pressure estimation in complex environments.

Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

Authors:Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

Date:2025-05-26 17:10:43

Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for evaluating long-term memory in multi-session TOD. Based on this new dataset, we propose a Memory-Active Policy (MAP) that improves multi-session dialogue efficiency through a two-stage approach. 1) Memory-Guided Dialogue Planning retrieves intent-aligned history, identifies key QA units via a memory judger, refines them by removing redundant questions, and generates responses based on the reconstructed memory. 2) Proactive Response Strategy detects and correct errors or omissions, ensuring efficient and accurate task completion. We evaluate MAP on MS-TOD dataset, focusing on response quality and effectiveness of the proactive strategy. Experiments on MS-TOD demonstrate that MAP significantly improves task success and turn efficiency in multi-session scenarios, while maintaining competitive performance on conventional single-session tasks.

The Problem of Algorithmic Collisions: Mitigating Unforeseen Risks in a Connected World

Authors:Maurice Chiodo, Dennis Müller

Date:2025-05-26 16:22:18

The increasing deployment of Artificial Intelligence (AI) and other autonomous algorithmic systems presents the world with new systemic risks. While focus often lies on the function of individual algorithms, a critical and underestimated danger arises from their interactions, particularly when algorithmic systems operate without awareness of each other, or when those deploying them are unaware of the full algorithmic ecosystem deployment is occurring in. These interactions can lead to unforeseen, rapidly escalating negative outcomes - from market crashes and energy supply disruptions to potential physical accidents and erosion of public trust - often exceeding the human capacity for effective monitoring and the legal capacities for proper intervention. Current governance frameworks are inadequate as they lack visibility into this complex ecosystem of interactions. This paper outlines the nature of this challenge and proposes some initial policy suggestions centered on increasing transparency and accountability through phased system registration, a licensing framework for deployment, and enhanced monitoring capabilities.

URPlanner: A Universal Paradigm For Collision-Free Robotic Motion Planning Based on Deep Reinforcement Learning

Authors:Fengkang Ying, Hanwen Zhang, Haozhe Wang, Huishi Huang, Marcelo H. Ang Jr

Date:2025-05-26 16:15:42

Collision-free motion planning for redundant robot manipulators in complex environments is yet to be explored. Although recent advancements at the intersection of deep reinforcement learning (DRL) and robotics have highlighted its potential to handle versatile robotic tasks, current DRL-based collision-free motion planners for manipulators are highly costly, hindering their deployment and application. This is due to an overreliance on the minimum distance between the manipulator and obstacles, inadequate exploration and decision-making by DRL, and inefficient data acquisition and utilization. In this article, we propose URPlanner, a universal paradigm for collision-free robotic motion planning based on DRL. URPlanner offers several advantages over existing approaches: it is platform-agnostic, cost-effective in both training and deployment, and applicable to arbitrary manipulators without solving inverse kinematics. To achieve this, we first develop a parameterized task space and a universal obstacle avoidance reward that is independent of minimum distance. Second, we introduce an augmented policy exploration and evaluation algorithm that can be applied to various DRL algorithms to enhance their performance. Third, we propose an expert data diffusion strategy for efficient policy learning, which can produce a large-scale trajectory dataset from only a few expert demonstrations. Finally, the superiority of the proposed methods is comprehensively verified through experiments.

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Authors:Ziming Wei, Bingqian Lin, Zijian Jiao, Yunshuang Nie, Liang Ma, Yuecheng Liu, Yuzheng Zhuang, Xiaodan Liang

Date:2025-05-26 15:48:14

Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.

Agentic 3D Scene Generation with Spatially Contextualized VLMs

Authors:Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

Date:2025-05-26 15:28:17

Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.

Investment Decisions for Perfect and Imperfect Competition in Ireland's Electricity Market

Authors:Davoud Hosseinnezhad, Mel T. Devine, Seán McGarraghy

Date:2025-05-26 14:53:14

This paper employs a game-theoretic approach to analyze investment decisions in Ireland's electricity market. It compares optimal electricity investment strategies among energy generators under a perfect competition framework with an imperfect Nash-Cournot competition. The model incorporates market price based on competition among generators while accounting for the supply capacity of each firm and each technology, along with the System Non-Synchronous Penetration (SNSP) constraint to reflect operational limitations in renewable energy contribution to the power system. Both models are formulated as single-objective function optimization problems. Furthermore, unit commitment constraints are introduced to the perfect competition model, allowing the model to incorporate binary decision variables to capture energy unit scheduling decisions of online status, startup, and shutdown costs. The proposed models are evaluated under three different demand test cases, using Ireland's electricity generation projections for 2023 to 2033. The results highlight key differences in investment decisions, carbon emissions, and the contribution of renewable technologies in perfect and imperfect competition structures. The findings provide managerial insights for policymakers and stakeholders, supporting optimal investment decisions and generation capacity planning to achieve Ireland's long-term energy objectives.

Target Tracking via LiDAR-RADAR Sensor Fusion for Autonomous Racing

Authors:Marcello Cellina, Matteo Corno, Sergio Matteo Savaresi

Date:2025-05-26 14:28:13

High Speed multi-vehicle Autonomous Racing will increase the safety and performance of road-going Autonomous Vehicles. Precise vehicle detection and dynamics estimation from a moving platform is a key requirement for planning and executing complex autonomous overtaking maneuvers. To address this requirement, we have developed a Latency-Aware EKF-based Multi Target Tracking algorithm fusing LiDAR and RADAR measurements. The algorithm explots the different sensor characteristics by explicitly integrating the Range Rate in the EKF Measurement Function, as well as a-priori knowledge of the racetrack during state prediction. It can handle Out-Of-Sequence Measurements via Reprocessing using a double State and Measurement Buffer, ensuring sensor delay compensation with no information loss. This algorithm has been implemented on Team PoliMOVE's autonomous racecar, and was proved experimentally by completing a number of fully autonomous overtaking maneuvers at speeds up to 275 km/h.

ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

Authors:Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, Huiyong Chen, Dongbin Zhao

Date:2025-05-26 14:12:38

Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.

Universal scaling of intra-urban climate fluctuations

Authors:Marc Duran-Sala, Martin Hendrick, Gabriele Manoli

Date:2025-05-26 13:49:01

Urban-induced changes in local microclimate, such as the urban heat island effect and air pollution, are known to vary with city size, leading to distinctive relations between average climate variables and city-scale quantities (e.g., total population or area). However, these approaches suffer from biases related to the choice of city boundaries and they neglect intra-urban variations of urban characteristics. Here we use high-resolution data of urban temperatures, air quality, population counts, and street intersections from 142 cities worldwide and show that their marginal and joint probability distributions follow universal scaling functions. By using a logarithmic relation between urban spatial features and climate variables, we show that average street network properties are sufficient to characterize the entire variability of the temperature and air pollution fields observed within and across cities. We further demonstrate that traditional models linking climate variables to the distance from the city center fail to reproduce the observed distributions unless the stochasticity of urban structure is fully considered. These findings provide a unified statistical framework for characterizing intra-urban climate variability, with important implications for climate modelling and urban planning.

The Limits of Preference Data for Post-Training

Authors:Eric Zhao, Jessica Dai, Pranjal Awasthi

Date:2025-05-26 13:26:15

Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

Deep Active Inference Agents for Delayed and Long-Horizon Environments

Authors:Yavar Taheri Yeganeh, Mohsen Jafari, Andrea Matta

Date:2025-05-26 11:50:22

With the recent success of world-model agents, which extend the core idea of model-based reinforcement learning by learning a differentiable model for sample-efficient control across diverse tasks, active inference (AIF) offers a complementary, neuroscience-grounded paradigm that unifies perception, learning, and action within a single probabilistic framework powered by a generative model. Despite this promise, practical AIF agents still rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments requiring plans over long horizons, tens to hundreds of steps. Moreover, most existing agents are evaluated on robotic or vision benchmarks which, while natural for biological agents, fall short of real-world industrial complexity. We address these limitations with a generative-policy architecture featuring (i) a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive planning from the control loop. We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings. The empirical results confirm the effectiveness of the proposed approach, demonstrating the coupled world-model with the AIF formalism yields an end-to-end probabilistic controller capable of effective decision making in delayed, long-horizon settings without handcrafted rewards or expensive planning.