planning - 2026-05-19

Aurora: Unified Video Editing with a Tool-Using Agent

Authors:Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, Jiebo Luo
Date:2026-05-18 17:59:03

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Authors:Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng, Zeyuan Ding, Xiancong Ren, Zhang Zhang, Qifeng Chen, Jian Liu, Yong Dai, Xiaozhu Ju
Date:2026-05-18 17:52:14

The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.

Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

Authors:Yanbo Wang, Yuxuan Wang, Chen Chen, Chunyu Xue, Yu Feng, Anbang Wu, Quan Chen, Yin Chen, Qizhen Weng
Date:2026-05-18 17:44:29

With the wide adoption of Multimodal Models (MMs) in real-world scenarios, it is significant to efficiently train emerging MMs that exhibit increasingly complex module architectures. For MM deployment, existing works allocate a GPU to only one MM module in a temporal-multiplexing manner; this compromises training efficiency because a single module often fails to achieve high GPU utilization. To improve GPU utilization and enable efficient MM training, we propose deploying MMs in a temporal-spatial multiplexing manner, allowing multiple MM modules to colocate on a GPU with well-controlled resource quotas. In this paper, we propose Apollo, an efficient MM training system that applies temporal-spatial multiplexing. We first develop a flexible and lightweight execution engine that supports MM training with arbitrary resource quotas, and then build a comprehensive and accurate performance model to estimate module execution time under different allocation plans. With the performance model, we further adopt effective heuristics to derive high-quality MM deployment plans efficiently. Testbed experiments confirm that Apollo effectively improves the training efficiency of popular MMs, with a training speedup of up to 1.31x.

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

Authors:Michael Aichmüller, Simon Ståhlberg, Martin Funkquist, Hector Geffner
Date:2026-05-18 17:15:23

Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can "jump" over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the $C_2$ logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.

Weak and Strong Fibrations of Functors

Authors:Isaac Carcacía-Campos, Enrique Macías-Virgós, David Mosquera-Lois
Date:2026-05-18 16:56:32

We develop a homotopical framework for small categories that extends classical invarints of algebraic topology to the categorical setting. Our approach is based on the construction of genuine path category, obtained trough a localization procedure, which allows us to define strong and weak fibrations for functor. We establish their basic properties, introduce a fibrant replacement for functors, and extend homotopical invariants such as the Svarc genus and sectional category to small categories. Finally, we apply this framework to motion planning in small categories, providing categorical analogues of Farber's topological complexity while removing finiteness constraints typical of existing approaches.

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Authors:Wencan Jiang, Jiangning Zhang, Jianbiao Mei, Jinzhuo Liu, Yu Yang, Xiaobin Hu, Zhucun Xue, Yong Liu, Dacheng Tao
Date:2026-05-18 16:43:32

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

Mechanism Design for Connecting Regions Under Disruptions

Authors:Hau Chan, Jianan Lin, Zining Qin, Chenhao Wang
Date:2026-05-18 16:33:18

Man-made and natural disruptions such as planned constructions on roads, suspensions of bridges, and blocked roads by trees/mudslides/floods can often create obstacles that separate two connected regions. As a result, the traveling and reachability of agents from their respective regions to other regions can be affected. To minimize the impact of the obstacles and maintain agent accessibility, we initiate the problem of constructing a new pathway (e.g., a detour or new bridge) connecting the regions disconnected by obstacles from the mechanism design perspective. In the problem, each agent in their region has a private location and is required to access the other region. The cost of an agent is the distance from their location to the other region via the pathway. Our goal is to design strategyproof mechanisms that elicit truthful locations from the agents and approximately optimize the social or maximum cost of agents by determining locations in the regions for building a pathway. We provide a characterization of all strategyproof and anonymous mechanisms. For the social and maximum costs, we provide upper and lower bounds on the approximation ratios of strategyproof mechanisms.

Towards the Deployment of the First NectarCAM, a Medium-Sized-Telescope Camera for the Cherenkov Telescope Array Observatory

Authors:Pablo Correa, CTAO NectarCAM Collaboration
Date:2026-05-18 16:07:38

NectarCAM is a Cherenkov camera designed to detect gamma rays with energies between 80 GeV and 50 TeV. It will equip nine medium-sized telescopes (MSTs) of the Cherenkov Telescope Array Observatory. NectarCAM consists of 1855 pixels distributed over 265 modules. Each pixel consists of a photomultiplier tube that is connected to a NECTAr3 chip. This NECTAr3 chip contains a 12-bit digitizer with a GHz sampling rate, and has a typical readout deadtime of ${\sim}0.7$ $μ$s. In these proceedings, we highlight the performance of the NectarCAM in terms of time resolution and charge resolution. We also present the latest calibration techniques that were recently implemented for the camera. Finally, we briefly present the current status and plans of the NectarCAM production; the first production-line NectarCAM will be ready for shipment by Summer 2026, and it is planned to equip one of the MST pathfinders of CTAO.

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

Authors:Ali Iranmanesh, Peng Liu
Date:2026-05-18 16:06:29

Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization. Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined.

Incorporating vaccine effects into epidemiological models: common pitfalls and solutions

Authors:Casey E. Middleton, Oliver Eales, James M. McCaw, Freya M. Shearer
Date:2026-05-18 15:50:40

Incorporating vaccination into mathematical models appears deceptively simple: models integrate vaccine-derived protections, such as reduced susceptibility to infection, using parameters informed by empirical estimates of vaccine efficacy or effectiveness (VE). In practice, however, empirical VE estimates often do not correspond directly to the parameters of epidemiological models. Here, we extend previous work to demonstrate that in order to accurately parameterize a model, one must consider both a vaccine's mechanism of action and the statistic used to infer VE from empirical data. When a vaccine confers leaky protection -- that is, vaccination partially rather than completely reduces individual infection risk -- we show that common empirical VE estimation methods do not provide directly applicable values for model parameters. Naive (i.e. direct) incorporation of these VE estimates into models results in an underestimate of population-level vaccine impact. To make progress when these estimates are the only available sources for VE, we introduce a parameterization approach which more accurately aligns the modeled effect of vaccination with empirical estimates. Under this adjusted parameterization approach, models predict fewer total infections and lower herd immunity thresholds for leaky vaccines than would be predicted under current parameterization practices. Our parameterization guidelines and adjustment approach can be used to improve accuracy in models that are used in vaccine decision making and public health planning.

HJ-Gauss: A Monte-Carlo HJ Reachability Scheme

Authors:Lekan Molu, Venkatraman Renganathan, Namhoon Cho
Date:2026-05-18 15:43:59

Backward reachable tubes (BRTs), computed via viscous Hamilton-Jacobi (HJ) partial differential equations, provide principled safety certificates for learned controllers and planning algorithms in trustworthy machine learning. However, classical grid-based HJ solvers require $O(M^n)$ memory footprint for $M$ grid points per $n$ state dimension. This renders them impractical for high-dimensional systems. We address this bottleneck with a local PDE linearization that enables a frozen-coefficient sampling scheme for the viscous HJ PDE: a generalized Cole-Hopf-type transformation reduces the nonlinear HJ equation to a sequence of linear heat equations whose solutions admit Gaussian heat-kernel representations. The value function and its spatial gradient are then recovered via roll-outs of Monte Carlo expectations on Gaussian densities, yielding a storage and grid-free algorithm that scales as $N\cdot n$ for $N$ samples. This decoupling of memory from dimensionality enables reachability analysis on problems where grid-based methods are simply impossible. We prove a finite-sample concentration bound $O(N^{-1/2})$ error and conditional linear convergence for the introduced Monte-Carlo Picard iterative scheme. Numerical validation on pursuit-evasion games demonstrates relative $L^2_{\text{rel}}$ errors of $0.03 - 0.20$, with $14-26$ second wall-clock times per 2D slice on a CPU. Crucially, the method scales with validation on up to (but not limited to) $n=45$-dimensional multi-agent games.

Geometry-Aware Surrogate for Real-Time Hydrodynamics Estimation of Autonomous Ground Vehicles in Amphibious Environments

Authors:Ammar Waheed, Luke Gallantree, Zohaib Hasnain
Date:2026-05-18 15:24:23

Autonomous ground vehicles operating in shallow water or flood-prone terrains require dynamic models that account for hydrodynamic forces. However, the simulation and planning tools currently available either lack the physical fidelity or are too computationally expensive to run in real time. This work presents a per-surface neural network surrogate that bridges this gap by predicting geometry-resolved hydrodynamic forces at real-time rates, trained entirely on high-fidelity CFD data from two geometrically distinct vehicles. A vehicle specific Signed Distance Field (SDF) provides per-surface submergence inputs, allowing the model to resolve how loading varies with vehicle geometry, depth, and flow direction. On held-out CFD data, the surrogate achieves a longitudinal-force symmetric MAPE (sMAPE) of 13\% and a vertical-force sMAPE of 3-12\%, with inference running under 0.9\,ms per sample. To evaluate the model under real-world conditions, water wading trials of a full-scale vehicle at different submersion depths are used. Motion capture derived kinematics serve as the surrogate inputs, and the resulting predictions are tested to reproduce known physical relationships between force, speed, and depth. The predicted drag follows quadratic speed scaling ($R^2 \geq 0.97$) and the buoyancy intercepts scale linearly with depth ($R^2 = 0.973$). Neither relationship is encoded in the model training loss, both emerge from the per-surface architecture summing individually predicted surface forces. The resulting framework provides a pathway for embedding physically grounded hydrodynamics into the simulation and planning loops that autonomous ground vehicles depend on in amphibious environments.

One Developer Is All You Need: A Case Study of an AI-Augmented One-Person Squad in a Brownfield Enterprise

Authors:Marcelo Vilas Boas, Gustavo Pinto, Edward Roberto Monteiro, Vinicius Fernandes Carida, Danilo Ribeiro
Date:2026-05-18 14:23:17

AI tools are enabling engineers to absorb roles previously distributed across cross-functional squads, yet there is little structured evidence on how to design or evaluate such a one-person squad in a regulated enterprise setting. Without that evidence, organizations adopting this model lack guidance on which design decisions make it viable and which conditions cause it to break down. We report a case study in which a single staff engineer, supported by four AI agents under a Spec-Driven Development workflow, delivered a brownfield product initiative scoped for a four-person squad in half the planned time, with 90\% acceptance of AI-generated code on first review, full integration test pass rates, and an above-85\% reduction in direct staffing cost. The results indicate that AI does not replace team members it multiplies the throughput of the experienced engineer who remains, making specification quality and institutional knowledge, not model capability, the binding constraints on one-person squad success.

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

Authors:Jianghong Dong, Yifeng Zhang, Jiawei Wang, Mengchi Cai, Keqiang Li, Guillaume Sartoretti
Date:2026-05-18 14:11:31

Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.

Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival

Authors:Neofytos Dimitriou
Date:2026-05-18 13:47:57

Accurate vessel estimated-time-of-arrival forecasts are critical for port operations and decarbonization, yet global-scale travel-time prediction remains difficult without costly contextual data. Herein, I present a methodology for constructing a historical maritime knowledge graph using only Automatic Identification System (AIS) data. First, segmented trajectories are extracted from noisy AIS data using a Gaussian-mixture-model-based preprocessing pipeline. The graph is then constructed by iteratively processing the trajectories and storing speed distributions stratified by vessel type, time of travel, and direction of travel; the resulting global graph comprises 5,433 geohash-3 nodes and 12,334 edges. The graph can be queried to retrieve travel-time predictions between any two location via a hierarchical, priority-based system that uses historical statistics with principled fallback. On a temporally held-out test set, median RMSE is 22.75 min (segment-level) and 30.90 min (trajectory-level), with 69.1% of trajectories within 20% of actual arrival time. On a second external test set, median RMSE is 27.36 min (segment-level) and 37.46 min (trajectory-level), with 62.1% of trajectories within 20%. These results corroborate the promise of our method, enabling global travel-time prediction and providing a strong foundation for just-in-time arrival planning and emissions reduction.

NEWTON: Agentic Planning for Physically Grounded Video Generation

Authors:Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang, Wenlong Hou, Yang Liu, Baigui Sun, Yong Liu, Shujun Wang
Date:2026-05-18 13:42:24

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: \href{https://Newton026.github.io/newton}{https://Newton026.github.io/newton}

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

Authors:Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco, Carme Torras
Date:2026-05-18 13:21:38

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

Authors:Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini
Date:2026-05-18 10:37:39

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

Authors:Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini
Date:2026-05-18 10:26:07

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

Authors:Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash, Baoshen Guo, Shenhao Wang, Jinhua Zhao
Date:2026-05-18 09:13:01

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

A regularization method for planar offset curves and bi-offset recognition

Authors:Rosanna Campagna, Salvatore Mondrone, Tomas Sauer
Date:2026-05-18 09:11:29

Offset curves for planar trajectories are interesting in the generation of tool paths for numerically controlled industrial machines and in trajectory planning methods for autonomous driving systems. Theoretical offset curves may exhibit peculiar singularities, including self-intersections, which limit their use in practical applications. Existing approaches address these issue through geometric filtering techniques to detect and remove undesirable features but the computation of accurate and well-behaved offset curves remains a challenging task. We assume a first stage of functional approximation of trajectories by penalized Hermite spline regression enabling the simultaneous fitting of positions and tangents. The regularization is imposed on the second derivatives, effectively mitigating the jerk effect, which is particularly relevant in motion planning and path smoothing applications. Then, taking into account the geometrical pointwise properties of the resulting curve, we design two offset curves through the simultaneous approximation of function values and derivatives. Then, a mathematical model to obtain the so-called bi-offset as most fitting as with the original generator curve is proposed, also relating the offset range and pointwise curvature values. The adaptive reconstruction of the center line from the external boundaries is a topic of interest and is the main focus of our work. Numerical experiments confirm the reliability of our approach at every stage of the resolution process.

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

Authors:Kane Qian, Xin Zhao, Yining Shi, Rujun Yan, Zhengqing Pan, Kaojin Zhu, Mengmeng Yang, Kai Sun, Diange Yang, Kun Jiang
Date:2026-05-18 08:55:32

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Authors:Zhiyuan Zhang, Zhenghao Jin, Yanlun Peng, Xianda Guo, Haoran Liu, Shaofeng Zhang, Xingjun Ma, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-Gang Jiang
Date:2026-05-18 08:45:24

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

Synergetic capacity planning of public and private EV charging piles via city-scale multi-objective optimization

Authors:Yiwu Hao, Hong Yuan, Nan Zhou, Minda Ma
Date:2026-05-18 08:35:32

Rapid electric vehicle (EV) expansion necessitates optimized charging infrastructure to bridge the persistent gaps between vehicle growth and charger availability. This study develops a demand-driven framework for city-scale EV charging demand assessment and charging pile capacity planning. It employs a bottom-up estimation approach to quantify electricity demand and a Harris Hawks Optimization algorithm to solve capacity planning challenges, capturing spatiotemporal demand variations across powertrain types and guiding allocation over 2022-2030 in Chongqing, China. The results show that (1) compared with June 2022, monthly EV electricity consumption tripled to 57.5 gigawatt-hours by the end of 2024, characterized by significant seasonal volatility and a structural shift in which the combined share of plug-in hybrid electric vehicles and extended-range electric vehicles reached 57.6%, necessitating a transition toward technology-specific infrastructure planning; (2) historical evaluations reveal a marked spatial mismatch, with actual deployment heavily concentrated in the urban core while public charging capacity consistently lagging behind demand, whereas the proposed optimized configuration achieved a superior comprehensive performance score of 0.28, compared to 0.65 for actual deployment, in balancing service adequacy across the "Core-Suburban-Exurban" hierarchy; and (3) by 2030, Chongqing is projected to require approximately 1.8 million charging units to sustain a stable 9:1 private-to-public ratio, a synergetic strategy expects to significantly mitigate urban-rural service disparities and enhance overall system resilience and grid compatibility. Ultimately, this study provides a versatile, spatially explicit tool for policymakers to support sustainable and cost-effective EV infrastructure deployment aligned with long-term electrification targets.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

Authors:Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou
Date:2026-05-18 07:03:48

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Authors:Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li
Date:2026-05-18 06:18:21

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

Assessing the Impact of Source Confusion for GREX-PLUS based on Deep JWST NIRCam Imaging

Authors:Yoshiaki Ono, Akio K. Inoue, Yuma Sugahara, Takeshi Hashigaya, Fumihide Iwamuro, Taiki Bessho, Yuji Ikeda, Matthew L. N. Ashby, Yuichi Harikane, Jarron Leisenring, Takao Nakagawa, Howard A. Smith
Date:2026-05-18 05:43:50

We investigate the effects of source confusion expected in observations with GREX-PLUS, a JAXA L-class space infrared telescope mission candidate with a wide-field infrared camera covering 2-8 um with a field of view of 0.50 deg$^2$. For the deep imaging band near 4 um, we calculate the GREX-PLUS PSF and ghost based on the latest optical design, and consider two representative imaging performance cases with PSF FWHM values of 0.9 and 1.2 arcsec. We construct simulated GREX-PLUS images at different depths by convolving JWST NIRCam imaging data from JADES, GLASS, CEERS, and COSMOS-Web with the PSF+ghost kernel. Comparing the limiting magnitudes estimated from random aperture photometry using the same aperture sizes, we find that the simulated GREX-PLUS images are shallower than the original JWST images, with larger deviations for deeper original JWST images. This likely reflects unresolved faint sources and extended PSF+ghost wings from bright sources, which elevate background fluctuations in blank regions. Nevertheless, the limiting magnitudes continue to improve with increasing integration time down to ~27 mag, without a clear plateau at depths comparable to the planned GREX-PLUS deep survey, although the improvement becomes progressively less efficient toward longer integrations. Based on Monte Carlo simulations, we estimate detection completeness and correct the number counts for magnitude bias and incompleteness, finding that confusion-induced blending can reduce the completeness even at magnitudes well above the nominal 5-sigma depth. The completeness-corrected number counts agree well with the JWST-based number counts down to around the detection limit. Overall, our results suggest that statistical studies of faint galaxies remain feasible for GREX-PLUS; however, survey planning should account for less efficient depth improvement toward longer integrations due to source confusion.

Profit-Oriented Planning and Multi-Market Operation Model for Hybrid Energy Storage Systems

Authors:Lizhong Zhang, Junqi Liu, Jianxiao Wang, Lei Zhu
Date:2026-05-18 05:20:05

The increasing penetration of renewable energy necessitates improved power system flexibility, driving the deployment of independent energy storage operators (ESOs). Existing research extensively investigates capacity sizing for price-taker storage systems or the operational coordination of aggregated distributed resources, lacking the joint optimization of capacity planning and multi-market bidding for a price-maker ESO with hybrid energy storage system (HESS) that preserves the technological heterogeneity of the integrated components. We propose a bi-level optimization framework to jointly optimize profit-oriented decisions on capacity and multi-market operation. The upper-level problem determines the optimal capacities of two heterogeneous storage systems while coordinating their bidding across day-ahead joint energy-reserve and real-time balancing markets. The lower-level problems represent market clearing of the system operator (SO). The model is reformulated into a mixed-integer linear program and solved with a Benders' decomposition algorithm. Results demonstrate that the ESO can allocate capacity between energy arbitrage and reserve provision strategically. The system with the high power-to-capacity ratio is used to capture arbitrage profits while the system with low power-to-capacity ratio is used to specialize in reserve markets. There can be internal power transfer between storage systems if there exist grid access constraints. The framework provides differentiated bidding strategies and market participation flexibility for HESS to enhance overall profitability.

Agentic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics

Authors:Mahdi Naser-Moghadasi
Date:2026-05-18 04:07:00

Query optimization in big data analytics remains computationally expensive, particularly for resource-constrained environments where traditional optimizers fail to satisfy memory and latency constraints. We present an agentic query planning system that combines a rule-based teacher planner, UCB1 bandit exploration, cost-aware prediction, and knowledge distillation to a lightweight student planner. Our teacher planner generates SQL plans using six key optimization strategies, while UCB1 bandit search efficiently explores the plan space under explicit resource constraints. A Random Forest cost model predicts query latency from plan features, enabling cost-aware decisions. A distilled student planner (Logistic Regression or Gradient Boosting) learns to mimic teacher-bandit decisions for fast inference. Evaluation on NYC Taxi and IMDB datasets demonstrates 23% latency reduction compared to default planners while maintaining 94% constraint satisfaction. The student planner achieves 89% accuracy in replicating optimal plans with 15x faster inference time. Our single-file implementation enables reproducible big-data analytics on resource-limited machines and is publicly available at https://github.com/mahdinaser/agentic-kd-planner.

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

Authors:Hao Lu, Rahul Shome
Date:2026-05-18 03:41:59

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.