planning - 2025-09-28

\LARGE GMP$^{3}$: Learning-Driven, Bellman-Guided Trajectory Planning for UAVs in Real-Time on SE(3)

Authors:Babak Salamat, Dominik Mattern, Sebastian-Sven Olzem, Gerhard Elsbacher, Christian Seidel, Andrea M. Tonello
Date:2025-09-25 14:56:49

We propose $\text{GMP}^{3}$, a multiphase global path planning framework that generates dynamically feasible three-dimensional trajectories for unmanned aerial vehicles (UAVs) operating in cluttered environments. The framework extends traditional path planning from Euclidean position spaces to the Lie group $\mathrm{SE}(3)$, allowing joint learning of translational motion and rotational dynamics. A modified Bellman-based operator is introduced to support reinforcement learning (RL) policy updates while leveraging prior trajectory information for improved convergence. $\text{GMP}^{3}$ is designed as a distributed framework in which agents influence each other and share policy information along the trajectory: each agent refines its assigned segment and shares with its neighbors via a consensus-based scheme, enabling cooperative policy updates and convergence toward a path shaped globally even under kinematic constraints. We also propose DroneManager, a modular ground control software that interfaces the planner with real UAV platforms via the MAVLink protocol, supporting real-time deployment and feedback. Simulation studies and indoor flight experiments validate the effectiveness of the proposed method in constrained 3D environments, demonstrating reliable obstacle avoidance and smooth, feasible trajectories across both position and orientation. The open-source implementation is available at https://github.com/Domattee/DroneManager

A Causality-Aware Spatiotemporal Model for Multi-Region and Multi-Pollutant Air Quality Forecasting

Authors:Junxin Lu, Shiliang Sun
Date:2025-09-25 14:54:23

Air pollution, a pressing global problem, threatens public health, environmental sustainability, and climate stability. Achieving accurate and scalable forecasting across spatially distributed monitoring stations is challenging due to intricate multi-pollutant interactions, evolving meteorological conditions, and region specific spatial heterogeneity. To address this challenge, we propose AirPCM, a novel deep spatiotemporal forecasting model that integrates multi-region, multi-pollutant dynamics with explicit meteorology-pollutant causality modeling. Unlike existing methods limited to single pollutants or localized regions, AirPCM employs a unified architecture to jointly capture cross-station spatial correlations, temporal auto-correlations, and meteorology-pollutant dynamic causality. This empowers fine-grained, interpretable multi-pollutant forecasting across varying geographic and temporal scales, including sudden pollution episodes. Extensive evaluations on multi-scale real-world datasets demonstrate that AirPCM consistently surpasses state-of-the-art baselines in both predictive accuracy and generalization capability. Moreover, the long-term forecasting capability of AirPCM provides actionable insights into future air quality trends and potential high-risk windows, offering timely support for evidence-based environmental governance and carbon mitigation planning.

Gravitational waves from two scalar fields unifying the dark sector with inflation

Authors:Orlando Luongo, Tommaso Mengoni, Paulo M. Sá
Date:2025-09-25 14:11:59

We investigate the gravitational-wave background predicted by a two-scalar-field cosmological model that aims to unify primordial inflation with the dark sector, namely late-time dark energy and dark matter, in a single and self-consistent theoretical framework. The model is constructed from an action inspired by several extensions of general relativity and string-inspired scenarios and features a non-minimal interaction between the two scalar fields, while both remain minimally coupled to gravity. In this context, we derive the gravitational-wave energy spectrum over wavelengths ranging from today's Hubble horizon to those at the end of inflation. We employ the continuous Bogoliubov coefficient formalism, originally introduced to describe particle creation in an expanding Universe, in analogy to the well-established mechanism of gravitational particle production and, in particular, generalized to gravitons. Using this method, which enables an accurate description of graviton creation across all cosmological epochs, we find that inflation provides the dominant gravitational-wave contribution, while subdominant features arise at the inflation-radiation, radiation-matter, and matter-dark energy transitions, i.e., epochs naturally encoded inside our scalar field picture. The resulting energy density spectrum is thus compared with the sensitivity curves of the planned next-generation ground- and space-based gravitational-wave observatories. The comparison identifies frequency bands where the predicted signal could be probed, providing those windows associated with potentially detectable signals, bounded by our analyses. Consequences of our recipe are thus compared with numerical outcomes and the corresponding physical properties discussed in detail.

Multi-Robot Vision-Based Task and Motion Planning for EV Battery Disassembly and Sorting

Authors:Abdelaziz Shaarawy, Cansu Erdogan, Rustam Stolkin, Alireza Rastegarpanah
Date:2025-09-25 11:30:45

Electric-vehicle (EV) battery disassembly requires precise multi-robot coordination, short and reliable motions, and robust collision safety in cluttered, dynamic scenes. We propose a four-layer task-and-motion planning (TAMP) framework that couples symbolic task planning and cost- and accessibility-aware allocation with a TP-GMM-guided motion planner learned from demonstrations. Stereo vision with YOLOv8 provides real-time component localization, while OctoMap-based 3D mapping and FCL(Flexible Collision Library) checks in MoveIt unify predictive digital-twin collision checking with reactive, vision-based avoidance. Validated on two UR10e robots across cable, busbar, service plug, and three leaf-cell removals, the approach yields substantially more compact and safer motions than a default RRTConnect baseline under identical perception and task assignments: average end-effector path length drops by $-63.3\%$ and makespan by $-8.1\%$; per-arm swept volumes shrink (R1: $0.583\rightarrow0.139\,\mathrm{m}^3$; R2: $0.696\rightarrow0.252\,\mathrm{m}^3$), and mutual overlap decreases by $47\%$ ($0.064\rightarrow0.034\,\mathrm{m}^3$). These results highlight improved autonomy, precision, and safety for multi-robot EV battery disassembly in unstructured, dynamic environments.

Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement

Authors:Jianbo Zhao, Taiyu Ban, Xiangjie Li, Xingtai Gui, Hangning Zhou, Lei Liu, Hongwei Zhao, Bin Li
Date:2025-09-25 09:24:45

The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent's worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models. The video document is available at https://tisa-dpo-e2e.github.io/.

SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Authors:Qinfeng Zhu, Han Li, Liang He, Lei Fan
Date:2025-09-25 09:01:36

Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model's perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

Mesh Interpolation Graph Network for Dynamic and Spatially Irregular Global Weather Forecasting

Authors:Zinan Zheng, Yang Liu, Jia Li
Date:2025-09-25 08:57:16

Graph neural networks have shown promising results in weather forecasting, which is critical for human activity such as agriculture planning and extreme weather preparation. However, most studies focus on finite and local areas for training, overlooking the influence of broader areas and limiting their ability to generalize effectively. Thus, in this work, we study global weather forecasting that is irregularly distributed and dynamically varying in practice, requiring the model to generalize to unobserved locations. To address such challenges, we propose a general Mesh Interpolation Graph Network (MIGN) that models the irregular weather station forecasting, consisting of two key designs: (1) learning spatially irregular data with regular mesh interpolation network to align the data; (2) leveraging parametric spherical harmonics location embedding to further enhance spatial generalization ability. Extensive experiments on an up-to-date observation dataset show that MIGN significantly outperforms existing data-driven models. Besides, we show that MIGN has spatial generalization ability, and is capable of generalizing to previous unseen stations.

MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases

Authors:Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, Diange Yang
Date:2025-09-25 07:31:27

Vision-Language Models(VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRDrive, a novel framework that integrates procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making. MTRDrive addresses these limitations through a closed-loop system that combines a memory-based experience retrieval mechanism with dynamic toolkits. This synergy enables the model to interact more effectively with its environment, improving both reasoning and decision-making capabilities with the help of our memory-tool synergistic reasoning. Additionally, we introduce a new benchmark based on complex Roadwork construction scenarios to rigorously evaluate zero-shot generalization. Extensive experiments demonstrate the superior effectiveness of our approach. On the public NAVSIM benchmark, our 3B-parameter MTRDrive model achieves an exceptional PDMS of 88.3 without chain-of-thought and sets a state-of-the-art performance bar on high-level planning, with a driving metric score of 79.8\% and a planning accuracy of 82.6\%. Rigorous zero-shot evaluation on the new Roadwork-VLM benchmark shows a strong ability to reason robustly in unseen scenarios, achieving a driving metric score of 80.2\%. These results highlight MTRDrive's potential to advance autonomous driving toward safer and more reliable systems.

Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation

Authors:Wei-Teng Chu, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi
Date:2025-09-25 02:30:05

Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as \emph{NeuS} and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.

Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport

Authors:Annabel Ma, Kaiying Hou, David Alvarez-Melis, Melanie Weber
Date:2025-09-25 02:25:24

Optimal transport (OT) is a widely used technique in machine learning, graphics, and vision that aligns two distributions or datasets using their relative geometry. In symmetry-rich settings, however, OT alignments based solely on pairwise geometric distances between raw features can ignore the intrinsic coherence structure of the data. We introduce Bispectral Optimal Transport, a symmetry-aware extension of discrete OT that compares elements using their representation using the bispectrum, a group Fourier invariant that preserves all signal structure while removing only the variation due to group actions. Empirically, we demonstrate that the transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets transformed with visual symmetries, improving the quality of meaningful correspondences that capture the underlying semantic label structure in the dataset while removing nuisance variation not affecting class or content.

Uncertainty-Aware Active Source Tracking of Marine Pollution using Unmanned Surface Vehicles

Authors:Song Ma, Richard Bucknall, Yuanchang Liu
Date:2025-09-24 22:23:06

This paper proposes an uncertainty-aware marine pollution source tracking framework for unmanned surface vehicles (USVs). By integrating high-fidelity marine pollution dispersion simulation with informative path planning techniques, we demonstrate effective identification of pollution sources in marine environments. The proposed approach is implemented based on Robot Operating System (ROS), processing real-time sensor data to update probabilistic source location estimates. The system progressively refines the estimation of source location while quantifying uncertainty levels in its predictions. Experiments conducted in simulated environments with varying source locations, flow conditions, and starting positions demonstrate the framework's ability to localise pollution sources with high accuracy. Results show that the proposed approach achieves reliable source localisation efficiently. This work contributes to the development of full autonomous environmental monitoring capabilities essential for rapid response to marine pollution incidents.

Action-Informed Estimation and Planning: Clearing Clutter on Staircases via Quadrupedal Pedipulation

Authors:Prasanna Sriganesh, Barath Satheeshkumar, Anushree Sabnis, Matthew Travers
Date:2025-09-24 19:42:49

For robots to operate autonomously in densely cluttered environments, they must reason about and potentially physically interact with obstacles to clear a path. Safely clearing a path on challenging terrain, such as a cluttered staircase, requires controlled interaction. For example, a quadrupedal robot that pushes objects out of the way with one leg while maintaining a stable stance with its three other legs. However, tightly coupled physical actions, such as one-legged pushing, create new constraints on the system that can be difficult to predict at design time. In this work, we present a new method that addresses one such constraint, wherein the object being pushed by a quadrupedal robot with one of its legs becomes occluded from the robot's sensors during manipulation. To address this challenge, we present a tightly coupled perception-action framework that enables the robot to perceive clutter, reason about feasible push paths, and execute the clearing maneuver. Our core contribution is an interaction-aware state estimation loop that uses proprioceptive feedback regarding foot contact and leg position to predict an object's displacement during the occlusion. This prediction guides the perception system to robustly re-detect the object after the interaction, closing the loop between action and sensing to enable accurate tracking even after partial pushes. Using this feedback allows the robot to learn from physical outcomes, reclassifying an object as immovable if a push fails due to it being too heavy. We present results of implementing our approach on a Boston Dynamics Spot robot that show our interaction-aware approach achieves higher task success rates and tracking accuracy in pushing objects on stairs compared to open-loop baselines.

Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting

Authors:Boqi Li, Siyuan Li, Weiyi Wang, Anran Li, Zhong Cao, Henry X. Liu
Date:2025-09-24 19:21:39

With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods.

The elusive chase for the first RR Lyr star in a binary system: the case of KIC 2831097

Authors:Ennio Poretti, Jean-Francois Le Borgne, Mercedes Correa, Adam Sodor, Monica Rainer, Maurice Audejean, Eric Denoux, Nicolas Esseiva, Joan Faine', Francesco Fumagalli, Ramon Naves, Alain Klotz
Date:2025-09-24 18:42:41

The lack of RR Lyr stars in binary systems is an atypical fact when we compared it to other classes of variables. Therefore, it has become a challenge for observers to detect an RR Lyr variable in a binary system. The RR Lyr variable KIC 2831097 was one of the most promising candidates. The phases of maximum brightness in the Kepler photometry showed a regular variation superimposed on a parabolic trend. These variations in the times of maximum brightness (Tmax ) were interpreted as a possible light-time travel effect (LTTE) in a wide binary and a fast evolutionary change in the period. We planned two spectroscopic runs with the FIES instrument mounted at the NOT to test the hypothesis of binarity. The observations were programmed at the predicted quadratures of the orbit. The GEOS collaboration complemented the spectroscopic survey by a photometric one. We also analysed Gaia time series and intensive TESS photometry. The RV curves obtained at the quadratures show the same mean RV (-203 km/s), which rules the possibility of an LTTE out. KIC 2831097 is a single high-velocity metal-poor RRc star belonging to the Galactic halo. We revisited Kepler photometry and detected a weak Blazhko effect consisting of an oscillation of only 1.1% of the period in about 50 d. We also analysed the TESS photometry of Kepler-1601, whose photometry is contaminated by KIC 2831097. In total, we collected 3624 times of maximum brightness. Linear ephemerides cannot fit the whole dataset, but only parts of them. The period shows a tendency to decrease in value, as if it were an evolutionary effect, but not at a constant rate.

Technical Memo: The impact of Nancy Grace Roman Telescope's default image processing on the detectability of moving Solar system objects

Authors:Joseph Masiero
Date:2025-09-24 18:38:49

The Nancy Grace Roman Telescope is scheduled to launch in 2026 to conduct a wide-field survey of the sky at near-infrared wavelengths. Although Roman is unable to track objects moving at non-sidereal rates, there is recent interest in the potential capability of the telescope to support planetary defense by tracking and characterizing asteroids and comets (Holler et al, 2025, arXiv:2508.14412). However, the standard pipeline image processing scheme that the mission is planning to implement for the majority of its survey data will preferentially reject flux from all moving objects during the process of cosmic ray rejection. Here we describe the impact of the default Wide Field Imager (WFI) processing on moving object detection, and possible mitigations that could be employed to recover moving object observations.

Language Models that Think, Chat Better

Authors:Adithya Bhaskar, Xi Ye, Danqi Chen
Date:2025-09-24 17:57:34

Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.

BBoE: Leveraging Bundle of Edges for Kinodynamic Bidirectional Motion Planning

Authors:Srikrishna Bangalore Raghu, Alessandro Roncone
Date:2025-09-24 17:23:20

In this work, we introduce BBoE, a bidirectional, kinodynamic, sampling-based motion planner that consistently and quickly finds low-cost solutions in environments with varying obstacle clutter. The algorithm combines exploration and exploitation while relying on precomputed robot state traversals, resulting in efficient convergence towards the goal. Our key contributions include: i) a strategy to navigate through obstacle-rich spaces by sorting and sequencing preprocessed forward propagations; and ii) BBoE, a robust bidirectional kinodynamic planner that utilizes this strategy to produce fast and feasible solutions. The proposed framework reduces planning time, diminishes solution cost and increases success rate in comparison to previous approaches.

Certified Learning-Enabled Noise-Aware Motion Planning for Urban Air Mobility

Authors:Jaejeong Park, Mahmoud Elfar, Cody Fleming, Yasser Shoukry
Date:2025-09-24 16:37:43

Urban Air Mobility (UAM) has emerged as a promising solution to alleviate urban congestion and transportation challenges. Nevertheless, the noise generated by eVTOL aircrafts poses a significant barrier to public acceptance and regulatory approval, potentially limiting the operational scope and scalability of UAM systems. Hence, the successful adoption of UAM systems hinges on the ability to predict generated noise levels, and further develop motion planning strategies that comply with community-level noise regulations while maintaining operational efficiency. To this end, this paper proposes a novel noise-aware motion planning framework for UAM systems that ensures compliance with noise regulations. We first develop a certifiable neural network model to accurately predict eVTOL noise propagation patterns in urban environments, providing provable bounds on its correctness. To achieve a desired level of accuracy, we propose an active sampling strategy to efficiently build the dataset used to train and test the noise model. Next, we develop a noise-aware motion planning algorithm that utilizes the noise model to generate eVTOL trajectories that guarantee compliance with community noise regulations. The algorithm exploits the monotonic structure of the noise model to efficiently sample the configuration space, ensuring that the generated trajectories are both noise-compliant and operationally efficient. We demonstrate the effectiveness of the proposed framework through a number of experiments for Vahana eVTOLs. The results show that the framework can generate noise-compliant flight plans for a fleet of eVTOLs that adhere to community noise regulations while optimizing operational efficiency.

Parse-Augment-Distill: Learning Generalizable Bimanual Visuomotor Policies from Single Human Video

Authors:Georgios Tziafas, Jiayun Zhang, Hamidreza Kasaei
Date:2025-09-24 16:19:18

Learning visuomotor policies from expert demonstrations is an important frontier in modern robotics research, however, most popular methods require copious efforts for collecting teleoperation data and struggle to generalize out-ofdistribution. Scaling data collection has been explored through leveraging human videos, as well as demonstration augmentation techniques. The latter approach typically requires expensive simulation rollouts and trains policies with synthetic image data, therefore introducing a sim-to-real gap. In parallel, alternative state representations such as keypoints have shown great promise for category-level generalization. In this work, we bring these avenues together in a unified framework: PAD (Parse-AugmentDistill), for learning generalizable bimanual policies from a single human video. Our method relies on three steps: (a) parsing a human video demo into a robot-executable keypoint-action trajectory, (b) employing bimanual task-and-motion-planning to augment the demonstration at scale without simulators, and (c) distilling the augmented trajectories into a keypoint-conditioned policy. Empirically, we showcase that PAD outperforms state-ofthe-art bimanual demonstration augmentation works relying on image policies with simulation rollouts, both in terms of success rate and sample/cost efficiency. We deploy our framework in six diverse real-world bimanual tasks such as pouring drinks, cleaning trash and opening containers, producing one-shot policies that generalize in unseen spatial arrangements, object instances and background distractors. Supplementary material can be found in the project webpage https://gtziafas.github.io/PAD_project/.

A co-evolving agentic AI system for medical imaging analysis

Authors:Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, Yan Cui, Jialu Yao, Shunsuke Koga, Zhi Huang
Date:2025-09-24 16:15:28

Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present "TissueLab", a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

AnchDrive: Bootstrapping Diffusion Policies with Hybrid Trajectory Anchors for End-to-End Driving

Authors:Jinhao Chai, Anqing Jiang, Hao Jiang, Shiyi Mu, Zichong Gu, Shugong Xu
Date:2025-09-24 15:38:41

End-to-end multi-modal planning has become a transformative paradigm in autonomous driving, effectively addressing behavioral multi-modality and the generalization challenge in long-tail scenarios. We propose AnchDrive, a framework for end-to-end driving that effectively bootstraps a diffusion policy to mitigate the high computational cost of traditional generative models. Rather than denoising from pure noise, AnchDrive initializes its planner with a rich set of hybrid trajectory anchors. These anchors are derived from two complementary sources: a static vocabulary of general driving priors and a set of dynamic, context-aware trajectories. The dynamic trajectories are decoded in real-time by a Transformer that processes dense and sparse perceptual features. The diffusion model then learns to refine these anchors by predicting a distribution of trajectory offsets, enabling fine-grained refinement. This anchor-based bootstrapping design allows for efficient generation of diverse, high-quality trajectories. Experiments on the NAVSIM benchmark confirm that AnchDrive sets a new state-of-the-art and shows strong gen?eralizability

LidarScout: Direct Out-of-Core Rendering of Massive Point Clouds

Authors:Philipp Erler, Lukas Herzberger, Michael Wimmer, Markus Schütz
Date:2025-09-24 14:53:52

Large-scale terrain scans are the basis for many important tasks, such as topographic mapping, forestry, agriculture, and infrastructure planning. The resulting point cloud data sets are so massive in size that even basic tasks like viewing take hours to days of pre-processing in order to create level-of-detail structures that allow inspecting the data set in their entirety in real time. In this paper, we propose a method that is capable of instantly visualizing massive country-sized scans with hundreds of billions of points. Upon opening the data set, we first load a sparse subsample of points and initialize an overview of the entire point cloud, immediately followed by a surface reconstruction process to generate higher-quality, hole-free heightmaps. As users start navigating towards a region of interest, we continue to prioritize the heightmap construction process to the user's viewpoint. Once a user zooms in closely, we load the full-resolution point cloud data for that region and update the corresponding height map textures with the full-resolution data. As users navigate elsewhere, full-resolution point data that is no longer needed is unloaded, but the updated heightmap textures are retained as a form of medium level of detail. Overall, our method constitutes a form of direct out-of-core rendering for massive point cloud data sets (terabytes, compressed) that requires no preprocessing and no additional disk space. Source code, executable, pre-trained model, and dataset are available at: https://github.com/cg-tuwien/lidarscout

U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Date:2025-09-24 14:19:33

Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Authors:Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang
Date:2025-09-24 13:35:15

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning

Authors:Xun Li, Rodrigo Santa Cruz, Mingze Xi, Hu Zhang, Madhawa Perera, Ziwei Wang, Ahalya Ravendran, Brandon J. Matthews, Feng Xu, Matt Adcock, Dadong Wang, Jiajun Liu
Date:2025-09-24 12:53:32

To enable robots to comprehend high-level human instructions and perform complex tasks, a key challenge lies in achieving comprehensive scene understanding: interpreting and interacting with the 3D environment in a meaningful way. This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability by linking multimodal object embeddings, and supporting object-level retrieval of geometric, visual, and semantic information. The retrieved data are then loaded into a robotic task planner for downstream execution. We evaluate our approach through simulated robotic task planning scenarios in Unity, guided by abstract language instructions and using the indoor public dataset Replica. Furthermore, we apply it in a digital duplicate of a real wet lab environment to test QSR-supported robotic task planning for emergency response. The results demonstrate the framework's ability to facilitate scene understanding and integrate spatial and semantic reasoning, effectively translating high-level human instructions into precise robotic task planning in complex 3D environments.

GPU-accelerated FREDopt package for simultaneous dose and LETd proton radiotherapy plan optimization via superiorization methods

Authors:Damian Borys, Jan Gajewski, Tobias Becher, Yair Censor, Renata Kopeć, Marzena Rydygier, Angelo Schiavi, Tomasz Skóra, Anna Spaleniak, Niklas Wahl, Agnieszka Wochnik, Antoni Ruciński
Date:2025-09-24 11:29:42

This study presents FREDopt, a newly developed GPU-accelerated open-source optimization software for simultaneous proton dose and dose-averaged LET (LETd) optimization in IMPT treatment planning. FREDopt was implemented entirely in Python, leveraging CuPy for GPU acceleration and incorporating fast Monte Carlo (MC) simulations from the FRED code. The treatment plan optimization workflow includes pre-optimization and optimization, the latter equipped with a novel superiorization of feasibility-seeking algorithms. Feasibility-seeking requires finding a point that satisfies prescribed constraints. Superiorization interlaces computational perturbations into iterative feasibility-seeking steps to steer them toward a superior feasible point, replacing the need for costly full-fledged constrained optimization. The method was validated on two treatment plans of patients treated in a clinical proton therapy center, with dose and LETd distributions compared before and after reoptimization. Simultaneous dose and LETd optimization using FREDopt led to a substantial reduction of LETd and (dose)x(LETd) in organs at risk (OARs) while preserving target dose conformity. Computational performance evaluation showed execution times of 14-50 minutes, depending on the algorithm and target volume size-satisfactory for clinical and research applications while enabling further development of the well-tested, documented open-source software.

OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Authors:Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma
Date:2025-09-24 10:28:06

Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference

Authors:Zijun Che, Yinghong Zhang, Shengyi Liang, Boyu Zhou, Jun Ma, Jinni Zhou
Date:2025-09-24 09:15:24

Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.

Report on LEO satellite impacts on ground-based optical astronomy for the Rubin Observatory LSST

Authors:Phanindra Kandula, Lee Kelvin, Erfan Nourbakhsh, Daniel Polin, Tom Prince, Meredith Rawls, Adam Snyder, Brianna Smart, Christopher Stubbs, Anthony Tyson, Zeljko Ivezic, Craig Lage, Clare Saunders
Date:2025-09-24 04:37:23

In August 2025 a workshop was convened to bring together experts to better understand steps that can be taken to mitigate the impact of satellite constellations on astronomical observations. At the time, just over 12,000 operational satellites were in low-Earth orbit (LEO). Although reflected sunlight and emissions all across the electromagnetic spectrum from artificial satellites impact scientific observations and the sky, the workshop focused on reflected sunlight in the wavelength range 330 nm to 1100 nm. This aligns with the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) planned imaging observations over the coming decade. Among other conclusions, we affirm previous recommendations that tracked satellite apparent magnitudes should be no brighter than 7th AB mag. The workshop participants discussed over 30 publications, reports, and presentations, and arrived at the Findings and Recommendations presented here. During the workshop, in addition to affirming many existing recommendations and best practices, the group discovered new issues and devised possible mitigations. These were nearly equally divided between advice to satellite builders and operators and to the observational astronomy community. While the workshop prioritized considerations for LSST, our hope is that many of the Findings and Recommendations will also apply to other observatories and constellations, and that all satellite companies will continue to engage in dialog with sky observers across the globe.

Trajectory Planning Using Safe Ellipsoidal Corridors as Projections of Orthogonal Trust Regions

Authors:Akshay Jaitly, Jon Arrizabalaga, Guanrui Li
Date:2025-09-24 03:27:02

Planning collision free trajectories in complex environments remains a core challenge in robotics. Existing corridor based planners which rely on decomposition of the free space into collision free subsets scale poorly with environmental complexity and require explicit allocations of time windows to trajectory segments. We introduce a new trajectory parameterization that represents trajectories in a nonconvex collision free corridor as being in a convex cartesian product of balls. This parameterization allows us to decouple problem size from geometric complexity of the solution and naturally avoids explicit time allocation by allowing trajectories to evolve continuously inside ellipsoidal corridors. Building on this representation, we formulate the Orthogonal Trust Region Problem (Orth-TRP), a specialized convex program with separable block constraints, and develop a solver that exploits this parallel structure and the unique structure of each parallel subproblem for efficient optimization. Experiments on a quadrotor trajectory planning benchmark show that our approach produces smoother trajectories and lower runtimes than state-of-the-art corridor based planners, especially in highly complicated environments.