planning - 2025-09-25

Language Models that Think, Chat Better

Authors:Adithya Bhaskar, Xi Ye, Danqi Chen

Date:2025-09-24 17:57:34

Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.

BBoE: Leveraging Bundle of Edges for Kinodynamic Bidirectional Motion Planning

Authors:Srikrishna Bangalore Raghu, Alessandro Roncone

Date:2025-09-24 17:23:20

In this work, we introduce BBoE, a bidirectional, kinodynamic, sampling-based motion planner that consistently and quickly finds low-cost solutions in environments with varying obstacle clutter. The algorithm combines exploration and exploitation while relying on precomputed robot state traversals, resulting in efficient convergence towards the goal. Our key contributions include: i) a strategy to navigate through obstacle-rich spaces by sorting and sequencing preprocessed forward propagations; and ii) BBoE, a robust bidirectional kinodynamic planner that utilizes this strategy to produce fast and feasible solutions. The proposed framework reduces planning time, diminishes solution cost and increases success rate in comparison to previous approaches.

Certified Learning-Enabled Noise-Aware Motion Planning for Urban Air Mobility

Authors:Jaejeong Park, Mahmoud Elfar, Cody Fleming, Yasser Shoukry

Date:2025-09-24 16:37:43

Urban Air Mobility (UAM) has emerged as a promising solution to alleviate urban congestion and transportation challenges. Nevertheless, the noise generated by eVTOL aircrafts poses a significant barrier to public acceptance and regulatory approval, potentially limiting the operational scope and scalability of UAM systems. Hence, the successful adoption of UAM systems hinges on the ability to predict generated noise levels, and further develop motion planning strategies that comply with community-level noise regulations while maintaining operational efficiency. To this end, this paper proposes a novel noise-aware motion planning framework for UAM systems that ensures compliance with noise regulations. We first develop a certifiable neural network model to accurately predict eVTOL noise propagation patterns in urban environments, providing provable bounds on its correctness. To achieve a desired level of accuracy, we propose an active sampling strategy to efficiently build the dataset used to train and test the noise model. Next, we develop a noise-aware motion planning algorithm that utilizes the noise model to generate eVTOL trajectories that guarantee compliance with community noise regulations. The algorithm exploits the monotonic structure of the noise model to efficiently sample the configuration space, ensuring that the generated trajectories are both noise-compliant and operationally efficient. We demonstrate the effectiveness of the proposed framework through a number of experiments for Vahana eVTOLs. The results show that the framework can generate noise-compliant flight plans for a fleet of eVTOLs that adhere to community noise regulations while optimizing operational efficiency.

Parse-Augment-Distill: Learning Generalizable Bimanual Visuomotor Policies from Single Human Video

Authors:Georgios Tziafas, Jiayun Zhang, Hamidreza Kasaei

Date:2025-09-24 16:19:18

Learning visuomotor policies from expert demonstrations is an important frontier in modern robotics research, however, most popular methods require copious efforts for collecting teleoperation data and struggle to generalize out-ofdistribution. Scaling data collection has been explored through leveraging human videos, as well as demonstration augmentation techniques. The latter approach typically requires expensive simulation rollouts and trains policies with synthetic image data, therefore introducing a sim-to-real gap. In parallel, alternative state representations such as keypoints have shown great promise for category-level generalization. In this work, we bring these avenues together in a unified framework: PAD (Parse-AugmentDistill), for learning generalizable bimanual policies from a single human video. Our method relies on three steps: (a) parsing a human video demo into a robot-executable keypoint-action trajectory, (b) employing bimanual task-and-motion-planning to augment the demonstration at scale without simulators, and (c) distilling the augmented trajectories into a keypoint-conditioned policy. Empirically, we showcase that PAD outperforms state-ofthe-art bimanual demonstration augmentation works relying on image policies with simulation rollouts, both in terms of success rate and sample/cost efficiency. We deploy our framework in six diverse real-world bimanual tasks such as pouring drinks, cleaning trash and opening containers, producing one-shot policies that generalize in unseen spatial arrangements, object instances and background distractors. Supplementary material can be found in the project webpage https://gtziafas.github.io/PAD_project/.

A co-evolving agentic AI system for medical imaging analysis

Authors:Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, Yan Cui, Jialu Yao, Shunsuke Koga, Zhi Huang

Date:2025-09-24 16:15:28

Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present "TissueLab", a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

AnchDrive: Bootstrapping Diffusion Policies with Hybrid Trajectory Anchors for End-to-End Driving

Authors:Jinhao Chai, Anqing Jiang, Hao Jiang, Shiyi Mu, Zichong Gu, Shugong Xu

Date:2025-09-24 15:38:41

End-to-end multi-modal planning has become a transformative paradigm in autonomous driving, effectively addressing behavioral multi-modality and the generalization challenge in long-tail scenarios. We propose AnchDrive, a framework for end-to-end driving that effectively bootstraps a diffusion policy to mitigate the high computational cost of traditional generative models. Rather than denoising from pure noise, AnchDrive initializes its planner with a rich set of hybrid trajectory anchors. These anchors are derived from two complementary sources: a static vocabulary of general driving priors and a set of dynamic, context-aware trajectories. The dynamic trajectories are decoded in real-time by a Transformer that processes dense and sparse perceptual features. The diffusion model then learns to refine these anchors by predicting a distribution of trajectory offsets, enabling fine-grained refinement. This anchor-based bootstrapping design allows for efficient generation of diverse, high-quality trajectories. Experiments on the NAVSIM benchmark confirm that AnchDrive sets a new state-of-the-art and shows strong gen?eralizability

LidarScout: Direct Out-of-Core Rendering of Massive Point Clouds

Authors:Philipp Erler, Lukas Herzberger, Michael Wimmer, Markus Schütz

Date:2025-09-24 14:53:52

Large-scale terrain scans are the basis for many important tasks, such as topographic mapping, forestry, agriculture, and infrastructure planning. The resulting point cloud data sets are so massive in size that even basic tasks like viewing take hours to days of pre-processing in order to create level-of-detail structures that allow inspecting the data set in their entirety in real time. In this paper, we propose a method that is capable of instantly visualizing massive country-sized scans with hundreds of billions of points. Upon opening the data set, we first load a sparse subsample of points and initialize an overview of the entire point cloud, immediately followed by a surface reconstruction process to generate higher-quality, hole-free heightmaps. As users start navigating towards a region of interest, we continue to prioritize the heightmap construction process to the user's viewpoint. Once a user zooms in closely, we load the full-resolution point cloud data for that region and update the corresponding height map textures with the full-resolution data. As users navigate elsewhere, full-resolution point data that is no longer needed is unloaded, but the updated heightmap textures are retained as a form of medium level of detail. Overall, our method constitutes a form of direct out-of-core rendering for massive point cloud data sets (terabytes, compressed) that requires no preprocessing and no additional disk space. Source code, executable, pre-trained model, and dataset are available at: https://github.com/cg-tuwien/lidarscout

U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li

Date:2025-09-24 14:19:33

Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Authors:Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang

Date:2025-09-24 13:35:15

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning

Authors:Xun Li, Rodrigo Santa Cruz, Mingze Xi, Hu Zhang, Madhawa Perera, Ziwei Wang, Ahalya Ravendran, Brandon J. Matthews, Feng Xu, Matt Adcock, Dadong Wang, Jiajun Liu

Date:2025-09-24 12:53:32

To enable robots to comprehend high-level human instructions and perform complex tasks, a key challenge lies in achieving comprehensive scene understanding: interpreting and interacting with the 3D environment in a meaningful way. This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability by linking multimodal object embeddings, and supporting object-level retrieval of geometric, visual, and semantic information. The retrieved data are then loaded into a robotic task planner for downstream execution. We evaluate our approach through simulated robotic task planning scenarios in Unity, guided by abstract language instructions and using the indoor public dataset Replica. Furthermore, we apply it in a digital duplicate of a real wet lab environment to test QSR-supported robotic task planning for emergency response. The results demonstrate the framework's ability to facilitate scene understanding and integrate spatial and semantic reasoning, effectively translating high-level human instructions into precise robotic task planning in complex 3D environments.

GPU-accelerated FREDopt package for simultaneous dose and LETd proton radiotherapy plan optimization via superiorization methods

Authors:Damian Borys, Jan Gajewski, Tobias Becher, Yair Censor, Renata Kopeć, Marzena Rydygier, Angelo Schiavi, Tomasz Skóra, Anna Spaleniak, Niklas Wahl, Agnieszka Wochnik, Antoni Ruciński

Date:2025-09-24 11:29:42

This study presents FREDopt, a newly developed GPU-accelerated open-source optimization software for simultaneous proton dose and dose-averaged LET (LETd) optimization in IMPT treatment planning. FREDopt was implemented entirely in Python, leveraging CuPy for GPU acceleration and incorporating fast Monte Carlo (MC) simulations from the FRED code. The treatment plan optimization workflow includes pre-optimization and optimization, the latter equipped with a novel superiorization of feasibility-seeking algorithms. Feasibility-seeking requires finding a point that satisfies prescribed constraints. Superiorization interlaces computational perturbations into iterative feasibility-seeking steps to steer them toward a superior feasible point, replacing the need for costly full-fledged constrained optimization. The method was validated on two treatment plans of patients treated in a clinical proton therapy center, with dose and LETd distributions compared before and after reoptimization. Simultaneous dose and LETd optimization using FREDopt led to a substantial reduction of LETd and (dose)x(LETd) in organs at risk (OARs) while preserving target dose conformity. Computational performance evaluation showed execution times of 14-50 minutes, depending on the algorithm and target volume size-satisfactory for clinical and research applications while enabling further development of the well-tested, documented open-source software.

OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Authors:Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma

Date:2025-09-24 10:28:06

Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference

Authors:Zijun Che, Yinghong Zhang, Shengyi Liang, Boyu Zhou, Jun Ma, Jinni Zhou

Date:2025-09-24 09:15:24

Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.

Report on LEO satellite impacts on ground-based optical astronomy for the Rubin Observatory LSST

Authors:Phanindra Kandula, Lee Kelvin, Erfan Nourbakhsh, Daniel Polin, Tom Prince, Meredith Rawls, Adam Snyder, Brianna Smart, Christopher Stubbs, Anthony Tyson, Zeljko Ivezic, Craig Lage, Clare Saunders

Date:2025-09-24 04:37:23

In August 2025 a workshop was convened to bring together experts to better understand steps that can be taken to mitigate the impact of satellite constellations on astronomical observations. At the time, just over 12,000 operational satellites were in low-Earth orbit (LEO). Although reflected sunlight and emissions all across the electromagnetic spectrum from artificial satellites impact scientific observations and the sky, the workshop focused on reflected sunlight in the wavelength range 330 nm to 1100 nm. This aligns with the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) planned imaging observations over the coming decade. Among other conclusions, we affirm previous recommendations that tracked satellite apparent magnitudes should be no brighter than 7th AB mag. The workshop participants discussed over 30 publications, reports, and presentations, and arrived at the Findings and Recommendations presented here. During the workshop, in addition to affirming many existing recommendations and best practices, the group discovered new issues and devised possible mitigations. These were nearly equally divided between advice to satellite builders and operators and to the observational astronomy community. While the workshop prioritized considerations for LSST, our hope is that many of the Findings and Recommendations will also apply to other observatories and constellations, and that all satellite companies will continue to engage in dialog with sky observers across the globe.

Trajectory Planning Using Safe Ellipsoidal Corridors as Projections of Orthogonal Trust Regions

Authors:Akshay Jaitly, Jon Arrizabalaga, Guanrui Li

Date:2025-09-24 03:27:02

Planning collision free trajectories in complex environments remains a core challenge in robotics. Existing corridor based planners which rely on decomposition of the free space into collision free subsets scale poorly with environmental complexity and require explicit allocations of time windows to trajectory segments. We introduce a new trajectory parameterization that represents trajectories in a nonconvex collision free corridor as being in a convex cartesian product of balls. This parameterization allows us to decouple problem size from geometric complexity of the solution and naturally avoids explicit time allocation by allowing trajectories to evolve continuously inside ellipsoidal corridors. Building on this representation, we formulate the Orthogonal Trust Region Problem (Orth-TRP), a specialized convex program with separable block constraints, and develop a solver that exploits this parallel structure and the unique structure of each parallel subproblem for efficient optimization. Experiments on a quadrotor trajectory planning benchmark show that our approach produces smoother trajectories and lower runtimes than state-of-the-art corridor based planners, especially in highly complicated environments.

A Multiobjective Mathematical Model for Optimal Irrigation Water Allocation

Authors:Nahid Sultana, M. M. Rizvi, G. M. Wali Ullah

Date:2025-09-23 22:42:22

Water resource management is vital for sustainable irrigation and food security under growing climatic and economic pressures. In this study, we develop two single-objective optimization models for irrigation planning, one that maximizes net benefit and the other that minimizes environmental flow deficiency, and benchmark them against previous works. We then extend the analysis to a multiobjective linear programming formulation solved through different approaches to evaluate trade-offs. Numerical experiments on the Muhuri Irrigation Project reveal three outcomes: (i) a complete scenario view with profits ranging from $0.2 \times 10^9$ to $1.497\times 10^9$ and environmental flow deficits between 0 and 1200 GL, where the 1200 GL represents the theoretical annual maximum under a 100 GL uniform monthly target; (ii) explicit trade-offs showing higher profits correspond to greater ecological shortfalls; and (iii) an integration-based approach producing nearly 1000 Pareto-optimal solutions within seconds, greatly outperforming earlier studies.

BALANCE: Bitrate-Adaptive Limit-Aware Netcast Content Enhancement Utilizing QUBO and Quantum Annealing

Authors:Animesh Rajpurohit, Michael Kelley, Wei Wang, Krishna Murthy Kattiyan Ramamoorthy

Date:2025-09-23 22:11:56

In an era of increasing data cap constraints, optimizing video streaming quality while adhering to user-defined data caps remains a significant challenge. This paper introduces Bitrate-Adaptive Limit-Aware Netcast Content Enhancement (BALANCE), a novel Quantum framework aimed at addressing this issue. BALANCE intelligently pre-selects video segments based on visual complexity and anticipated data consumption, utilizing the Video Multimethod Assessment Fusion (VMAF) metric to enhance Quality of Experience (QoE). We compare our method against traditional bitrate ladders used in Adaptive Bitrate (ABR) streaming, demonstrating a notable improvement in QoE under equivalent data constraints. We compare the Slack variable approach with the Dynamic Penalization Approach (DPA) by framing the bitrate allocation problem through Quadratic Unconstrained Binary Optimization (QUBO) to effectively enforce data limits. Our results indicate that the DPA consistently outperforms the Slack Variable Method, delivering more valid and optimal solutions as data limits increase. This new quantum approach significantly enhances streaming satisfaction for users with limited data plans.

Look as You Leap: Planning Simultaneous Motion and Perception for High-DOF Robots

Authors:Qingxi Meng, Emiliano Flores, Carlos Quintero-Peña, Peizhu Qian, Zachary Kingston, Shannan K. Hamlin, Vaibhav Unhelkar, Lydia E. Kavraki

Date:2025-09-23 22:00:51

In this work, we address the problem of planning robot motions for a high-degree-of-freedom (DoF) robot that effectively achieves a given perception task while the robot and the perception target move in a dynamic environment. Achieving navigation and perception tasks simultaneously is challenging, as these objectives often impose conflicting requirements. Existing methods that compute motion under perception constraints fail to account for obstacles, are designed for low-DoF robots, or rely on simplified models of perception. Furthermore, in dynamic real-world environments, robots must replan and react quickly to changes and directly evaluating the quality of perception (e.g., object detection confidence) is often expensive or infeasible at runtime. This problem is especially important in human-centered environments such as homes and hospitals, where effective perception is essential for safe and reliable operation. To address these challenges, we propose a GPU-parallelized perception-score-guided probabilistic roadmap planner with a neural surrogate model (PS-PRM). The planner explicitly incorporates the estimated quality of a perception task into motion planning for high-DoF robots. Our method uses a learned model to approximate perception scores and leverages GPU parallelism to enable efficient online replanning in dynamic settings. We demonstrate that our planner, evaluated on high-DoF robots, outperforms baseline methods in both static and dynamic environments in both simulation and real-robot experiments.

vashTimer: A Multi-Purpose, Multimodal Mobile App For Maintaining Passage of Time by means of Visual, Auditory, Speech, and Haptic Alerts

Authors:Aziz N Zeidieh, Sanchita S. Kamath, JooYoung Seo

Date:2025-09-23 21:48:06

Effective time management during presentations is challenging, particularly for Blind and Low-Vision (BLV) individuals, as existing tools often lack accessibility and multimodal feedback. To address this gap, we developed vashTimer: a free, open-source, and accessible iOS application. This paper demonstrates the design and functionality of vashTimer, which provides presenters with a robust tool for temporal awareness. The application delivers highly customizable alerts across four distinct modalities: visual, auditory, speech, and haptic; and supports multiple configurable intervals within a single session. By offering a flexible and non-intrusive time management solution, vashTimer empowers presenters of all visual abilities. The implications of this work extend beyond public speaking to any professional, such as a clinical therapist, who requires discreet temporal cues, fostering greater independence and focus for a wide range of users. This demonstration serves as the foundation for a planned formal user evaluation.

Terra: Hierarchical Terrain-Aware 3D Scene Graph for Task-Agnostic Outdoor Mapping

Authors:Chad R. Samuelson, Abigail Austin, Seth Knoop, Blake Romrell, Gabriel R. Slade, Timothy W. McLain, Joshua G. Mangelson

Date:2025-09-23 21:05:30

Outdoor intelligent autonomous robotic operation relies on a sufficiently expressive map of the environment. Classical geometric mapping methods retain essential structural environment information, but lack a semantic understanding and organization to allow high-level robotic reasoning. 3D scene graphs (3DSGs) address this limitation by integrating geometric, topological, and semantic relationships into a multi-level graph-based map. Outdoor autonomous operations commonly rely on terrain information either due to task-dependence or the traversability of the robotic platform. We propose a novel approach that combines indoor 3DSG techniques with standard outdoor geometric mapping and terrain-aware reasoning, producing terrain-aware place nodes and hierarchically organized regions for outdoor environments. Our method generates a task-agnostic metric-semantic sparse map and constructs a 3DSG from this map for downstream planning tasks, all while remaining lightweight for autonomous robotic operation. Our thorough evaluation demonstrates our 3DSG method performs on par with state-of-the-art camera-based 3DSG methods in object retrieval and surpasses them in region classification while remaining memory efficient. We demonstrate its effectiveness in diverse robotic tasks of object retrieval and region monitoring in both simulation and real-world environments.

MAGIC: Multi-task Gaussian process for joint imputation and classification in healthcare time series

Authors:Dohyun Ku, Catherine D. Chong, Visar Berisha, Todd J. Schwedt, Jing Li

Date:2025-09-23 21:02:39

Time series analysis has emerged as an important tool for improving patient diagnosis and management in healthcare applications. However, these applications commonly face two critical challenges: time misalignment and data sparsity. Traditional approaches address these issues through a two-step process of imputation followed by prediction. We propose MAGIC (Multi-tAsk Gaussian Process for Imputation and Classification), a novel unified framework that simultaneously performs class-informed missing value imputation and label prediction within a hierarchical multi-task Gaussian process coupled with functional logistic regression. To handle intractable likelihood components, MAGIC employs Taylor expansion approximations with bounded error analysis, and parameter estimation is performed using EM algorithm with block coordinate optimization supported by convergence analysis. We validate MAGIC through two healthcare applications: prediction of post-traumatic headache improvement following mild traumatic brain injury and prediction of in-hospital mortality within 48 hours after ICU admission. In both applications, MAGIC achieves superior predictive accuracy compared to existing methods. The ability to generate real-time and accurate predictions with limited samples facilitates early clinical assessment and treatment planning, enabling healthcare providers to make more informed treatment decisions.

Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

Authors:Sacha Morin, Kumaraditya Gupta, Mahtab Sandhu, Charlie Gauthier, Francesco Argenziano, Kirsty Ellis, Liam Paull

Date:2025-09-23 20:56:00

Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

Stochastic Path Planning in Correlated Obstacle Fields

Authors:Li Zhou, Elvan Ceyhan

Date:2025-09-23 20:30:35

We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency. To find the optimal traversal policy, we propose a novel two-stage learning framework. An offline phase learns a robust base policy via optimistic policy iteration augmented with information bonus to encourage exploration in informative regions, followed by an online rollout policy with periodic base updates via a Bayesian mechanism for information adaptation. This framework supports both Monte Carlo point estimation and distributional reinforcement learning (RL) to learn full cost distributions, leading to stronger uncertainty quantification. We establish theoretical benefits of correlation-aware updating and convergence property under posterior sampling. Comprehensive empirical evaluations across varying obstacle densities, sensor capabilities demonstrate consistent performance gains over baselines. This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.

Autonomous Elemental Characterization Enabled by a Low Cost Robotic Platform Built Upon a Generalized Software Architecture

Authors:Xuan Cao, Yuxin Wu, Michael L. Whittaker

Date:2025-09-23 20:09:33

Despite the rapidly growing applications of robots in industry, the use of robots to automate tasks in scientific laboratories is less prolific due to lack of generalized methodologies and high cost of hardware. This paper focuses on the automation of characterization tasks necessary for reducing cost while maintaining generalization, and proposes a software architecture for building robotic systems in scientific laboratory environment. A dual-layer (Socket.IO and ROS) action server design is the basic building block, which facilitates the implementation of a web-based front end for user-friendly operations and the use of ROS Behavior Tree for convenient task planning and execution. A robotic platform for automating mineral and material sample characterization is built upon the architecture, with an open source, low-cost three-axis computer numerical control gantry system serving as the main robot. A handheld laser induced breakdown spectroscopy (LIBS) analyzer is integrated with a 3D printed adapter, enabling automated 2D chemical mapping. We demonstrate the utility of automated chemical mapping by scanning of the surface of a spodumene-bearing pegmatite core sample with a 1071-point dense hyperspectral map acquired at a rate of 1520 bits per second. Automated LIBS scanning enables controlled chemical quantification in the laboratory that complements field-based measurements acquired with the same handheld device, linking resource exploration and processing steps in the supply chain for lithium-based battery materials.

Supercomputing for High-speed Avoidance and Reactive Planning in Robots

Authors:Kieran S. Lachmansingh, José R. González-Estrada, Ryan E. Grant, Matthew K. X. J. Pan

Date:2025-09-23 18:48:56

This paper presents SHARP (Supercomputing for High-speed Avoidance and Reactive Planning), a proof-of-concept study demonstrating how high-performance computing (HPC) can enable millisecond-scale responsiveness in robotic control. While modern robots face increasing demands for reactivity in human--robot shared workspaces, onboard processors are constrained by size, power, and cost. Offloading to HPC offers massive parallelism for trajectory planning, but its feasibility for real-time robotics remains uncertain due to network latency and jitter. We evaluate SHARP in a stress-test scenario where a 7-DOF manipulator must dodge high-speed foam projectiles. Using a parallelized multi-goal A* search implemented with MPI on both local and remote HPC clusters, the system achieves mean planning latencies of 22.9 ms (local) and 30.0 ms (remote, ~300 km away), with avoidance success rates of 84% and 88%, respectively. These results show that when round-trip latency remains within the tens-of-milliseconds regime, HPC-side computation is no longer the bottleneck, enabling avoidance well below human reaction times. The SHARP results motivate hybrid control architectures: low-level reflexes remain onboard for safety, while bursty, high-throughput planning tasks are offloaded to HPC for scalability. By reporting per-stage timing and success rates, this study provides a reproducible template for assessing real-time feasibility of HPC-driven robotics. Collectively, SHARP reframes HPC offloading as a viable pathway toward dependable, reactive robots in dynamic environments.

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

Authors:Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

Date:2025-09-23 18:07:10

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

WolBanking77: Wolof Banking Speech Intent Classification Dataset

Authors:Abdou Karim Kandji, Frédéric Precioso, Cheikh Ba, Samba Ndiaye, Augustin Ndione

Date:2025-09-23 17:34:10

Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, with an illiteracy rate of 42\% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.

Imitation-Guided Bimanual Planning for Stable Manipulation under Changing External Forces

Authors:Kuanqi Cai, Chunfeng Wang, Zeqi Li, Haowen Yao, Weinan Chen, Luis Figueredo, Aude Billard, Arash Ajoudani

Date:2025-09-23 17:19:25

Robotic manipulation in dynamic environments often requires seamless transitions between different grasp types to maintain stability and efficiency. However, achieving smooth and adaptive grasp transitions remains a challenge, particularly when dealing with external forces and complex motion constraints. Existing grasp transition strategies often fail to account for varying external forces and do not optimize motion performance effectively. In this work, we propose an Imitation-Guided Bimanual Planning Framework that integrates efficient grasp transition strategies and motion performance optimization to enhance stability and dexterity in robotic manipulation. Our approach introduces Strategies for Sampling Stable Intersections in Grasp Manifolds for seamless transitions between uni-manual and bi-manual grasps, reducing computational costs and regrasping inefficiencies. Additionally, a Hierarchical Dual-Stage Motion Architecture combines an Imitation Learning-based Global Path Generator with a Quadratic Programming-driven Local Planner to ensure real-time motion feasibility, obstacle avoidance, and superior manipulability. The proposed method is evaluated through a series of force-intensive tasks, demonstrating significant improvements in grasp transition efficiency and motion performance. A video demonstrating our simulation results can be viewed at \href{https://youtu.be/3DhbUsv4eDo}{\textcolor{blue}{https://youtu.be/3DhbUsv4eDo}}.

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Authors:Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

Date:2025-09-23 17:05:46

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Authors:Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan

Date:2025-09-23 16:02:09

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.