planning - 2025-09-24

WolBanking77: Wolof Banking Speech Intent Classification Dataset

Authors:Abdou Karim Kandji, Frédéric Precioso, Cheikh Ba, Samba Ndiaye, Augustin Ndione
Date:2025-09-23 17:34:10

Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, with an illiteracy rate of 42\% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.

Imitation-Guided Bimanual Planning for Stable Manipulation under Changing External Forces

Authors:Kuanqi Cai, Chunfeng Wang, Zeqi Li, Haowen Yao, Weinan Chen, Luis Figueredo, Aude Billard, Arash Ajoudani
Date:2025-09-23 17:19:25

Robotic manipulation in dynamic environments often requires seamless transitions between different grasp types to maintain stability and efficiency. However, achieving smooth and adaptive grasp transitions remains a challenge, particularly when dealing with external forces and complex motion constraints. Existing grasp transition strategies often fail to account for varying external forces and do not optimize motion performance effectively. In this work, we propose an Imitation-Guided Bimanual Planning Framework that integrates efficient grasp transition strategies and motion performance optimization to enhance stability and dexterity in robotic manipulation. Our approach introduces Strategies for Sampling Stable Intersections in Grasp Manifolds for seamless transitions between uni-manual and bi-manual grasps, reducing computational costs and regrasping inefficiencies. Additionally, a Hierarchical Dual-Stage Motion Architecture combines an Imitation Learning-based Global Path Generator with a Quadratic Programming-driven Local Planner to ensure real-time motion feasibility, obstacle avoidance, and superior manipulability. The proposed method is evaluated through a series of force-intensive tasks, demonstrating significant improvements in grasp transition efficiency and motion performance. A video demonstrating our simulation results can be viewed at \href{https://youtu.be/3DhbUsv4eDo}{\textcolor{blue}{https://youtu.be/3DhbUsv4eDo}}.

Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

Authors:Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
Date:2025-09-23 17:05:46

We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Authors:Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
Date:2025-09-23 16:02:09

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

A Multimodal Stochastic Planning Approach for Navigation and Multi-Robot Coordination

Authors:Mark Gonzales, Ethan Oh, Joseph Moore
Date:2025-09-23 15:43:18

In this paper, we present a receding-horizon, sampling-based planner capable of reasoning over multimodal policy distributions. By using the cross-entropy method to optimize a multimodal policy under a common cost function, our approach increases robustness against local minima and promotes effective exploration of the solution space. We show that our approach naturally extends to multi-robot collision-free planning, enables agents to share diverse candidate policies to avoid deadlocks, and allows teams to minimize a global objective without incurring the computational complexity of centralized optimization. Numerical simulations demonstrate that employing multiple modes significantly improves success rates in trap environments and in multi-robot collision avoidance. Hardware experiments further validate the approach's real-time feasibility and practical performance.

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Authors:Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang
Date:2025-09-23 14:42:31

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

Authors:Ganesh Mallya, Yotam Gigi, Dahun Kim, Maxim Neumann, Genady Beryozkin, Tomer Shekel, Anelia Angelova
Date:2025-09-23 14:40:52

Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.

World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

Authors:Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, Dongbin Zhao
Date:2025-09-23 14:38:15

Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at https://world4rl.github.io/.

Position: Human-Robot Interaction in Embodied Intelligence Demands a Shift From Static Privacy Controls to Dynamic Learning

Authors:Shuning Zhang, Hong Jia, Simin Li, Ting Dang, Yongquan `Owen' Hu, Xin Yi, Hewu Li
Date:2025-09-23 14:10:00

The reasoning capabilities of embodied agents introduce a critical, under-explored inferential privacy challenge, where the risk of an agent generate sensitive conclusions from ambient data. This capability creates a fundamental tension between an agent's utility and user privacy, rendering traditional static controls ineffective. To address this, this position paper proposes a framework that reframes privacy as a dynamic learning problem grounded in theory of Contextual Integrity (CI). Our approach enables agents to proactively learn and adapt to individual privacy norms through interaction, outlining a research agenda to develop embodied agents that are both capable and function as trustworthy safeguards of user privacy.

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Authors:Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara
Date:2025-09-23 13:46:31

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Authors:Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu
Date:2025-09-23 12:00:14

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

Teaching RDM in a smart advanced inorganic lab course and its provision in the DALIA platform

Authors:Alexander Hoffmann, Jochen Ortmeyer, Fabian Fink, Charles Tapley Hoyt, Jonathan D. Geiger, Paul Kehrein, Torsten Schrade, Sonja Herres-Pawlis
Date:2025-09-23 11:40:39

Research data management (RDM) is a key data literacy skill that chemistry students must acquire. Concepts such as the FAIR data principles (Findable, Accessible, Interoperable, Reusable) should be taught and applied in undergraduate studies already. Traditionally, research data from labs, theses, and internships were handwritten and stored in inaccessible formats such as PDFs, limiting reuse and machine learning applications. At RWTH Aachen University, a fifth-semester lab course introduces students to the electronic laboratory notebook (ELN) Chemotion, an open-source DFG-funded tool linked to the national NFDI4Chem initiative. Students plan, document, and evaluate experiments digitally, ensuring metadata and analysis are captured for long-term reuse. Chemotion's intuitive interface and repository enable sustainable data sharing. To reinforce RDM, students receive a seminar and access to online training videos with interactive Moodle elements. Herein we highlight the use of the DALIA platform as a discovery tool for the students.

Learning Obstacle Avoidance using Double DQN for Quadcopter Navigation

Authors:Nishant Doshi, Amey Sutvani, Sanket Gujar
Date:2025-09-23 07:27:48

One of the challenges faced by Autonomous Aerial Vehicles is reliable navigation through urban environments. Factors like reduction in precision of Global Positioning System (GPS), narrow spaces and dynamically moving obstacles make the path planning of an aerial robot a complicated task. One of the skills required for the agent to effectively navigate through such an environment is to develop an ability to avoid collisions using information from onboard depth sensors. In this paper, we propose Reinforcement Learning of a virtual quadcopter robot agent equipped with a Depth Camera to navigate through a simulated urban environment.

Content and Quality Analysis of mHealth Apps for Feeding Children with Autism Spectrum Disorder

Authors:Christopher Cofie Kuzagbe, Fabrice Mukarage, Skye Nandi Adams, N'guessan Yves-Roland Douha, Edith Talina Luhanga
Date:2025-09-23 07:04:01

Background: Approximately 1 in 100 children worldwide are diagnosed with Autism Spectrum Disorder (ASD), and 46% to 89% experience significant feeding difficulties. Mobile health applications (mHealth apps) have emerged as a potential tool for scalable support. However, their quality and relevance in managing ASD-related feeding challenges remain unclear. Objective: To identify and evaluate the quality of mHealth apps available in the Africa region addressing feeding difficulties in children with ASD. Methods: A systematic search was conducted on the Apple App Store and Google Play Store between September and October 2024. Applications were included if they were free, in English, updated within the past year, explicitly focused on feeding in children with autism, available in the Africa region, and had more than 100 downloads. Eligible apps were assessed using the Behavior Change Wheel (BCW) framework and rated with the Mobile App Rating Scale (MARS) across four domains: engagement, functionality, aesthetics, and information quality. Results: Of the 326 applications identified, only two iOS apps met all inclusion criteria. EduKitchen-Toddlers Food Games featured child-centered interactive games and sensory-friendly visuals, while Autism Food Coach 2 provided structured caregiver tools, visual meal plans, and progress tracking. Both apps aligned with multiple BCW intervention functions, including education, training, and enablement. MARS scores of 3.7 and 3.9 indicated acceptable to good usability and content quality. Conclusion: There is a critical shortage of high-quality, evidence-based mHealth applications addressing feeding difficulties in children with ASD. Future development should prioritize clinical validation and the integration of comprehensive, caregiver-centered support features to address this gap.

Query-Centric Diffusion Policy for Generalizable Robotic Assembly

Authors:Ziyi Xu, Haohong Lin, Shiqi Liu, Ding Zhao
Date:2025-09-23 06:10:46

The robotic assembly task poses a key challenge in building generalist robots due to the intrinsic complexity of part interactions and the sensitivity to noise perturbations in contact-rich settings. The assembly agent is typically designed in a hierarchical manner: high-level multi-part reasoning and low-level precise control. However, implementing such a hierarchical policy is challenging in practice due to the mismatch between high-level skill queries and low-level execution. To address this, we propose the Query-centric Diffusion Policy (QDP), a hierarchical framework that bridges high-level planning and low-level control by utilizing queries comprising objects, contact points, and skill information. QDP introduces a query-centric mechanism that identifies task-relevant components and uses them to guide low-level policies, leveraging point cloud observations to improve the policy's robustness. We conduct comprehensive experiments on the FurnitureBench in both simulation and real-world settings, demonstrating improved performance in skill precision and long-horizon success rate. In the challenging insertion and screwing tasks, QDP improves the skill-wise success rate by over 50% compared to baselines without structured queries.

Distributionally Robust Safe Motion Planning with Contextual Information

Authors:Kaizer Rahaman, Simran Kumari, Ashish R. Hota
Date:2025-09-23 05:34:06

We present a distributionally robust approach for collision avoidance by incorporating contextual information. Specifically, we embed the conditional distribution of future trajectory of the obstacle conditioned on the motion of the ego agent in a reproducing kernel Hilbert space (RKHS) via the conditional kernel mean embedding operator. Then, we define an ambiguity set containing all distributions whose embedding in the RKHS is within a certain distance from the empirical estimate of conditional mean embedding learnt from past data. Consequently, a distributionally robust collision avoidance constraint is formulated, and included in the receding horizon based motion planning formulation of the ego agent. Simulation results show that the proposed approach is more successful in avoiding collision compared to approaches that do not include contextual information and/or distributional robustness in their formulation in several challenging scenarios.

Number Adaptive Formation Flight Planning via Affine Deformable Guidance in Narrow Environments

Authors:Yuan Zhou, Jialiang Hou, Guangtong Xu, Fei Gao
Date:2025-09-23 04:39:21

Formation maintenance with varying number of drones in narrow environments hinders the convergence of planning to the desired configurations. To address this challenge, this paper proposes a formation planning method guided by Deformable Virtual Structures (DVS) with continuous spatiotemporal transformation. Firstly, to satisfy swarm safety distance and preserve formation shape filling integrity for irregular formation geometries, we employ Lloyd algorithm for uniform $\underline{PA}$rtitioning and Hungarian algorithm for $\underline{AS}$signment (PAAS) in DVS. Subsequently, a spatiotemporal trajectory involving DVS is planned using primitive-based path search and nonlinear trajectory optimization. The DVS trajectory achieves adaptive transitions with respect to a varying number of drones while ensuring adaptability to narrow environments through affine transformation. Finally, each agent conducts distributed trajectory planning guided by desired spatiotemporal positions within the DVS, while incorporating collision avoidance and dynamic feasibility requirements. Our method enables up to 15\% of swarm numbers to join or leave in cluttered environments while rapidly restoring the desired formation shape in simulation. Compared to cutting-edge formation planning method, we demonstrate rapid formation recovery capacity and environmental adaptability. Real-world experiments validate the effectiveness and resilience of our formation planning method.

PIE: Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving

Authors:Chengran Yuan, Zijian Lu, Zhanqi Zhang, Yimin Zhao, Zefan Huang, Shuo Sun, Jiawei Sun, Jiahui Li, Christina Dao Wen Lee, Dongen Li, Marcelo H. Ang Jr
Date:2025-09-23 03:57:33

End-to-end motion planning is promising for simplifying complex autonomous driving pipelines. However, challenges such as scene understanding and effective prediction for decision-making continue to present substantial obstacles to its large-scale deployment. In this paper, we present PIE, a pioneering framework that integrates advanced perception, reasoning, and intention modeling to dynamically capture interactions between the ego vehicle and surrounding agents. It incorporates a bidirectional Mamba fusion that addresses data compression losses in multimodal fusion of camera and LiDAR inputs, alongside a novel reasoning-enhanced decoder integrating Mamba and Mixture-of-Experts to facilitate scene-compliant anchor selection and optimize adaptive trajectory inference. PIE adopts an action-motion interaction module to effectively utilize state predictions of surrounding agents to refine ego planning. The proposed framework is thoroughly validated on the NAVSIM benchmark. PIE, without using any ensemble and data augmentation techniques, achieves an 88.9 PDM score and 85.6 EPDM score, surpassing the performance of prior state-of-the-art methods. Comprehensive quantitative and qualitative analyses demonstrate that PIE is capable of reliably generating feasible and high-quality ego trajectories.

VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

Authors:Neel P. Bhatt, Yunhao Yang, Rohan Siva, Pranay Samineni, Daniel Milan, Zhangyang Wang, Ufuk Topcu
Date:2025-09-23 03:23:03

Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.

Spatial Envelope MPC: High Performance Driving without a Reference

Authors:Siyuan Yu, Congkai Shen, Yufei Xi, James Dallas, Michael Thompson, John Subosits, Hiroshi Yasuda, Tulga Ersal
Date:2025-09-23 01:16:01

This paper presents a novel envelope based model predictive control (MPC) framework designed to enable autonomous vehicles to handle high performance driving across a wide range of scenarios without a predefined reference. In high performance autonomous driving, safe operation at the vehicle's dynamic limits requires a real time planning and control framework capable of accounting for key vehicle dynamics and environmental constraints when following a predefined reference trajectory is suboptimal or even infeasible. State of the art planning and control frameworks, however, are predominantly reference based, which limits their performance in such situations. To address this gap, this work first introduces a computationally efficient vehicle dynamics model tailored for optimization based control and a continuously differentiable mathematical formulation that accurately captures the entire drivable envelope. This novel model and formulation allow for the direct integration of dynamic feasibility and safety constraints into a unified planning and control framework, thereby removing the necessity for predefined references. The challenge of envelope planning, which refers to maximally approximating the safe drivable area, is tackled by combining reinforcement learning with optimization techniques. The framework is validated through both simulations and real world experiments, demonstrating its high performance across a variety of tasks, including racing, emergency collision avoidance and off road navigation. These results highlight the framework's scalability and broad applicability across a diverse set of scenarios.

Integrated Prediction and Distributionally Robust Optimization for Air Traffic Management

Authors:Haochen Wu, Xinting Zhu, Lishuai Li, Max Z. Li
Date:2025-09-23 00:53:18

Strategic Traffic Management Initiatives (TMIs) such as Ground Delay Programs (GDPs) play a crucial role in mitigating operational costs associated with air traf- fic demand-capacity imbalances. However, GDPs can only be planned (e.g., duration, delay assignments) with confidence if the future capacities at constrained re- sources (i.e., airports) are predictable. In reality, such future capacities are uncer- tain, and predictive models may provide predictions that are vulnerable to errors and distribution shifts. Motivated by the goal of planning optimal GDPs that are distributionally robust against airport capacity prediction errors, we study a fully integrated learning-driven optimization framework. We design a deep learning- based prediction model capable of forecasting arrival and departure capacity distributions across a network of airports. We incorporate the predictions into a distributionally robust formulation of the multi-airport ground holding program (DR- MAGHP). Our results demonstrate that DR-MAGHP can achieve up to a 15.6% improvement over the stochastic programming formulation (SP-MAGHP) under airport capacity distribution shifts. We conclude by outlining future research di- rections aimed at enhancing both the learning and optimization components of the framework.

An Analysis of Kalman Filter based Object Tracking Methods for Fast-Moving Tiny Objects

Authors:Prithvi Raj Singh, Raju Gottumukkala, Anthony Maida
Date:2025-09-22 22:12:48

Unpredictable movement patterns and small visual mark make precise tracking of fast-moving tiny objects like a racquetball one of the challenging problems in computer vision. This challenge is particularly relevant for sport robotics applications, where lightweight and accurate tracking systems can improve robot perception and planning capabilities. While Kalman filter-based tracking methods have shown success in general object tracking scenarios, their performance degrades substantially when dealing with rapidly moving objects that exhibit irregular bouncing behavior. In this study, we evaluate the performance of five state-of-the-art Kalman filter-based tracking methods-OCSORT, DeepOCSORT, ByteTrack, BoTSORT, and StrongSORT-using a custom dataset containing 10,000 annotated racquetball frames captured at 720p-1280p resolution. We focus our analysis on two critical performance factors: inference speed and update frequency per image, examining how these parameters affect tracking accuracy and reliability for fast-moving tiny objects. Our experimental evaluation across four distinct scenarios reveals that DeepOCSORT achieves the lowest tracking error with an average ADE of 31.15 pixels compared to ByteTrack's 114.3 pixels, while ByteTrack demonstrates the fastest processing at 26.6ms average inference time versus DeepOCSORT's 26.8ms. However, our results show that all Kalman filter-based trackers exhibit significant tracking drift with spatial errors ranging from 3-11cm (ADE values: 31-114 pixels), indicating fundamental limitations in handling the unpredictable motion patterns of fast-moving tiny objects like racquetballs. Our analysis demonstrates that current tracking approaches require substantial improvements, with error rates 3-4x higher than standard object tracking benchmarks, highlighting the need for specialized methodologies for fast-moving tiny object tracking applications.

5GC-Bench: A Framework for Stress-Testing and Benchmarking 5G Core VNFs

Authors:Ioannis Panitsas, Tolga O. Atalay, Dragoslav Stojadinovic, Angelos Stavrou, Leandros Tassiulas
Date:2025-09-22 22:00:05

The disaggregated, cloud-native design of the 5G Core (5GC) enables flexibility and scalability but introduces significant challenges. Control-plane procedures involve complex interactions across multiple Virtual Network Functions (VNFs), while the user plane must sustain diverse and resource-intensive traffic. Existing tools often benchmark these dimensions in isolation, rely on synthetic workloads, or lack visibility into fine-grained resource usage. This paper presents 5GC-Bench, a modular framework for stress-testing the 5GC under realistic workloads. 5GC-Bench jointly emulates signaling and service traffic, supporting both VNF profiling and end-to-end service-chain analysis. By characterizing bottlenecks and resource demands, it provides actionable insights for capacity planning and performance optimization. We integrated 5GC-Bench with the OpenAirInterface (OAI) 5GC and deployed it on a real 5G testbed, demonstrating its ability to uncover resource constraints and expose cross-VNF dependencies under scenarios that mirror operational 5G deployments. To foster reproducibility and further research, we release publicly all the artifacts.

CPT-4DMR: Continuous sPatial-Temporal Representation for 4D-MRI Reconstruction

Authors:Xinyang Wu, Muheng Li, Xia Li, Orso Pusterla, Sairos Safai, Philippe C. Cattin, Antony J. Lomax, Ye Zhang
Date:2025-09-22 21:18:26

Four-dimensional MRI (4D-MRI) is an promising technique for capturing respiratory-induced motion in radiation therapy planning and delivery. Conventional 4D reconstruction methods, which typically rely on phase binning or separate template scans, struggle to capture temporal variability, complicate workflows, and impose heavy computational loads. We introduce a neural representation framework that considers respiratory motion as a smooth, continuous deformation steered by a 1D surrogate signal, completely replacing the conventional discrete sorting approach. The new method fuses motion modeling with image reconstruction through two synergistic networks: the Spatial Anatomy Network (SAN) encodes a continuous 3D anatomical representation, while a Temporal Motion Network (TMN), guided by Transformer-derived respiratory signals, produces temporally consistent deformation fields. Evaluation using a free-breathing dataset of 19 volunteers demonstrates that our template- and phase-free method accurately captures both regular and irregular respiratory patterns, while preserving vessel and bronchial continuity with high anatomical fidelity. The proposed method significantly improves efficiency, reducing the total processing time from approximately five hours required by conventional discrete sorting methods to just 15 minutes of training. Furthermore, it enables inference of each 3D volume in under one second. The framework accurately reconstructs 3D images at any respiratory state, achieves superior performance compared to conventional methods, and demonstrates strong potential for application in 4D radiation therapy planning and real-time adaptive treatment.

Assistive Decision-Making for Right of Way Navigation at Uncontrolled Intersections

Authors:Navya Tiwari, Joseph Vazhaeparampil, Victoria Preston
Date:2025-09-22 20:46:23

Uncontrolled intersections account for a significant fraction of roadway crashes due to ambiguous right-of-way rules, occlusions, and unpredictable driver behavior. While autonomous vehicle research has explored uncertainty-aware decision making, few systems exist to retrofit human-operated vehicles with assistive navigation support. We present a driver-assist framework for right-of-way reasoning at uncontrolled intersections, formulated as a Partially Observable Markov Decision Process (POMDP). Using a custom simulation testbed with stochastic traffic agents, pedestrians, occlusions, and adversarial scenarios, we evaluate four decision-making approaches: a deterministic finite state machine (FSM), and three probabilistic planners: QMDP, POMCP, and DESPOT. Results show that probabilistic planners outperform the rule-based baseline, achieving up to 97.5 percent collision-free navigation under partial observability, with POMCP prioritizing safety and DESPOT balancing efficiency and runtime feasibility. Our findings highlight the importance of uncertainty-aware planning for driver assistance and motivate future integration of sensor fusion and environment perception modules for real-time deployment in realistic traffic environments.

Neural Network-Driven Direct CBCT-Based Dose Calculation for Head-and-Neck Proton Treatment Planning

Authors:Muheng Li, Evangelia Choulilitsa, Lisa Fankhauser, Francesca Albertini, Antony Lomax, Ye Zhang
Date:2025-09-22 20:01:32

Accurate dose calculation on cone beam computed tomography (CBCT) images is essential for modern proton treatment planning workflows, particularly when accounting for inter-fractional anatomical changes in adaptive treatment scenarios. Traditional CBCT-based dose calculation suffers from image quality limitations, requiring complex correction workflows. This study develops and validates a deep learning approach for direct proton dose calculation from CBCT images using extended Long Short-Term Memory (xLSTM) neural networks. A retrospective dataset of 40 head-and-neck cancer patients with paired planning CT and treatment CBCT images was used to train an xLSTM-based neural network (CBCT-NN). The architecture incorporates energy token encoding and beam's-eye-view sequence modelling to capture spatial dependencies in proton dose deposition patterns. Training utilized 82,500 paired beam configurations with Monte Carlo-generated ground truth doses. Validation was performed on 5 independent patients using gamma analysis, mean percentage dose error assessment, and dose-volume histogram comparison. The CBCT-NN achieved gamma pass rates of 95.1 $\pm$ 2.7% using 2mm/2% criteria. Mean percentage dose errors were 2.6 $\pm$ 1.4% in high-dose regions ($>$90% of max dose) and 5.9 $\pm$ 1.9% globally. Dose-volume histogram analysis showed excellent preservation of target coverage metrics (Clinical Target Volume V95% difference: -0.6 $\pm$ 1.1%) and organ-at-risk constraints (parotid mean dose difference: -0.5 $\pm$ 1.5%). Computation time is under 3 minutes without sacrificing Monte Carlo-level accuracy. This study demonstrates the proof-of-principle of direct CBCT-based proton dose calculation using xLSTM neural networks. The approach eliminates traditional correction workflows while achieving comparable accuracy and computational efficiency suitable for adaptive protocols.

TinyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Bird's Eye View Perception and Planning

Authors:Reeshad Khan, John Gauch
Date:2025-09-22 19:54:02

We present TinyBEV, a unified, camera only Bird's Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher (UniAD [19]) into a compact, real-time student model. Unlike prior efficient camera only baselines such as VAD[23] and VADv2[7], TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M-parameter backbone, achieving a 78% reduction in parameters over UniAD [19]. Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV representation. On nuScenes[4], Tiny-BEV achieves 39.0 mAP for detection, 1.08 minADE for motion forecasting, and a 0.32 collision rate, while running 5x faster (11 FPS) and requiring only camera input. These results demonstrate that full-stack driving intelligence can be retained in resource-constrained settings, bridging the gap between large-scale, multi-modal perception-planning models and deployment-ready real-time autonomy.

Advancing ARA: the Next-Generation (ARA-Next) DAQ System

Authors:Pawan Giri
Date:2025-09-22 19:48:57

The Askaryan Radio Array (ARA) has been operating at the South Pole for over a decade, searching for ultra-high energy astrophysical and cosmogenic neutrinos using the Askaryan effect. ARA has consistently served as a testbed for innovative trigger designs and advancing electronic upgrades, with ongoing data acquisition (DAQ) improvements over the past 2-3 years and a long-term plan to transition to Radio Frequency System on Chip (RFSoC) technology. This upgrade enables real-time data processing and sophisticated triggers, enhancing efficiency by identifying double pulses from in-ice neutrino interactions, using templates for cosmic rays, searching for real-time coincidences with the IceCube detector observations, and filtering anthropogenic noise through directional analysis. In 2024, two of the five ARA stations received DAQ upgrades, improving the existing electronics, with RFSoC-based DAQ foreseen in the coming years. In this proceedings contribution, recent ARA activities are presented, with emphasis on the planned ARA-Next trigger strategies involving RFSoC technology and the 2024-2025 season upgrades of the existing ATRI-based DAQ system to its revised version.

The Landform Contextual Mesh: Automatically Fusing Surface and Orbital Terrain for Mars 2020

Authors:Marsette Vona
Date:2025-09-22 18:51:19

The Landform contextual mesh fuses 2D and 3D data from up to thousands of Mars 2020 rover images, along with orbital elevation and color maps from Mars Reconnaissance Orbiter, into an interactive 3D terrain visualization. Contextual meshes are built automatically for each rover location during mission ground data system processing, and are made available to mission scientists for tactical and strategic planning in the Advanced Science Targeting Tool for Robotic Operations (ASTTRO). A subset of them are also deployed to the "Explore with Perseverance" public access website.

Haptic Communication in Human-Human and Human-Robot Co-Manipulation

Authors:Katherine H. Allen, Chris Rogers, Elaine S. Short
Date:2025-09-22 18:49:56

When a human dyad jointly manipulates an object, they must communicate about their intended motion plans. Some of that collaboration is achieved through the motion of the manipulated object itself, which we call "haptic communication." In this work, we captured the motion of human-human dyads moving an object together with one participant leading a motion plan about which the follower is uninformed. We then captured the same human participants manipulating the same object with a robot collaborator. By tracking the motion of the shared object using a low-cost IMU, we can directly compare human-human shared manipulation to the motion of those same participants interacting with the robot. Intra-study and post-study questionnaires provided participant feedback on the collaborations, indicating that the human-human collaborations are significantly more fluent, and analysis of the IMU data indicates that it captures objective differences in the motion profiles of the conditions. The differences in objective and subjective measures of accuracy and fluency between the human-human and human-robot trials motivate future research into improving robot assistants for physical tasks by enabling them to send and receive anthropomorphic haptic signals.