planning - 2025-04-15

Decoupled Diffusion Sparks Adaptive Scene Generation

Authors:Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li

Date:2025-04-14 17:59:57

Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

Layered Multirate Control of Constrained Linear Systems

Authors:Charis Stamouli, Anastasios Tsiamis, Manfred Morari, George J. Pappas

Date:2025-04-14 17:48:34

Layered control architectures have been a standard paradigm for efficiently managing complex constrained systems. A typical architecture consists of: i) a higher layer, where a low-frequency planner controls a simple model of the system, and ii) a lower layer, where a high-frequency tracking controller guides a detailed model of the system toward the output of the higher-layer model. A fundamental problem in this layered architecture is the design of planners and tracking controllers that guarantee both higher- and lower-layer system constraints are satisfied. Toward addressing this problem, we introduce a principled approach for layered multirate control of linear systems subject to output and input constraints. Inspired by discrete-time simulation functions, we propose a streamlined control design that guarantees the lower-layer system tracks the output of the higher-layer system with computable precision. Using this design, we derive conditions and present a method for propagating the constraints of the lower-layer system to the higher-layer system. The propagated constraints are integrated into the design of an arbitrary planner that can handle higher-layer system constraints. Our framework ensures that the output constraints of the lower-layer system are satisfied at all high-level time steps, while respecting its input constraints at all low-level time steps. We apply our approach in a scenario of motion planning, highlighting its critical role in ensuring collision avoidance.

HybridCollab: Unifying In-Person and Remote Collaboration for Cardiovascular Surgical Planning in Mobile Augmented Reality

Authors:Pratham Darrpan Mehta, Rahul Ozhur Narayanan, Vidhi Kulkarni, Timothy Slesnick, Fawwaz Shaw, Duen Horng Chau

Date:2025-04-14 17:31:35

Surgical planning for congenital heart disease traditionally relies on collaborative group examinations of a patient's 3D-printed heart model, a process that lacks flexibility and accessibility. While mobile augmented reality (AR) offers a promising alternative with its portability and familiar interaction gestures, existing solutions limit collaboration to users in the same physical space. We developed HybridCollab, the first iOS AR application that introduces a novel paradigm that enables both in-person and remote medical teams to interact with a shared AR heart model in a single surgical planning session. For example, a team of two doctors in one hospital room can collaborate in real time with another team in a different hospital.Our approach is the first to leverage Apple's GameKit service for surgical planning, ensuring an identical collaborative experience for all participants, regardless of location. Additionally, co-located users can interact with the same anchored heart model in their shared physical space. By bridging the gap between remote and in-person collaboration across medical teams, HybridCollab has the potential for significant real-world impact, streamlining communication and enhancing the effectiveness of surgical planning. Watch the demo: https://youtu.be/hElqJYDuvLM.

An energy optimization method based on mixed-integer model and variational quantum computing algorithm for faster IMPT

Authors:Ya-Nan Zhu, Nimita Shinde, Bowen Lin, Hao Gao

Date:2025-04-14 15:24:23

Intensity-modulated proton therapy (IMPT) offers superior dose conformity with reduced exposure to surrounding healthy tissues compared to conventional photon therapy. Improving IMPT delivery efficiency reduces motion-related uncertainties, enhances plan robustness, and benefits breath-hold techniques by shortening treatment time. Among various factors, energy switching time plays a critical role, making energy layer optimization (ELO) essential. This work develops an energy layer optimization method based on mixed integer model and variational quantum computing algorithm to enhance the efficiency of IMPT. The energy layer optimization problem is modeled as a mixed-integer program, where continuous variables optimize the dose distribution and binary variables indicate energy layer selection. To solve it, iterative convex relaxation decouples the dose-volume constraints, followed by the alternating direction method of multipliers (ADMM) to separate mixed-variable optimization and the minimum monitor unit (MMU) constraint. The resulting beam intensity subproblem, subject to MMU, either admits a closed-form solution or is efficiently solvable via conjugate gradient. The binary subproblem is cast as a quadratic unconstrained binary optimization (QUBO) problem, solvable using variational quantum computing algorithms. With nearly the same plan quality, the proposed method noticeable reduces the number of the used energies. For example, compared to conventional IMPT, QC can reduce the number of energy layers from 61 to 35 in HN case, from 56 to 35 in lung case, and from 59 to 32 to abdomen case. The reduced number of energies also results in fewer delivery time, e.g., the delivery time is reduced from 100.6, 232.0, 185.3 seconds to 90.7, 215.4, 154.0 seconds, respectively.

Essay: A path for the construction of a Muon Collider

Authors:Diktys Stratakis

Date:2025-04-14 15:06:02

Muons are elementary particles and provide cleaner collision events that can explore higher energies compared to composite particles like protons. Muons are also far heavier than their electron cousins, meaning that they emit less synchrotron radiation that effectively limits the energies of circular electron-positron colliders. These characteristics open up the possibility for a Muon Collider to surpass the direct energy reach of the Large Hadron Collider while achieving unprecedented precision measurements of Standard Model processes. In this Essay, after briefly summarizing the progress achieved so far, I identify important missing R&D steps and envision a compelling plan to bring a Muon Collider to reality in the next two decades. A Muon Collider could allow for the exploration of physics that is not available with current technologies. For example, it may provide a way to study the Higgs boson directly or probe new particles, including those related to dark matter or other phenomena beyond the Standard Model.

A Quasi-Steady-State Black Box Simulation Approach for the Generation of g-g-g-v Diagrams

Authors:Frederik Werner, Simon Sagmeister, Mattia Piccinini, Johannes Betz

Date:2025-04-14 13:45:26

The classical g-g diagram, representing the achievable acceleration space for a vehicle, is commonly used as a constraint in trajectory planning and control due to its computational simplicity. To address non-planar road geometries, this concept can be extended to incorporate g-g constraints as a function of vehicle speed and vertical acceleration, commonly referred to as g-g-g-v diagrams. However, the estimation of g-g-g-v diagrams is an open problem. Existing simulation-based approaches struggle to isolate non-transient, open-loop stable states across all combinations of speed and acceleration, while optimization-based methods often require simplified vehicle equations and have potential convergence issues. In this paper, we present a novel, open-source, quasi-steady-state black box simulation approach that applies a virtual inertial force in the longitudinal direction. The method emulates the load conditions associated with a specified longitudinal acceleration while maintaining constant vehicle speed, enabling open-loop steering ramps in a purely QSS manner. Appropriate regulation of the ramp steer rate inherently mitigates transient vehicle dynamics when determining the maximum feasible lateral acceleration. Moreover, treating the vehicle model as a black box eliminates model mismatch issues, allowing the use of high-fidelity or proprietary vehicle dynamics models typically unsuited for optimization approaches. An open-source version of the proposed method is available at: https://github.com/TUM-AVS/GGGVDiagrams

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Authors:Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He

Date:2025-04-14 11:35:02

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Impact of rainfall risk on rice production: realized volatility in mean model

Authors:Soham Ghosh, Sujay Mukhoti, Pritee Sharma

Date:2025-04-14 11:30:37

Rural economies are largely dependent upon agriculture, which is greatly determined by climatic conditions such as rainfall. This study aims to forecast agricultural production in Maharashtra, India, which utilises annual data from the year 1962 to 2021. Since rainfall plays a major role with respect to the crop yield, we analyze the impact of rainfall on crop yield using four time series models that includes ARIMA, ARIMAX, GARCH-ARIMA and GARCH-ARIMAX. We take advantage of rainfall as an external regressor to examine if it contributes to the performance of the model. 1-step, 2-step, and 3-step ahead forecasts are obtained and the model performance is assessed using MAE and RMSE. The models are able to more accurately predict when using rainfall as a predictor compared to when solely dependant on historical production trends (more improved outcomes are seen in the ARIMAX and GARCH-ARIMAX models). As such, these findings underscore the need for climate-aware forecasting techniques that provide useful information to policymakers and farmers to aid in agricultural planning.

Application of nanodiamond-polymer composite holographic gratings in a very cold neutron interferometer

Authors:Sonja Falmbigl, Roxana H. Ackermann, Elhoucine Hadden, Hanno Filter-Pieler, Tobias Jenke, Juergen Klepp, Christian Pruner, Yasuo Tomita, Martin Fally

Date:2025-04-14 11:26:03

In recent decades, photosensitive materials have been used for the development of optical devices not only for light, but also for cold and very cold neutrons. We show that holographically recorded gratings in nanodiamond-polymer composites (nDPC) form ideal diffraction elements for very cold neutrons. Their advantage of high diffraction efficiency, combined with low angular selectivity as a two-port beam splitter, meets the necessary conditions for application in a very cold neutron interferometer. We provide an overview of the latest achievements in the construction of such a triple Laue interferometer. A first operational test of the interferometer is planned immediately after this conference in May 2025.

A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

Authors:Zeng Ren, Xinyi Guan, Martin Rohrmeier

Date:2025-04-14 10:08:28

Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.

Progressive Transfer Learning for Multi-Pass Fundus Image Restoration

Authors:Uyen Phan, Ozer Can Devecioglu, Serkan Kiranyaz, Moncef Gabbouj

Date:2025-04-14 09:28:10

Diabetic retinopathy is a leading cause of vision impairment, making its early diagnosis through fundus imaging critical for effective treatment planning. However, the presence of poor quality fundus images caused by factors such as inadequate illumination, noise, blurring and other motion artifacts yields a significant challenge for accurate DR screening. In this study, we propose progressive transfer learning for multi pass restoration to iteratively enhance the quality of degraded fundus images, ensuring more reliable DR screening. Unlike previous methods that often focus on a single pass restoration, multi pass restoration via PTL can achieve a superior blind restoration performance that can even improve most of the good quality fundus images in the dataset. Initially, a Cycle GAN model is trained to restore low quality images, followed by PTL induced restoration passes over the latest restored outputs to improve overall quality in each pass. The proposed method can learn blind restoration without requiring any paired data while surpassing its limitations by leveraging progressive learning and fine tuning strategies to minimize distortions and preserve critical retinal features. To evaluate PTL's effectiveness on multi pass restoration, we conducted experiments on DeepDRiD, a large scale fundus imaging dataset specifically curated for diabetic retinopathy detection. Our result demonstrates state of the art performance, showcasing PTL's potential as a superior approach to iterative image quality restoration.

NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation

Authors:Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, Hui Cheng

Date:2025-04-14 09:06:02

Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

Fusing Bluetooth with Pedestrian Dead Reckoning: A Floor Plan-Assisted Positioning Approach

Authors:Wenxuan Pan, Yang Yang, Mingzhe Chen, Dong Wei, Caili Guo, Shiwen Mao

Date:2025-04-14 06:00:39

Floor plans can provide valuable prior information that helps enhance the accuracy of indoor positioning systems. However, existing research typically faces challenges in efficiently leveraging floor plan information and applying it to complex indoor layouts. To fully exploit information from floor plans for positioning, we propose a floor plan-assisted fusion positioning algorithm (FP-BP) using Bluetooth low energy (BLE) and pedestrian dead reckoning (PDR). In the considered system, a user holding a smartphone walks through a positioning area with BLE beacons installed on the ceiling, and can locate himself in real time. In particular, FP-BP consists of two phases. In the offline phase, FP-BP programmatically extracts map features from a stylized floor plan based on their binary masks, and constructs a mapping function to identify the corresponding map feature of any given position on the map. In the online phase, FP-BP continuously computes BLE positions and PDR results from BLE signals and smartphone sensors, where a novel grid-based maximum likelihood estimation (GML) algorithm is introduced to enhance BLE positioning. Then, a particle filter is used to fuse them and obtain an initial estimate. Finally, FP-BP performs post-position correction to obtain the final position based on its specific map feature. Experimental results show that FP-BP can achieve a real-time mean positioning accuracy of 1.19 m, representing an improvement of over 28% compared to existing floor plan-fused baseline algorithms.

Can VLMs Assess Similarity Between Graph Visualizations?

Authors:Seokweon Jung, Hyeon Jeon, Jeongmin Rhee, Jinwook Seo

Date:2025-04-14 04:08:27

Graph visualizations have been studied for tasks such as clustering and temporal analysis, but how these visual similarities relate to established graph similarity measures remains unclear. In this paper, we explore the potential of Vision Language Models (VLMs) to approximate human-like perception of graph similarity. We generate graph datasets of various sizes and densities and compare VLM-derived visual similarity scores with feature-based measures. Our findings indicate VLMs can assess graph similarity in a manner similar to feature-based measures, even though differences among the measures exist. In future work, we plan to extend our research by conducting experiments on human visual graph perception.

Score Matching Diffusion Based Feedback Control and Planning of Nonlinear Systems

Authors:Karthik Elamvazhuthi, Darshan Gadginmath, Fabio Pasqualetti

Date:2025-04-14 03:04:48

We propose a novel control-theoretic framework that leverages principles from generative modeling -- specifically, Denoising Diffusion Probabilistic Models (DDPMs) -- to stabilize control-affine systems with nonholonomic constraints. Unlike traditional stochastic approaches, which rely on noise-driven dynamics in both forward and reverse processes, our method crucially eliminates the need for noise in the reverse phase, making it particularly relevant for control applications. We introduce two formulations: one where noise perturbs all state dimensions during the forward phase while the control system enforces time reversal deterministically, and another where noise is restricted to the control channels, embedding system constraints directly into the forward process. For controllable nonlinear drift-free systems, we prove that deterministic feedback laws can exactly reverse the forward process, ensuring that the system's probability density evolves correctly without requiring artificial diffusion in the reverse phase. Furthermore, for linear time-invariant systems, we establish a time-reversal result under the second formulation. By eliminating noise in the backward process, our approach provides a more practical alternative to machine learning-based denoising methods, which are unsuitable for control applications due to the presence of stochasticity. We validate our results through numerical simulations on benchmark systems, including a unicycle model in a domain with obstacles, a driftless five-dimensional system, and a four-dimensional linear system, demonstrating the potential for applying diffusion-inspired techniques in linear, nonlinear, and settings with state space constraints.

SPOT: Spatio-Temporal Pattern Mining and Optimization for Load Consolidation in Freight Transportation Networks

Authors:Sikai Cheng, Amira Hijazi, Jeren Konak, Alan Erera, Pascal Van Hentenryck

Date:2025-04-13 18:14:38

Freight consolidation has significant potential to reduce transportation costs and mitigate congestion and pollution. An effective load consolidation plan relies on carefully chosen consolidation points to ensure alignment with existing transportation management processes, such as driver scheduling, personnel planning, and terminal operations. This complexity represents a significant challenge when searching for optimal consolidation strategies. Traditional optimization-based methods provide exact solutions, but their computational complexity makes them impractical for large-scale instances and they fail to leverage historical data. Machine learning-based approaches address these issues but often ignore operational constraints, leading to infeasible consolidation plans. This work proposes SPOT, an end-to-end approach that integrates the benefits of machine learning (ML) and optimization for load consolidation. The ML component plays a key role in the planning phase by identifying the consolidation points through spatio-temporal clustering and constrained frequent itemset mining, while the optimization selects the most cost-effective feasible consolidation routes for a given operational day. Extensive experiments conducted on industrial load data demonstrate that SPOT significantly reduces travel distance and transportation costs (by about 50% on large terminals) compared to the existing industry-standard load planning strategy and a neighborhood-based heuristic. Moreover, the ML component provides valuable tactical-level insights by identifying frequently recurring consolidation opportunities that guide proactive planning. In addition, SPOT is computationally efficient and can be easily scaled to accommodate large transportation networks.

Bridging Immutability with Flexibility: A Scheme for Secure and Efficient Smart Contract Upgrades

Authors:Tahrim Hossain, Sakib Hassan, Faisal Haque Bappy, Muhammad Nur Yanhaona, Tarannum Shaila Zaman, Tariqul Islam

Date:2025-04-13 16:59:28

The emergence of blockchain technology has revolutionized contract execution through the introduction of smart contracts. Ethereum, the leading blockchain platform, leverages smart contracts to power decentralized applications (DApps), enabling transparent and self-executing systems across various domains. While the immutability of smart contracts enhances security and trust, it also poses significant challenges for updates, defect resolution, and adaptation to changing requirements. Existing upgrade mechanisms are complex, resource-intensive, and costly in terms of gas consumption, often compromising security and limiting practical adoption. To address these challenges, we propose FlexiContracts+, a novel scheme that reimagines smart contracts by enabling secure, in-place upgrades on Ethereum while preserving historical data without relying on multiple contracts or extensive pre-deployment planning. FlexiContracts+ enhances security, simplifies development, reduces engineering overhead, and supports adaptable, expandable smart contracts. Comprehensive testing demonstrates that FlexiContracts+ achieves a practical balance between immutability and flexibility, advancing the capabilities of smart contract systems.

Unification of Consensus-Based Multi-Objective Optimization and Multi-Robot Path Planning

Authors:Michael P. Wozniak

Date:2025-04-13 13:56:54

Multi-agent systems seeking consensus may also have other objective functions to optimize, requiring the research of multi-objective optimization in consensus. Several recent publications have explored this domain using various methods such as weighted-sum optimization and penalization methods. This paper reviews the state of the art for consensus-based multi-objective optimization, poses a multi-agent lunar rover exploration problem seeking consensus and maximization of explored area, and achieves optimal edge weights and steering angles by applying SQP algorithms.

Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation

Authors:Yu Hao, Geeta Chandra Raju Bethala, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Congcong Wen, Baoru Huang, Anh Nguyen, Yi Fang

Date:2025-04-13 11:37:32

Enabling humanoid robots to autonomously perform loco-manipulation tasks in complex, unstructured environments poses significant challenges. This entails equipping robots with the capability to plan actions over extended horizons while leveraging multi-modality to bridge gaps between high-level planning and actual task execution. Recent advancements in multi-modal foundation models have showcased substantial potential in enhancing planning and reasoning abilities, particularly in the comprehension and processing of semantic information for robotic control tasks. In this paper, we introduce a novel framework based on foundation models that applies the embodied chain of action reasoning methodology to autonomously plan actions from textual instructions for humanoid loco-manipulation. Our method integrates humanoid-specific chain of thought methodology, including detailed affordance and body movement analysis, which provides a breakdown of the task into a sequence of locomotion and manipulation actions. Moreover, we incorporate spatial reasoning based on the observation and target object properties to effectively navigate where target position may be unseen or occluded. Through rigorous experimental setups on object rearrangement, manipulations and loco-manipulation tasks on a real-world environment, we evaluate our method's efficacy on the decoupled upper and lower body control and demonstrate the effectiveness of the chain of robotic action reasoning strategies in comprehending human instructions.

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Authors:Xing Zi, Tengjun Ni, Xianjing Fan, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad

Date:2025-04-13 11:29:31

Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

Authors:Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, Zhenglong Ding

Date:2025-04-13 08:22:09

Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

DoorBot: Closed-Loop Task Planning and Manipulation for Door Opening in the Wild with Haptic Feedback

Authors:Zhi Wang, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan

Date:2025-04-12 22:35:44

Robots operating in unstructured environments face significant challenges when interacting with everyday objects like doors. They particularly struggle to generalize across diverse door types and conditions. Existing vision-based and open-loop planning methods often lack the robustness to handle varying door designs, mechanisms, and push/pull configurations. In this work, we propose a haptic-aware closed-loop hierarchical control framework that enables robots to explore and open different unseen doors in the wild. Our approach leverages real-time haptic feedback, allowing the robot to adjust its strategy dynamically based on force feedback during manipulation. We test our system on 20 unseen doors across different buildings, featuring diverse appearances and mechanical types. Our framework achieves a 90% success rate, demonstrating its ability to generalize and robustly handle varied door-opening tasks. This scalable solution offers potential applications in broader open-world articulated object manipulation tasks.

Explorer: Robust Collection of Interactable GUI Elements

Authors:Iason Chaimalas, Arnas Vyšniauskas, Gabriel Brostow

Date:2025-04-12 22:02:29

Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

Text To 3D Object Generation For Scalable Room Assembly

Authors:Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano

Date:2025-04-12 20:13:07

Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Adaptive Planning Framework for UAV-Based Surface Inspection in Partially Unknown Indoor Environments

Authors:Hanyu Jin, Zhefan Xu, Haoyu Shen, Xinming Han, Kanlong Ye, Kenji Shimada

Date:2025-04-12 17:30:11

Inspecting indoor environments such as tunnels, industrial facilities, and construction sites is essential for infrastructure monitoring and maintenance. While manual inspection in these environments is often time-consuming and potentially hazardous, Unmanned Aerial Vehicles (UAVs) can improve efficiency by autonomously handling inspection tasks. Such inspection tasks usually rely on reference maps for coverage planning. However, in industrial applications, only the floor plans are typically available. The unforeseen obstacles not included in the floor plans will result in outdated reference maps and inefficient or unsafe inspection trajectories. In this work, we propose an adaptive inspection framework that integrates global coverage planning with local reactive adaptation to improve the coverage and efficiency of UAV-based inspection in partially unknown indoor environments. Experimental results in structured indoor scenarios demonstrate the effectiveness of the proposed approach in inspection efficiency and achieving high coverage rates with adaptive obstacle handling, highlighting its potential for enhancing the efficiency of indoor facility inspection.

Concurrent-Allocation Task Execution for Multi-Robot Path-Crossing-Minimal Navigation in Obstacle Environments

Authors:Bin-Bin Hu, Weijia Yao, Yanxin Zhou, Henglai Wei, Chen Lv

Date:2025-04-12 14:15:27

Reducing undesirable path crossings among trajectories of different robots is vital in multi-robot navigation missions, which not only reduces detours and conflict scenarios, but also enhances navigation efficiency and boosts productivity. Despite recent progress in multi-robot path-crossing-minimal (MPCM) navigation, the majority of approaches depend on the minimal squared-distance reassignment of suitable desired points to robots directly. However, if obstacles occupy the passing space, calculating the actual robot-point distances becomes complex or intractable, which may render the MPCM navigation in obstacle environments inefficient or even infeasible. In this paper, the concurrent-allocation task execution (CATE) algorithm is presented to address this problem (i.e., MPCM navigation in obstacle environments). First, the path-crossing-related elements in terms of (i) robot allocation, (ii) desired-point convergence, and (iii) collision and obstacle avoidance are encoded into integer and control barrier function (CBF) constraints. Then, the proposed constraints are used in an online constrained optimization framework, which implicitly yet effectively minimizes the possible path crossings and trajectory length in obstacle environments by minimizing the desired point allocation cost and slack variables in CBF constraints simultaneously. In this way, the MPCM navigation in obstacle environments can be achieved with flexible spatial orderings. Note that the feasibility of solutions and the asymptotic convergence property of the proposed CATE algorithm in obstacle environments are both guaranteed, and the calculation burden is also reduced by concurrently calculating the optimal allocation and the control input directly without the path planning process.

Enhancing U.S. swine farm preparedness for infectious foreign animal diseases with rapid access to biosecurity information

Authors:Christian Fleming, Kelsey Mills, Nicolas Cardenas, Jason A. Galvis, Gustavo Machado

Date:2025-04-12 13:29:31

The U.S. launched the Secure Pork Supply (SPS) Plan for Continuity of Business, a voluntary program providing foreign animal disease (FAD) guidance and setting biosecurity standards to maintain business continuity amid FAD outbreaks. The role of biosecurity in disease prevention is well recognized, yet the U.S. swine industry lacks knowledge of individual farm biosecurity plans and the efficacy of existing measures. We describe a multi-sector initiative that formed the Rapid Access Biosecurity (RAB) app consortium with the swine industry, government, and academia. We (i) summarized 7,625 farms using RABapp, (ii) mapped U.S. commercial swine coverage and areas of limited biosecurity, and (iii) examined associations between biosecurity and occurrences of porcine reproductive and respiratory syndrome virus (PRRSV) and porcine epidemic diarrhea virus (PEDV). RABapp, used in 31 states, covers ~47% of U.S. commercial swine. Of 307 Agricultural Statistics Districts with swine, 78% (238) had <50% of those animals in RABapp. We used a mixed-effects logistic regression model, accounting for production company and farm type (breeding vs. non-breeding). Requiring footwear/clothing changes, having multiple carcass disposal locations, hosting other businesses, and greater distance to swine farms reduced infection odds. Rendering carcasses, manure pit storage or land application, multiple perimeter buffer areas, and a larger animal housing area increased risk. This study leveraged RABapp to assess U.S. swine farm biosecurity, revealing gaps in SPS plan adoption that create vulnerable regions. Some biosecurity practices (e.g., footwear changes) lowered PRRSV/PEDV risk, while certain disposal and manure practices increased it. Targeted biosecurity measures and broader RABapp adoption can bolster industry resilience against foreign animal diseases.

Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models

Authors:Matt Grenander, Siddharth Varia, Paula Czarnowska, Yogarshi Vyas, Kishaloy Halder, Bonan Min

Date:2025-04-12 04:11:37

Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.

Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

Authors:Yifan Yang, Lei Zou, Bing Zhou, Daoyang Li, Binbin Lin, Joynal Abedin, Mingzheng Yang

Date:2025-04-12 03:52:31

Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.

Associating transportation planning-related measures with Mild Cognitive Impairment

Authors:Souradeep Chattopadhyay, Guillermo Basulto-Elias, Jun Ha Chang, Matthew Rizzo, Shauna Hallmark, Anuj Sharma, Soumik Sarkar

Date:2025-04-12 00:52:25

Understanding the relationship between mild cognitive impairment and driving behavior is essential to improve road safety, especially among older adults. In this study, we computed certain variables that reflect daily driving habits, such as trips to specific locations (e.g., home, work, medical, social, and errands) of older drivers in Nebraska using geohashing. The computed variables were then analyzed using a two-fold approach involving data visualization and machine learning models (C5.0, Random Forest, Support Vector Machines) to investigate the efficiency of the computed variables in predicting whether a driver is cognitively impaired or unimpaired. The C5.0 model demonstrated robust and stable performance with a median recall of 74\%, indicating that our methodology was able to identify cognitive impairment in drivers 74\% of the time correctly. This highlights our model's effectiveness in minimizing false negatives which is an important consideration given the cost of missing impaired drivers could be potentially high. Our findings highlight the potential of life space variables in understanding and predicting cognitive decline, offering avenues for early intervention and tailored support for affected individuals.