planning - 2025-11-25

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Authors:Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

Date:2025-11-24 18:59:17

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification

Authors:Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi, Adam Mushtak, Amith Khandakar

Date:2025-11-24 18:01:47

Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.

A primer on treatment planning aspects for temporally modulated pulsed radiation therapy

Authors:Christian Velten, Jiayi Huang, Wolfgang A. Tomé

Date:2025-11-24 17:21:06

Temporally modulated pulsed radiotherapy (TMPRT) delivers conventional fraction doses of radiation using temporally separated pulses of low doses (<30 cGy) yielding fraction-effective dose rates of around 6.7 cGy/min with the goal to exploit tumor radiation hypersensitivity, which was observed in both, preclinical models and in human clinical trials. To facilitate TMPRT, volumetric modulated arc therapy (VMAT) and 3D-CRT planning techniques were developed following the guidelines of the proposed NRG CC-017 trial. Plans were evaluated with respect to homogeneity, conformality, and adherence to dose constraints. Deliverability of plans was assessed using in-phantom measurements for absorbed dose accuracy at low dose rates and using EPID for isodose verification. For VMAT only single arc plans were found to be acceptable due to otherwise unacceptably heterogeneous field doses, while for dynamic conformal arcs machine limtations on the number of monitor units per degree require the use of partial arcs for each pulse. Delivery of plans at low dose rates (< 100 MU/min) was accurate with high Gamma pass rates on modern LINACs and moderate pass rates on legacy LINACs, in line with their general performance. Generally, VMAT is preferred to achieve optimal homogeneity, conformality, and organ-at-risk sparing, while the use of 3D-CRT can increase the availability of TMPRT for more patients and clinics.

Closing Gaps in Emissions Monitoring with Climate TRACE

Authors:Brittany V. Lancellotti, Jordan M. Malof, Aaron Davitt, Gavin McCormick, Shelby Anderson, Pol Carbó-Mestre, Gary Collins, Verity Crane, Zoheyr Doctor, George Ebri, Kevin Foster, Trey M. Gowdy, Michael Guzzardi, John Heal, Heather Hunter, David Kroodsma, Khandekar Mahammad Galib, Paul J. Markakis, Gavin McDonald, Daniel P. Moore, Eric D. Nguyen, Sabina Parvu, Michael Pekala, Christine D. Piatko, Amy Piscopo, Mark Powell, Krsna Raniga, Elizabeth P. Reilly, Michael Robinette, Ishan Saraswat, Patrick Sicurello, Isabella Söldner-Rembold, Raymond Song, Charlotte Underwood, Kyle Bradbury

Date:2025-11-24 16:28:44

Global greenhouse gas emissions estimates are essential for monitoring and mitigation planning. Yet most datasets lack one or more characteristics that enhance their actionability, such as accuracy, global coverage, high spatial and temporal resolution, and frequent updates. To address these gaps, we present Climate TRACE (climatetrace.org), an open-access platform delivering global emissions estimates with enhanced detail, coverage, and timeliness. Climate TRACE synthesizes existing emissions data, prioritizing accuracy, coverage, and resolution, and fills gaps using sector-specific estimation approaches. The dataset is the first to provide globally comprehensive emissions estimates for individual sources (e.g., individual power plants) for all anthropogenic emitting sectors. The dataset spans January 1, 2021, to the present, with a two-month reporting lag and monthly updates. The open-access platform enables non-technical audiences to engage with detailed emissions datasets for most subnational governments worldwide. Climate TRACE supports data-driven climate action at scales where decisions are made, representing a major breakthrough for emissions accounting and mitigation.

Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation

Authors:Mohamed Rissal Hedna, Sesugh Samuel Nder

Date:2025-11-24 16:05:40

Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

Authors:Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, Minzhe Niu, Haojie Zhu, Qichao Dong, Xuechao Yan, Siyuan Dong, Lu Hou, Qingqiu Huang, Xiaosong Jia, Hang Xu

Date:2025-11-24 15:28:25

Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks

Authors:Ann-Sophia Müller, Moonkwang Jeong, Meng Zhang, Jiyuan Tian, Arkadiusz Miernik, Stefanie Speidel, Tian Qiu

Date:2025-11-24 15:07:45

Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.

Enhancing Fast Radio Transient Detection with Mask R-CNN Image Segmentation

Authors:Sergio Belmonte Diaz, Rene P. Breton, Zafiirah Hosenie, Ben W. Stappers

Date:2025-11-24 11:43:45

Traditionally, fast radio transient searches are conducted on dedispersed time series using thresholding techniques based on the statistical properties of the data. However, peaks in dedispersed time series do not directly provide information on the nature of the source. In the DM-time domain, the S/N variation of real, dispersed astrophysical signals forms a characteristic bow tie shape, whereas radio frequency interference (RFI) can take multiple different forms. We have developed a method that bypasses the thresholding step of traditional single-pulse searches in favour of a direct DM-time domain image analysis. The backbone of our pipeline is a Mask R-CNN, a deep learning model designed for object detection, enabling it to identify the bow tie signature and distinguish real sources from RFI. Previous deep learning models often include a snippet of the DM-time domain in their input. We have trained the model on simulated bursts injected on top of real MeerKAT noise observations. We tested the model on MeerKAT follow-up observations of the repeater FRB20240114A and we were able to recover all bursts with a signal-to-noise above the traditional threshold, while detecting two bursts that were fainter. Our new approach considerably reduces the number of candidates above a nominal threshold while being capable of running in real time for typical surveys. We also propose a modified version of the traditional dedispersion plan optimised for this method.

Active Inference is a Subtype of Variational Inference

Authors:Wouter W. L. Nuijten, Mykola Lukashchuk

Date:2025-11-24 10:14:09

Automated decision-making under uncertainty requires balancing exploitation and exploration. Classical methods treat these separately using heuristics, while Active Inference unifies them through Expected Free Energy (EFE) minimization. However, EFE minimization is computationally expensive, limiting scalability. We build on recent theory recasting EFE minimization as variational inference, formally unifying it with Planning-as-Inference and showing the epistemic drive as a unique entropic contribution. Our main contribution is a novel message-passing scheme for this unified objective, enabling scalable Active Inference in factored-state MDPs and overcoming high-dimensional planning intractability.

Facade Segmentation for Solar Photovoltaic Suitability

Authors:Ayca Duran, Christoph Waibel, Bernd Bickel, Iro Armeni, Arno Schlueter

Date:2025-11-24 08:37:35

Building integrated photovoltaic (BIPV) facades represent a promising pathway towards urban decarbonization, especially where roof areas are insufficient and ground-mounted arrays are infeasible. Although machine learning-based approaches to support photovoltaic (PV) planning on rooftops are well researched, automated approaches for facades still remain scarce and oversimplified. This paper therefore presents a pipeline that integrates detailed information on the architectural composition of the facade to automatically identify suitable surfaces for PV application and estimate the solar energy potential. The pipeline fine-tunes SegFormer-B5 on the CMP Facades dataset and converts semantic predictions into facade-level PV suitability masks and PV panel layouts considering module sizes and clearances. Applied to a dataset of 373 facades with known dimensions from ten cities, the results show that installable BIPV potential is significantly lower than theoretical potential, thus providing valuable insights for reliable urban energy planning. With the growing availability of facade imagery, the proposed pipeline can be scaled to support BIPV planning in cities worldwide.

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Authors:Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue

Date:2025-11-24 07:11:12

Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Authors:Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li

Date:2025-11-24 04:02:48

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Authors:Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, Yandan Luo

Date:2025-11-24 03:45:32

Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose \textit{\textbf{GuideFlow}}, a novel planning framework that leverages Constrained Flow Matching. Concretely, \textit{\textbf{GuideFlow}} explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, \textit{\textbf{GuideFlow}} unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, \textit{\textbf{GuideFlow}} parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of \textit{\textbf{GuideFlow}}. Notably, on the NavSim test hard split (Navhard), \textit{\textbf{GuideFlow}} achieved SOTA with an EPDMS score of 43.0. The code will be released.

Asynchronous Distributed Multi-Robot Motion Planning Under Imperfect Communication

Authors:Ardalan Tajbakhsh, Augustinos Saravanos, James Zhu, Evangelos A. Theodorou, Lorenz T. Biegler, Aaron M. Johnson

Date:2025-11-24 02:55:12

This paper addresses the challenge of coordinating multi-robot systems under realistic communication delays using distributed optimization. We focus on consensus ADMM as a scalable framework for generating collision-free, dynamically feasible motion plans in both trajectory optimization and receding-horizon control settings. In practice, however, these algorithms are sensitive to penalty tuning or adaptation schemes (e.g. residual balancing and adaptive parameter heuristics) that do not explicitly consider delays. To address this, we introduce a Delay-Aware ADMM (DA-ADMM) variant that adapts penalty parameters based on real-time delay statistics, allowing agents to down-weight stale information and prioritize recent updates during consensus and dual updates. Through extensive simulations in 2D and 3D environments with double-integrator, Dubins-car, and drone dynamics, we show that DA-ADMM significantly improves robustness, success rate, and solution quality compared to fixed-parameter, residual-balancing, and fixed-constraint baselines. Our results highlight that performance degradation is not solely determined by delay length or frequency, but by the optimizer's ability to contextually reason over delayed information. The proposed DA-ADMM achieves consistently better coordination performance across a wide range of delay conditions, offering a principled and efficient mechanism for resilient multi-robot motion planning under imperfect communication.

CNN-Based Camera Pose Estimation and Localisation of Scan Images for Aircraft Visual Inspection

Authors:Xueyan Oh, Leonard Loh, Shaohui Foong, Zhong Bao Andy Koh, Kow Leong Ng, Poh Kang Tan, Pei Lin Pearlin Toh, U-Xuan Tan

Date:2025-11-24 02:52:36

General Visual Inspection is a manual inspection process regularly used to detect and localise obvious damage on the exterior of commercial aircraft. There has been increasing demand to perform this process at the boarding gate to minimise the downtime of the aircraft and automating this process is desired to reduce the reliance on human labour. Automating this typically requires estimating a camera's pose with respect to the aircraft for initialisation but most existing localisation methods require infrastructure, which is very challenging in uncontrolled outdoor environments and within the limited turnover time (approximately 2 hours) on an airport tarmac. Additionally, many airlines and airports do not allow contact with the aircraft's surface or using UAVs for inspection between flights, and restrict access to commercial aircraft. Hence, this paper proposes an on-site method that is infrastructure-free and easy to deploy for estimating a pan-tilt-zoom camera's pose and localising scan images. This method initialises using the same pan-tilt-zoom camera used for the inspection task by utilising a Deep Convolutional Neural Network fine-tuned on only synthetic images to predict its own pose. We apply domain randomisation to generate the dataset for fine-tuning the network and modify its loss function by leveraging aircraft geometry to improve accuracy. We also propose a workflow for initialisation, scan path planning, and precise localisation of images captured from a pan-tilt-zoom camera. We evaluate and demonstrate our approach through experiments with real aircraft, achieving root-mean-square camera pose estimation errors of less than 0.24 m and 2 degrees for all real scenes.

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Authors:Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

Date:2025-11-24 02:02:29

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.

Lean 5.0: A Predictive, Human-AI, and Ethically Grounded Paradigm for Construction Management

Authors:Atena Khoshkonesh, Mohsen Mohammadagha, Navid Ebrahimi, Narges Sadeghigolshan

Date:2025-11-23 23:11:55

This paper introduces Lean 5.0, a human-centric evolution of Lean-Digital integration that connects predictive analytics, AI collaboration, and continuous learning within Industry 5.0 and Construction 5.0 contexts. A systematic literature review (2019-2024) and a 12-week empirical validation study demonstrate measurable performance gains, including a 13% increase in Plan Percent Complete (PPC), 22% reduction in rework, and 42% improvement in forecast accuracy. The study adopts a mixed-method Design Science Research (DSR) approach aligned with PRISMA 2020 guidelines. The paper also examines integration with digital twin and blockchain technologies to improve traceability, auditability, and lifecycle transparency. Despite limitations related to sample size, single-case design, and study duration, the findings show that Lean 5.0 provides a transformative paradigm connecting human cognition with predictive control in construction management.

An Analysis of Constraint-Based Multi-Agent Pathfinding Algorithms

Authors:Hannah Lee, James D. Motes, Marco Morales, Nancy M. Amato

Date:2025-11-23 20:11:47

This study informs the design of future multi-agent pathfinding (MAPF) and multi-robot motion planning (MRMP) algorithms by guiding choices based on constraint classification for constraint-based search algorithms. We categorize constraints as conservative or aggressive and provide insights into their search behavior, focusing specifically on vanilla Conflict-Based Search (CBS) and Conflict-Based Search with Priorities (CBSw/P). Under a hybrid grid-roadmap representation with varying resolution, we observe that aggressive (priority constraint) formulations tend to solve more instances as agent count or resolution increases, whereas conservative (motion constraint) formulations yield stronger solution quality when both succeed. Findings are synthesized in a decision flowchart, aiding users in selecting suitable constraints. Recommendations extend to Multi-Robot Motion Planning (MRMP), emphasizing the importance of considering topological features alongside problem, solution, and representation features. A comprehensive exploration of the study, including raw data and map performance, is available in our public GitHub Repository: https://GitHub.com/hannahjmlee/constraint-mapf-analysis

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Authors:Kuan Wei Huang, Brandon Li, Bharath Hariharan, Noah Snavely

Date:2025-11-23 18:12:42

Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo--floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.

From LIGO to Fiber- Compact Sagnac Interferometer for Gravitational-Wave Detection

Authors:Farhad Hakimi, Hosain Hakimi

Date:2025-11-23 16:50:30

Gravitational wave detection has transformed astrophysics, granting us direct access to black hole mergers, neutron star collisions, and the cataclysms of stellar death. Yet the great observatories of today, LIGO, Virgo, KAGRA, and the planned Einstein Telescope, rest on Michelson interferometers that, despite their triumphs, confront fundamental barriers of scale, cost, and environmental vulnerability. We envision a new path, a Sagnac-based fiber interferometer that leverages reciprocity and inherent robustness. Its meter-scale, modular design compact enough to fit within a small facility, offers dramatic gains in scalability and affordability over kilometer scale Michelson systems. Tunable to frequency bands where conventional detectors lose sensitivity, it opens the door to compact, versatile, and accessible GW observatories, empowering universities and research centers worldwide. Linked together in a global network, such facilities could transcend mere detection, they could localize cosmic sources and reconstruct them into images, much as black holes were first directly revealed, ushering in a new era of gravitational-wave astronomy and multi-messenger discovery.

Splatblox: Traversability-Aware Gaussian Splatting for Outdoor Robot Navigation

Authors:Samarth Chopra, Jing Liang, Gershom Seneviratne, Yonghan Lee, Jaehoon Choi, Jianyu An, Stephen Cheng, Dinesh Manocha

Date:2025-11-23 16:35:51

We present Splatblox, a real-time system for autonomous navigation in outdoor environments with dense vegetation, irregular obstacles, and complex terrain. Our method fuses segmented RGB images and LiDAR point clouds using Gaussian Splatting to construct a traversability-aware Euclidean Signed Distance Field (ESDF) that jointly encodes geometry and semantics. Updated online, this field enables semantic reasoning to distinguish traversable vegetation (e.g., tall grass) from rigid obstacles (e.g., trees), while LiDAR ensures 360-degree geometric coverage for extended planning horizons. We validate Splatblox on a quadruped robot and demonstrate transfer to a wheeled platform. In field trials across vegetation-rich scenarios, it outperforms state-of-the-art methods with over 50% higher success rate, 40% fewer freezing incidents, 5% shorter paths, and up to 13% faster time to goal, while supporting long-range missions up to 100 meters. Experiment videos and more details can be found on our project page: https://splatblox.github.io

NeuroVascU-Net: A Unified Multi-Scale and Cross-Domain Adaptive Feature Fusion U-Net for Precise 3D Segmentation of Brain Vessels in Contrast-Enhanced T1 MRI

Authors:Mohammad Jafari Vayeghan, Niloufar Delfan, Mehdi Tale Masouleh, Mansour Parvaresh Rizi, Behzad Moshiri

Date:2025-11-23 12:23:20

Precise 3D segmentation of cerebral vasculature from T1-weighted contrast-enhanced (T1CE) MRI is crucial for safe neurosurgical planning. Manual delineation is time-consuming and prone to inter-observer variability, while current automated methods often trade accuracy for computational cost, limiting clinical use. We present NeuroVascU-Net, the first deep learning architecture specifically designed to segment cerebrovascular structures directly from clinically standard T1CE MRI in neuro-oncology patients, addressing a gap in prior work dominated by TOF-MRA-based approaches. NeuroVascU-Net builds on a dilated U-Net and integrates two specialized modules: a Multi-Scale Contextual Feature Fusion ($MSC^2F$) module at the bottleneck and a Cross-Domain Adaptive Feature Fusion ($CDA^2F$) module at deeper hierarchical layers. $MSC^2F$ captures both local and global information via multi-scale dilated convolutions, while $CDA^2F$ dynamically integrates domain-specific features, enhancing representation while keeping computation low. The model was trained and validated on a curated dataset of T1CE scans from 137 brain tumor biopsy patients, annotated by a board-certified functional neurosurgeon. NeuroVascU-Net achieved a Dice score of 0.8609 and precision of 0.8841, accurately segmenting both major and fine vascular structures. Notably, it requires only 12.4M parameters, significantly fewer than transformer-based models such as Swin U-NetR. This balance of accuracy and efficiency positions NeuroVascU-Net as a practical solution for computer-assisted neurosurgical planning.

AutoMAS: A Generic Multi-Agent System for Algorithm Self-Adaptation in Wireless Networks

Authors:Dingli Yuan, Jingchen Peng, Jie Fan, Boxiang Ren, Lu Yang, Peng Liu

Date:2025-11-23 12:00:35

The wireless communication environment has the characteristic of strong dynamics. Conventional wireless networks operate based on the static rules with predefined algorithms, lacking the self-adaptation ability. The rapid development of artificial intelligence (AI) provides a possibility for wireless networks to become more intelligent and fully automated. As such, we plan to integrate the cognitive capability and high intelligence of the emerging AI agents into wireless networks. In this work, we propose AutoMAS, a generic multi-agent system which can autonomously select the most suitable wireless optimization algorithm according to the dynamic wireless environment. Our AutoMAS combines theoretically guaranteed wireless algorithms with agents' perception ability, thereby providing sounder solutions to complex tasks no matter how the environment changes. As an example, we conduct a case study on the classical channel estimation problem, where the mobile user moves in diverse environments with different channel propagation characteristics. Simulation results demonstrate that our AutoMAS can guarantee the highest accuracy in changing scenarios. Similarly, our AutoMAS can be generalized to autonomously handle various tasks in 6G wireless networks with high accuracy.

Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

Authors:Yue Hu, Xiaoming He, Rui Yuan, Shahid Mumtaz

Date:2025-11-23 09:27:24

Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low-latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high-dimensional action sequences and managing intensive on-board computation. We propose an Intent-Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision-making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user-intent network into the trajectory network to maintain action dependencies. We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA-MAPPO achieve superior performance across diverse scenarios.

Enhancing UAV Search under Occlusion using Next Best View Planning

Authors:Sigrid Helene Strand, Thomas Wiedemann, Bram Burczek, Dmitriy Shutin

Date:2025-11-23 09:00:33

Search and rescue missions are often critical following sudden natural disasters or in high-risk environmental situations. The most challenging search and rescue missions involve difficult-to-access terrains, such as dense forests with high occlusion. Deploying unmanned aerial vehicles for exploration can significantly enhance search effectiveness, facilitate access to challenging environments, and reduce search time. However, in dense forests, the effectiveness of unmanned aerial vehicles depends on their ability to capture clear views of the ground, necessitating a robust search strategy to optimize camera positioning and perspective. This work presents an optimized planning strategy and an efficient algorithm for the next best view problem in occluded environments. Two novel optimization heuristics, a geometry heuristic, and a visibility heuristic, are proposed to enhance search performance by selecting optimal camera viewpoints. Comparative evaluations in both simulated and real-world settings reveal that the visibility heuristic achieves greater performance, identifying over 90% of hidden objects in simulated forests and offering 10% better detection rates than the geometry heuristic. Additionally, real-world experiments demonstrate that the visibility heuristic provides better coverage under the canopy, highlighting its potential for improving search and rescue missions in occluded environments.

General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification

Authors:Helia Abedini, Saba Rahimi, Reza Vaziri

Date:2025-11-23 07:31:41

Brain tumor detection from MRI scans plays a crucial role in early diagnosis and treatment planning. Deep convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, particularly when pretrained on large datasets. However, it remains unclear which type of pretrained model performs better when only a small dataset is available: those trained on domain-specific medical data or those pretrained on large general datasets. In this study, we systematically evaluate three pretrained CNN architectures for brain tumor classification: RadImageNet DenseNet121 with medical-domain pretraining, EfficientNetV2S, and ConvNeXt-Tiny, which are modern general-purpose CNNs. All models were trained and fine-tuned under identical conditions using a limited-size brain MRI dataset to ensure a fair comparison. Our results reveal that ConvNeXt-Tiny achieved the highest accuracy, followed by EfficientNetV2S, while RadImageNet DenseNet121, despite being pretrained on domain-specific medical data, exhibited poor generalization with lower accuracy and higher loss. These findings suggest that domain-specific pretraining may not generalize well under small-data conditions. In contrast, modern, deeper general-purpose CNNs pretrained on large-scale datasets can offer superior transfer learning performance in specialized medical imaging tasks.

Deep Learning Decision Support System for Open-Pit Mining Optimisation: GPU-Accelerated Planning Under Geological Uncertainty

Authors:Iman Rahimi

Date:2025-11-23 05:27:04

This study presents Part II of an AI-enhanced Decision Support System (DSS), extending Rahimi (2025, Part I) by introducing a fully uncertainty-aware optimization framework for long-term open-pit mine planning. Geological uncertainty is modelled using a Variational Autoencoder (VAE) trained on 50,000 spatial grade samples, enabling the generation of probabilistic, multi-scenario orebody realizations that preserve geological continuity and spatial correlation. These scenarios are optimized through a hybrid metaheuristic engine integrating Genetic Algorithms (GA), Large Neighborhood Search (LNS), Simulated Annealing (SA), and reinforcement-learning-based adaptive control. An ε-constraint relaxation strategy governs the population exploration phase, allowing near-feasible schedule discovery early in the search and gradual tightening toward strict constraint satisfaction. GPU-parallel evaluation enables the simultaneous assessment of 65,536 geological scenarios, achieving near-real-time feasibility analysis. Results demonstrate up to 1.2 million-fold runtime improvement over IBM CPLEX and significantly higher expected NPV under geological uncertainty, confirming the DSS as a scalable and uncertainty-resilient platform for intelligent mine planning.

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Authors:Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun

Date:2025-11-23 03:25:39

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Authors:Yogesh Kulkarni, Pooyan Fazli

Date:2025-11-23 01:25:17

Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $\textbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $\textbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $\textbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.

APULSE: A Scalable Hybrid Algorithm for the RCSPP on Large-Scale Dense Graphs

Authors:Nuno Soares, António Grilo

Date:2025-11-23 00:49:33

The resource-constrained shortest path problem (RCSPP) is a fundamental NP-hard optimization challenge with broad applications, from network routing to autonomous navigation. This problem involves finding a path that minimizes a primary cost subject to a budget on a secondary resource. While various RCSPP solvers exist, they often face critical scalability limitations when applied to the large, dense graphs characteristic of complex, real-world scenarios, making them impractical for time-critical planning. This challenge is particularly acute in domains like mission planning for unmanned ground vehicles (UGVs), which demand solutions on large-scale terrain graphs. This paper introduces APULSE, a hybrid label-setting algorithm designed to efficiently solve the RCSPP on such challenging graphs. APULSE integrates a best-first search guided by an A* heuristic with aggressive, Pulse-style pruning mechanisms and a time-bucketing strategy for effective state-space reduction. A computational study, using a large-scale UGV planning scenario, benchmarks APULSE against state-of-the-art algorithms. The results demonstrate that APULSE consistently finds near-optimal solutions while being orders of magnitude faster and more robust, particularly on large problem instances where competing methods fail. This superior scalability establishes APULSE as an effective solution for RCSPP in complex, large-scale environments, enabling capabilities such as interactive decision support and dynamic replanning.