planning - 2026-01-14

RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis

Authors:Zhengwei Tao, Bo Li, Jialong Wu, Guochen Yan, Huanyao Zhang, Jiahao Xu, Haitao Mi, Wentao Zhang
Date:2026-01-13 16:25:07

Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Authors:Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu
Date:2026-01-13 15:43:43

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.

Percentile-based probabilistic optimization for systematic and random uncertainties in radiation therapy

Authors:Albin Fredriksson, Erik Engwall, Jenneke de Jong, Johan Sundström
Date:2026-01-13 15:15:43

Geometric uncertainty can degrade treatment quality in radiation therapy. While margins and robust optimization mitigate these effects, they provide only implicit control over clinical goal fulfillment probability. We therefore develop a probabilistic planning framework using a percentile-based optimization function that targets a specified probability of clinical goal fulfillment. Systematic and random uncertainties were explicitly modeled over full treatment courses. A scenario dose approximation method based on interpolation between a fixed set of doses was used, enabling efficient simulation of treatment courses during optimization. The framework was evaluated on a prostate case treated with volumetric-modulated arc therapy (VMAT) and a brain case treated with pencil beam scanning (PBS) proton therapy. Plans were compared to conventional margin-based and worst-case robust optimization using probabilistic evaluation. For the prostate case, probabilistic optimization improved organ at risk (OAR) sparing while maintaining target coverage compared to margin-based planning, increasing average OAR goal fulfillment probability by 13.3 percentage points and reducing 90th percentile OAR doses by an average of 3.5~Gy. For the brain case, probabilistic optimization improved target minimum dose passing probabilities (e.g., 88\% vs.~22\% for $D_{95}$) and brainstem maximum dose passing probability (70\% vs.~30\%), while maintaining comparable or improved OAR sparing compared to worst-case optimization. Probabilistic optimization enables explicit and interpretable control over goal fulfillment probabilities. Combining full treatment course modeling with efficient approximate dose calculation, the proposed framework improved the trade-off between target coverage and OAR sparing compared to conventional planning approaches in both photon and proton therapy.

Cities at Play: Improving Equilibria in Urban Neighbourhood Games

Authors:Martin Gairing, Adrian Vetta, Zhanzhan Zhao
Date:2026-01-13 15:14:45

How should cities invest to improve social welfare when individuals respond strategically to local conditions? We model this question using a game-theoretic version of Schelling's bounded neighbourhood model, where agents choose neighbourhoods based on concave, non-monotonic utility functions reflecting local population. While naive improvements may worsen outcomes - analogous to Braess' paradox - we show that carefully designed, small-scale investments can reliably align individual incentives with societal goals. Specifically, modifying utilities at a total cost of at most $0.81 ε^2 \cdot \texttt{opt}$ guarantees that every resulting Nash equilibrium achieves a social welfare of at least $ε\cdot \texttt{opt}$, where $\texttt{opt}$ is the optimum social welfare. Our results formalise how targeted interventions can transform supra-negative outcomes into supra-positive returns, offering new insights into strategic urban planning and decentralised collective behaviour.

Rewriting Video: Text-Driven Reauthoring of Video Footage

Authors:Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li
Date:2026-01-13 13:49:05

Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Authors:Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao
Date:2026-01-13 11:47:42

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

CoMa: Contextual Massing Generation with Vision-Language Models

Authors:Evgenii Maslov, Valentin Khrulkov, Anastasia Volkova, Anton Gusarov, Andrey Kuznetsov, Ivan Oseledets
Date:2026-01-13 11:44:00

The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.

Effective outdoor pathloss prediction: A multi-layer segmentation approach with weighting map

Authors:Yuan Gao, Tao Wen, Wenjing Xie, Jianbo Du, Yong Zeng, Dusit Niyato, Shugong Xu
Date:2026-01-13 11:07:36

Predicting pathloss by considering the physical environment is crucial for effective wireless network planning. Traditional methods, such as ray tracing and model-based approaches, often face challenges due to high computational complexity and discrepancies between models and real-world environments. In contrast, deep learning has emerged as a promising alternative, offering accurate path loss predictions with reduced computational complexity. In our research, we introduce a ResNet-based model designed to enhance path loss prediction. We employ innovative techniques to capture key features of the environment by generating transmission (Tx) and reception (Rx) depth maps, as well as a distance map from the geographic data. Recognizing the significant attenuation caused by signal reflection and diffraction, particularly at high frequencies, we have developed a weighting map that emphasizes the areas adjacent to the direct path between Tx and Rx for path loss prediction. {Extensive simulations demonstrate that our model outperforms PPNet, RPNet, and Vision Transformer (ViT) by 1.2-3.0 dB using dataset of ITU challenge 2024 and ICASSP 2023. In addition, the floating point operations (FLOPs) of the proposed model is 60\% less than those of benchmarks.} Additionally, ablation studies confirm that the inclusion of the weighting map significantly enhances prediction performance.

Large Multimodal Models for Embodied Intelligent Driving: The Next Frontier in Self-Driving?

Authors:Long Zhang, Yuchen Xia
Date:2026-01-13 11:05:12

The advent of Large Multimodal Models (LMMs) offers a promising technology to tackle the limitations of modular design in autonomous driving, which often falters in open-world scenarios requiring sustained environmental understanding and logical reasoning. Besides, embodied artificial intelligence facilitates policy optimization through closed-loop interactions to achieve the continuous learning capability, thereby advancing autonomous driving toward embodied intelligent (El) driving. However, such capability will be constrained by relying solely on LMMs to enhance EI driving without joint decision-making. This article introduces a novel semantics and policy dual-driven hybrid decision framework to tackle this challenge, ensuring continuous learning and joint decision. The framework merges LMMs for semantic understanding and cognitive representation, and deep reinforcement learning (DRL) for real-time policy optimization. We starts by introducing the foundational principles of EI driving and LMMs. Moreover, we examine the emerging opportunities this framework enables, encompassing potential benefits and representative use cases. A case study is conducted experimentally to validate the performance superiority of our framework in completing lane-change planning task. Finally, several future research directions to empower EI driving are identified to guide subsequent work.

Kantorovich Distance via Spanning Trees: Properties and Algorithms

Authors:Jérémie Bigot, Luis Fredes
Date:2026-01-13 10:02:59

We study optimal transport between probability measures supported on the same finite metric space, where the ground cost is a distance induced by a weighted connected graph. Building on recent work showing that the resulting Kantorovich distance can be expressed as a minimization problem over the set of spanning trees of this underlying graph, we investigate the implications of this reformulation on the construction of an optimal transport plan and a dual potential based on the solution of such an optimization problem. In this setting, we derive an explicit formula for the Kantorovich potential in terms of the imbalanced cumulative mass (a generalization of the cumulative distribution in R) along an optimal spanning tree solving such a minimization problem, under a weak non-degeneracy condition on the pair of measures that guarantees the uniqueness of a dual potential. Our second contribution establishes the existence of an optimal transport plan that can be computed efficiently by a dynamic programming procedure once an optimal spanning tree is known. Finally, we propose a stochastic algorithm based on simulated annealing on the space of spanning trees to compute such an optimal spanning tree. Numerical experiments illustrate the theoretical results and demonstrate the practical relevance of the proposed approach for optimal transport on finite metric spaces.

Evaluating the effectiveness of radio frequency interference removal algorithms for single pulse searches

Authors:R. S. Hombal, L. Levin, B. W. Stappers, M. Droog, A. Karastergiou, D. Lumbaa, M. B. Mickaliger, A. Naidu, K. M. Rajwade, J. Sepulveda, B. Shaw, S. Singh, T. Prabu
Date:2026-01-13 09:10:33

Radio Frequency Interference (RFI), the presence of artificial and/or terrestrial signals in astronomical data, poses a great challenge to the search for pulsars and radio transients, such as Rotating Radio Transients (RRATs) and Fast Radio Bursts (FRBs), by obscuring or distorting the signal of interest and resulting in large numbers of erroneous detections. RFI mitigation algorithms aim to remove this interference and improve the chance of detection of transients, but with the growing number of techniques, selecting the most appropriate method for a given survey can be problematic. The choice of method is particularly important in real-time searches planned for next-generation telescopes such as those of the SKAO, where there is no possibility to reprocess the data. In this paper, we explore the algorithm selection problem by injecting pulses into data which simulates several RFI environments. A set of these files is then cleaned using RFI mitigation algorithms and run through a single pulse search pipeline to analyse the recovery of the injected pulses. We examine the recovery of the injected single pulses with an emphasis on a number of cases spanning a range of pulse brightness, width and dispersion measure. The efficacy and side effects of a few popular RFI excision methods, namely IQRM, SKF, and ZDMF are evaluated.

AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture

Authors:Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu, Nueraili Aierken, Shijian Li
Date:2026-01-13 07:53:09

Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.

ReCo-KD: Region- and Context-Aware Knowledge Distillation for Efficient 3D Medical Image Segmentation

Authors:Qizhen Lan, Yu-Chun Hsu, Nida Saddaf Khan, Xiaoqian Jiang
Date:2026-01-13 07:44:43

Accurate 3D medical image segmentation is vital for diagnosis and treatment planning, but state-of-the-art models are often too large for clinics with limited computing resources. Lightweight architectures typically suffer significant performance loss. To address these deployment and speed constraints, we propose Region- and Context-aware Knowledge Distillation (ReCo-KD), a training-only framework that transfers both fine-grained anatomical detail and long-range contextual information from a high-capacity teacher to a compact student network. The framework integrates Multi-Scale Structure-Aware Region Distillation (MS-SARD), which applies class-aware masks and scale-normalized weighting to emphasize small but clinically important regions, and Multi-Scale Context Alignment (MS-CA), which aligns teacher-student affinity patterns across feature levels. Implemented on nnU-Net in a backbone-agnostic manner, ReCo-KD requires no custom student design and is easily adapted to other architectures. Experiments on multiple public 3D medical segmentation datasets and a challenging aggregated dataset show that the distilled lightweight model attains accuracy close to the teacher while markedly reducing parameters and inference latency, underscoring its practicality for clinical deployment.

OpenMic: A Multi-Agent-Based Stand-Up Comedy Generation System

Authors:Yuyang Wu, Hanzhong Cao, Jianhao Chen, Yufei Li
Date:2026-01-13 07:26:23

Chinese stand-up comedy generation goes beyond plain text generation, requiring culturally grounded humor, precise timing, stage-performance cues, and implicit multi-step reasoning. Moreover, commonly used Chinese humor datasets are often better suited for humor understanding and evaluation than for long-form stand-up generation, making direct supervision misaligned with the target task. To address these challenges, we present OpenMic, an end-to-end multi-agent system built on AutoGen that transforms a user-provided life topic into a 3-5 minute Chinese stand-up performance and further produces a narrated comedy video. OpenMic orchestrates multiple specialized agents in a multi-round iterative loop-planning to jointly optimize humor, timing, and performability. To mitigate the dataset-task mismatch, we augment generation with retrieval-augmented generation (RAG) for material grounding and idea expansion, and we fine-tune a dedicated JokeWriter to better internalize stand-up-specific setup-punchline structures and long-range callbacks.

An Axiomatic Approach to General Intelligence: SANC(E3) -- Self-organizing Active Network of Concepts with Energy E3

Authors:Daesuk Kwon, Won-gi Paeng
Date:2026-01-13 05:06:07

General intelligence must reorganize experience into internal structures that enable prediction and action under finite resources. Existing systems implicitly presuppose fixed primitive units -- tokens, subwords, pixels, or predefined sensor channels -- thereby bypassing the question of how representational units themselves emerge and stabilize. This paper proposes SANC(E3), an axiomatic framework in which representational units are not given a priori but instead arise as stable outcomes of competitive selection, reconstruction, and compression under finite activation capacity, governed by the explicit minimization of an energy functional E3. SANC(E3) draws a principled distinction between system tokens -- structural anchors such as {here, now, I} and sensory sources -- and tokens that emerge through self-organization during co-occurring events. Five core axioms formalize finite capacity, association from co-occurrence, similarity-based competition, confidence-based stabilization, and the reconstruction-compression-update trade-off. A key feature is a pseudo-memory-mapped I/O mechanism, through which internally replayed Gestalts are processed via the same axiomatic pathway as external sensory input. As a result, perception, imagination, prediction, planning, and action are unified within a single representational and energetic process. From the axioms, twelve propositions are derived, showing that category formation, hierarchical organization, unsupervised learning, and high-level cognitive activities can all be understood as instances of Gestalt completion under E3 minimization.

Hybrid Centralized Distributed Control for Lifelong MAPF over Wireless Connections

Authors:Jinghao Cao, Wanchun Liu, Yonghui Li, Branka Vucetic
Date:2026-01-13 04:38:56

In lifelong multi-agent path finding (MAPF) with many robots, unreliable wireless links and stochastic executions are the norm. Existing approaches typically either rely on centralized planning under idealized communication, or run fully distributed local controllers with fixed communication patterns; they rarely couple communication scheduling with policy learning, and thus struggle when bandwidth is scarce or packets are frequently dropped. We address this joint control--communication problem and propose a hybrid centralized--distributed scheme: a centralized cloud policy sends small residual corrections only when selected, while a lightweight on-board Gated recurrent unit (GRU) policy provides a safe default fallback when wireless connection is not available.

Modal Parameter Extraction via Propeller-Driven Vibration Testing

Authors:Gabriele Dessena, Alessandro Pontillo
Date:2026-01-13 01:30:18

Ground Vibration Testing (GVT) supports aircraft certification but often requires lengthy and costly campaigns. Propeller-driven Vibration Testing (PVT) is assessed here as an output-only alternative, in line with Operational Modal Analysis approaches such as Taxi Vibration Testing and Flight Vibration Testing. A cantilever Aluminium 7075-T6 wing spar is instrumented with seven accelerometers and excited by an outboard electric motor and propeller. Seven runs are carried out: a motor-off baseline, five constant-throttle cases, and a manual up-down throttle sweep. The acquired spectra indicate that the dominant resonances remain observable under propeller excitation, while low-throttle conditions introduce narrowband harmonics that may mask structural peaks; the sweep reduces persistent overlap. Modal parameters are identified for the baseline and sweep cases using the Natural Excitation Technique with the Loewner Framework (NExT-LF). The first two modes remain closely matched (Modal Assurance Criterion (MAC) > 0.99), whereas the third mode shows reduced repeatability (MAC = 0.827) and a larger frequency shift, consistent with propeller-induced bending--torsion coupling and non-ideal sweep control. Overall, PVT provides a viable complement to GVT for extracting low-frequency modal information and motivates pursuing future work on automated throttle scheduling and coupling-aware test planning.

Exploiting DINOv3-Based Self-Supervised Features for Robust Few-Shot Medical Image Segmentation

Authors:Guoping Xu, Jayaram K. Udupa, Weiguo Lu, You Zhang
Date:2026-01-12 23:44:25

Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on https://github.com/apple1986/DINO-AugSeg.

Contact-aware Path Planning for Autonomous Neuroendovascular Navigation

Authors:Aabha Tamhankar, Ron Alterovitz, Ajit S. Puri, Giovanni Pittiglio
Date:2026-01-12 19:24:08

We propose a deterministic and time-efficient contact-aware path planner for neurovascular navigation. The algorithm leverages information from pre- and intra-operative images of the vessels to navigate pre-bent passive tools, by intelligently predicting and exploiting interactions with the anatomy. A kinematic model is derived and employed by the sampling-based planner for tree expansion that utilizes simplified motion primitives. This approach enables fast computation of the feasible path, with negligible loss in accuracy, as demonstrated in diverse and representative anatomies of the vessels. In these anatomical demonstrators, the algorithm shows a 100% convergence rate within 22.8s in the worst case, with sub-millimeter tracking errors (less than 0.64 mm), and is found effective on anatomical phantoms representative of around 94% of patients.

Video Generation Models in Robotics -- Applications, Research Challenges, Future Directions

Authors:Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, Philip Dames, Anirudha Majumdar
Date:2026-01-12 18:57:34

Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.

Vision-Language Model for Accurate Crater Detection

Authors:Patrick Bauer, Marius Schwinning, Florian Renk, Andreas Weinmann, Hichem Snoussi
Date:2026-01-12 18:08:17

The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.

Structural Approach to Guiding a Present-Biased Agent

Authors:Tatiana Belova, Yuriy Dementiev, Artur Ignatiev, Danil Sagunov
Date:2026-01-12 17:47:38

Time-inconsistent behavior, such as procrastination or abandonment of long-term goals, arises when agents evaluate immediate outcomes disproportionately higher than future ones. This leads to globally suboptimal behavior, where plans are frequently revised or abandoned entirely. In the influential model of Kleinberg and Oren (2014) such behavior is modeled by a present-biased agent navigating a task graph toward a goal, making locally optimal decisions at each step based on discounted future costs. As a result, the agent may repeatedly deviate from initial plans. Recent work by Belova et al. (2024) introduced a two-agent extension of this model, where a fully-aware principal attempts to guide the present-biased agent through a specific set of critical tasks without causing abandonment. This captures a rich class of principal-agent dynamics in behavioral settings. In this paper, we provide a comprehensive algorithmic characterization of this problem. We analyze its computational complexity through the framework of parameterized algorithms, focusing on graph parameters that naturally emerge in this setting, such as treewidth, vertex cover, and feedback vertex set. Our main result is a fixed-parameter tractable algorithm when parameterized by the treewidth of the task graph and the number of distinct (v,t)-path costs. Our algorithm encaptures several input settings, such as bounded edge costs and restricted task graph structure. We demonstrate that our main result yields efficient algorithms for a number of such configurations. We complement this with tight hardness results, that highlight the extreme difficulty of the problem even on simplest graphs with bounded number of nodes and constant parameter values, and motivate our choice of parameters. We delineate tractable and intractable regions of the problem landscape, which include answers to open questions of Belova et al. (2024).

On Angels and Demons: Strategic (De)Construction of Dynamic Models

Authors:Davide Catta, Rustam Galimullin, Munyque Mittelmann
Date:2026-01-12 16:19:22

In recent years, there has been growing interest in logics that formalise strategic reasoning about agents capable of modifying the structure of a given model. This line of research has been motivated by applications where a modelled system evolves over time, such as communication networks, security protocols, and multi-agent planning. In this paper, we introduce three logics for reasoning about strategies that modify the topology of weighted graphs. In Strategic Deconstruction Logic, a destructive agent (the demon) removes edges up to a certain cost. In Strategic Construction Logic, a constructive agent (the angel) adds edges within a cost bound. Finally, Strategic Update Logic combines both agents, who may cooperate or compete. We study the expressive power of these logics and the complexity of their model checking problems.

A note on thermodynamics of the production processes

Authors:Vladimir Pokrovskii
Date:2026-01-12 15:02:56

The process of creating goods and services, measured by their value, is considered as a process of creating complexity. This makes it possible to consider the production system as an open thermodynamic system, and to develop a simple heuristic model for the production process. The model includes three production factors: the index of complexity of production equipment (physical capital $K$), human activity (labour $L$), and the substitutive capacity of equipment (substitutive work $P$). The latter is a contribution to economic theory from the thermodynamic approach, which also requires the introduction of technological characteristics of production equipment, such as labor requirement ($\overlineλ$) and energy requirement ($\overline{\varepsilon}$), which indicate the amounts of labor and energy required to operate production equipment. By applying thermodynamic principles to the theory of production, we can understand how labour can be replaced by capital, and derive the production function in four equivalent but different formulations. Two of them are known and used by economists for interpretation the production phenomena; the thermodynamic approach gives some foundation for economic theory. The production function allows an unambiguously decompose of the growth rate of output according to the growth rates of production factors and technological level. The introduction of substitute work as a factor of production and technological features of capital expands planning and analyse of production processes.

Pheromone-Focused Ant Colony Optimization algorithm for path planning

Authors:Yi Liu, Hongda Zhang, Zhongxue Gan, Yuning Chen, Ziqing Zhou, Chunlei Meng, Chun Ouyang
Date:2026-01-12 14:44:45

Ant Colony Optimization (ACO) is a prominent swarm intelligence algorithm extensively applied to path planning. However, traditional ACO methods often exhibit shortcomings, such as blind search behavior and slow convergence within complex environments. To address these challenges, this paper proposes the Pheromone-Focused Ant Colony Optimization (PFACO) algorithm, which introduces three key strategies to enhance the problem-solving ability of the ant colony. First, the initial pheromone distribution is concentrated in more promising regions based on the Euclidean distances of nodes to the start and end points, balancing the trade-off between exploration and exploitation. Second, promising solutions are reinforced during colony iterations to intensify pheromone deposition along high-quality paths, accelerating convergence while maintaining solution diversity. Third, a forward-looking mechanism is implemented to penalize redundant path turns, promoting smoother and more efficient solutions. These strategies collectively produce the focused pheromones to guide the ant colony's search, which enhances the global optimization capabilities of the PFACO algorithm, significantly improving convergence speed and solution quality across diverse optimization problems. The experimental results demonstrate that PFACO consistently outperforms comparative ACO algorithms in terms of convergence speed and solution quality.

FlyCo: Foundation Model-Empowered Drones for Autonomous 3D Structure Scanning in Open-World Environments

Authors:Chen Feng, Guiyong Zheng, Tengkai Zhuang, Yongqian Wu, Fangzhan He, Haojia Li, Juepeng Zheng, Shaojie Shen, Boyu Zhou
Date:2026-01-12 14:14:39

Autonomous 3D scanning of open-world target structures via drones remains challenging despite broad applications. Existing paradigms rely on restrictive assumptions or effortful human priors, limiting practicality, efficiency, and adaptability. Recent foundation models (FMs) offer great potential to bridge this gap. This paper investigates a critical research problem: What system architecture can effectively integrate FM knowledge for this task? We answer it with FlyCo, a principled FM-empowered perception-prediction-planning loop enabling fully autonomous, prompt-driven 3D target scanning in diverse unknown open-world environments. FlyCo directly translates low-effort human prompts (text, visual annotations) into precise adaptive scanning flights via three coordinated stages: (1) perception fuses streaming sensor data with vision-language FMs for robust target grounding and tracking; (2) prediction distills FM knowledge and combines multi-modal cues to infer the partially observed target's complete geometry; (3) planning leverages predictive foresight to generate efficient and safe paths with comprehensive target coverage. Building on this, we further design key components to boost open-world target grounding efficiency and robustness, enhance prediction quality in terms of shape accuracy, zero-shot generalization, and temporal stability, and balance long-horizon flight efficiency with real-time computability and online collision avoidance. Extensive challenging real-world and simulation experiments show FlyCo delivers precise scene understanding, high efficiency, and real-time safety, outperforming existing paradigms with lower human effort and verifying the proposed architecture's practicality. Comprehensive ablations validate each component's contribution. FlyCo also serves as a flexible, extensible blueprint, readily leveraging future FM and robotics advances. Code will be released.

ViewMorpher3D: A 3D-aware Diffusion Framework for Multi-Camera Novel View Synthesis in Autonomous Driving

Authors:Farhad G. Zanjani, Hong Cai, Amirhossein Habibian
Date:2026-01-12 13:44:14

Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity.

Data-Driven Stochastic VRP: Integration of Forecast Duration into Optimization for Utility Workforce Management

Authors:Matteo Garbelli
Date:2026-01-12 13:12:46

This paper investigates the integration of machine learning forecasts of intervention durations into a stochastic variant of the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW). In particular, we exploit tree-based gradient boosting (XGBoost) trained on eight years of gas meter maintenance data to produce point predictions and uncertainty estimates, which then drive a multi-objective evolutionary optimization routine. The methodology addresses uncertainty through sub-Gaussian concentration bounds for route-level risk buffers and explicitly accounts for competing operational KPIs through a multi-objective formulation. Empirical analysis of prediction residuals validates the sub-Gaussian assumption underlying the risk model. From an empirical point of view, our results report improvements around 20-25\% in operator utilization and completion rates compared with plans computed using default durations. The integration of uncertainty quantification and risk-aware optimization provides a practical framework for handling stochastic service durations in real-world routing applications.

Anatomy Aware Cascade Network: Bridging Epistemic Uncertainty and Geometric Manifold for 3D Tooth Segmentation

Authors:Bing Yu, Liu Shi, Haitao Wang, Deran Qi, Xiang Cai, Wei Zhong, Qiegen Liu
Date:2026-01-12 12:53:27

Accurate three-dimensional (3D) tooth segmentation from Cone-Beam Computed Tomography (CBCT) is a prerequisite for digital dental workflows. However, achieving high-fidelity segmentation remains challenging due to adhesion artifacts in naturally occluded scans, which are caused by low contrast and indistinct inter-arch boundaries. To address these limitations, we propose the Anatomy Aware Cascade Network (AACNet), a coarse-to-fine framework designed to resolve boundary ambiguity while maintaining global structural consistency. Specifically, we introduce two mechanisms: the Ambiguity Gated Boundary Refiner (AGBR) and the Signed Distance Map guided Anatomical Attention (SDMAA). The AGBR employs an entropy based gating mechanism to perform targeted feature rectification in high uncertainty transition zones. Meanwhile, the SDMAA integrates implicit geometric constraints via signed distance map to enforce topological consistency, preventing the loss of spatial details associated with standard pooling. Experimental results on a dataset of 125 CBCT volumes demonstrate that AACNet achieves a Dice Similarity Coefficient of 90.17 \% and a 95\% Hausdorff Distance of 3.63 mm, significantly outperforming state-of-the-art methods. Furthermore, the model exhibits strong generalization on an external dataset with an HD95 of 2.19 mm, validating its reliability for downstream clinical applications such as surgical planning. Code for AACNet is available at https://github.com/shiliu0114/AACNet.

R3-RECON: Radiance-Field-Free Active Reconstruction via Renderability

Authors:Xiaofeng Jin, Matteo Frosi, Yiran Guo, Matteo Matteucci
Date:2026-01-12 12:37:26

In active reconstruction, an embodied agent must decide where to look next to efficiently acquire views that support high-quality novel-view rendering. Recent work on active view planning for neural rendering largely derives next-best-view (NBV) criteria by backpropagating through radiance fields or estimating information entropy over 3D Gaussian primitives. While effective, these strategies tightly couple view selection to heavy, representation-specific mechanisms and fail to account for the computational and resource constraints required for lightweight online deployment. In this paper, we revisit active reconstruction from a renderability-centric perspective. We propose $\mathbb{R}^{3}$-RECON, a radiance-fields-free active reconstruction framework that induces an implicit, pose-conditioned renderability field over SE(3) from a lightweight voxel map. Our formulation aggregates per-voxel online observation statistics into a unified scalar renderability score that is cheap to update and can be queried in closed form at arbitrary candidate viewpoints in milliseconds, without requiring gradients or radiance-field training. This renderability field is strongly correlated with image-space reconstruction error, naturally guiding NBV selection. We further introduce a panoramic extension that estimates omnidirectional (360$^\circ$) view utility to accelerate candidate evaluation. In the standard indoor Replica dataset, $\mathbb{R}^{3}$-RECON achieves more uniform novel-view quality and higher 3D Gaussian splatting (3DGS) reconstruction accuracy than recent active GS baselines with matched view and time budgets.