planning - 2026-01-08

Prediction Intervals for Interim Events in Randomized Clinical Trials with Time-to-Event Endpoints

Authors:Edoardo Ratti, Federico L. Perlino, Stefania Galimberti, Maria G. Valsecchi

Date:2026-01-07 18:58:45

Time-to-event endpoints are central to evaluate treatment efficacy across many disease areas. Many trial protocols include interim analyses within group-sequential designs that control type I error via spending functions or boundary methods. The corresponding operating characteristics depend on the number of looks and the information accrued. Planning interim analyses with time-to-event endpoints is challenging because statistical information depends on the number of observed events. Ensuring adequate follow-up to accrue the required events is therefore critical, making interim prediction of information at scheduled looks and at the final analysis essential. While several methods have been developed to predict the calendar time required to reach a target number of events, to the best of our knowledge there is no established framework that addresses the prediction of the number of events at a future date with corresponding prediction intervals. Starting from an prediction interval approach originally developed in reliability engineering for the number of future component failures, we reformulated and extended it to the context of interim monitoring in clinical trials. This adaptation yields a general framework for event-count prediction intervals in the clinical setting, taking the patient as the unit of analysis and accommodating a range of parametric survival models, patient-level covariates, stagged entry and possible dependence between entry dates and lost to follow-up. Prediction intervals are obtained in a frequentist framework from a bootstrap estimator of the conditional distribution of future events. The performance of the proposed approach is investigated via simulation studies and illustrated by analyzing a real-world phase III trial in childhood acute lymphoblastic leukaemia.

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Authors:Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang

Date:2026-01-07 17:50:37

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Authors:Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang

Date:2026-01-07 16:21:19

The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.

Stage-specific cancer survival prediction enriched by explainable machine learning

Authors:Parisa Poorhasani, Bogdan Iancu

Date:2026-01-07 14:44:04

Despite the fact that cancer survivability rates vary greatly between stages, traditional survival prediction models have frequently been trained and assessed using examples from all combined phases of the disease. This method may result in an overestimation of performance and ignore the stage-specific variations. Using the SEER dataset, we created and verified explainable machine learning (ML) models to predict stage-specific cancer survivability in colorectal, stomach, and liver cancers. ML-based cancer survival analysis has been a long-standing topic in the literature; however, studies involving the explainability and transparency of ML survivability models are limited. Our use of explainability techniques, including SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), enabled us to illustrate significant feature-cancer stage interactions that would have remained hidden in traditional black-box models. We identified how certain demographic and clinical variables influenced survival differently across cancer stages and types. These insights provide not only transparency but also clinical relevance, supporting personalized treatment planning. By focusing on stage-specific models, this study provides new insights into the most important factors at each stage of cancer, offering transparency and potential clinical relevance to support personalized treatment planning.

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM

Authors:Kangjie Zhou, Zhejia Wen, Zhiyong Zhuo, Zike Yan, Pengying Wu, Ieng Hou U, Shuaiyang Li, Han Gao, Kang Ding, Wenhan Cao, Wei Pan, Chang Liu

Date:2026-01-07 14:10:46

Recent Vision-Language Models (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot's physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation via Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, specifically introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS significantly outperforms representative baselines, achieving a 17\% higher overall success rate and over 80\% improvement in complex long-horizon scenarios compared to the best-performing baseline

Towards Safe Autonomous Driving: A Real-Time Motion Planning Algorithm on Embedded Hardware

Authors:Korbinian Moller, Glenn Johannes Tungka, Lucas Jürgens, Johannes Betz

Date:2026-01-07 13:14:41

Ensuring the functional safety of Autonomous Vehicles (AVs) requires motion planning modules that not only operate within strict real-time constraints but also maintain controllability in case of system faults. Existing safeguarding concepts, such as Online Verification (OV), provide safety layers that detect infeasible planning outputs. However, they lack an active mechanism to ensure safe operation in the event that the main planner fails. This paper presents a first step toward an active safety extension for fail-operational Autonomous Driving (AD). We deploy a lightweight sampling-based trajectory planner on an automotive-grade, embedded platform running a Real-Time Operating System (RTOS). The planner continuously computes trajectories under constrained computational resources, forming the foundation for future emergency planning architectures. Experimental results demonstrate deterministic timing behavior with bounded latency and minimal jitter, validating the feasibility of trajectory planning on safety-certifiable hardware. The study highlights both the potential and the remaining challenges of integrating active fallback mechanisms as an integral part of next-generation safeguarding frameworks. The code is available at: https://github.com/TUM-AVS/real-time-motion-planning

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Authors:Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou

Date:2026-01-07 09:29:57

Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

VideoMemory: Toward Consistent Video Generation via Memory Integration

Authors:Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, Ying-cong Chen

Date:2026-01-07 07:10:32

Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.

MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction

Authors:Wenjie Luo, Chuanhu Deng, Chaorong Li, Rongyao Deng, Qiang Yang

Date:2026-01-07 06:24:26

Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.

Locomotion Beyond Feet

Authors:Tae Hoon Yang, Haochen Shi, Jiacheng Hu, Zhicong Zhang, Daniel Jiang, Weizhuo Wang, Yao He, Zhen Wu, Yuming Chen, Yifan Hou, Monroe Kennedy, Shuran Song, C. Karen Liu

Date:2026-01-07 05:36:39

Most locomotion methods for humanoid robots focus on leg-based gaits, yet natural bipeds frequently rely on hands, knees, and elbows to establish additional contacts for stability and support in complex environments. This paper introduces Locomotion Beyond Feet, a comprehensive system for whole-body humanoid locomotion across extremely challenging terrains, including low-clearance spaces under chairs, knee-high walls, knee-high platforms, and steep ascending and descending stairs. Our approach addresses two key challenges: contact-rich motion planning and generalization across diverse terrains. To this end, we combine physics-grounded keyframe animation with reinforcement learning. Keyframes encode human knowledge of motor skills, are embodiment-specific, and can be readily validated in simulation or on hardware, while reinforcement learning transforms these references into robust, physically accurate motions. We further employ a hierarchical framework consisting of terrain-specific motion-tracking policies, failure recovery mechanisms, and a vision-based skill planner. Real-world experiments demonstrate that Locomotion Beyond Feet achieves robust whole-body locomotion and generalizes across obstacle sizes, obstacle instances, and terrain sequences.

A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving

Authors:Liangdong Zhang, Yiming Nie, Haoyang Li, Fanjie Kong, Baobao Zhang, Shunxin Huang, Kai Fu, Chen Min, Liang Xiao

Date:2026-01-07 02:08:18

Efficient trajectory planning in off-road terrains presents a formidable challenge for autonomous vehicles, often necessitating complex multi-step pipelines. However, traditional approaches exhibit limited adaptability in dynamic environments. To address these limitations, this paper proposes OFF-EMMA, a novel end-to-end multimodal framework designed to overcome the deficiencies of insufficient spatial perception and unstable reasoning in visual-language-action (VLA) models for off-road autonomous driving scenarios. The framework explicitly annotates input images through the design of a visual prompt block and introduces a chain-of-thought with self-consistency (COT-SC) reasoning strategy to enhance the accuracy and robustness of trajectory planning. The visual prompt block utilizes semantic segmentation masks as visual prompts, enhancing the spatial understanding ability of pre-trained visual-language models for complex terrains. The COT- SC strategy effectively mitigates the error impact of outliers on planning performance through a multi-path reasoning mechanism. Experimental results on the RELLIS-3D off-road dataset demonstrate that OFF-EMMA significantly outperforms existing methods, reducing the average L2 error of the Qwen backbone model by 13.3% and decreasing the failure rate from 16.52% to 6.56%.

Evolving Programmatic Skill Networks

Authors:Haochen Shi, Xingdi Yuan, Bang Liu

Date:2026-01-07 01:43:25

We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.

Online Decision-Making Under Uncertainty for Vehicle-to-Building Systems

Authors:Rishav Sen, Yunuo Zhang, Fangqi Liu, Jose Paolo Talusan, Ava Pettet, Yoshinori Suzue, Ayan Mukhopadhyay, Abhishek Dubey

Date:2026-01-07 00:05:45

Vehicle-to-building (V2B) systems integrate physical infrastructures, such as smart buildings and electric vehicles (EVs) connected to chargers at the building, with digital control mechanisms to manage energy use. By utilizing EVs as flexible energy reservoirs, buildings can dynamically charge and discharge them to optimize energy use and cut costs under time-variable pricing and demand charge policies. This setup leads to the V2B optimization problem, where buildings coordinate EV charging and discharging to minimize total electricity costs while meeting users' charging requirements. However, the V2B optimization problem is challenging because of: (1) fluctuating electricity pricing, which includes both energy charges ($/kWh) and demand charges ($/kW); (2) long planning horizons (typically over 30 days); (3) heterogeneous chargers with varying charging rates, controllability, and directionality (i.e., unidirectional or bidirectional); and (4) user-specific battery levels at departure to ensure user requirements are met. In contrast to existing approaches that often model this setting as a single-shot combinatorial optimization problem, we highlight critical limitations in prior work and instead model the V2B optimization problem as a Markov decision process (MDP), i.e., a stochastic control process. Solving the resulting MDP is challenging due to the large state and action spaces. To address the challenges of the large state space, we leverage online search, and we counter the action space by using domain-specific heuristics to prune unpromising actions. We validate our approach in collaboration with Nissan Advanced Technology Center - Silicon Valley. Using data from their EV testbed, we show that the proposed framework significantly outperforms state-of-the-art methods.

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Authors:Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai

Date:2026-01-06 23:43:00

Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder

Authors:Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun

Date:2026-01-06 23:13:35

End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder's weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.

Weather-Aware Transformer for Real-Time Route Optimization in Drone-as-a-Service Operations

Authors:Kamal Mohamed, Lillian Wassim, Ali Hamdi, Khaled Shaban

Date:2026-01-06 19:23:15

This paper presents a novel framework to accelerate route prediction in Drone-as-a-Service operations through weather-aware deep learning models. While classical path-planning algorithms, such as A* and Dijkstra, provide optimal solutions, their computational complexity limits real-time applicability in dynamic environments. We address this limitation by training machine learning and deep learning models on synthetic datasets generated from classical algorithm simulations. Our approach incorporates transformer-based and attention-based architectures that utilize weather heuristics to predict optimal next-node selections while accounting for meteorological conditions affecting drone operations. The attention mechanisms dynamically weight environmental factors including wind patterns, wind bearing, and temperature to enhance routing decisions under adverse weather conditions. Experimental results demonstrate that our weather-aware models achieve significant computational speedup over traditional algorithms while maintaining route optimization performance, with transformer-based architectures showing superior adaptation to dynamic environmental constraints. The proposed framework enables real-time, weather-responsive route optimization for large-scale DaaS operations, representing a substantial advancement in the efficiency and safety of autonomous drone systems.

A Versatile Multimodal Agent for Multimedia Content Generation

Authors:Daoan Zhang, Wenlin Yao, Xiaoyang Wang, Yebowen Hu, Jiebo Luo, Dong Yu

Date:2026-01-06 18:49:47

With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.

A High-Fidelity Digital Twin for Robotic Manipulation Based on 3D Gaussian Splatting

Authors:Ziyang Sun, Lingfan Bao, Tianhu Peng, Jingcheng Sun, Chengxu Zhou

Date:2026-01-06 17:29:10

Developing high-fidelity, interactive digital twins is crucial for enabling closed-loop motion planning and reliable real-world robot execution, which are essential to advancing sim-to-real transfer. However, existing approaches often suffer from slow reconstruction, limited visual fidelity, and difficulties in converting photorealistic models into planning-ready collision geometry. We present a practical framework that constructs high-quality digital twins within minutes from sparse RGB inputs. Our system employs 3D Gaussian Splatting (3DGS) for fast, photorealistic reconstruction as a unified scene representation. We enhance 3DGS with visibility-aware semantic fusion for accurate 3D labelling and introduce an efficient, filter-based geometry conversion method to produce collision-ready models seamlessly integrated with a Unity-ROS2-MoveIt physics engine. In experiments with a Franka Emika Panda robot performing pick-and-place tasks, we demonstrate that this enhanced geometric accuracy effectively supports robust manipulation in real-world trials. These results demonstrate that 3DGS-based digital twins, enriched with semantic and geometric consistency, offer a fast, reliable, and scalable path from perception to manipulation in unstructured environments.

Subjective-Objective Median-based Importance Technique (SOMIT) to Aid Multi-Criteria Renewable Energy Evaluation

Authors:Ding Ding, Yang Li, Poh Ling Neo, Zhiyuan Wang, Chongwu Xia

Date:2026-01-06 17:02:13

Accelerating the renewable energy transition requires informed decision-making that accounts for the diverse financial, technical, environmental, and social trade-offs across different renewable energy technologies. A critical step in this multi-criteria decision-making (MCDM) process is the determination of appropriate criteria weights. However, deriving these weights often solely involves either subjective assessment from decision-makers or objective weighting methods, each of which has limitations in terms of cognitive burden, potential bias, and insufficient contextual relevance. This study proposes the subjective-objective median-based importance technique (SOMIT), a novel hybrid approach for determining criteria weights in MCDM. By tailoring SOMIT to renewable energy evaluation, the method directly supports applied energy system planning, policy analysis, and technology prioritization under carbon neutrality goals. The practical utility of SOMIT is demonstrated through two MCDM case studies on renewable energy decision-making in India and Saudi Arabia. Using the derived weights from SOMIT, the TOPSIS method ranks the renewable energy alternatives, with solar power achieving the highest performance scores in both cases. The main contributions of this work are five-fold: 1) the proposed SOMIT reduces the number of required subjective comparisons from the conventional quadratic order to a linear order; 2) SOMIT is more robust to outliers in the alternatives-criteria matrix (ACM); 3) SOMIT balances subjective expert knowledge with objective data-driven insights, thereby mitigating bias; 4) SOMIT is inherently modular, allowing both its individual parts and the complete approach to be seamlessly coupled with a wide range of MCDM methods commonly applied in energy systems and policy analysis; 5) a dedicated Python library, pysomit, is developed for SOMIT.

Unified Thinker: A General Reasoning Modular Core for Image Generation

Authors:Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao

Date:2026-01-06 15:59:33

Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.

Fast Surrogate Models for Adaptive Aircraft Trajectory Prediction in En route Airspace

Authors:Nick Pepper, Marc Thomas, Zack Xuereb Conti

Date:2026-01-06 15:00:47

Trajectory prediction (TP) is crucial for ensuring safety and efficiency in modern air traffic management systems. It is, for example, a core component of conflict detection and resolution tools, arrival sequencing algorithms, capacity planning, as well as several future concepts. However, TP accuracy within operational systems is hampered by a range of epistemic uncertainties such as the mass and performance settings of aircraft and the effect of meteorological conditions on aircraft performance. It can also require considerable computational resources. This paper proposes a method for adaptive TP that has two components: first, a fast surrogate TP model based on linear state space models (LSSM)s with an execution time that was 6.7 times lower on average than an implementation of the Base of Aircraft Data (BADA) in Python. It is demonstrated that such models can effectively emulate the BADA aircraft performance model, which is based on the numerical solution of a partial differential equation (PDE), and that the LSSMs can be fitted to trajectories in a dataset of historic flight data. Secondly, the paper proposes an algorithm to assimilate radar observations using particle filtering to adaptively refine TP accuracy. Comparison with baselines using BADA and Kalman filtering demonstrate that the proposed framework improves system identification and state estimation for both climb and descent phases, with 46.3% and 64.7% better estimates for time to top of climb and bottom of descent compared to the best performing benchmark model. In particular, the particle filtering approach provides the flexibility to capture non-linear performance effects including the CAS-Mach transition.

A Bi-directional Adaptive Framework for Agile UAV Landing

Authors:Chunhui Zhao, Xirui Kao, Yilin Lu, Yang Lyu

Date:2026-01-06 14:10:06

Autonomous landing on mobile platforms is crucial for extending quadcopter operational flexibility, yet conventional methods are often too inefficient for highly dynamic scenarios. The core limitation lies in the prevalent ``track-then-descend'' paradigm, which treats the platform as a passive target and forces the quadcopter to perform complex, sequential maneuvers. This paper challenges that paradigm by introducing a bi-directional cooperative landing framework that redefines the roles of the vehicle and the platform. The essential innovation is transforming the problem from a single-agent tracking challenge into a coupled system optimization. Our key insight is that the mobile platform is not merely a target, but an active agent in the landing process. It proactively tilts its surface to create an optimal, stable terminal attitude for the approaching quadcopter. This active cooperation fundamentally breaks the sequential model by parallelizing the alignment and descent phases. Concurrently, the quadcopter's planning pipeline focuses on generating a time-optimal and dynamically feasible trajectory that minimizes energy consumption. This bi-directional coordination allows the system to execute the recovery in an agile manner, characterized by aggressive trajectory tracking and rapid state synchronization within transient windows. The framework's effectiveness, validated in dynamic scenarios, significantly improves the efficiency, precision, and robustness of autonomous quadrotor recovery in complex and time-constrained missions.

SA-ResGS: Self-Augmented Residual 3D Gaussian Splatting for Next Best View Selection

Authors:Kim Jun-Seong, Tae-Hyun Oh, Eduardo Pérez-Pellitero, Youngkyoon Jang

Date:2026-01-06 13:59:07

We propose Self-Augmented Residual 3D Gaussian Splatting (SA-ResGS), a novel framework to stabilize uncertainty quantification and enhancing uncertainty-aware supervision in next-best-view (NBV) selection for active scene reconstruction. SA-ResGS improves both the reliability of uncertainty estimates and their effectiveness for supervision by generating Self-Augmented point clouds (SA-Points) via triangulation between a training view and a rasterized extrapolated view, enabling efficient scene coverage estimation. While improving scene coverage through physically guided view selection, SA-ResGS also addresses the challenge of under-supervised Gaussians, exacerbated by sparse and wide-baseline views, by introducing the first residual learning strategy tailored for 3D Gaussian Splatting. This targeted supervision enhances gradient flow in high-uncertainty Gaussians by combining uncertainty-driven filtering with dropout- and hard-negative-mining-inspired sampling. Our contributions are threefold: (1) a physically grounded view selection strategy that promotes efficient and uniform scene coverage; (2) an uncertainty-aware residual supervision scheme that amplifies learning signals for weakly contributing Gaussians, improving training stability and uncertainty estimation across scenes with diverse camera distributions; (3) an implicit unbiasing of uncertainty quantification as a consequence of constrained view selection and residual supervision, which together mitigate conflicting effects of wide-baseline exploration and sparse-view ambiguity in NBV planning. Experiments on active view selection demonstrate that SA-ResGS outperforms state-of-the-art baselines in both reconstruction quality and view selection robustness.

Probing ionized bubbles around luminous sources during reionization with SKA 21-cm observations

Authors:Arnab Mishra, Kanan Kumar Datta, Chandra Shekhar Murmu, Samir Choudhuri, Iffat Nasreen, Snehasish Saha

Date:2026-01-06 10:23:37

Detecting and characterizing individual ionized bubbles during the Epoch of Reionization (EoR) using the redshifted HI 21-cm signal provides a direct probe of the early ionizing sources and the intergalactic medium. We develop and validate a computationally efficient estimator that operates on gridded visibilities to detect ionized bubbles. This serves as an accurate alternative to the more computationally demanding bare estimator that uses all baselines and frequency channels. Further, we employ a non-parametric foreground-subtraction method based on Gaussian process regression, which minimizes loss of the HI 21-cm signal and yields improved signal-to-noise ratios. Our analysis indicates that ionized bubbles at redshifts $z \sim 7 - 8$ can be detected with SNR $\gtrsim 10$ using $\sim 100$ hours of SKA1-Low AA$^*$ and AA4 observations. We further derive a scaling relation that connects the SNR to the bubble radius, redshift, total observing time, and the mean neutral hydrogen fraction of the surrounding IGM. This helps to quickly predict the observational outcome for any planned observations and is, therefore, useful for devising observational strategies. Finally, we apply a Bayesian likelihood framework with Markov Chain Monte Carlo sampling to the residual visibilities to recover ionized bubble properties, including radius, position, and the mean neutral fraction. The resulting posterior distributions demonstrate accurate recovery of the bubble parameters. This confirms the feasibility of robustly characterizing individual ionized regions with the SKA1-Low.

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Authors:Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

Date:2026-01-06 09:58:24

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

Lesion Segmentation in FDG-PET/CT Using Swin Transformer U-Net 3D: A Robust Deep Learning Framework

Authors:Shovini Guha, Dwaipayan Nandi

Date:2026-01-06 09:52:00

Accurate and automated lesion segmentation in Positron Emission Tomography / Computed Tomography (PET/CT) imaging is essential for cancer diagnosis and therapy planning. This paper presents a Swin Transformer UNet 3D (SwinUNet3D) framework for lesion segmentation in Fluorodeoxyglucose Positron Emission Tomography / Computed Tomography (FDG-PET/CT) scans. By combining shifted window self-attention with U-Net style skip connections, the model captures both global context and fine anatomical detail. We evaluate SwinUNet3D on the AutoPET III FDG dataset and compare it against a baseline 3D U-Net. Results show that SwinUNet3D achieves a Dice score of 0.88 and IoU of 0.78, surpassing 3D U-Net (Dice 0.48, IoU 0.32) while also delivering faster inference times. Qualitative analysis demonstrates improved detection of small and irregular lesions, reduced false positives, and more accurate PET/CT fusion. While the framework is currently limited to FDG scans and trained under modest GPU resources, it establishes a strong foundation for future multi-tracer, multi-center evaluations and benchmarking against other transformer-based architectures. Overall, SwinUNet3D represents an efficient and robust approach to PET/CT lesion segmentation, advancing the integration of transformer-based models into oncology imaging workflows.

Electricity Price Forecasting: Bridging Linear Models, Neural Networks and Online Learning

Authors:Btissame El Mahtout, Florian Ziel

Date:2026-01-06 09:35:02

Precise day-ahead forecasts for electricity prices are crucial to ensure efficient portfolio management, support strategic decision-making for power plant operations, enable efficient battery storage optimization, and facilitate demand response planning. However, developing an accurate prediction model is highly challenging in an uncertain and volatile market environment. For instance, although linear models generally exhibit competitive performance in predicting electricity prices with minimal computational requirements, they fail to capture relevant nonlinear relationships. Nonlinear models, on the other hand, can improve forecasting accuracy with a surge in computational costs. We propose a novel multivariate neural network approach that combines linear and nonlinear feed-forward neural structures. Unlike previous hybrid models, our approach integrates online learning and forecast combination for efficient training and accuracy improvement. It also incorporates all relevant characteristics, particularly the fundamental relationships arising from wind and solar generation, electricity demand patterns, related energy fuel and carbon markets, in addition to autoregressive dynamics and calendar effects. Compared to the current state-of-the-art benchmark models, the proposed forecasting method significantly reduces computational cost while delivering superior forecasting accuracy (12-13% RMSE and 15-18% MAE reductions). Our results are derived from a six-year forecasting study conducted on major European electricity markets.

Sample-Efficient Neurosymbolic Deep Reinforcement Learning

Authors:Celeste Veronese, Daniele Meli, Alessandro Farinelli

Date:2026-01-06 09:28:53

Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond small-scale training scenarios, even within standard benchmarks. We propose a neuro-symbolic DRL approach that integrates background symbolic knowledge to improve sample efficiency and generalization to more challenging, unseen tasks. Partial policies defined for simple domain instances, where high performance is easily attained, are transferred as useful priors to accelerate learning in more complex settings and avoid tuning DRL parameters from scratch. To do so, partial policies are represented as logical rules, and online reasoning is performed to guide the training process through two mechanisms: (i) biasing the action distribution during exploration, and (ii) rescaling Q-values during exploitation. This neuro-symbolic integration enhances interpretability and trustworthiness while accelerating convergence, particularly in sparse-reward environments and tasks with long planning horizons. We empirically validate our methodology on challenging variants of gridworld environments, both in the fully observable and partially observable setting. We show improved performance over a state-of-the-art reward machine baseline.

Mastering the Game of Go with Self-play Experience Replay

Authors:Jingbin Liu, Xuechun Wang

Date:2026-01-06 08:42:40

The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.

Beyond Sgr A* and M87*: Sub-Microarcsecond Black Hole Shadow Detection via Lunar-based Extremely Long Baseline Interferometry

Authors:Shan-Shan Zhao, Ru-Sen Lu, Lei Liu, Zhiqiang Shen

Date:2026-01-06 08:40:25

The 1.3 mm ground-based very long baseline interferometry (VLBI) array, the Event Horizon Telescope (EHT), is limited by the Earth's diameter and can image the supermassive black hole (SMBH) shadows of only M87* and Sgr A*. Extending the array with an assumed lunar-based telescope could achieve $\sim 0.85\ μ$as angular resolution at 230 GHz, enabling black hole shadow detection for a larger SMBH sample. The concept is motivated by space VLBI missions and lunar exploration, including the ongoing Lunar Orbit VLBI Experiment (LOVEX) aboard QueQiao-2 (Chang'E-7) and the planned International Lunar Research Station (ILRS). We assess shadow detectability for 31 SMBH with predicted large angular sizes, exploring different telescope location and antenna size. Assuming a telescope at the lunar antipode, we simulate the Moon-Earth (u,v) coverage and show that source geometry relative to the Moon's orbit determines whether the primary indicator of shadow, first visibility null, can be sampled. Using a geometric ring model, we identify six high-priority targets: M104, NGC 524, PGC 049940, NGC 5077, NGC 5252, and NGC 1052. Shadows of M104, NGC 5077, and NGC 1052 are detectable with a 5 m lunar-based telescope; PGC 049940 requires 20 m; NGC 524 and NGC 5252 require 100 m. Photon ring detection for Sgr A*, M87*, NGC 1600, and M31 is possible if space telescopes fill the baseline coverage gaps and sensitivity requirements are met. These results provide a clear scientific and technical motivation for lunar-based telescopes in future black hole shadow studies.