planning - 2026-02-05

El Agente Quntur: A research collaborator agent for quantum chemistry

Authors:Juan B. Pérez-Sánchez, Yunheng Zou, Jorge A. Campos-Gonzalez-Angulo, Marcel Müller, Ignacio Gustin, Andrew Wang, Han Hao, Tsz Wai Ko, Changhyeok Choi, Eric S. Isbrandt, Mohammad Ghazi Vakili, Hanyong Xu, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

Date:2026-02-04 18:38:50

Quantum chemistry is a foundational enabling tool for the fields of chemistry, materials science, computational biology and others. Despite of its power, the practical application of quantum chemistry simulations remains in the hands of qualified experts due to methodological complexity, software heterogeneity, and the need for informed interpretation of results. To bridge the accessibility gap for these tools and expand their reach to chemists with broader backgrounds, we introduce El Agente Quntur, a hierarchical, multi-agent AI system designed to operate not merely as an automation tool but as a research collaborator for computational quantum chemistry. Quntur was designed following three main strategies: i) elimination of hard-coded procedural policies in favour of reasoning-driven decisions, ii) construction of general and composable actions that facilitate generalization and efficiency, and iii) implementation of guided deep research to integrate abstract quantum-chemical reasoning across subdisciplines and a detailed understanding of the software's internal logic and syntax. Although instantiated in ORCA, these design principles are applicable to research agents more generally and easily expandable to additional quantum chemistry packages and beyond. Quntur supports the full range of calculations available in ORCA 6.0 and reasons over software documentation and scientific literature to plan, execute, adapt, and analyze in silico chemistry experiments following best practices. We discuss the advances and current bottlenecks in agentic systems operating at the research level in computational chemistry, and outline a roadmap toward a fully autonomous end-to-end computational chemistry research agent.

Fluid Representations in Reasoning Models

Authors:Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin

Date:2026-02-04 18:34:50

Reasoning language models, which generate long chains of thought, dramatically outperform non-reasoning language models on abstract problems. However, the internal model mechanisms that allow this superior performance remain poorly understood. We present a mechanistic analysis of how QwQ-32B - a model specifically trained to produce extensive reasoning traces - process abstract structural information. On Mystery Blocksworld - a semantically obfuscated planning domain - we find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning. The model develops abstract encodings that focus on structure rather than specific action names. Through steering experiments, we establish causal evidence that these adaptations improve problem solving: injecting refined representations from successful traces boosts accuracy, while symbolic representations can replace many obfuscated encodings with minimal performance loss. We find that one of the factors driving reasoning model performance is in-context refinement of token representations, which we dub Fluid Reasoning Representations.

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

Authors:Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

Date:2026-02-04 15:53:41

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.

Stochastic Decision Horizons for Constrained Reinforcement Learning

Authors:Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

Date:2026-02-04 14:27:16

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

Dual Mind World Model Inspired Network Digital Twin for Access Scheduling

Authors:Hrishikesh Dutta, Roberto Minerva, Noel Crespi

Date:2026-02-04 13:53:55

Emerging networked systems such as industrial IoT and real-time cyber-physical infrastructures demand intelligent scheduling strategies capable of adapting to dynamic traffic, deadlines, and interference constraints. In this work, we present a novel Digital Twin-enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture, for learning-informed and imagination-driven network control. Unlike conventional rule-based or purely data-driven policies, the proposed DMWM combines short-horizon predictive planning with symbolic model-based rollout, enabling the scheduler to anticipate future network states and adjust transmission decisions accordingly. We implement the framework in a configurable simulation testbed and benchmark its performance against traditional heuristics and reinforcement learning baselines under varied traffic conditions. Our results show that DMWM achieves superior performance in bursty, interference-limited, and deadline-sensitive environments, while maintaining interpretability and sample efficiency. The proposed design bridges the gap between network-level reasoning and low-overhead learning, marking a step toward scalable and adaptive NDT-based network optimization.

A Unified Complementarity-based Approach for Rigid-Body Manipulation and Motion Prediction

Authors:Bingkun Huang, Xin Ma, Nilanjan Chakraborty, Riddhiman Laha

Date:2026-02-04 13:10:57

Robotic manipulation in unstructured environments requires planners to reason jointly about free-space motion and sustained, frictional contact with the environment. Existing (local) planning and simulation frameworks typically separate these regimes or rely on simplified contact representations, particularly when modeling non-convex or distributed contact patches. Such approximations limit the fidelity of contact-mode transitions and hinder the robust execution of contact-rich behaviors in real time. This paper presents a unified discrete-time modeling framework for robotic manipulation that consistently captures both free motion and frictional contact within a single mathematical formalism (Unicomp). Building on complementarity-based rigid-body dynamics, we formulate free-space motion and contact interactions as coupled linear and nonlinear complementarity problems, enabling principled transitions between contact modes without enforcing fixed-contact assumptions. For planar patch contact, we derive a frictional contact model from the maximum power dissipation principle in which the set of admissible contact wrenches is represented by an ellipsoidal limit surface. This representation captures coupled force-moment effects, including torsional friction, while remaining agnostic to the underlying pressure distribution across the contact patch. The resulting formulation yields a discrete-time predictive model that relates generalized velocities and contact wrenches through quadratic constraints and is suitable for real-time optimization-based planning. Experimental results show that the proposed approach enables stable, physically consistent behavior at interactive speeds across tasks, from planar pushing to contact-rich whole-body maneuvers.

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Authors:Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, Börje F. Karlsson

Date:2026-02-04 13:04:56

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.

SPEAR: An Engineering Case Study of Multi-Agent Coordination for Smart Contract Auditing

Authors:Arnab Mallick, Indraveni Chebolu, Harmesh Rana

Date:2026-02-04 10:51:19

We present SPEAR, a multi-agent coordination framework for smart contract auditing that applies established MAS patterns in a realistic security analysis workflow. SPEAR models auditing as a coordinated mission carried out by specialized agents: a Planning Agent prioritizes contracts using risk-aware heuristics, an Execution Agent allocates tasks via the Contract Net protocol, and a Repair Agent autonomously recovers from brittle generated artifacts using a programmatic-first repair policy. Agents maintain local beliefs updated through AGM-compliant revision, coordinate via negotiation and auction protocols, and revise plans as new information becomes available. An empirical study compares the multi-agent design with centralized and pipeline-based alternatives under controlled failure scenarios, focusing on coordination, recovery behavior, and resource use.

Angle dependent dose transformer algorithm for fast proton therapy dose calculations

Authors:Mikołaj Stryja, Danny Lathouwers, Zoltán Perkó

Date:2026-02-04 09:52:00

Accurate 3D dose calculation for Pencil Beam Scanning Proton Therapy (PBSPT) is typically performed with Monte Carlo (MC) engines, but their runtimes limit adaptive workflows and repeated evaluations. Current deep-learning proton dose engines often require orthogonality between proton rays and the CT grid, forcing computationally expensive beamlet-wise 3D reinterpolation. We propose the Angle-dependent Dose Transformer Algorithm (ADoTA), which eliminates grid rotation by augmenting the model input with a fast analytical beamlet-shape projection that explicitly encodes beam direction. The model was trained on CT data from 108 patients to predict beamlet dose distributions for initial energies of $70$--$270\,\mathrm{MeV}$ over an $80\times110\,\mathrm{mm}^2$ field, and tested on an independent cohort of 50 patients. On the test set, gamma pass rates $(1\%,3\,\mathrm{mm})$ were $99.40\pm0.86\%$ (thorax) and $99.87\pm0.23\%$ (abdomen/pelvis). Single-beamlet inference took $1.72\pm0.8\,\mathrm{ms}$. By avoiding reinterpolation, end-to-end 3D dose computation was reduced by $\approx86\%$ relative to the fastest published reinterpolation-based methods. For full treatment plans, gamma pass rates $Γ(2\%,2\,\mathrm{mm})$ with a 10\% dose cut-off reached $98.4\%$ (lung) and $98.9\%$ (prostate). ADoTA provides an angle-aware deep-learning proton dose engine that preserves MC-level accuracy across heterogeneous anatomies while substantially reducing computational overhead.

Safe and Stylized Trajectory Planning for Autonomous Driving via Diffusion Model

Authors:Shuo Pei, Yong Wang, Yuanchen Zhu, Chen Sun, Qin Li, Yanan Zhao, Huachun Tan

Date:2026-02-04 08:46:05

Achieving safe and stylized trajectory planning in complex real-world scenarios remains a critical challenge for autonomous driving systems. This paper proposes the SDD Planner, a diffusion-based framework designed to effectively reconcile safety constraints with driving styles in real time. The framework integrates two core modules: a Multi-Source Style-Aware Encoder, which employs distance-sensitive attention to fuse dynamic agent data and environmental contexts for heterogeneous safety-style perception; and a Style-Guided Dynamic Trajectory Generator, which adaptively modulates priority weights within the diffusion denoising process to generate user-preferred yet safe trajectories. Extensive experiments demonstrate that SDD Planner achieves state-of-the-art performance. On the StyleDrive benchmark, it improves the SM-PDMS metric by 3.9% over WoTE, the strongest baseline. Furthermore, on the NuPlan Test14 and Test14-hard benchmarks, SDD Planner ranks first with overall scores of 91.76 and 80.32, respectively, outperforming leading methods such as PLUTO. Real-vehicle closed-loop tests further confirm that SDD Planner maintains high safety standards while aligning with preset driving styles, validating its practical applicability for real-world deployment.

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

Authors:Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, Hao Tang

Date:2026-02-04 08:30:27

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Authors:Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

Date:2026-02-04 06:37:14

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.

Why Agentic-PRs Get Rejected: A Comparative Study of Coding Agents

Authors:Sota Nakashima, Yuta Ishimoto, Masanari Kondo, Shane Mclntosh, Yasutaka Kamei

Date:2026-02-04 05:24:18

Agentic coding -- software development workflows in which autonomous coding agents plan, implement, and submit code changes with minimal human involvement -- is rapidly gaining traction. Prior work has shown that Pull Requests (PRs) produced using coding agents (Agentic-PRs) are accepted less often than PRs that are not labeled as agentic (Human-PRs). The rejection reasons for a single agent (Claude Code) have been explored, but a comparison of how rejection reasons differ between Agentic-PRs generated by different agents has not yet been performed. This comparison is important since different coding agents are often used for different purposes, which can lead to agent-specific failure patterns. In this paper, we inspect 654 rejected PRs from the AIDev dataset covering five coding agents, as well as a human baseline. Our results show that seven rejection modes occur only in Agentic-PRs, including distrust of AI-generated code. We also observe agent-specific patterns (e.g., automated withdrawal of inactive PRs by Devin), reflecting differences in how agents are configured and used in practice. Notably, a large proportion of rejected PRs (67.9%) lack explicit reviewer feedback, making their rejection reasons difficult to determine. To mitigate this issue, we propose a set of heuristics that reduce the proportion of such cases, offering a practical preprocessing step for future studies of PR rejection in agentic coding.

ALORE: Autonomous Large-Object Rearrangement with a Legged Manipulator

Authors:Zhihai Bi, Yushan Zhang, Kai Chen, Guoyang Zhao, Yulin Li, Jun Ma

Date:2026-02-04 04:57:36

Endowing robots with the ability to rearrange various large and heavy objects, such as furniture, can substantially alleviate human workload. However, this task is extremely challenging due to the need to interact with diverse objects and efficiently rearrange multiple objects in complex environments while ensuring collision-free loco-manipulation. In this work, we present ALORE, an autonomous large-object rearrangement system for a legged manipulator that can rearrange various large objects across diverse scenarios. The proposed system is characterized by three main features: (i) a hierarchical reinforcement learning training pipeline for multi-object environment learning, where a high-level object velocity controller is trained on top of a low-level whole-body controller to achieve efficient and stable joint learning across multiple objects; (ii) two key modules, a unified interaction configuration representation and an object velocity estimator, that allow a single policy to regulate planar velocity of diverse objects accurately; and (iii) a task-and-motion planning framework that jointly optimizes object visitation order and object-to-target assignment, improving task efficiency while enabling online replanning. Comparisons against strong baselines show consistent superiority in policy generalization, object-velocity tracking accuracy, and multi-object rearrangement efficiency. Key modules are systematically evaluated, and extensive simulations and real-world experiments are conducted to validate the robustness and effectiveness of the entire system, which successfully completes 8 continuous loops to rearrange 32 chairs over nearly 40 minutes without a single failure, and executes long-distance autonomous rearrangement over an approximately 40 m route. The open-source packages are available at https://zhihaibi.github.io/Alore/.

Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

Authors:Angel Martinez-Sanchez, Parthib Roy, Ross Greer

Date:2026-02-04 03:44:56

Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: https://github.com/Mi3-Lab/doScenes-VLM-Planning

GenMRP: A Generative Multi-Route Planning Framework for Efficient and Personalized Real-Time Industrial Navigation

Authors:Chengzhang Wang, Chao Chen, Jun Tao, Tengfei Liu, He Bai, Song Wang, Longfei Xu, Kaikui Liu, Xiangxiang Chu

Date:2026-02-04 03:21:21

Existing industrial-scale navigation applications contend with massive road networks, typically employing two main categories of approaches for route planning. The first relies on precomputed road costs for optimal routing and heuristic algorithms for generating alternatives, while the second, generative methods, has recently gained significant attention. However, the former struggles with personalization and route diversity, while the latter fails to meet the efficiency requirements of large-scale real-time scenarios. To address these limitations, we propose GenMRP, a generative framework for multi-route planning. To ensure generation efficiency, GenMRP first introduces a skeleton-to-capillary approach that dynamically constructs a relevant sub-network significantly smaller than the full road network. Within this sub-network, routes are generated iteratively. The first iteration identifies the optimal route, while the subsequent ones generate alternatives that balance quality and diversity using the newly proposed correctional boosting approach. Each iteration incorporates road features, user historical sequences, and previously generated routes into a Link Cost Model to update road costs, followed by route generation using the Dijkstra algorithm. Extensive experiments show that GenMRP achieves state-of-the-art performance with high efficiency in both offline and online environments. To facilitate further research, we have publicly released the training and evaluation dataset. GenMRP has been fully deployed in a real-world navigation app, demonstrating its effectiveness and benefits.

Charged lepton flavor violating decays $Z\to \ell_α\ell_β$ in the inverse seesaw

Authors:Adrián González-Quiterio, Héctor Novales-Sánchez

Date:2026-02-04 03:04:13

After confirmation of massiveness and mixing of neutrinos, by neutrino oscillation data, the origin of neutrino mass and the occurrence of charged-lepton-flavor non-conservation in nature have become two main objectives for the physics of elementary particles. Taking inspiration from both matters, we address the decays $Z\to\ell_α\ell_β$, with $\ell_α\ne\ell_β$, thus violating charged-lepton flavor. We calculate the set of contributing one-loop diagrams characterized by virtual neutral leptons, both light and heavy, emerged from the inverse seesaw mechanism for the generation of neutrino mass. By neglecting charged-lepton and light-neutrino masses, and then assuming that the mass spectrum of the heavy neutral leptons is degenerate, we find that a relation $\textrm{Br}\big( Z\to\ell_α\ell_β\big)\propto\big| η_{βα} \big|^2$, with $η$ the matrix describing non-unitarity effects in light-lepton mixing, is fulfilled. Our quantitative analysis, which considers both scenarios of degenerate and non-degenerate masses of heavy neutral leptons, takes into account upper bounds on $η_{μe}$, imposed by current constraints on the decay $μ\to eγ$ from the MEG II experiment, while projected future sensitivity of this experiment is considered as well. We find that, even though current constraints on $Z\to\ell_α\ell_β$, by the ATLAS Collaboration, remain far from inverse-seesaw contributions, improved sensitivity from in-plans machines, such as the Future Circular Collider and the Circular Electron Positron Collider, shall be able to probe this mass-generating mechanism through these decays.

OMG-Agent: Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agentic Workflows

Authors:Ruiting Dai, Zheyu Wang, Haoyu Yang, Yihan Liu, Chengzhi Wang, Zekun Zhang, Zishan Huang, Jiaman Cen, Lisi Mo

Date:2026-02-04 02:25:40

Data incompleteness severely impedes the reliability of multimodal systems. Existing reconstruction methods face distinct bottlenecks: conventional parametric/generative models are prone to hallucinations due to over-reliance on internal memory, while retrieval-augmented frameworks struggle with retrieval rigidity. Critically, these end-to-end architectures are fundamentally constrained by Semantic-Detail Entanglement -- a structural conflict between logical reasoning and signal synthesis that compromises fidelity. In this paper, we present \textbf{\underline{O}}mni-\textbf{\underline{M}}odality \textbf{\underline{G}}eneration Agent (\textbf{OMG-Agent}), a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow. By mimicking a \textit{deliberate-then-act} cognitive process, OMG-Agent explicitly decouples the task into three synergistic stages: (1) an MLLM-driven Semantic Planner that resolves input ambiguity via Progressive Contextual Reasoning, creating a deterministic structured semantic plan; (2) a non-parametric Evidence Retriever that grounds abstract semantics in external knowledge; and (3) a Retrieval-Injected Executor that utilizes retrieved evidence as flexible feature prompts to overcome rigidity and synthesize high-fidelity details. Extensive experiments on multiple benchmarks demonstrate that OMG-Agent consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness, e.g., a $2.6$-point gain on CMU-MOSI at $70$\% missing rates.

Rate-Optimal Noise Annealing in Semi-Dual Neural Optimal Transport: Tangential Identifiability, Off-Manifold Ambiguity, and Guaranteed Recovery

Authors:Raymond Chu, Jaewoong Choi, Dohyun Kwon

Date:2026-02-04 00:49:38

Semi-dual neural optimal transport learns a transport map via a max-min objective, yet training can converge to incorrect or degenerate maps. We fully characterize these spurious solutions in the common regime where data concentrate on low-dimensional manifold: the objective is underconstrained off the data manifold, while the on-manifold transport signal remains identifiable. Following Choi, Choi, and Kwon (2025), we study additive-noise smoothing as a remedy and prove new map recovery guarantees as the noise vanishes. Our main practical contribution is a computable terminal noise level $\varepsilon_{\mathrm{stat}}(N)$ that attains the optimal statistical rate, with scaling governed by the intrinsic dimension $m$ of the data. The formula arises from a theoretical unified analysis of (i) quantitative stability of optimal plans, (ii) smoothing-induced bias, and (iii) finite-sample error, yielding rates that depend on $m$ rather than the ambient dimension. Finally, we show that the reduced semi-dual objective becomes increasingly ill-conditioned as $\varepsilon \downarrow 0$. This provides a principled stopping rule: annealing below $\varepsilon_{\mathrm{stat}}(N)$ can $\textit{worsen}$ optimization conditioning without improving statistical accuracy.

Time-to-Event Estimation with Unreliably Reported Events in Medicare Health Plan Payment

Authors:Oana M. Enache, Sherri Rose

Date:2026-02-04 00:04:44

Time-to-event estimation (i.e., survival analysis) is common in health research, most often using methods that assume proportional hazards and no competing risks. Because both assumptions are frequently invalid, estimators more aligned with real-world settings have been proposed. An effect can be estimated as the difference in areas below the cumulative incidence functions of two groups up to a pre-specified time point. This approach, restricted mean time lost (RMTL), can be used in settings with competing risks as well. We extend RMTL estimation for use in an understudied health policy application in Medicare. Medicare currently supports healthcare payment for over 69 million beneficiaries, most of whom are enrolled in Medicare Advantage plans and receive insurance from private insurers. These insurers are prospectively paid by the federal government for each of their beneficiaries' anticipated health needs using an ordinary least squares linear regression algorithm. As all coefficients are positive and predictor variables are largely insurer-submitted health conditions, insurers are incentivized to upcode, or report more diagnoses than may be accurate. Such gaming is projected to cost the federal government $40 billion in 2025 alone without clear benefit to beneficiaries. We propose several novel estimators of coding intensity and possible upcoding in Medicare Advantage, including accounting for unreliable reporting. We demonstrate estimator performance in simulated data leveraging the National Institutes of Health's All of Us study and also develop an open source R package to simulate realistic labeled upcoding data, which were not previously available.

Cross-Frequency Bispectral EEG Analysis of Reach-to-Grasp Planning and Execution

Authors:Sima Ghafoori, Anna Cetera, Ali Rabiee, MH Farhadi, Rahul Singh, Mariusz Furmanek, Yalda Shahriari, Reza Abiri

Date:2026-02-03 21:05:12

Neural control of grasping arises from nonlinear interactions across multiple brain rhythms, yet EEG-based motor decoding has largely relied on linear, second-order spectral features. Here, we examine whether higher-order cross-frequency dynamics distinguish motor planning from execution during natural reach-to-grasp behavior. EEG was recorded in a cue-based paradigm during executed precision and power grips, enabling stage-resolved analysis of preparatory and execution-related neural activity. Cross-frequency bispectral analysis was used to compute bicoherence matrices across canonical frequency band pairs, from which magnitude- and phase-based features were extracted. Classification, permutation-based feature selection, and within-subject statistical testing showed that execution is characterized by substantially stronger and more discriminative nonlinear coupling than planning, with dominant contributions from beta- and gamma-driven interactions. In contrast, decoding of precision versus power grips achieved comparable performance during planning and execution, indicating that grasp-type representations emerge during planning and persist into execution. Spatial and spectral analyses further revealed that informative bispectral features reflect coordinated activity across prefrontal, central, and occipital regions. Despite substantial feature redundancy, effective dimensionality reduction preserved decoding performance. Together, these findings demonstrate that nonlinear cross-frequency coupling provides an interpretable and robust marker of motor planning and execution, extending bispectral EEG analysis to ecologically valid grasping and supporting its relevance for brain-computer interfaces and neuroprosthetic control.

Deep-learning-based pan-phenomic data reveals the explosive evolution of avian visual disparity

Authors:Jiao Sun

Date:2026-02-03 18:32:15

The evolution of biological morphology is critical for understanding the diversity of the natural world, yet traditional analyses often involve subjective biases in the selection and coding of morphological traits. This study employs deep learning techniques, utilising a ResNet34 model capable of recognising over 10,000 bird species, to explore avian morphological evolution. We extract weights from the model's final fully connected (fc) layer and investigate the semantic alignment between the high-dimensional embedding space learned by the model and biological phenotypes. The results demonstrate that the high-dimensional embedding space encodes phenotypic convergence. Subsequently, we assess the morphological disparity among various taxa and evaluate the association between morphological disparity and species richness, demonstrating that species richness is the primary driver of morphospace expansion. Moreover, the disparity-through-time analysis reveals a visual "early burst" after the K-Pg extinction. While mainly aimed at evolutionary analysis, this study also provides insights into the interpretability of Deep Neural Networks. We demonstrate that hierarchical semantic structures (biological taxonomy) emerged in the high-dimensional embedding space despite being trained on flat labels. Furthermore, through adversarial examples, we provide evidence that our model in this task can overcome texture bias and learn holistic shape representations (body plans), challenging the prevailing view that CNNs rely primarily on local textures.

Deep-Learning Denoising of Radio Observations for Ultra-High-Energy Cosmic-Ray Detection

Authors:Zhisen Lai, Oscar Macias, Aurélien Benoit-Lévy, Arsène Ferrière, Matías Tueros

Date:2026-02-03 18:21:22

Ultra-high-energy cosmic rays (UHECRs) can be detected via the broadband radio pulses produced by their extensive air showers. The Giant Radio Array for Neutrino Detection (GRAND) is a planned radio observatory that aims to deploy autonomous antenna arrays over areas of order $\sim 10^5\,\mathrm{km}^2$ to detect this emission. However, Galactic and instrumental radio backgrounds make the identification of low signal-to-noise ratio (SNR) pulses a central challenge. Here, we present a deep convolutional denoiser model that jointly processes each GRAND antenna trace in the time and frequency domains, allowing the network to learn transient pulse morphology and broadband spectral features while suppressing background noise. By training the model on $4.1\times 10^5$ simulated traces that include detailed UHECR radio emission and realistic detector response and noise, we find a median output-SNR improvement of $\sim 15-23\,\mathrm{dB}$ in the $50-200~\mathrm{MHz}$ band and a reduction of the normalized mean squared error of the waveform by about an order of magnitude relative to a Hilbert-envelope denoiser baseline. We also verify that applying the denoiser to noise-only windows does not produce spurious pulse candidates. Near the detection threshold, the denoiser increases the number of antennas contributing reliable pulse timing by a factor of $\sim 2-3$, which correspondingly tightens direction reconstruction uncertainties. When we additionally require accurate recovery of the waveform shape, the denoiser yields a median gain of $\sim 3-4$ antennas usable for energy reconstruction at SNR$\simeq 5-6$, strengthening event-level direction and energy estimates in sparse radio arrays.

Conformal Reachability for Safe Control in Unknown Environments

Authors:Xinhang Ma, Junlin Wu, Yiannis Kantaros, Yevgeniy Vorobeychik

Date:2026-02-03 18:01:38

Designing provably safe control is a core problem in trustworthy autonomy. However, most prior work in this regard assumes either that the system dynamics are known or deterministic, or that the state and action space are finite, significantly limiting application scope. We address this limitation by developing a probabilistic verification framework for unknown dynamical systems which combines conformal prediction with reachability analysis. In particular, we use conformal prediction to obtain valid uncertainty intervals for the unknown dynamics at each time step, with reachability then verifying whether safety is maintained within the conformal uncertainty bounds. Next, we develop an algorithmic approach for training control policies that optimize nominal reward while also maximizing the planning horizon with sound probabilistic safety guarantees. We evaluate the proposed approach in seven safe control settings spanning four domains -- cartpole, lane following, drone control, and safe navigation -- for both affine and nonlinear safety specifications. Our experiments show that the policies we learn achieve the strongest provable safety guarantees while still maintaining high average reward.

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Authors:Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

Date:2026-02-03 17:56:28

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .

Reasoning with Latent Tokens in Diffusion Language Models

Authors:Andre He, Sean Welleck, Daniel Fried

Date:2026-02-03 17:27:46

Discrete diffusion models have recently become competitive with autoregressive models for language modeling, even outperforming them on reasoning tasks requiring planning and global coherence, but they require more computation at inference time. We trace this trade-off to a key mechanism: diffusion models are trained to jointly predict a distribution over all unknown tokens, including those that will not actually be decoded in the current step. Ablating this joint prediction yields faster inference but degrades performance, revealing that accurate prediction at the decoded position relies on joint reasoning about the distribution of undecoded tokens. We interpret these as latent tokens and introduce a method for modulating their number, demonstrating empirically that this enables a smooth tradeoff between inference speed and sample quality. Furthermore, we demonstrate that latent tokens can be introduced into autoregressive models through an auxiliary multi-token prediction objective, yielding substantial improvements on the same reasoning tasks where they have traditionally struggled. Our results suggest that latent tokens, while arising naturally in diffusion, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead.

Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Authors:Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

Date:2026-02-03 16:43:09

Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.

OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Authors:Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo

Date:2026-02-03 16:28:24

Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

When Should Agents Coordinate in Differentiable Sequential Decision Problems?

Authors:Caleb Probine, Su Ann Low, David Fridovich-Keil, Ufuk Topcu

Date:2026-02-03 15:55:16

Multi-robot teams must coordinate to operate effectively. When a team operates in an uncoordinated manner, and agents choose actions that are only individually optimal, the team's outcome can suffer. However, in many domains, coordination requires costly communication. We explore the value of coordination in a broad class of differentiable motion-planning problems. In particular, we model coordinated behavior as a spectrum: at one extreme, agents jointly optimize a common team objective, and at the other, agents make unilaterally optimal decisions given their individual decision variables, i.e., they operate at Nash equilibria. We then demonstrate that reasoning about coordination in differentiable motion-planning problems reduces to reasoning about the second-order properties of agents' objectives, and we provide algorithms that use this second-order reasoning to determine at which times a team of agents should coordinate.

Scalable non-separable spatio-temporal Gaussian process models for large-scale short-term weather prediction

Authors:Tim Gyger, Reinhard Furrer, Fabio Sigrist

Date:2026-02-03 15:03:48

Monitoring daily weather fields is critical for climate science, agriculture, and environmental planning, yet fully probabilistic spatio-temporal models become computationally prohibitive at continental scale. We present a case study on short-term forecasting of daily maximum temperature and precipitation across the conterminous United States using novel scalable spatio-temporal Gaussian process methodology. Building on three approximation families - inducing-point methods (FITC), Vecchia approximations, and a hybrid Vecchia-inducing-point full-scale approach (VIF) - we introduce three extensions that address key bottlenecks in large space-time settings: (i) a scalable correlation-based neighbor selection strategy for Vecchia approximations with point-referenced data, enabling accurate conditioning under complex dependence structures, (ii) a space-time kMeans++ inducing-point selection algorithm, and (iii) GPU-accelerated implementations of computationally expensive operations, including matrix operations and neighbor searches. Using both synthetic experiments and a large NOAA station dataset containing approximately 1.7 million space-time observations, we analyze the models with respect to predictive performance, parameter estimation, and computational efficiency. Our results demonstrate that scalable Gaussian process models can yield accurate continental-scale forecasts while remaining computationally feasible, offering practical tools for weather applications.