Counterfactual explanations study what should have changed in order to get an alternative result, enabling end-users to understand machine learning mechanisms with counterexamples. Actionability is defined as the ability to transform the original case to be explained into a counterfactual one. We develop a method for actionable counterfactual explanations that, unlike predecessors, does not directly leverage training data. Rather, data is only used to learn a density estimator, creating a search landscape in which to apply path planning algorithms to solve the problem and masking the endogenous data, which can be sensitive or private. We put special focus on estimating the data density using Bayesian networks, demonstrating how their enhanced interpretability is useful in high-stakes scenarios in which fairness is raising concern. Using a synthetic benchmark comprised of 15 datasets, our proposal finds more actionable and simpler counterfactuals than the current state-of-the-art algorithms. We also test our algorithm with a real-world Environmental Protection Agency dataset, facilitating a more efficient and equitable study of policies to improve the quality of life in United States of America counties. Our proposal captures the interaction of variables, ensuring equity in decisions, as policies to improve certain domains of study (air, water quality, etc.) can be detrimental in others. In particular, the sociodemographic domain is often involved, where we find important variables related to the ongoing housing crisis that can potentially have a severe negative impact on communities.
We calculate the away-side hadron-triggered modification factor $I_{AA}$ in $AA$ collisions at RHIC and LHC energies for scenarios with and without quark-gluon plasma formation in $pp$ collision. We find that for both scenarios theoretical results for $I_{AA}$ agree well with the available data for 2.76 TeV Pb+Pb and 0.2 TeV Au+Au collisions. We make predictions for $I_{AA}$ in 7 TeV O+O collisions that are planned at the LHC. Our results show that measuring $I_{OO}$ in the whole centrality interval and at small centrality ($\lesssim 5$%) may give information on the presence of jet quenching in $pp$ collisions.
The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.
Robotic cutting is a challenging contact-rich manipulation task where the robot must simultaneously negotiate unknown object mechanics, large contact forces, and precise motion requirements. We introduce a new virtual-model control scheme that enables knife rocking motion for robot manipulators, without pre-planned trajectories or precise information of the environment. Motion is generated through interconnection with virtual mechanisms, given by virtual springs, dampers, and masses arranged in a suitable way. Through analysis and experiments, we demonstrate that the controlled robot behavior settles into a periodic motion. Experiments with a Franka manipulator demonstrate robust cuts with five different vegetables, and sub-millimeter slice accuracy from 1 mm to 6 mm at nearly one cut per second. The same controller survives changes in knife shape and cutting board height, and adaptation to a different humanoid manipulator, demonstrating robustness and platform independence.
Manipulating matter with a scanning tunneling microscope (STM) enables creation of atomically defined artificial structures that host designer quantum states. However, the time-consuming nature of the manipulation process, coupled with the sensitivity of the STM tip, constrains the exploration of diverse configurations and limits the size of designed features. In this study, we present a reinforcement learning (RL)-based framework for creating artificial structures by spatially manipulating carbon monoxide (CO) molecules on a copper substrate using the STM tip. The automated workflow combines molecule detection and manipulation, employing deep learning-based object detection to locate CO molecules and linear assignment algorithms to allocate these molecules to designated target sites. We initially perform molecule maneuvering based on randomized parameter sampling for sample bias, tunneling current setpoint and manipulation speed. This dataset is then structured into an action trajectory used to train an RL agent. The model is subsequently deployed on the STM for real-time fine-tuning of manipulation parameters during structure construction. Our approach incorporates path planning protocols coupled with active drift compensation to enable atomically precise fabrication of structures with significantly reduced human input while realizing larger-scale artificial lattices with desired electronic properties. To underpin of efficiency of our approach we demonstrate the automated construction of an extended artificial graphene lattice and confirm the existence of characteristic Dirac point in its electronic structure. Further challenges to RL-based structural assembly scalability are discussed.
Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-U$^2$Net, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U$^2$Net outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.
Computer perception (CP) technologies (digital phenotyping, affective computing and related passive sensing approaches) offer unprecedented opportunities to personalize healthcare, but provoke concerns about privacy, bias and the erosion of empathic, relationship-centered practice. A comprehensive understanding of perceived risks, benefits, and implementation challenges from those who design, deploy and experience these tools in real-world settings remains elusive. This study provides the first evidence-based account of key stakeholder perspectives on the relational, technical, and governance challenges raised by the integration of CP technologies into patient care. We conducted in-depth, semi-structured interviews with 102 stakeholders: adolescent patients and their caregivers, frontline clinicians, technology developers, and ethics, legal, policy or philosophy scholars. Transcripts underwent thematic analysis by a multidisciplinary team; reliability was enhanced through double coding and consensus adjudication. Stakeholders articulated seven interlocking concern domains: (1) trustworthiness and data integrity; (2) patient-specific relevance; (3) utility and workflow integration; (4) regulation and governance; (5) privacy and data protection; (6) direct and indirect patient harms; and (7) philosophical critiques of reductionism. To operationalize humanistic safeguards, we propose "personalized roadmaps": co-designed plans that predetermine which metrics will be monitored, how and when feedback is shared, thresholds for clinical action, and procedures for reconciling discrepancies between algorithmic inferences and lived experience. By translating these insights into personalized roadmaps, we offer a practical framework for developers, clinicians and policymakers seeking to harness continuous behavioral data while preserving the humanistic core of care.
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
Multi-robot coordination is crucial for autonomous systems, yet real-world deployments often encounter various failures. These include both temporary and permanent disruptions in sensing and communication, which can significantly degrade system robustness and performance if not explicitly modeled. Despite its practical importance, failure-aware coordination remains underexplored in the literature. To bridge the gap between idealized conditions and the complexities of real-world environments, we propose a unified failure-aware coordination framework designed to enable resilient and adaptive multi-robot target tracking under both temporary and permanent failure conditions. Our approach systematically distinguishes between two classes of failures: (1) probabilistic and temporary disruptions, where robots recover from intermittent sensing or communication losses by dynamically adapting paths and avoiding inferred danger zones, and (2) permanent failures, where robots lose sensing or communication capabilities irreversibly, requiring sustained, decentralized behavioral adaptation. To handle these scenarios, the robot team is partitioned into subgroups. Robots that remain connected form a communication group and collaboratively plan using partially centralized nonlinear optimization. Robots experiencing permanent disconnection or failure continue to operate independently through decentralized or individual optimization, allowing them to contribute to the task within their local context. We extensively evaluate our method across a range of benchmark variations and conduct a comprehensive assessment under diverse real-world failure scenarios. Results show that our framework consistently achieves robust performance in realistic environments with unknown danger zones, offering a practical and generalizable solution for the multi-robot systems community.
Modern data analytic workloads increasingly require handling multiple data models simultaneously. Two primary approaches meet this need: polyglot persistence and multi-model database systems. Polyglot persistence employs a coordinator program to manage several independent database systems but suffers from high communication costs due to its physically disaggregated architecture. Meanwhile, existing multi-model database systems rely on a single storage engine optimized for a specific data model, resulting in inefficient processing across diverse data models. To address these limitations, we present M2, a multi-model analytic system with integrated storage engines. M2 treats all data models as first-class entities, composing query plans that incorporate operations across models. To effectively combine data from different models, the system introduces a specialized inter-model join algorithm called multi-stage hash join. Our evaluation demonstrates that M2 outperforms existing approaches by up to 188x speedup on multi-model analytics, confirming the effectiveness of our proposed techniques.
Ion Cyclotron Range of Frequencies heating (ICRH) and current drive will be essential for sustaining high-performance plasmas in next-generation fusion devices (e.g. ITER, SPARC). ICRH actuators routinely produce localized hot spots on limiters and nearby components, posing serious risks to antenna reliability, material survivability, and overall plasma performance. Remarkably, these hot spots are often strongly asymmetric, even with nominally symmetric plasma conditions and antenna geometries. We show that such asymmetries exist intrinsically in the wave physics rather than solely being due to misalignment or edge plasma variation. Our results strongly suggest that this asymmetry can be compensated for by using either poloidal phasing control (which e.g. the under construction WEST traveling wave antenna can do) or modified limiter shapes, suppressing peak sputtering by a factor $\sim$3 and reducing total erosion by a factor of $\sim$2 compared to state-of-the-art designs. This capability is essential for sustaining high-power, long-duration ICRH operation in reactor-scale devices and other next-generation fusion systems. By distributing heat and particle fluxes more evenly across antenna surfaces, optimized limiter shaping provides a clear pathway to robust, reliable ICRH performance in existing and planned fusion devices.
This paper proposes an adaptive lattice-based motion planning solution to address the problem of generating feasible trajectories for systems, represented by a linearly parameterizable non-linear model operating within a cluttered environment. The system model is considered to have uncertain model parameters. The key idea here is to utilize input/output data online to update the model set containing the uncertain system parameter, as well as a dynamic estimated parameter of the model, so that the associated model estimation error reduces over time. This in turn improves the quality of the motion primitives generated by the lattice-based motion planner using a nominal estimated model selected on the basis of suitable criteria. The motion primitives are also equipped with tubes to account for the model mismatch between the nominal estimated model and the true system model, to guarantee collision-free overall motion. The tubes are of uniform size, which is directly proportional to the size of the model set containing the uncertain system parameter. The adaptive learning module guarantees a reduction in the diameter of the model set as well as in the parameter estimation error between the dynamic estimated parameter and the true system parameter. This directly implies a reduction in the size of the implemented tubes and guarantees that the utilized motion primitives go arbitrarily close to the resolution-optimal motion primitives associated with the true model of the system, thus significantly improving the overall motion planning performance over time. The efficiency of the motion planner is demonstrated by a suitable simulation example that considers a drone model represented by Euler-Lagrange dynamics containing uncertain parameters and operating within a cluttered environment.
The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions (e.g., tree-based kernels, deep kernel learning) and acquisition functions (e.g., multi-step lookahead, tree-based planning) have been explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been overlooked. This forces practitioners to rely on heuristics or costly manual training. We propose a simple yet effective framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a lightweight, offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most suitable configuration before expensive evaluations. BOOST partitions data-in-hand into two subsets: a reference subset and a query subset, and it prepares all possible kernel-acquisition pairs from the user's chosen candidates. For each configuration, BOOST conducts internal BO runs using the reference subset, evaluating how effectively each pair guides the search toward the optimum in the unknown query subset, thereby identifying the configuration with the best retrospective performance for future optimization. Experiments on both synthetic benchmark functions and real-world hyperparameter optimization tasks demonstrate that BOOST consistently outperforms standard BO approaches with fixed hyperparameters, highlighting its effectiveness and robustness in diverse problem landscapes.
Access to comprehensive flight operations data remains severely restricted in aviation due to commercial sensitivity and competitive considerations, hindering the development of predictive models for operational planning. This paper investigates whether synthetic data can effectively replace real operational data for training machine learning models in pre-tactical aviation scenarios-predictions made hours to days before operations using only scheduled flight information. We evaluate four state-of-the-art synthetic data generators on three prediction tasks: aircraft turnaround time, departure delays, and arrival delays. Using a Train on Synthetic, Test on Real (TSTR) methodology on over 1.7 million European flight records, we first validate synthetic data quality through fidelity assessments, then assess both predictive performance and the preservation of operational relationships. Our results show that advanced neural network architectures, specifically transformer-based generators, can retain 94-97% of real-data predictive performance while maintaining feature importance patterns informative for operational decision-making. Our analysis reveals that even with real data, prediction accuracy is inherently limited when only scheduled information is available-establishing realistic baselines for pre-tactical forecasting. These findings suggest that high-quality synthetic data can enable broader access to aviation analytics capabilities while preserving commercial confidentiality, though stakeholders must maintain realistic expectations about pre-tactical prediction accuracy given the stochastic nature of flight operations.
This paper introduces CoralGuide, a novel framework designed for path planning and trajectory optimization for tethered multi-robot systems. We focus on marine robotics, which commonly have tethered configurations of an Autonomous Surface Vehicle (ASV) and an Autonomous Underwater Vehicle (AUV). CoralGuide provides safe navigation in marine environments by enhancing the A* algorithm with specialized heuristics tailored for tethered ASV-AUV systems. Our method integrates catenary curve modelling for tether management and employs Bezier curve interpolation for smoother trajectory planning, ensuring efficient and synchronized operations without compromising safety. Through simulations and real-world experiments, we have validated CoralGuides effectiveness in improving path planning and trajectory optimization, demonstrating its potential to significantly enhance operational capabilities in marine research and infrastructure inspection.
Medical image segmentation is crucial for disease diagnosis and treatment planning, yet developing robust segmentation models often requires substantial computational resources and large datasets. Existing research shows that pre-trained and finetuned foundation models can boost segmentation performance. However, questions remain about how particular image preprocessing steps may influence segmentation performance across different medical imaging modalities. In particular, edges-abrupt transitions in pixel intensity-are widely acknowledged as vital cues for object boundaries but have not been systematically examined in the pre-training of foundation models. We address this gap by investigating to which extend pre-training with data processed using computationally efficient edge kernels, such as kirsch, can improve cross-modality segmentation capabilities of a foundation model. Two versions of a foundation model are first trained on either raw or edge-enhanced data across multiple medical imaging modalities, then finetuned on selected raw subsets tailored to specific medical modalities. After systematic investigation using the medical domains Dermoscopy, Fundus, Mammography, Microscopy, OCT, US, and XRay, we discover both increased and reduced segmentation performance across modalities using edge-focused pre-training, indicating the need for a selective application of this approach. To guide such selective applications, we propose a meta-learning strategy. It uses standard deviation and image entropy of the raw image to choose between a model pre-trained on edge-enhanced or on raw data for optimal performance. Our experiments show that integrating this meta-learning layer yields an overall segmentation performance improvement across diverse medical imaging tasks by 16.42% compared to models pre-trained on edge-enhanced data only and 19.30% compared to models pre-trained on raw data only.
We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.
Purpose: Radiation pneumonitis (RP) is a serious complication of intensity-modulated radiation therapy (IMRT) for breast cancer patients, underscoring the need for precise and explainable predictive models. This study presents an Explainable Dual-Omics Filtering (EDOF) model that integrates spatially localized dosiomic and radiomic features for voxel-level RP prediction. Methods: A retrospective cohort of 72 breast cancer patients treated with IMRT was analyzed, including 28 who developed RP. The EDOF model consists of two components: (1) dosiomic filtering, which extracts local dose intensity and spatial distribution features from planning dose maps, and (2) radiomic filtering, which captures texture-based features from pre-treatment CT scans. These features are jointly analyzed using the Explainable Boosting Machine (EBM), a transparent machine learning model that enables feature-specific risk evaluation. Model performance was assessed using five-fold cross-validation, reporting area under the curve (AUC), sensitivity, and specificity. Feature importance was quantified by mean absolute scores, and Partial Dependence Plots (PDPs) were used to visualize nonlinear relationships between RP risk and dual-omic features. Results: The EDOF model achieved strong predictive performance (AUC = 0.95 +- 0.01; sensitivity = 0.81 +- 0.05). The most influential features included dosiomic Intensity Mean, dosiomic Intensity Mean Absolute Deviation, and radiomic SRLGLE. PDPs revealed that RP risk increases beyond 5 Gy and rises sharply between 10-30 Gy, consistent with clinical dose thresholds. SRLGLE also captured structural heterogeneity linked to RP in specific lung regions. Conclusion: The EDOF framework enables spatially resolved, explainable RP prediction and may support personalized radiation planning to mitigate pulmonary toxicity.
Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.
World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action-conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.
We present a novel MUTE-DSS, a digital-twin-based decision support system for minimizing underwater radiated noise (URN) during ship voyage planning. It is a ROS2-centric framework that integrates state-of-the-art acoustic models combining a semi-empirical reference spectrum for near-field modeling with 3D ray tracing for propagation losses for far-field modeling, offering real-time computation of the ship noise signature, alongside a data-driven Southern resident killer whale distribution model. The proposed DSS performs a two-stage optimization pipeline: Batch Informed Trees for collision-free ship routing and a genetic algorithm for adaptive ship speed profiling under voyage constraints that minimizes cumulative URN exposure to marine mammals. The effectiveness of MUTE-DSS is demonstrated through case studies of ships operating between the Strait of Georgia and the Strait of Juan de Fuca, comparing optimized voyages against baseline trajectories derived from automatic identification system data. Results show substantial reductions in noise exposure level, up to 7.14 dB, corresponding to approximately an 80.68% reduction in a simplified scenario, and an average 4.90 dB reduction, corresponding to approximately a 67.6% reduction in a more realistic dynamic setting. These results illustrate the adaptability and practical utility of the proposed decision support system.
Video caching can significantly improve delivery efficiency and enhance quality of video streaming, which constitutes the majority of wireless communication traffic. Due to limited cache size, caching strategies must be designed to adapt to and dynamic user demand in order to maximize system revenue. The system revenue depends on the benefits of delivering the requested videos and costs for (a) transporting the files to the users and (b) cache replacement. Since the cache content at any point in time impacts the replacement costs in the future, demand predictions over multiple cache placement slots become an important prerequisite for efficient cache planning. Motivated by this, we introduce a novel two-stage privacy-preserving solution for revenue optimization in wireless video caching networks. First, we train a Transformer using privacy-preserving federated learning (FL) to predict multi-slot future demands. Given that prediction results are never entirely accurate, especially for longer horizons, we further combine global content popularity with per-user prediction results to estimate the content demand distribution. Then, in the second stage, we leverage these estimation results to find caching strategies that maximize the long-term system revenue. This latter problem takes on the form of a multi-stage knapsack problem, which we then transform to a integer linear program. Our extensive simulation results demonstrate that (i) our FL solution delivers nearly identical performance to that of the ideal centralized solution and outperforms other existing caching methods, and (ii) our novel revenue optimization approach provides deeper system performance insights than traditional cache hit ratio (CHR)-based optimization approaches.
Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
AI systems and technologies that can interact with humans in real time face a communication dilemma: when to offer assistance and how frequently. Overly frequent or contextually redundant assistance can cause users to disengage, undermining the long-term benefits of AI assistance. We introduce a cognitive modeling framework based on Partially Observable Markov Decision Processes (POMDPs) that addresses this timing challenge by inferring a user's latent cognitive state related to AI engagement over time. Additionally, our framework incorporates reasoning about the long-term effects of AI assistance, explicitly aiming to avoid actions that could lead the human user to disengage or deactivate the AI. A key component of our approach is counterfactual reasoning: at each time step, the AI considers how well the user would perform independently and weighs the potential boost in performance against the risk of diminishing engagement with the AI. Through simulations, we show that this adaptive strategy significantly outperforms baseline policies in which assistance is always provided or never provided. Our results highlight the importance of balancing short-term decision accuracy with sustained user engagement, showing how communication strategies can be optimized to avoid alert fatigue while preserving the user's receptiveness to AI guidance.
Deep sequence models have achieved notable success in time-series analysis, such as interpolation and forecasting. Recent advances move beyond discrete-time architectures like Recurrent Neural Networks (RNNs) toward continuous-time formulations such as the family of Neural Ordinary Differential Equations (Neural ODEs). Generally, they have shown that capturing the underlying dynamics is beneficial for generic tasks like interpolation, extrapolation, and classification. However, existing methods approximate the dynamics using unconstrained neural networks, which struggle to adapt reliably under distributional shifts. In this paper, we recast time-series problems as the continuous ODE-based optimal control problem. Rather than learning dynamics solely from data, we optimize control actions that steer ODE trajectories toward task objectives, bringing control-theoretical performance guarantees. To achieve this goal, we need to (1) design the appropriate control actions and (2) apply effective optimal control algorithms. As the actions should contain rich context information, we propose to employ the discrete-time model to process past sequences and generate actions, leading to a coordinate model to extract long-term temporal features to modulate short-term continuous dynamics. During training, we apply model predictive control to plan multi-step future trajectories, minimize a task-specific cost, and greedily select the optimal current action. We show that, under mild assumptions, this multi-horizon optimization leads to exponential convergence to infinite-horizon solutions, indicating that the coordinate model can gain robust and generalizable performance. Extensive experiments on diverse time-series datasets validate our method's superior generalization and adaptability compared to state-of-the-art baselines.
The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at https://github.com/ZhangXiaotong015/GrPn.
Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1\% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15\% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.
We propose a novel Energy-Predictive Drone Service (EPDS) framework for efficient package delivery within a skyway network. The EPDS framework incorporates a formal modeling of an EPDS and an adaptive bidirectional Long Short-Term Memory (Bi-LSTM) machine learning model. This model predicts the energy status and stochastic arrival times of other drones operating in the same skyway network. Leveraging these predictions, we develop a heuristic optimization approach for composite drone services. This approach identifies the most time-efficient and energy-efficient skyway path and recharging schedule for each drone in the network. We conduct extensive experiments using a real-world drone flight dataset to evaluate the performance of the proposed framework.
Planning collision-free paths for a large group of agents is a challenging problem with numerous real-world applications. While recent advances in Multi-Agent Path Finding (MAPF) have shown promising progress, standard MAPF algorithms rely on simplified kinodynamic models, preventing agents from directly following the generated MAPF plan. To bridge this gap, we propose kinodynamic Temporal Plan Graph Planning (kTPG), a multi-agent speed optimization algorithm that efficiently refines a MAPF plan into a kinodynamically feasible plan while accounting for uncertainties and preserving collision-freeness. Building on kTPG, we propose Windowed kTPG (WinkTPG), a MAPF execution framework that incrementally refines MAPF plans using a window-based mechanism, dynamically incorporating agent information during execution to reduce uncertainty. Experiments show that WinkTPG can generate speed profiles for up to 1,000 agents in 1 second and improves solution quality by up to 51.7% over existing MAPF execution methods.
With growing interest in sustainable logistics, electric vehicle (EV)-based deliveries offer a promising alternative for urban distribution. However, EVs face challenges due to their limited battery capacity, requiring careful planning for recharging. This depends on factors such as the charging point (CP) availability, cost, proximity, and vehicles' state of charge (SoC). We propose CARGO, a framework addressing the EV-based delivery route planning problem (EDRP), which jointly optimizes route planning and charging for deliveries within time windows. After proving the problem's NP-hardness, we propose a mixed integer linear programming (MILP)-based exact solution and a computationally efficient heuristic method. Using real-world datasets, we evaluate our methods by comparing the heuristic to the MILP solution, and benchmarking it against baseline strategies, Earliest Deadline First (EDF) and Nearest Delivery First (NDF). The results show up to 39% and 22% reductions in the charging cost over EDF and NDF, respectively, while completing comparable deliveries.