Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
The report presents a new initiative for the development of a network of ground-based stations for solar observations in the optical range. Three separate locations in Bulgaria have installed instrumentation for solar dedicated observations. The currently used telescopes (type, mounting, guiding systems) and designated filters (white-light, H-alpha) are described in detail. Test images are also included. Future plans for improvements are briefly discussed.
The current planned space-based gravitational-wave detectors require a bidirectional optical connection, referred to as Backlink, between two adjacent optical benches to provide a mutual phase reference for the local interferometric measurements. However, if the Backlink shows asymmetry between the two propagation directions, the effective optical pathlengths of the counter-propagating beams can introduce a differential phase noise, called non-reciprocity, into the main interferometric measurement that will limit the achievable accuracy in time-delay interferometry (TDI) post-processing. Hence, it is important to understand the properties of the Backlink to ensure that it will not compromise the interferometric detection. The Three-Backlink Experiment (3BL), which consists of an optical test facility with two rotatable benches, was designed under the Laser Interferometer Space Antenna (LISA) framework to study the performance of three Backlink configurations: two fiber-based and one free-beam scheme. In this paper, we report recent experimental results from the 3BL. We describe the commissioning and the subsequent noise mitigation. We achieve a setup noise floor below $1\text{ pm}\sqrt{\text{Hz}}$ across most of the LISA measurement band, and provide an understanding of the current technical limitations. With this low-noise baseline, we measured the performance of the three Backlink implementations under non-rotational conditions. We show that all three Backlinks reach sub-picometer non-reciprocity levels across most of the frequency band, with the remaining part dominated by the mentioned testbed noise. This enabled us to conduct a preliminary study of the Backlink inherent noise, where we emphasized on the backscatter noise intrinsic to a straightforward fiber-based Backlink, as this is the current baseline for LISA.
Comprehensive environment perception is essential for autonomous vehicles to operate safely. It is crucial to detect both dynamic road users and static objects like traffic signs or lanes as these are required for safe motion planning. However, in many circumstances a complete perception of other objects or lanes is not achievable due to limited sensor ranges, occlusions, and curves. In scenarios where an accurate localization is not possible or for roads where no HD maps are available, an autonomous vehicle must rely solely on its perceived road information. Thus, extending local sensing capabilities through collective perception using vehicle-to-vehicle communication is a promising strategy that has not yet been explored for lane detection. Therefore, we propose a real-time capable approach for collective perception of lanes using a spline-based estimation of undetected road sections. We evaluate our proposed fusion algorithm in various situations and road types. We were able to achieve real-time capability and extend the perception range by up to 200%.
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
We investigate a forecasting-driven, incentive-compatible service provisioning framework for distributed air-ground integrated edge networks under human-machine coexistence. We consider hybrid players where the computing capacity of edge servers (ESs) are augmented by vehicular-UAV agent pairs (AgPs) that can be proactively dispatched to overloaded hotspots. Unique key challenges should be addressed: highly uncertain spatio-temporal workloads, spatio-temporal coupling between road traffic and UAV capacity, forecast-driven contracting risks, and heterogeneous quality-of-service (QoS) requirements of human users (HuUs) and machine users (MaUs). To cope with these issues, we propose FUSION, a two-stage framework that tightly couples service demand prediction, agent deployment, and task scheduling. In the offline stage, a Pro-LNN module performs intelligent multi-step spatio-temporal demand forecasting at distributed ESs, whose outputs are exploited by an enhanced ant colony optimization-based routing scheme (eACO-VRP) and an auction-based incentive-compatible contracting mechanism (Off-AIC^2), to jointly determine ES-AgP contracts and pre-planned service routes. In the online stage, we formulate congestion-aware task scheduling as an potential game among HuUs, MaUs, and heterogeneous ES/UAV providers, and devise a potential-guided best-response dynamics (PG-BR) algorithm that provably converges to a pure-strategy Nash equilibrium. Extensive experiments on both synthetic and real-world traces show that FUSION significantly improves social welfare, latency, resource utilization, and robustness compared with state-of-the-art baselines.
Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/
We consider the problem of cooperative manipulation by a mobile multi-manipulator system operating in obstacle-cluttered and highly constrained environments under spatio-temporal task specifications. The task requires transporting a grasped object while respecting both continuous robot dynamics and discrete geometric constraints arising from obstacles and narrow passages. To address this hybrid structure, we propose a multi-rate planning and control framework that combines offline generation of an STL-satisfying object trajectory and collision-free base footprints with online constrained inverse kinematics and continuous-time feedback control. The resulting closed-loop system enables coordinated reconfiguration of multiple manipulators while tracking the desired object motion. The approach is evaluated in high-fidelity physics simulations using three Franka Emika Panda mobile manipulators rigidly grasping an object.
Large-scale AI data center portfolios procure identical SKUs across geographically heterogeneous campuses, yet finance and operations require a single system-level 'world price' per SKU for budgeting and planning. A common practice is deployment-weighted blending of campus prices, which preserves total cost but can trigger Simpson-type aggregation failures: heterogeneous location mixes can reverse SKU rankings and distort decision signals. I formalize cost-preserving blended pricing under location heterogeneity and propose two practical operators that reconcile accounting identity with ranking robustness and production implementability. A two-way fixed-effects operator separates global SKU effects from campus effects and restores exact cost preservation via scalar normalization, providing interpretable decomposition and smoothing under mild missingness. A convex common-weight operator computes a single set of campus weights under accounting constraints to enforce a location-robust benchmark and prevent dominance reversals; I also provide feasibility diagnostics and a slack-based fallback for extreme mix conditions. Simulations and an AI data center OPEX illustration show substantial reductions in ranking violations relative to naive blending while maintaining cost accuracy, with scalable distributed implementation.
Industrial human-robot collaboration requires motion planning that is collision-free, responsive, and ergonomically safe to reduce fatigue and musculoskeletal risk. We propose the Configuration Space Ergonomic Field (CSEF), a continuous and differentiable field over the human joint space that quantifies ergonomic quality and provides gradients for real-time ergonomics-aware planning. An efficient algorithm constructs CSEF from established metrics with joint-wise weighting and task conditioning, and we integrate it into a gradient-based planner compatible with impedance-controlled robots. In a 2-DoF benchmark, CSEF-based planning achieves higher success rates, lower ergonomic cost, and faster computation than a task-space ergonomic planner. Hardware experiments with a dual-arm robot in unimanual guidance, collaborative drilling, and bimanual cocarrying show faster ergonomic cost reduction, closer tracking to optimized joint targets, and lower muscle activation than a point-to-point baseline. CSEF-based planning method reduces average ergonomic scores by up to 10.31% for collaborative drilling tasks and 5.60% for bimanual co-carrying tasks while decreasing activation in key muscle groups, indicating practical benefits for real-world deployment.
The ability to adapt to changing environments is crucial for the autonomous navigation systems of Unmanned Aerial Vehicles (UAVs). However, existing navigation systems adopt fixed execution configurations without considering environmental dynamics based on available computing resources, e.g., with a high execution frequency and task workload. This static approach causes rigid flight strategies and excessive computations, ultimately degrading flight performance or even leading to failures in UAVs. Despite the necessity for an adaptive system, dynamically adjusting workloads remains challenging, due to difficulties in quantifying environmental complexity and modeling the relationship between environment and system configuration. Aiming at adapting to dynamic environments, this paper proposes E-Navi, an environmental-adaptive navigation system for UAVs that dynamically adjusts task executions on the CPUs in response to environmental changes based on available computational resources. Specifically, the perception-planning pipeline of UAVs navigation system is redesigned through dynamic adaptation of mapping resolution and execution frequency, driven by the quantitative environmental complexity evaluations. In addition, E-Navi supports flexible deployment across hardware platforms with varying levels of computing capability. Extensive Hardware-In-the-Loop and real-world experiments demonstrate that the proposed system significantly outperforms the baseline method across various hardware platforms, achieving up to 53.9% navigation task workload reduction, up to 63.8% flight time savings, and delivering more stable velocity control.
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld
This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.
Covariate adjustment is an approach to improve the precision of trial analyses by adjusting for baseline variables that are prognostic of the primary endpoint. Motivated by the SEARCH Universal HIV Test-and-Treat Trial (2013-2017), we tell our story of developing, evaluating, and implementing a machine learning-based approach for covariate adjustment. We provide the rationale for as well as the practical concerns with such an approach for estimating marginal effects. Using schematics, we illustrate our procedure: targeted machine learning estimation (TMLE) with Adaptive Pre-specification. Briefly, sample-splitting is used to data-adaptively select the combination of estimators of the outcome regression (i.e., the conditional expectation of the outcome given the trial arm and covariates) and known propensity score (i.e., the conditional probability of being randomized to the intervention given the covariates) that minimizes the cross-validated variance estimate and, thereby, maximizes empirical efficiency. We discuss our approach for evaluating finite sample performance with parametric and plasmode simulations, pre-specifying the Statistical Analysis Plan, and unblinding in real-time on video conference with our colleagues from around the world. We present the results from applying our approach in the primary, pre-specified analysis of 8 recently published trials (2022-2024). We conclude with practical recommendations and an invitation to implement our approach in the primary analysis of your next trial.
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
The large-scale deployment of renewable energy sources, particularly offshore wind, requires large-scale transmission grid expansion projects to transmit the produced low-carbon power to the main demand centers. However, the planning and design of such complex projects currently lack a transparent and systematic process that system operators can follow when considering such investments in their grids. This paper identifies and classifies the main technical design constraints and considerations relevant to the planning of transmission grid expansion projects, and more specifically, electrical energy hubs. Seven key areas of interest are identified, namely network integration, HVDC technologies, costs (CAPEX, OPEX, and space requirements), electricity market design, future proofness and modular expandability, reliability-availability-maintainability, and sustainability. Each area of interest is analyzed in terms of its technical and operational relevance, with technical design constraints and considerations derived from such analysis. In addition, a hierarchical classification of the identified constraints and considerations (and therefore areas of interest) is introduced, distinguishing them between three criticality classes, namely hard constraints, main drivers, and key considerations. The dependencies between the different areas are discussed, too. Therefore, this work provides system operators and policymakers with a structured basis to support a transparent planning methodology with clear decision hierarchies for investments in transmission grid expansion projects.
Renewable energy sources play a major role in future net-zero energy systems. However, achieving energy system resilience remains challenging, since renewables depend on weather fluctuations, and future energy systems are subject to major design uncertainty. Existing literature mostly treats these types of uncertainty separately. Therefore, the assessment of uncertainties surrounding climate change and energy system design, and particularly their interactions, is insufficiently understood. To close this gap, we evaluate net load to assess energy system stress without relying on perfect foresight, while maintaining temporal and spatial correlations of the climate system. Net load is calculated from hourly historical and future climate model data translated to energy variables. To scope the extent of plausible energy systems, we consider eight different design scenarios inspired by the European Ten-Year Network Development Plan (TYNDP) and different levels of transmission expansion. We find that climate change impacts on net load are highly sensitive to the energy system design, implying that energy systems can be designed so that they are either hindered or helped by climate change. Furthermore, within a system scenario, climate change can change the frequency and seasonality of high net load events and their technological and meteorological composition. Wind-dominated systems with currently electrified heating levels, for instance, feature a 30% increase of high net load events under climate change, mostly in summer and fall, while fully electrified net zero systems are impacted by high net load events in winter and spring, which decrease by 50% with climate change. Our work thus calls for a wider perspective on energy-climate stress that captures the non-linear interactions of climate change and system design uncertainty, thereby overcoming the current focus on cold Dunkelflauten.
Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components - grasping style and affordance - and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at https://github.com/TUM-utg/PyCAALP, promotes further collaboration and adoption in both industrial and production research applications.
Transmission and eclipse spectroscopy have been invaluable tools for the characterisation of extrasolar planet atmospheres. While they will continue to provide many new insights and discoveries in the decade(s) to come, these methods are running up against sources of stellar noise from stellar surface inhomogeneities and variability. In this white paper we discuss how the next steps in the characterisation of small, temperate rocky planets requires high-contrast imaging, making the planetary systems around our closest neighbouring stars the new frontier in exoplanet science. The Extremely Large Telescopes (ELTs) will be at the forefront of this quest. The Planetary Camera and Spectrograph (PCS) on ESO's ELT and GmagAO-X on the GMT are planned to become operational in the 2035-2040 time-frame, allowing the characterisation of up to dozen(s) of rocky planets around nearby red dwarf stars. We discuss what role there will be still to play for ground-based exoplanet characterisation in the era of the space-borne Habitable Worlds Observatory and LIFE missions.
Laser Wakefield Acceleration (LWFA) is a promising approach for producing high-brightness electron beams in the GeV energy range, offering significant potential for compact next-generation accelerator facilities. In this work, we present a computational study of LWFA in a specially designed single-stage capillary gas-cell target aimed at producing high-quality, GeV-class electron beams. The capillary cell includes a short (~2 mm) injection region at the entrance filled with a helium (He) and nitrogen (N2 ) gas mixture. This is followed by a longer (~12 mm) pure He section, which provides the required acceleration length and limits continuous ionization injection, thereby significantly reducing the energy spread of the accelerated beam. Hydrodynamic simulations are performed to optimize the capillary geometry and generate the required two-section gas-pressure profile. The resulting gas-density distributions for various cases are then directly incorporated in Particle-In-Cell (PIC) simulations to study LWFA. In particular, our hydrodynamic simulations demonstrate how tailored density profiles with longitudinal density tapering in the acceleration section can be realized in a capillary gas cell, while the corresponding PIC simulations reveal how these profiles influence the acceleration process and the resulting beam quality. Using a 100 TW-class laser system with parameters relevant to the L2-DUHA laser at the ELI Beamlines Facility, the PIC results demonstrate electron acceleration to mean energies exceeding 1.0 GeV with high-quality beam properties. Self-injected He electrons are also observed, and their impact on the main beam quality is evaluated. The findings of this study provide valuable insights for upcoming LWFA experiments planned within the EuPRAXIA Project at the ELI Beamlines Facility.
Driven by the ongoing energy transition, shared mobility service providers are emerging actors in electrical power systems which aim to shift combustion-based mobility to electric paradigm. In the meantime, Energy Communities are deployed to enhance the local usage of distributed renewable production. As both ators share the same goal of satisfying the demand at the lowest cost, they could take advantage of their complementarity and coordinate their decisions to enhance each other operation. This paper presents an original Mixed-Integer Second Order Cone Programming long-term Electric Vehicle fleet planning optimization problem that integrates the coordination with a Renewable Energy Community and Vehicle-to-Grid capability. This model is used to assess the economic, energy, and grid performances of their collaboration in a 21 buses low-voltage distribution network. Key results show that, both actors coordination can help reducing the yearly cost up to 11.3 % compared to their stand-alone situation and that it may reduce the stress on the substation transformer by 46 % through the activation of the inherent EVs flexibility when subject to peak penalties from the grid operator.
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting in severe failures.Current research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this issue.To this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art performance.This module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-Conditioned Heat-Inspired Diffusion (LCHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LCHD integrates CLIP-based semantic priors with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LCHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency.
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fails. Although modern GPUs (like the NVIDIA H100) now incorporate an inter-core connection mechanism known as Distributed Shared Memory(DSM)--providing a larger, high-bandwidth, and low-latency on-chip memory pool--this hardware potential has yet to be exploited by software frameworks. To bridge this gap, we present FlashFuser, the first compiler framework to utilize inter-core connection for kernel fusion on modern GPUs. FlashFuser extends established fusion techniques to the DSM domain through three core contributions. First, we propose a powerful DSM-based communication abstraction that formalizes complex cluster-based data exchange patterns, such as reduce, shuffle and multiply. Second, we introduce a dataflow analyzer that generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy; it determines the optimal execution order and tile sizes by quantifying data movement across memory levels. Finally, FlashFuser integrates these components into a unified search engine that employs analytical cost modeling and DSM-aware pruning strategies to efficiently discover the optimal execution plan. Our evaluation on an NVIDIA H100 GPU shows that FlashFuser reduces memory access by 58% and delivers kernel speedups of 3.3x against highly-tuned libraries and 4.1x against state-of-the-art compilers, resulting in a 1.24x end-to-end speedup.
This paper presents a method to predict the evolution of a complex traffic scenario with multiple objects. The current state of the scenario is assumed to be known from sensors and the prediction is taking into account various hypotheses about the behavior of traffic participants. This way, the uncertainties regarding the behavior of traffic participants can be modelled in detail. In the first part of this paper a model-based approach is presented to compute Predicted-Occupancy Grids (POG), which are introduced as a grid-based probabilistic representation of the future scenario hypotheses. However, due to the large number of possible trajectories for each traffic participant, the model-based approach comes with a very high computational load. Thus, a machine-learning approach is adopted for the computation of POGs. This work uses a novel grid-based representation of the current state of the traffic scenario and performs the mapping to POGs. This representation consists of augmented cells in an occupancy grid. The adopted machine-learning approach is based on the Random Forest algorithm. Simulations of traffic scenarios are performed to compare the machine-learning with the model-based approach. The results are promising and could enable the real-time computation of POGs for vehicle safety applications. With this detailed modelling of uncertainties, crucial components in vehicle safety systems like criticality estimation and trajectory planning can be improved.
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI
Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.