Multi-agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision-making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi-agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three-tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT* into MARL pipelines. We use them as standalone baselines and combine RRT* with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.
With the rapid advancement of vehicular communication and autonomous driving technologies, connected vehicle platoon has emerged as a promising approach to improve traffic efficiency and driving safety. Reliable Vehicle-to-Vehicle (V2V) communication is critical to achieving efficient cooperative control. However, in real-world traffic environments, V2V links may suffer from time-varying delay and packet loss, leading to degraded control performance and even safety risks. To mitigate the adverse effects of non-ideal communication, this paper proposes a Dynamic Communication Topology based Multi-Agent Reinforcement Learning (DCT-MARL) algorithm for robust cooperative platoon control. Specifically, the state space is augmented with historical control action and delay to enhance robustness against communication delay. To mitigate the impact of packet loss, a multi-key gated communication mechanism is introduced, which dynamically adjusts the communication topology based on the correlation between agents and their current communication status.Simulation results demonstrate that the proposed DCT-MARL significantly outperforms state-of-the-art methods in terms of string stability and driving comfort, validating its superior robustness and effectiveness.
This paper proposes a novel method to enhance locomotion for a single humanoid robot through cooperative-heterogeneous multi-agent deep reinforcement learning (MARL). While most existing methods typically employ single-agent reinforcement learning algorithms for a single humanoid robot or MARL algorithms for multi-robot system tasks, we propose a distinct paradigm: applying cooperative-heterogeneous MARL to optimize locomotion for a single humanoid robot. The proposed method, multi-agent reinforcement learning for single humanoid locomotion (MASH), treats each limb (legs and arms) as an independent agent that explores the robot's action space while sharing a global critic for cooperative learning. Experiments demonstrate that MASH accelerates training convergence and improves whole-body cooperation ability, outperforming conventional single-agent reinforcement learning methods. This work advances the integration of MARL into single-humanoid-robot control, offering new insights into efficient locomotion strategies.
Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HATRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO-W, a Karush-Kuhn-Tucker-based (KKT-based) method that optimizes threshold assignment under global KL constraints, and HATRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO-W and HATRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.
The Centralized Training with Decentralized Execution (CTDE) paradigm has gained significant attention in multi-agent reinforcement learning (MARL) and is the foundation of many recent algorithms. However, decentralized policies operate under partial observability and often yield suboptimal performance compared to centralized policies, while fully centralized approaches typically face scalability challenges as the number of agents increases. We propose Centralized Permutation Equivariant (CPE) learning, a centralized training and execution framework that employs a fully centralized policy to overcome these limitations. Our approach leverages a novel permutation equivariant architecture, Global-Local Permutation Equivariant (GLPE) networks, that is lightweight, scalable, and easy to implement. Experiments show that CPE integrates seamlessly with both value decomposition and actor-critic methods, substantially improving the performance of standard CTDE algorithms across cooperative benchmarks including MPE, SMAC, and RWARE, and matching the performance of state-of-the-art RWARE implementations.
Multi-agent self-organizing systems (MASOS) exhibit key characteristics including scalability, adaptability, flexibility, and robustness, which have contributed to their extensive application across various fields. However, the self-organizing nature of MASOS also introduces elements of unpredictability in their emergent behaviors. This paper focuses on the emergence of dependency hierarchies during task execution, aiming to understand how such hierarchies arise from agents' collective pursuit of the joint objective, how they evolve dynamically, and what factors govern their development. To investigate this phenomenon, multi-agent reinforcement learning (MARL) is employed to train MASOS for a collaborative box-pushing task. By calculating the gradients of each agent's actions in relation to the states of other agents, the inter-agent dependencies are quantified, and the emergence of hierarchies is analyzed through the aggregation of these dependencies. Our results demonstrate that hierarchies emerge dynamically as agents work towards a joint objective, with these hierarchies evolving in response to changing task requirements. Notably, these dependency hierarchies emerge organically in response to the shared objective, rather than being a consequence of pre-configured rules or parameters that can be fine-tuned to achieve specific results. Furthermore, the emergence of hierarchies is influenced by the task environment and network initialization conditions. Additionally, hierarchies in MASOS emerge from the dynamic interplay between agents' "Talent" and "Effort" within the "Environment." "Talent" determines an agent's initial influence on collective decision-making, while continuous "Effort" within the "Environment" enables agents to shift their roles and positions within the system.
In multi-agent systems, the safe and reliable execution of tasks often depends on agents correctly coordinating their actions. However, in real-world deployments, failures of computational components are inevitable, presenting a critical challenge: ensuring that multi-agent reinforcement learning (MARL) policies remain effective even when some agents malfunction. We propose the Multi-Agent Robust Training Algorithm (MARTA), a plug-and-play framework for training MARL agents to be resilient to potentially severe faults. MARTA operates in cooperative multi-agent settings where agents may lose the ability to execute their intended actions. It learns to identify failure scenarios that are especially detrimental to system performance and equips agents with strategies to mitigate their impact. At the heart of MARTA is a novel adversarial Markov game in which an adversary -- modelled via \emph{Markov switching controls} -- learns to disable agents in high-risk state regions, while the remaining agents are trained to \emph{jointly} best-respond to such targeted malfunctions. To ensure practicality, MARTA enforces a malfunction budget, constraining the adversary to a fixed number of failures and learning robust policies accordingly. We provide theoretical guarantees that MARTA converges to a Markov perfect equilibrium, ensuring agents optimally counteract worst-case faults. Empirically, we show that MARTA achieves state-of-the-art fault-tolerant performance across benchmark environments, including Multi-Agent Particle World and Level-Based Foraging.
Underwater Wireless Sensor Networks (UWSNs) represent a promising technology that enables diverse underwater applications through acoustic communication. However, it encounters significant challenges including harsh communication environments, limited energy supply, and restricted signal transmission. This paper aims to provide efficient and reliable communication in underwater networks with limited energy and communication resources by optimizing the scheduling of communication links and adjusting transmission parameters (e.g., transmit power and transmission rate). The efficient and reliable communication multi-objective optimization problem (ERCMOP) is formulated as a decentralized partially observable Markov decision process (Dec-POMDP). A Traffic Load-Aware Resource Management (TARM) strategy based on deep multi-agent reinforcement learning (MARL) is presented to address this problem. Specifically, a traffic load-aware mechanism that leverages the overhear information from neighboring nodes is designed to mitigate the disparity between partial observations and global states. Moreover, by incorporating a solution space optimization algorithm, the number of candidate solutions for the deep MARL-based decision-making model can be effectively reduced, thereby optimizing the computational complexity. Simulation results demonstrate the adaptability of TARM in various scenarios with different transmission demands and collision probabilities, while also validating the effectiveness of the proposed approach in supporting efficient and reliable communication in underwater networks with limited resources.
Underwater wireless sensor networks (UWSNs) stand as promising technologies facilitating diverse underwater applications. However, the major design issues of the considered system are the severely limited energy supply and unexpected node malfunctions. This paper aims to provide fair, efficient, and reliable (FER) communication to the imperfect and energy-constrained UWSNs (IC-UWSNs). Therefore, we formulate a FER-communication optimization problem (FERCOP) and propose ICRL-JSA to solve the formulated problem. ICRL-JSA is a deep multi-agent reinforcement learning (MARL)-based optimizer for IC-UWSNs through joint link scheduling and power allocation, which automatically learns scheduling algorithms without human intervention. However, conventional RL methods are unable to address the challenges posed by underwater environments and IC-UWSNs. To construct ICRL-JSA, we integrate deep Q-network into IC-UWSNs and propose an advanced training mechanism to deal with complex acoustic channels, limited energy supplies, and unexpected node malfunctions. Simulation results demonstrate the superiority of the proposed ICRL-JSA scheme with an advanced training mechanism compared to various benchmark algorithms.
This paper investigates the fair-effective communication and robustness in imperfect and energy-constrained underwater acoustic sensor networks (IC-UASNs). Specifically, we investigate the impact of unexpected node malfunctions on the network performance under the time-varying acoustic channels. Each node is expected to satisfy Quality of Service (QoS) requirements. However, achieving individual QoS requirements may interfere with other concurrent communications. Underwater nodes rely excessively on the rationality of other underwater nodes when guided by fully cooperative approaches, making it difficult to seek a trade-off between individual QoS and global fair-effective communications under imperfect conditions. Therefore, this paper presents a SEmi-COoperative Power Allocation approach (SECOPA) that achieves fair-effective communication and robustness in IC-UASNs. The approach is distributed multi-agent reinforcement learning (MARL)-based, and the objectives are twofold. On the one hand, each intelligent node individually decides the transmission power to simultaneously optimize individual and global performance. On the other hand, advanced training algorithms are developed to provide imperfect environments for training robust models that can adapt to the time-varying acoustic channels and handle unexpected node failures in the network. Numerical results are presented to validate our proposed approach.
With wireless devices increasingly forming a unified smart network for seamless, user-friendly operations, random access (RA) medium access control (MAC) design is considered a key solution for handling unpredictable data traffic from multiple terminals. However, it remains challenging to design an effective RA-based MAC protocol to minimize collisions and ensure transmission fairness across the devices. While existing multi-agent reinforcement learning (MARL) approaches with centralized training and decentralized execution (CTDE) have been proposed to optimize RA performance, their reliance on centralized training and the significant overhead required for information collection can make real-world applications unrealistic. In this work, we adopt a fully decentralized MARL architecture, where policy learning does not rely on centralized tasks but leverages consensus-based information exchanges across devices. We design our MARL algorithm over an actor-critic (AC) network and propose exchanging only local rewards to minimize communication overhead. Furthermore, we provide a theoretical proof of global convergence for our approach. Numerical experiments show that our proposed MARL algorithm can significantly improve RA network performance compared to other baselines.
Cooperative multi-agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent's contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi-level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi-level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention-based framework, MACA identifies correlated agent relationships and constructs multi-level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1\&v2 tasks demonstrate MACA's superior performance, underscoring its efficacy in complex credit assignment scenarios.
Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation.
Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.
This work leverages adaptive social learning to estimate partially observable global states in multi-agent reinforcement learning (MARL) problems. Unlike existing methods, the proposed approach enables the concurrent operation of social learning and reinforcement learning. Specifically, it alternates between a single step of social learning and a single step of MARL, eliminating the need for the time- and computation-intensive two-timescale learning frameworks. Theoretical guarantees are provided to support the effectiveness of the proposed method. Simulation results verify that the performance of the proposed methodology can approach that of reinforcement learning when the true state is known.
A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges.
Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.
We compare the efficacy of learned versus engineered communication strategies in a cooperative multi-agent reinforcement learning (MARL) environment. For the learned approach, we introduce Learned Direct Communication (LDC), where agents generate messages and actions concurrently via a neural network. Our engineered approach, Intention Communication, employs an Imagined Trajectory Generation Module (ITGM) and a Message Generation Network (MGN) to formulate messages based on predicted future states. Both strategies are evaluated on their success rates in cooperative tasks under fully and partially observable conditions. Our findings indicate that while emergent communication is viable, the engineered approach demonstrates superior performance and scalability, particularly as environmental complexity increases.
In the autonomous driving testing methods based on evolving scenarios, the construction method of the driver model, which determines the driving maneuvers of background vehicles (BVs) in the scenario, plays a critical role in generating safety-critical scenarios. In particular, the cooperative adversarial driving characteristics between BVs can contribute to the efficient generation of safety-critical scenarios with high testing value. In this paper, a multi-agent reinforcement learning (MARL) method is used to train and generate a dual-modal driver model (Dual-DM) with non-adversarial and adversarial driving modalities. The model is then connected to a continuous simulated traffic environment to generate complex, diverse and strong interactive safety-critical scenarios through evolving scenario generation method. After that, the generated evolving scenarios are evaluated in terms of fidelity, test efficiency, complexity and diversity. Results show that without performance degradation in scenario fidelity (>85% similarity to real-world scenarios) and complexity (complexity metric: 0.45, +32.35% and +12.5% over two baselines), Dual-DM achieves a substantial enhancement in the efficiency of generating safety-critical scenarios (efficiency metric: 0.86, +195% over two baselines). Furthermore, statistical analysis and case studies demonstrate the diversity of safety-critical evolving scenarios generated by Dual-DM in terms of the adversarial interaction patterns. Therefore, Dual-DM can greatly improve the performance of the generation of safety-critical scenarios through evolving scenario generation method.
This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl
Next-generation automotive applications require vehicular edge computing (VEC), but current management systems are essentially fixed and reactive. They are suboptimal in extremely dynamic vehicular environments because they are constrained to static optimization objectives and base their decisions on the current network states. This paper presents a novel Semantic-Aware Proactive LLM Orchestration (SP-LLM) framework to address these issues. Our method transforms the traditional Digital Twin (DT) into a Predictive Digital Twin (pDT) that predicts important network parameters such as task arrivals, vehicle mobility, and channel quality. A Large Language Model (LLM) that serves as a cognitive orchestrator is at the heart of our framework. It makes proactive, forward-looking decisions about task offloading and resource allocation by utilizing the pDT's forecasts. The LLM's ability to decipher high-level semantic commands given in natural language is crucial because it enables it to dynamically modify its optimization policy to match evolving strategic objectives, like giving emergency services priority or optimizing energy efficiency. We show through extensive simulations that SP-LLM performs significantly better in terms of scalability, robustness in volatile conditions, and adaptability than state-of-the-art reactive and MARL-based approaches. More intelligent, autonomous, and goal-driven vehicular networks will be possible due to our framework's outstanding capacity to convert human intent into optimal network behavior.
Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge suboptimally when each agent's policy gradient points toward a suboptimal equilibrium. In this work, we identify a subtler failure mode that arises \textit{even when the expected policy gradients of all agents point toward an optimal solution.} After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This \textit{sampling error} w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can lead agents to converge suboptimally. In this paper, we investigate if joint sampling error can be reduced through coordinated action selection and whether doing so improves the reliability of policy gradient learning in MARL. Toward this end, we introduce an adaptive action sampling approach to reduce joint sampling error. Our method, Multi-Agent Proximal Robust On-Policy Sampling (MA-PROPS), uses a centralized behavior policy that we continually adapt to place larger probability on joint actions that are currently under-sampled w.r.t. the current joint policy. We empirically evaluate MA-PROPS in a diverse range of multi-agent games and demonstrate that (1) MA-PROPS reduces joint sampling error more efficiently than standard on-policy sampling and (2) improves the reliability of independent policy gradient algorithms, increasing the fraction of training runs that converge to an optimal joint policy.
Decentralized Multi-Agent Reinforcement Learning (MARL) methods allow for learning scalable multi-agent policies, but suffer from partial observability and induced non-stationarity. These challenges can be addressed by introducing mechanisms that facilitate coordination and high-level planning. Specifically, coordination and temporal abstraction can be achieved through communication (e.g., message passing) and Hierarchical Reinforcement Learning (HRL) approaches to decision-making. However, optimization issues limit the applicability of hierarchical policies to multi-agent systems. As such, the combination of these approaches has not been fully explored. To fill this void, we propose a novel and effective methodology for learning multi-agent hierarchies of message-passing policies. We adopt the feudal HRL framework and rely on a hierarchical graph structure for planning and coordination among agents. Agents at lower levels in the hierarchy receive goals from the upper levels and exchange messages with neighboring agents at the same level. To learn hierarchical multi-agent policies, we design a novel reward-assignment method based on training the lower-level policies to maximize the advantage function associated with the upper levels. Results on relevant benchmarks show that our method performs favorably compared to the state of the art.
The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.
Despite substantial progress in applying neural networks (NN) to multi-agent reinforcement learning (MARL) areas, they still largely suffer from a lack of transparency and interoperability. However, its implicit cooperative mechanism is not yet fully understood due to black-box networks. In this work, we study an interpretable value decomposition framework via concept bottleneck models, which promote trustworthiness by conditioning credit assignment on an intermediate level of human-like cooperation concepts. To address this problem, we propose a novel value-based method, named Concepts learning for Multi-agent Q-learning (CMQ), that goes beyond the current performance-vs-interpretability trade-off by learning interpretable cooperation concepts. CMQ represents each cooperation concept as a supervised vector, as opposed to existing models where the information flowing through their end-to-end mechanism is concept-agnostic. Intuitively, using individual action value conditioning on global state embeddings to represent each concept allows for extra cooperation representation capacity. Empirical evaluations on the StarCraft II micromanagement challenge and level-based foraging (LBF) show that CMQ achieves superior performance compared with the state-of-the-art counterparts. The results also demonstrate that CMQ provides more cooperation concept representation capturing meaningful cooperation modes, and supports test-time concept interventions for detecting potential biases of cooperation mode and identifying spurious artifacts that impact cooperation.
Constraint-based optimization is a cornerstone of robotics, enabling the design of controllers that reliably encode task and safety requirements such as collision avoidance or formation adherence. However, handcrafted constraints can fail in multi-agent settings that demand complex coordination. We introduce ReCoDe--Reinforcement-based Constraint Design--a decentralized, hybrid framework that merges the reliability of optimization-based controllers with the adaptability of multi-agent reinforcement learning. Rather than discarding expert controllers, ReCoDe improves them by learning additional, dynamic constraints that capture subtler behaviors, for example, by constraining agent movements to prevent congestion in cluttered scenarios. Through local communication, agents collectively constrain their allowed actions to coordinate more effectively under changing conditions. In this work, we focus on applications of ReCoDe to multi-agent navigation tasks requiring intricate, context-based movements and consensus, where we show that it outperforms purely handcrafted controllers, other hybrid approaches, and standard MARL baselines. We give empirical (real robot) and theoretical evidence that retaining a user-defined controller, even when it is imperfect, is more efficient than learning from scratch, especially because ReCoDe can dynamically change the degree to which it relies on this controller.
In this paper, we propose a general digital twin edge computing network comprising multiple vehicles and a server. Each vehicle generates multiple computing tasks within a time slot, leading to queuing challenges when offloading tasks to the server. The study investigates task offloading strategies, queue stability, and resource allocation. Lyapunov optimization is employed to transform long-term constraints into tractable short-term decisions. To solve the resulting problem, an in-context learning approach based on large language model (LLM) is adopted, replacing the conventional multi-agent reinforcement learning (MARL) framework. Experimental results demonstrate that the LLM-based method achieves comparable or even superior performance to MARL.
Efficient exploration in multi-agent reinforcement learning (MARL) is a challenging problem when receiving only a team reward, especially in environments with sparse rewards. A powerful method to mitigate this issue involves crafting dense individual rewards to guide the agents toward efficient exploration. However, individual rewards generally rely on manually engineered shaping-reward functions that lack high-order intelligence, thus it behaves ineffectively than humans regarding learning and generalization in complex problems. To tackle these issues, we combine the above two paradigms and propose a novel framework, LIGHT (Learning Individual Intrinsic reward via Incorporating Generalized Human experTise), which can integrate human knowledge into MARL algorithms in an end-to-end manner. LIGHT guides each agent to avoid unnecessary exploration by considering both individual action distribution and human expertise preference distribution. Then, LIGHT designs individual intrinsic rewards for each agent based on actionable representational transformation relevant to Q-learning so that the agents align their action preferences with the human expertise while maximizing the joint action value. Experimental results demonstrate the superiority of our method over representative baselines regarding performance and better knowledge reusability across different sparse-reward tasks on challenging scenarios.
Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an auto-regressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.