multi-agent - 2025-03-02

MARVEL: Multi-Agent Reinforcement Learning for constrained field-of-View multi-robot Exploration in Large-scale environments

Authors:Jimmy Chiun, Shizhe Zhang, Yizhuo Wang, Yuhong Cao, Guillaume Sartoretti
Date:2025-02-27 15:58:42

In multi-robot exploration, a team of mobile robot is tasked with efficiently mapping an unknown environments. While most exploration planners assume omnidirectional sensors like LiDAR, this is impractical for small robots such as drones, where lightweight, directional sensors like cameras may be the only option due to payload constraints. These sensors have a constrained field-of-view (FoV), which adds complexity to the exploration problem, requiring not only optimal robot positioning but also sensor orientation during movement. In this work, we propose MARVEL, a neural framework that leverages graph attention networks, together with novel frontiers and orientation features fusion technique, to develop a collaborative, decentralized policy using multi-agent reinforcement learning (MARL) for robots with constrained FoV. To handle the large action space of viewpoints planning, we further introduce a novel information-driven action pruning strategy. MARVEL improves multi-robot coordination and decision-making in challenging large-scale indoor environments, while adapting to various team sizes and sensor configurations (i.e., FoV and sensor range) without additional training. Our extensive evaluation shows that MARVEL's learned policies exhibit effective coordinated behaviors, outperforming state-of-the-art exploration planners across multiple metrics. We experimentally demonstrate MARVEL's generalizability in large-scale environments, of up to 90m by 90m, and validate its practical applicability through successful deployment on a team of real drone hardware.

RouteRL: Multi-agent reinforcement learning framework for urban route choice with autonomous vehicles

Authors:Ahmet Onur Akman, Anastasia Psarou, Łukasz Gorczyca, Zoltán György Varga, Grzegorz Jamróz, Rafał Kucharski
Date:2025-02-27 13:13:09

RouteRL is a novel framework that integrates multi-agent reinforcement learning (MARL) with a microscopic traffic simulation, facilitating the testing and development of efficient route choice strategies for autonomous vehicles (AVs). The proposed framework simulates the daily route choices of driver agents in a city, including two types: human drivers, emulated using behavioral route choice models, and AVs, modeled as MARL agents optimizing their policies for a predefined objective. RouteRL aims to advance research in MARL, transport modeling, and human-AI interaction for transportation applications. This study presents a technical report on RouteRL, outlines its potential research contributions, and showcases its impact via illustrative examples.

Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning

Authors:Xinran Li, Xiaolu Wang, Chenjia Bai, Jun Zhang
Date:2025-02-27 03:15:31

In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.

Joint Power Allocation and Phase Shift Design for Stacked Intelligent Metasurfaces-aided Cell-Free Massive MIMO Systems with MARL

Authors:Yiyang Zhu, Jiayi Zhang, Enyu Shi, Ziheng Liu, Chau Yuen, Bo Ai
Date:2025-02-27 01:34:34

Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost-effective, energy-efficient solution. This paper focuses on optimizing the joint power allocation of APs and the phase shift of SIMs to maximize the sum SE. To address this complex problem, we introduce a fully distributed multi-agent reinforcement learning (MARL) algorithm. Our novel algorithm, the noisy value method with a recurrent policy in multi-agent policy optimization (NVR-MAPPO), enhances performance by encouraging diverse exploration under centralized training and decentralized execution. Simulations demonstrate that NVR-MAPPO significantly improves sum SE and robustness across various scenarios.

Combining Planning and Reinforcement Learning for Solving Relational Multiagent Domains

Authors:Nikhilesh Prabhakar, Ranveer Singh, Harsha Kokel, Sriraam Natarajan, Prasad Tadepalli
Date:2025-02-26 16:55:23

Multiagent Reinforcement Learning (MARL) poses significant challenges due to the exponential growth of state and action spaces and the non-stationary nature of multiagent environments. This results in notable sample inefficiency and hinders generalization across diverse tasks. The complexity is further pronounced in relational settings, where domain knowledge is crucial but often underutilized by existing MARL algorithms. To overcome these hurdles, we propose integrating relational planners as centralized controllers with efficient state abstractions and reinforcement learning. This approach proves to be sample-efficient and facilitates effective task transfer and generalization.

Toward Dependency Dynamics in Multi-Agent Reinforcement Learning for Traffic Signal Control

Authors:Yuli Zhang, Shangbo Wang, Dongyao Jia, Pengfei Fan, Ruiyuan Jiang, Hankang Gu, Andy H. F. Chow
Date:2025-02-23 15:29:12

Reinforcement learning (RL) emerges as a promising data-driven approach for adaptive traffic signal control (ATSC) in complex urban traffic networks, with deep neural networks substantially augmenting its learning capabilities. However, centralized RL becomes impractical for ATSC involving multiple agents due to the exceedingly high dimensionality of the joint action space. Multi-agent RL (MARL) mitigates this scalability issue by decentralizing control to local RL agents. Nevertheless, this decentralized method introduces new challenges: the environment becomes partially observable from the perspective of each local agent due to constrained inter-agent communication. Both centralized RL and MARL exhibit distinct strengths and weaknesses, particularly under heavy intersectional traffic conditions. In this paper, we justify that MARL can achieve the optimal global Q-value by separating into multiple IRL (Independent Reinforcement Learning) processes when no spill-back congestion occurs (no agent dependency) among agents (intersections). In the presence of spill-back congestion (with agent dependency), the maximum global Q-value can be achieved by using centralized RL. Building upon the conclusions, we propose a novel Dynamic Parameter Update Strategy for Deep Q-Network (DQN-DPUS), which updates the weights and bias based on the dependency dynamics among agents, i.e. updating only the diagonal sub-matrices for the scenario without spill-back congestion. We validate the DQN-DPUS in a simple network with two intersections under varying traffic, and show that the proposed strategy can speed up the convergence rate without sacrificing optimal exploration. The results corroborate our theoretical findings, demonstrating the efficacy of DQN-DPUS in optimizing traffic signal control.

PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning

Authors:Kun Hu, Muning Wen, Xihuai Wang, Shao Zhang, Yiwei Shi, Minne Li, Minglong Li, Ying Wen
Date:2025-02-23 08:30:14

Multi-agent reinforcement learning (MARL) faces challenges in coordinating agents due to complex interdependencies within multi-agent systems. Most MARL algorithms use the simultaneous decision-making paradigm but ignore the action-level dependencies among agents, which reduces coordination efficiency. In contrast, the sequential decision-making paradigm provides finer-grained supervision for agent decision order, presenting the potential for handling dependencies via better decision order management. However, determining the optimal decision order remains a challenge. In this paper, we introduce Action Generation with Plackett-Luce Sampling (AGPS), a novel mechanism for agent decision order optimization. We model the order determination task as a Plackett-Luce sampling process to address issues such as ranking instability and vanishing gradient during the network training process. AGPS realizes credit-based decision order determination by establishing a bridge between the significance of agents' local observations and their decision credits, thus facilitating order optimization and dependency management. Integrating AGPS with the Multi-Agent Transformer, we propose the Prioritized Multi-Agent Transformer (PMAT), a sequential decision-making MARL algorithm with decision order optimization. Experiments on benchmarks including StarCraft II Multi-Agent Challenge, Google Research Football, and Multi-Agent MuJoCo show that PMAT outperforms state-of-the-art algorithms, greatly enhancing coordination efficiency.

Curiosity Driven Multi-agent Reinforcement Learning for 3D Game Testing

Authors:Raihana Ferdous, Fitsum Kifetew, Davide Prandi, Angelo Susi
Date:2025-02-20 14:43:46

Recently testing of games via autonomous agents has shown great promise in tackling challenges faced by the game industry, which mainly relied on either manual testing or record/replay. In particular Reinforcement Learning (RL) solutions have shown potential by learning directly from playing the game without the need for human intervention. In this paper, we present cMarlTest, an approach for testing 3D games through curiosity driven Multi-Agent Reinforcement Learning (MARL). cMarlTest deploys multiple agents that work collaboratively to achieve the testing objective. The use of multiple agents helps resolve issues faced by a single agent approach. We carried out experiments on different levels of a 3D game comparing the performance of cMarlTest with a single agent RL variant. Results are promising where, considering three different types of coverage criteria, cMarlTest achieved higher coverage. cMarlTest was also more efficient in terms of the time taken, with respect to the single agent based variant.

Autonomous Vehicles Using Multi-Agent Reinforcement Learning for Routing Decisions Can Harm Urban Traffic

Authors:Anastasia Psarou, Ahmet Onur Akman, Łukasz Gorczyca, Michał Hoffmann, Zoltán György Varga, Grzegorz Jamróz, Rafał Kucharski
Date:2025-02-18 13:37:02

Autonomous vehicles (AVs) using Multi-Agent Reinforcement Learning (MARL) for simultaneous route optimization may destabilize traffic environments, with human drivers possibly experiencing longer travel times. We study this interaction by simulating human drivers and AVs. Our experiments with standard MARL algorithms reveal that, even in trivial cases, policies often fail to converge to an optimal solution or require long training periods. The problem is amplified by the fact that we cannot rely entirely on simulated training, as there are no accurate models of human routing behavior. At the same time, real-world training in cities risks destabilizing urban traffic systems, increasing externalities, such as $CO_2$ emissions, and introducing non-stationarity as human drivers adapt unpredictably to AV behaviors. Centralization can improve convergence in some cases, however, it raises privacy concerns for the travelers' destination data. In this position paper, we argue that future research must prioritize realistic benchmarks, cautious deployment strategies, and tools for monitoring and regulating AV routing behaviors to ensure sustainable and equitable urban mobility systems.

Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

Authors:Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
Date:2025-02-17 15:09:45

Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.

Scalable Multi-Agent Offline Reinforcement Learning and the Role of Information

Authors:Riccardo Zamboni, Enrico Brunetti, Marcello Restelli
Date:2025-02-16 20:28:42

Offline Reinforcement Learning (RL) focuses on learning policies solely from a batch of previously collected data. offering the potential to leverage such datasets effectively without the need for costly or risky active exploration. While recent advances in Offline Multi-Agent RL (MARL) have shown promise, most existing methods either rely on large datasets jointly collected by all agents or agent-specific datasets collected independently. The former approach ensures strong performance but raises scalability concerns, while the latter emphasizes scalability at the expense of performance guarantees. In this work, we propose a novel scalable routine for both dataset collection and offline learning. Agents first collect diverse datasets coherently with a pre-specified information-sharing network and subsequently learn coherent localized policies without requiring either full observability or falling back to complete decentralization. We theoretically demonstrate that this structured approach allows a multi-agent extension of the seminal Fitted Q-Iteration (FQI) algorithm to globally converge, in high probability, to near-optimal policies. The convergence is subject to error terms that depend on the informativeness of the shared information. Furthermore, we show how this approach allows to bound the inherent error of the supervised-learning phase of FQI with the mutual information between shared and unshared information. Our algorithm, SCAlable Multi-agent FQI (SCAM-FQI), is then evaluated on a distributed decision-making problem. The empirical results align with our theoretical findings, supporting the effectiveness of SCAM-FQI in achieving a balance between scalability and policy performance.

Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

Authors:Zhiyuan Li, Wenshuai Zhao, Joni Pajarinen
Date:2025-02-14 13:23:18

Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS achieves up to 30\% higher win rates than state-of-the-art MARL algorithms in symmetric scenarios.

Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Authors:Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi
Date:2025-02-13 21:28:51

Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

Authors:Xun Wang, Zhuoran Li, Hai Zhong, Longbo Huang
Date:2025-02-13 05:47:57

As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, in this paper, we propose a task-efficient multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing offline skill-discovery methods, SD-CQL discovers skills by reconstructing the next observation. It then evaluates fixed and variable actions separately and employs behavior-regularized conservative Q-learning to execute the optimal action for each skill. This approach eliminates the need for local-global alignment and enables strong multi-task generalization from limited small-scale source tasks. Substantial experiments on StarCraftII demonstrates the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on $\textbf{10}$ out of $14$ task sets, with up to $\textbf{65%}$ improvement on individual task sets, and is within $4\%$ of the best baseline on the remaining four.

Distributed Value Decomposition Networks with Networked Agents

Authors:Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo
Date:2025-02-11 15:23:05

We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

Who is Helping Whom? Analyzing Inter-dependencies to Evaluate Cooperation in Human-AI Teaming

Authors:Upasana Biswas, Siddhant Bhambri, Subbarao Kambhampati
Date:2025-02-10 19:16:20

The long-standing research challenges of Human-AI Teaming(HAT) and Zero-shot Cooperation(ZSC) have been tackled by applying multi-agent reinforcement learning(MARL) to train an agent by optimizing the environment reward function and evaluating their performance through task performance metrics such as task reward. However, such evaluation focuses only on task completion, while being agnostic to `how' the two agents work with each other. Specifically, we are interested in understanding the cooperation arising within the team when trained agents are paired with humans. To formally address this problem, we propose the concept of interdependence to measure how much agents rely on each other's actions to achieve the shared goal, as a key metric for evaluating cooperation in human-agent teams. Towards this, we ground this concept through a symbolic formalism and define evaluation metrics that allow us to assess the degree of reliance between the agents' actions. We pair state-of-the-art agents trained through MARL for HAT, with learned human models for the the popular Overcooked domain, and evaluate the team performance for these human-agent teams. Our results demonstrate that trained agents are not able to induce cooperative behavior, reporting very low levels of interdependence across all the teams. We also report that teaming performance of a team is not necessarily correlated with the task reward.

Towards Bio-inspired Heuristically Accelerated Reinforcement Learning for Adaptive Underwater Multi-Agents Behaviour

Authors:Antoine Vivien, Thomas Chaffre, Matthew Stephenson, Eva Artusi, Paulo Santos, Benoit Clement, Karl Sammut
Date:2025-02-10 02:47:33

This paper describes the problem of coordination of an autonomous Multi-Agent System which aims to solve the coverage planning problem in a complex environment. The considered applications are the detection and identification of objects of interest while covering an area. These tasks, which are highly relevant for space applications, are also of interest among various domains including the underwater context, which is the focus of this study. In this context, coverage planning is traditionally modelled as a Markov Decision Process where a coordinated MAS, a swarm of heterogeneous autonomous underwater vehicles, is required to survey an area and search for objects. This MDP is associated with several challenges: environment uncertainties, communication constraints, and an ensemble of hazards, including time-varying and unpredictable changes in the underwater environment. MARL algorithms can solve highly non-linear problems using deep neural networks and display great scalability against an increased number of agents. Nevertheless, most of the current results in the underwater domain are limited to simulation due to the high learning time of MARL algorithms. For this reason, a novel strategy is introduced to accelerate this convergence rate by incorporating biologically inspired heuristics to guide the policy during training. The PSO method, which is inspired by the behaviour of a group of animals, is selected as a heuristic. It allows the policy to explore the highest quality regions of the action and state spaces, from the beginning of the training, optimizing the exploration/exploitation trade-off. The resulting agent requires fewer interactions to reach optimal performance. The method is applied to the MSAC algorithm and evaluated for a 2D covering area mission in a continuous control environment.

Multi-Agent Reinforcement Learning in Wireless Distributed Networks for 6G

Authors:Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Enyu Shi, Bokai Xu, Chau Yuen, Dusit Niyato, Mérouane Debbah, Shi Jin, Bo Ai, Xuemin, Shen
Date:2025-02-09 08:35:09

The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth-generation (6G) networks, emphasizing massive capacity, ultra-low latency, and unparalleled reliability. Wireless distributed networks and multi-agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions for the great attention. Given their distinct capabilities, such as decentralization and collaborative mechanisms, integrating these two paradigms holds great promise for unleashing the full power of 6G, attracting significant research and development attention. This paper provides a comprehensive study on MARL-assisted wireless distributed networks for 6G. In particular, we introduce the basic mathematical background and evolution of wireless distributed networks and MARL, as well as demonstrate their interrelationships. Subsequently, we analyze different structures of wireless distributed networks from the perspectives of homogeneous and heterogeneous. Furthermore, we introduce the basic concepts of MARL and discuss two typical categories, including model-based and model-free. We then present critical challenges faced by MARL-assisted wireless distributed networks, providing important guidance and insights for actual implementation. We also explore an interplay between MARL-assisted wireless distributed networks and emerging techniques, such as information bottleneck and mirror learning, delivering in-depth analyses and application scenarios. Finally, we outline several compelling research directions for future MARL-assisted wireless distributed networks.

Low-Rank Agent-Specific Adaptation (LoRASA) for Multi-Agent Policy Learning

Authors:Beining Zhang, Aditya Kapoor, Mingfei Sun
Date:2025-02-08 13:57:53

Multi-agent reinforcement learning (MARL) often relies on \emph{parameter sharing (PS)} to scale efficiently. However, purely shared policies can stifle each agent's unique specialization, reducing overall performance in heterogeneous environments. We propose \textbf{Low-Rank Agent-Specific Adaptation (LoRASA)}, a novel approach that treats each agent's policy as a specialized ``task'' fine-tuned from a shared backbone. Drawing inspiration from parameter-efficient transfer methods, LoRASA appends small, low-rank adaptation matrices to each layer of the shared policy, naturally inducing \emph{parameter-space sparsity} that promotes both specialization and scalability. We evaluate LoRASA on challenging benchmarks including the StarCraft Multi-Agent Challenge (SMAC) and Multi-Agent MuJoCo (MAMuJoCo), implementing it atop widely used algorithms such as MAPPO and A2PO. Across diverse tasks, LoRASA matches or outperforms existing baselines \emph{while reducing memory and computational overhead}. Ablation studies on adapter rank, placement, and timing validate the method's flexibility and efficiency. Our results suggest LoRASA's potential to establish a new norm for MARL policy parameterization: combining a shared foundation for coordination with low-rank agent-specific refinements for individual specialization.

LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning

Authors:Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
Date:2025-02-08 05:26:02

Developing intelligent agents for long-term cooperation in dynamic open-world scenarios is a major challenge in multi-agent systems. Traditional Multi-agent Reinforcement Learning (MARL) frameworks like centralized training decentralized execution (CTDE) struggle with scalability and flexibility. They require centralized long-term planning, which is difficult without custom reward functions, and face challenges in processing multi-modal data. CTDE approaches also assume fixed cooperation strategies, making them impractical in dynamic environments where agents need to adapt and plan independently. To address decentralized multi-agent cooperation, we propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi-agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long-term planning and reasoning. Instead of fully sharing information from all past experiences, DAMCS introduces a multi-modal memory system organized as a hierarchical knowledge graph and a structured communication protocol to optimize agent cooperation. This allows agents to reason from past interactions and share relevant information efficiently. Experiments on novel multi-agent open-world tasks show that DAMCS outperforms both MARL and LLM baselines in task efficiency and collaboration. Compared to single-agent scenarios, the two-agent scenario achieves the same goal with 63% fewer steps, and the six-agent scenario with 74% fewer steps, highlighting the importance of adaptive memory and structured communication in achieving long-term goals. We publicly release our project at: https://happyeureka.github.io/damcs.

$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

Authors:Aditya Kapoor, Kale-ab Tessera, Mayank Baranwal, Harshad Khadilkar, Stefano Albrecht, Mingfei Sun
Date:2025-02-07 12:07:57

In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks

Authors:George Papadopoulos, Andreas Kontogiannis, Foteini Papadopoulou, Chaido Poulianou, Ioannis Koumentis, George Vouros
Date:2025-02-07 09:17:02

Multi-Agent Reinforcement Learning (MARL) has recently emerged as a significant area of research. However, MARL evaluation often lacks systematic diversity, hindering a comprehensive understanding of algorithms' capabilities. In particular, cooperative MARL algorithms are predominantly evaluated on benchmarks such as SMAC and GRF, which primarily feature team game scenarios without assessing adequately various aspects of agents' capabilities required in fully cooperative real-world tasks such as multi-robot cooperation and warehouse, resource management, search and rescue, and human-AI cooperation. Moreover, MARL algorithms are mainly evaluated on low dimensional state spaces, and thus their performance on high-dimensional (e.g., image) observations is not well-studied. To fill this gap, this paper highlights the crucial need for expanding systematic evaluation across a wider array of existing benchmarks. To this end, we conduct extensive evaluation and comparisons of well-known MARL algorithms on complex fully cooperative benchmarks, including tasks with images as agents' observations. Interestingly, our analysis shows that many algorithms, hailed as state-of-the-art on SMAC and GRF, may underperform standard MARL baselines on fully cooperative benchmarks. Finally, towards more systematic and better evaluation of cooperative MARL algorithms, we have open-sourced PyMARLzoo+, an extension of the widely used (E)PyMARL libraries, which addresses an open challenge from [TBG++21], facilitating seamless integration and support with all benchmarks of PettingZoo, as well as Overcooked, PressurePlate, Capture Target and Box Pushing.

Multi-Agent Reinforcement Learning with Focal Diversity Optimization

Authors:Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu
Date:2025-02-06 20:44:26

The advancement of Large Language Models (LLMs) and their finetuning strategies has triggered the renewed interests in multi-agent reinforcement learning. In this paper, we introduce a focal diversity-optimized multi-agent reinforcement learning approach, coined as MARL-Focal, with three unique characteristics. First, we develop an agent-fusion framework for encouraging multiple LLM based agents to collaborate in producing the final inference output for each LLM query. Second, we develop a focal-diversity optimized agent selection algorithm that can choose a small subset of the available agents based on how well they can complement one another to generate the query output. Finally, we design a conflict-resolution method to detect output inconsistency among multiple agents and produce our MARL-Focal output through reward-aware and policy-adaptive inference fusion. Extensive evaluations on five benchmarks show that MARL-Focal is cost-efficient and adversarial-robust. Our multi-agent fusion model achieves performance improvement of 5.51\% compared to the best individual LLM-agent and offers stronger robustness over the TruthfulQA benchmark. Code is available at https://github.com/sftekin/rl-focal

DECAF: Learning to be Fair in Multi-agent Resource Allocation

Authors:Ashwin Kumar, William Yeoh
Date:2025-02-06 18:29:11

A wide variety of resource allocation problems operate under resource constraints that are managed by a central arbitrator, with agents who evaluate and communicate preferences over these resources. We formulate this broad class of problems as Distributed Evaluation, Centralized Allocation (DECA) problems and propose methods to learn fair and efficient policies in centralized resource allocation. Our methods are applied to learning long-term fairness in a novel and general framework for fairness in multi-agent systems. We show three different methods based on Double Deep Q-Learning: (1) A joint weighted optimization of fairness and utility, (2) a split optimization, learning two separate Q-estimators for utility and fairness, and (3) an online policy perturbation to guide existing black-box utility functions toward fair solutions. Our methods outperform existing fair MARL approaches on multiple resource allocation domains, even when evaluated using diverse fairness functions, and allow for flexible online trade-offs between utility and fairness.

Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Authors:Nikunj Gupta, James Zachary Hare, Rajgopal Kannan, Viktor Prasanna
Date:2025-02-06 12:35:52

This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. However, existing approaches rely solely on pairwise relations between agents, which potentially oversimplifies complex multi-agent interactions. DMCG goes beyond these simple direct interactions by also capturing useful higher-order and indirect relationships among agents. It generates novel graph structures accommodating multiple types of interactions and arbitrary lengths of multi-hop connections in coordination graphs to model such interactions. It then employs a graph convolutional network module to learn powerful representations in an end-to-end manner. We demonstrate its effectiveness in multiple coordination problems in MARL where other state-of-the-art methods can suffer from sample inefficiency or fail entirely. All codes can be found here: https://github.com/Nikunj-Gupta/dmcg-marl.

PAGNet: Pluggable Adaptive Generative Networks for Information Completion in Multi-Agent Communication

Authors:Zhuohui Zhang, Bin Cheng, Zhipeng Wang, Yanmin Zhou, Gang Li, Ping Lu, Bin He, Jie Chen
Date:2025-02-06 07:55:24

For partially observable cooperative tasks, multi-agent systems must develop effective communication and understand the interplay among agents in order to achieve cooperative goals. However, existing multi-agent reinforcement learning (MARL) with communication methods lack evaluation metrics for information weights and information-level communication modeling. This causes agents to neglect the aggregation of multiple messages, thereby significantly reducing policy learning efficiency. In this paper, we propose pluggable adaptive generative networks (PAGNet), a novel framework that integrates generative models into MARL to enhance communication and decision-making. PAGNet enables agents to synthesize global states representations from weighted local observations and use these representations alongside learned communication weights for coordinated decision-making. This pluggable approach reduces the computational demands typically associated with the joint training of communication and policy networks. Extensive experimental evaluations across diverse benchmarks and communication scenarios demonstrate the significant performance improvements achieved by PAGNet. Furthermore, we analyze the emergent communication patterns and the quality of generated global states, providing insights into operational mechanisms.

Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning

Authors:Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
Date:2025-02-06 07:23:40

Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model fine-tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine-tuning.To overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph-structured, time-varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real-world foundation model fine-tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.

Reinforcement Learning on AYA Dyads to Enhance Medication Adherence

Authors:Ziping Xu, Hinal Jajal, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Alexandra M. Psihogios, Pei-Yao Hung, Susan Murphy
Date:2025-02-06 02:27:35

Medication adherence is critical for the recovery of adolescents and young adults (AYAs) who have undergone hematopoietic cell transplantation (HCT). However, maintaining adherence is challenging for AYAs after hospital discharge, who experience both individual (e.g. physical and emotional symptoms) and interpersonal barriers (e.g., relational difficulties with their care partner, who is often involved in medication management). To optimize the effectiveness of a three-component digital intervention targeting both members of the dyad as well as their relationship, we propose a novel Multi-Agent Reinforcement Learning (MARL) approach to personalize the delivery of interventions. By incorporating the domain knowledge, the MARL framework, where each agent is responsible for the delivery of one intervention component, allows for faster learning compared with a flattened agent. Evaluation using a dyadic simulator environment, based on real clinical data, shows a significant improvement in medication adherence (approximately 3%) compared to purely random intervention delivery. The effectiveness of this approach will be further evaluated in an upcoming trial.

Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning

Authors:Muhan Lin, Shuyang Shi, Yue Guo, Vaishnav Tadiparthi, Behdad Chalaki, Ehsan Moradi Pari, Simon Stepputtis, Woojun Kim, Joseph Campbell, Katia Sycara
Date:2025-02-06 02:26:47

Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor-intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent-specific rewards based on a natural language description of the task and the overall team goal. By learning a potential-based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state-of-the-art MARL baselines.

Energy-Efficient Flying LoRa Gateways: A Multi-Agent Reinforcement Learning Approach

Authors:Abdullahi Isa Ahmed, El Mehdi Amhoud
Date:2025-02-05 17:16:40

With the rapid development of next-generation Internet of Things (NG-IoT) networks, the increasing number of connected devices has led to a surge in power consumption. This rise in energy demand poses significant challenges to resource availability and raises sustainability concerns for large-scale IoT deployments. Efficient energy utilization in communication networks, particularly for power-constrained IoT devices, has thus become a critical area of research. In this paper, we deployed flying LoRa gateways (GWs) mounted on unmanned aerial vehicles (UAVs) to collect data from LoRa end devices (EDs) and transmit it to a central server. Our primary objective is to maximize the global system energy efficiency (EE) of wireless LoRa networks by joint optimization of transmission power (TP), spreading factor (SF), bandwidth (W), and ED association. To solve this challenging problem, we model the problem as a partially observable Markov decision process (POMDP), where each flying LoRa GW acts as a learning agent using a cooperative Multi-Agent Reinforcement Learning (MARL) approach under centralized training and decentralized execution (CTDE). Simulation results demonstrate that our proposed method, based on the multi-agent proximal policy optimization (MAPPO) algorithm, significantly improves the global system EE and surpasses the conventional MARL schemes.

Learning Efficient Flocking Control based on Gibbs Random Fields

Authors:Dengyu Zhang, Chenghao, Feng Xue, Qingrui Zhang
Date:2025-02-05 08:27:58

Flocking control is essential for multi-robot systems in diverse applications, yet achieving efficient flocking in congested environments poses challenges regarding computation burdens, performance optimality, and motion safety. This paper addresses these challenges through a multi-agent reinforcement learning (MARL) framework built on Gibbs Random Fields (GRFs). With GRFs, a multi-robot system is represented by a set of random variables conforming to a joint probability distribution, thus offering a fresh perspective on flocking reward design. A decentralized training and execution mechanism, which enhances the scalability of MARL concerning robot quantity, is realized using a GRF-based credit assignment method. An action attention module is introduced to implicitly anticipate the motion intentions of neighboring robots, consequently mitigating potential non-stationarity issues in MARL. The proposed framework enables learning an efficient distributed control policy for multi-robot systems in challenging environments with success rate around $99\%$, as demonstrated through thorough comparisons with state-of-the-art solutions in simulations and experiments. Ablation studies are also performed to validate the efficiency of different framework modules.

Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

Authors:Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han
Date:2025-02-05 02:59:23

Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering system-wide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL.

MAGNNET: Multi-Agent Graph Neural Network-based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning

Authors:Lavanya Ratnabala, Aleksey Fedoseev, Robinroy Peter, Dzmitry Tsetserukou
Date:2025-02-04 13:29:56

This paper addresses the challenge of decentralized task allocation within heterogeneous multi-agent systems operating under communication constraints. We introduce a novel framework that integrates graph neural networks (GNNs) with a centralized training and decentralized execution (CTDE) paradigm, further enhanced by a tailored Proximal Policy Optimization (PPO) algorithm for multi-agent deep reinforcement learning (MARL). Our approach enables unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to dynamically allocate tasks efficiently without necessitating central coordination in a 3D grid environment. The framework minimizes total travel time while simultaneously avoiding conflicts in task assignments. For the cost calculation and routing, we employ reservation-based A* and R* path planners. Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 7.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 20 agents with allocation processing of 2.8 s and robustness in responding to dynamically generated tasks, underscoring its potential for real-world applications in complex multi-agent scenarios.

Sequential Multi-objective Multi-agent Reinforcement Learning Approach for Predictive Maintenance

Authors:Yan Chen, Cheng Liu
Date:2025-02-04 07:42:58

Existing predictive maintenance (PdM) methods typically focus solely on whether to replace system components without considering the costs incurred by inspection. However, a well-considered approach should be able to minimize Remaining Useful Life (RUL) at engine replacement while maximizing inspection interval. To achieve this, multi-agent reinforcement learning (MARL) can be introduced. However, due to the sequential and mutually constraining nature of these 2 objectives, conventional MARL is not applicable. Therefore, this paper introduces a novel framework and develops a Sequential Multi-objective Multi-agent Proximal Policy Optimization (SMOMA-PPO) algorithm. Furthermore, to provide comprehensive and effective degradation information to RL agents, we also employed Gated Recurrent Unit, quantile regression, and probability distribution fitting to develop a GRU-based RUL Prediction (GRP) model. Experiments demonstrate that the GRP method significantly improves the accuracy of RUL predictions in the later stages of system operation compared to existing methods. When incorporating its output into SMOMA-PPO, we achieve at least a 15% reduction in average RUL without unscheduled replacements (UR), nearly a 10% increase in inspection interval, and an overall decrease in maintenance costs. Importantly, our approach offers a new perspective for addressing multi-objective maintenance planning with sequential constraints, effectively enhancing system reliability and reducing maintenance expenses.

CH-MARL: Constrained Hierarchical Multiagent Reinforcement Learning for Sustainable Maritime Logistics

Authors:Saad Alqithami
Date:2025-02-04 07:13:21

Addressing global challenges such as greenhouse gas emissions and resource inequity demands advanced AI-driven coordination among autonomous agents. We propose CH-MARL (Constrained Hierarchical Multiagent Reinforcement Learning), a novel framework that integrates hierarchical decision-making with dynamic constraint enforcement and fairness-aware reward shaping. CH-MARL employs a real-time constraint-enforcement layer to ensure adherence to global emission caps, while incorporating fairness metrics that promote equitable resource distribution among agents. Experiments conducted in a simulated maritime logistics environment demonstrate considerable reductions in emissions, along with improvements in fairness and operational efficiency. Beyond this domain-specific success, CH-MARL provides a scalable, generalizable solution to multi-agent coordination challenges in constrained, dynamic settings, thus advancing the state of the art in reinforcement learning.

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Authors:Zelai Xu, Chao Yu, Ruize Zhang, Huining Yuan, Xiangmin Yi, Shilong Ji, Chuqi Wang, Wenhao Tang, Yu Wang
Date:2025-02-04 02:07:23

Multi-agent reinforcement learning (MARL) has made significant progress, largely fueled by the development of specialized testbeds that enable systematic evaluation of algorithms in controlled yet challenging scenarios. However, existing testbeds often focus on purely virtual simulations or limited robot morphologies such as robotic arms, quadrupeds, and humanoids, leaving high-mobility platforms with real-world physical constraints like drones underexplored. To bridge this gap, we present VolleyBots, a new MARL testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots features a turn-based interaction model under volleyball rules, a hierarchical decision-making process that combines motion control and strategic play, and a high-fidelity simulation for seamless sim-to-real transfer. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative MARL and game-theoretic algorithms. Results in simulation show that while existing algorithms handle simple tasks effectively, they encounter difficulty in complex tasks that require both low-level control and high-level strategy. We further demonstrate zero-shot deployment of a simulation-learned policy to real-world drones, highlighting VolleyBots' potential to propel MARL research involving agile robotic platforms. The project page is at https://sites.google.com/view/thu-volleybots/home.

Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication

Authors:Sydney Dolan, Siddharth Nayak, Jasmine Jerry Aloor, Hamsa Balakrishnan
Date:2025-02-01 21:04:32

We consider the problem setting in which multiple autonomous agents must cooperatively navigate and perform tasks in an unknown, communication-constrained environment. Traditional multi-agent reinforcement learning (MARL) approaches assume synchronous communications and perform poorly in such environments. We propose AsynCoMARL, an asynchronous MARL approach that uses graph transformers to learn communication protocols from dynamic graphs. AsynCoMARL can accommodate infrequent and asynchronous communications between agents, with edges of the graph only forming when agents communicate with each other. We show that AsynCoMARL achieves similar success and collision rates as leading baselines, despite 26\% fewer messages being passed between agents.

The Composite Task Challenge for Cooperative Multi-Agent Reinforcement Learning

Authors:Yurui Li, Yuxuan Chen, Li Zhang, Shijian Li, Gang Pan
Date:2025-02-01 07:07:08

The significant role of division of labor (DOL) in promoting cooperation is widely recognized in real-world applications.Many cooperative multi-agent reinforcement learning (MARL) methods have incorporated the concept of DOL to improve cooperation among agents.However, the tasks used in existing testbeds typically correspond to tasks where DOL is often not a necessary feature for achieving optimal policies.Additionally, the full utilize of DOL concept in MARL methods remains unrealized due to the absence of appropriate tasks.To enhance the generality and applicability of MARL methods in real-world scenarios, there is a necessary to develop tasks that demand multi-agent DOL and cooperation.In this paper, we propose a series of tasks designed to meet these requirements, drawing on real-world rules as the guidance for their design.We guarantee that DOL and cooperation are necessary condition for completing tasks and introduce three factors to expand the diversity of proposed tasks to cover more realistic situations.We evaluate 10 cooperative MARL methods on the proposed tasks.The results indicate that all baselines perform poorly on these tasks.To further validate the solvability of these tasks, we also propose simplified variants of proposed tasks.Experimental results show that baselines are able to handle these simplified variants, providing evidence of the solvability of the proposed tasks.The source files is available at https://github.com/Yurui-Li/CTC.

O-MAPL: Offline Multi-agent Preference Learning

Authors:The Viet Bui, Tien Mai, Hong Thanh Nguyen
Date:2025-01-31 08:08:20

Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL), where large joint state-action spaces and complex inter-agent interactions complicate the task. While prior single-agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end-to-end preference-based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q-functions. Our approach uses a carefully-designed multi-agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.

Learning Mean Field Control on Sparse Graphs

Authors:Christian Fabian, Kai Cui, Heinz Koeppl
Date:2025-01-28 17:03:30

Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multi-agent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.

Multi-Agent Meta-Offline Reinforcement Learning for Timely UAV Path Planning and Data Collection

Authors:Eslam Eldeeb, Hirley Alves
Date:2025-01-27 14:47:19

Multi-agent reinforcement learning (MARL) has been widely adopted in high-performance computing and complex data-driven decision-making in the wireless domain. However, conventional MARL schemes face many obstacles in real-world scenarios. First, most MARL algorithms are online, which might be unsafe and impractical. Second, MARL algorithms are environment-specific, meaning network configuration changes require model retraining. This letter proposes a novel meta-offline MARL algorithm that combines conservative Q-learning (CQL) and model agnostic meta-learning (MAML). CQL enables offline training by leveraging pre-collected datasets, while MAML ensures scalability and adaptability to dynamic network configurations and objectives. We propose two algorithm variants: independent training (M-I-MARL) and centralized training decentralized execution (M-CTDE-MARL). Simulation results show that the proposed algorithm outperforms conventional schemes, especially the CTDE approach that achieves 50 % faster convergence in dynamic scenarios than the benchmarks. The proposed framework enhances scalability, robustness, and adaptability in wireless communication systems by optimizing UAV trajectories and scheduling policies.

Adaptive AI-based Decentralized Resource Management in the Cloud-Edge Continuum

Authors:Lanpei Li, Jack Bell, Massimo Coppola, Vincenzo Lomonaco
Date:2025-01-27 06:07:09

The increasing complexity of application requirements and the dynamic nature of the Cloud-Edge Continuum present significant challenges for efficient resource management. These challenges stem from the ever-changing infrastructure, which is characterized by additions, removals, and reconfigurations of nodes and links, as well as the variability of application workloads. Traditional centralized approaches struggle to adapt to these changes due to their static nature, while decentralized solutions face challenges such as limited global visibility and coordination overhead. This paper proposes a hybrid decentralized framework for dynamic application placement and resource management. The framework utilizes Graph Neural Networks (GNNs) to embed resource and application states, enabling comprehensive representation and efficient decision-making. It employs a collaborative multi-agent reinforcement learning (MARL) approach, where local agents optimize resource management in their neighborhoods and a global orchestrator ensures system-wide coordination. By combining decentralized application placement with centralized oversight, our framework addresses the scalability, adaptability, and accuracy challenges inherent in the Cloud-Edge Continuum. This work contributes to the development of decentralized application placement strategies, the integration of GNN embeddings, and collaborative MARL systems, providing a foundation for efficient, adaptive and scalable resource management.

Contextual Knowledge Sharing in Multi-Agent Reinforcement Learning with Decentralized Communication and Coordination

Authors:Hung Du, Srikanth Thudumu, Hy Nguyen, Rajesh Vasa, Kon Mouzakis
Date:2025-01-26 22:49:50

Decentralized Multi-Agent Reinforcement Learning (Dec-MARL) has emerged as a pivotal approach for addressing complex tasks in dynamic environments. Existing Multi-Agent Reinforcement Learning (MARL) methodologies typically assume a shared objective among agents and rely on centralized control. However, many real-world scenarios feature agents with individual goals and limited observability of other agents, complicating coordination and hindering adaptability. Existing Dec-MARL strategies prioritize either communication or coordination, lacking an integrated approach that leverages both. This paper presents a novel Dec-MARL framework that integrates peer-to-peer communication and coordination, incorporating goal-awareness and time-awareness into the agents' knowledge-sharing processes. Our framework equips agents with the ability to (i) share contextually relevant knowledge to assist other agents, and (ii) reason based on information acquired from multiple agents, while considering their own goals and the temporal context of prior knowledge. We evaluate our approach through several complex multi-agent tasks in environments with dynamically appearing obstacles. Our work demonstrates that incorporating goal-aware and time-aware knowledge sharing significantly enhances overall performance.

MARL-OT: Multi-Agent Reinforcement Learning Guided Online Fuzzing to Detect Safety Violation in Autonomous Driving Systems

Authors:Linfeng Liang, Xi Zheng
Date:2025-01-24 12:34:04

Autonomous Driving Systems (ADSs) are safety-critical, as real-world safety violations can result in significant losses. Rigorous testing is essential before deployment, with simulation testing playing a key role. However, ADSs are typically complex, consisting of multiple modules such as perception and planning, or well-trained end-to-end autonomous driving systems. Offline methods, such as the Genetic Algorithm (GA), can only generate predefined trajectories for dynamics, which struggle to cause safety violations for ADSs rapidly and efficiently in different scenarios due to their evolutionary nature. Online methods, such as single-agent reinforcement learning (RL), can quickly adjust the dynamics' trajectory online to adapt to different scenarios, but they struggle to capture complex corner cases of ADS arising from the intricate interplay among multiple vehicles. Multi-agent reinforcement learning (MARL) has a strong ability in cooperative tasks. On the other hand, it faces its own challenges, particularly with convergence. This paper introduces MARL-OT, a scalable framework that leverages MARL to detect safety violations of ADS resulting from surrounding vehicles' cooperation. MARL-OT employs MARL for high-level guidance, triggering various dangerous scenarios for the rule-based online fuzzer to explore potential safety violations of ADS, thereby generating dynamic, realistic safety violation scenarios. Our approach improves the detected safety violation rate by up to 136.2% compared to the state-of-the-art (SOTA) testing technique.

Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System

Authors:Haikuo Du, Fandi Gou, Yunze Cai
Date:2025-01-23 15:01:19

Safety and scalability are two critical challenges faced by practical Multi-Agent Systems (MAS). However, existing Multi-Agent Reinforcement Learning (MARL) algorithms that rely solely on reward shaping are ineffective in ensuring safety, and their scalability is rather limited due to the fixed-size network output. To address these issues, we propose a novel framework, Scalable Safe MARL (SS-MARL), to enhance the safety and scalability of MARL methods. Leveraging the inherent graph structure of MAS, we design a multi-layer message passing network to aggregate local observations and communications of varying sizes. Furthermore, we develop a constrained joint policy optimization method in the setting of local observation to improve safety. Simulation experiments demonstrate that SS-MARL achieves a better trade-off between optimality and safety compared to baselines, and its scalability significantly outperforms the latest methods in scenarios with a large number of agents. The feasibility of our method is also verified by hardware implementation with Mecanum-wheeled vehicles.

WFCRL: A Multi-Agent Reinforcement Learning Benchmark for Wind Farm Control

Authors:Claire Bizon Monroc, Ana Bušić, Donatien Dubuc, Jiamin Zhu
Date:2025-01-23 12:01:17

The wind farm control problem is challenging, since conventional model-based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model-free and multi-agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi-agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi-Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state-of-the-art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (FAST.Farm). For each simulator, $10$ wind layouts are provided, including $5$ real wind farms. Two state-of-the-art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on FAST.Farm is highly time-consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to FAST.Farm.

BMG-Q: Localized Bipartite Match Graph Attention Q-Learning for Ride-Pooling Order Dispatch

Authors:Yulong Hu, Siyuan Feng, Sen Li
Date:2025-01-23 08:01:24

This paper introduces Localized Bipartite Match Graph Attention Q-Learning (BMG-Q), a novel Multi-Agent Reinforcement Learning (MARL) algorithm framework tailored for ride-pooling order dispatch. BMG-Q advances ride-pooling decision-making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride-pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration-exploitation trade-off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG-Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG-Q as an effective, scalable, and robust framework for advancing ride-pooling order dispatch operations.

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Authors:Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
Date:2025-01-22 20:08:53

Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.

An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management

Authors:Eslam Eldeeb, Hirley Alves
Date:2025-01-22 16:25:46

Offline multi-agent reinforcement learning (MARL) addresses key limitations of online MARL, such as safety concerns, expensive data collection, extended training intervals, and high signaling overhead caused by online interactions with the environment. In this work, we propose an offline MARL algorithm for radio resource management (RRM), focusing on optimizing scheduling policies for multiple access points (APs) to jointly maximize the sum and tail rates of user equipment (UEs). We evaluate three training paradigms: centralized, independent, and centralized training with decentralized execution (CTDE). Our simulation results demonstrate that the proposed offline MARL framework outperforms conventional baseline approaches, achieving over a 15\% improvement in a weighted combination of sum and tail rates. Additionally, the CTDE framework strikes an effective balance, reducing the computational complexity of centralized methods while addressing the inefficiencies of independent training. These results underscore the potential of offline MARL to deliver scalable, robust, and efficient solutions for resource management in dynamic wireless networks.

Experience-replay Innovative Dynamics

Authors:Tuo Zhang, Leonardo Stella, Julian Barreiro-Gomez
Date:2025-01-21 15:10:14

Despite its groundbreaking success, multi-agent reinforcement learning (MARL) still suffers from instability and nonstationarity. Replicator dynamics, the most well-known model from evolutionary game theory (EGT), provide a theoretical framework for the convergence of the trajectories to Nash equilibria and, as a result, have been used to ensure formal guarantees for MARL algorithms in stable game settings. However, they exhibit the opposite behavior in other settings, which poses the problem of finding alternatives to ensure convergence. In contrast, innovative dynamics, such as the Brown-von Neumann-Nash (BNN) or Smith, result in periodic trajectories with the potential to approximate Nash equilibria. Yet, no MARL algorithms based on these dynamics have been proposed. In response to this challenge, we develop a novel experience replay-based MARL algorithm that incorporates revision protocols as tunable hyperparameters. We demonstrate, by appropriately adjusting the revision protocols, that the behavior of our algorithm mirrors the trajectories resulting from these dynamics. Importantly, our contribution provides a framework capable of extending the theoretical guarantees of MARL algorithms beyond replicator dynamics. Finally, we corroborate our theoretical findings with empirical results.

Tackling Uncertainties in Multi-Agent Reinforcement Learning through Integration of Agent Termination Dynamics

Authors:Somnath Hazra, Pallab Dasgupta, Soumyajit Dey
Date:2025-01-21 11:31:01

Multi-Agent Reinforcement Learning (MARL) has gained significant traction for solving complex real-world tasks, but the inherent stochasticity and uncertainty in these environments pose substantial challenges to efficient and robust policy learning. While Distributional Reinforcement Learning has been successfully applied in single-agent settings to address risk and uncertainty, its application in MARL is substantially limited. In this work, we propose a novel approach that integrates distributional learning with a safety-focused loss function to improve convergence in cooperative MARL tasks. Specifically, we introduce a Barrier Function based loss that leverages safety metrics, identified from inherent faults in the system, into the policy learning process. This additional loss term helps mitigate risks and encourages safer exploration during the early stages of training. We evaluate our method in the StarCraft II micromanagement benchmark, where our approach demonstrates improved convergence and outperforms state-of-the-art baselines in terms of both safety and task completion. Our results suggest that incorporating safety considerations can significantly enhance learning performance in complex, multi-agent environments.

ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance

Authors:Andrey Risukhin, Kavel Rao, Ben Caffee, Alan Fan
Date:2025-01-17 22:55:33

Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.

GAWM: Global-Aware World Model for Multi-Agent Reinforcement Learning

Authors:Zifeng Shi, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong, Ping Wei
Date:2025-01-17 11:01:56

In recent years, Model-based Multi-Agent Reinforcement Learning (MARL) has demonstrated significant advantages over model-free methods in terms of sample efficiency by using independent environment dynamics world models for data sample augmentation. However, without considering the limited sample size, these methods still lag behind model-free methods in terms of final convergence performance and stability. This is primarily due to the world model's insufficient and unstable representation of global states in partially observable environments. This limitation hampers the ability to ensure global consistency in the data samples and results in a time-varying and unstable distribution mismatch between the pseudo data samples generated by the world model and the real samples. This issue becomes particularly pronounced in more complex multi-agent environments. To address this challenge, we propose a model-based MARL method called GAWM, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm. GAWM uniquely leverages an additional Transformer architecture to fuse local observation information from different agents, thereby improving its ability to extract and represent global state information. This enhancement not only improves sample efficiency but also enhances training stability, leading to superior convergence performance, particularly in complex and challenging multi-agent environments. This advancement enables model-based methods to be effectively applied to more complex multi-agent environments. Experimental results demonstrate that GAWM outperforms various model-free and model-based approaches, achieving exceptional performance in the challenging domains of SMAC.

Networked Agents in the Dark: Team Value Learning under Partial Observability

Authors:Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo
Date:2025-01-15 13:01:32

We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.

AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL

Authors:Tyler Stennett, Myeongsoo Kim, Saurabh Sinha, Alessandro Orso
Date:2025-01-15 05:54:33

As REST APIs have become widespread in modern web services, comprehensive testing of these APIs has become increasingly crucial. Due to the vast search space consisting of operations, parameters, and parameter values along with their complex dependencies and constraints, current testing tools suffer from low code coverage, leading to suboptimal fault detection. To address this limitation, we present a novel tool, AutoRestTest, which integrates the Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SODG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. AutoRestTest provides a command-line interface and continuous telemetry on successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary results.

Dynamic Pricing in High-Speed Railways Using Multi-Agent Reinforcement Learning

Authors:Enrique Adrian Villarrubia-Martin, Luis Rodriguez-Benitez, David Muñoz-Valero, Giovanni Montana, Luis Jimenez-Linares
Date:2025-01-14 16:19:25

This paper addresses a critical challenge in the high-speed passenger railway industry: designing effective dynamic pricing strategies in the context of competing and cooperating operators. To address this, a multi-agent reinforcement learning (MARL) framework based on a non-zero-sum Markov game is proposed, incorporating random utility models to capture passenger decision making. Unlike prior studies in areas such as energy, airlines, and mobile networks, dynamic pricing for railway systems using deep reinforcement learning has received limited attention. A key contribution of this paper is a parametrisable and versatile reinforcement learning simulator designed to model a variety of railway network configurations and demand patterns while enabling realistic, microscopic modelling of user behaviour, called RailPricing-RL. This environment supports the proposed MARL framework, which models heterogeneous agents competing to maximise individual profits while fostering cooperative behaviour to synchronise connecting services. Experimental results validate the framework, demonstrating how user preferences affect MARL performance and how pricing policies influence passenger choices, utility, and overall system dynamics. This study provides a foundation for advancing dynamic pricing strategies in railway systems, aligning profitability with system-wide efficiency, and supporting future research on optimising pricing policies.

Cooperative Patrol Routing: Optimizing Urban Crime Surveillance through Multi-Agent Reinforcement Learning

Authors:Juan Palma-Borda, Eduardo Guzmán, María-Victoria Belmonte
Date:2025-01-14 11:20:19

The effective design of patrol strategies is a difficult and complex problem, especially in medium and large areas. The objective is to plan, in a coordinated manner, the optimal routes for a set of patrols in a given area, in order to achieve maximum coverage of the area, while also trying to minimize the number of patrols. In this paper, we propose a multi-agent reinforcement learning (MARL) model, based on a decentralized partially observable Markov decision process, to plan unpredictable patrol routes within an urban environment represented as an undirected graph. The model attempts to maximize a target function that characterizes the environment within a given time frame. Our model has been tested to optimize police patrol routes in three medium-sized districts of the city of Malaga. The aim was to maximize surveillance coverage of the most crime-prone areas, based on actual crime data in the city. To address this problem, several MARL algorithms have been studied, and among these the Value Decomposition Proximal Policy Optimization (VDPPO) algorithm exhibited the best performance. We also introduce a novel metric, the coverage index, for the evaluation of the coverage performance of the routes generated by our model. This metric is inspired by the predictive accuracy index (PAI), which is commonly used in criminology to detect hotspots. Using this metric, we have evaluated the model under various scenarios in which the number of agents (or patrols), their starting positions, and the level of information they can observe in the environment have been modified. Results show that the coordinated routes generated by our model achieve a coverage of more than $90\%$ of the $3\%$ of graph nodes with the highest crime incidence, and $65\%$ for $20\%$ of these nodes; $3\%$ and $20\%$ represent the coverage standards for police resource allocation.

Reinforcement Learning for Enhancing Sensing Estimation in Bistatic ISAC Systems with UAV Swarms

Authors:Obed Morrison Atsu, Salmane Naoumi, Roberto Bomfin, Marwa Chafii
Date:2025-01-11 06:57:52

This paper introduces a novel Multi-Agent Reinforcement Learning (MARL) framework to enhance integrated sensing and communication (ISAC) networks using unmanned aerial vehicle (UAV) swarms as sensing radars. By framing the positioning and trajectory optimization of UAVs as a Partially Observable Markov Decision Process, we develop a MARL approach that leverages centralized training with decentralized execution to maximize the overall sensing performance. Specifically, we implement a decentralized cooperative MARL strategy to enable UAVs to develop effective communication protocols, therefore enhancing their environmental awareness and operational efficiency. Additionally, we augment the MARL solution with a transmission power adaptation technique to mitigate interference between the communicating drones and optimize the communication protocol efficiency. Moreover, a transmission power adaptation technique is incorporated to mitigate interference and optimize the learned communication protocol efficiency. Despite the increased complexity, our solution demonstrates robust performance and adaptability across various scenarios, providing a scalable and cost-effective enhancement for future ISAC networks.

CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness

Authors:Shoucheng Song, Youfang Lin, Sheng Han, Chang Yao, Hao Wu, Shuo Wang, Kai Lv
Date:2025-01-09 12:57:41

Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed {\it{Asynchronous Communication}}, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.

Revisiting Communication Efficiency in Multi-Agent Reinforcement Learning from the Dimensional Analysis Perspective

Authors:Chuxiong Sun, Peng He, Rui Wang, Changwen Zheng
Date:2025-01-06 10:06:43

In this work, we introduce a novel perspective, i.e., dimensional analysis, to address the challenge of communication efficiency in Multi-Agent Reinforcement Learning (MARL). Our findings reveal that simply optimizing the content and timing of communication at sending end is insufficient to fully resolve communication efficiency issues. Even after applying optimized and gated messages, dimensional redundancy and confounders still persist in the integrated message embeddings at receiving end, which negatively impact communication quality and decision-making. To address these challenges, we propose Dimensional Rational Multi-Agent Communication (DRMAC), designed to mitigate both dimensional redundancy and confounders in MARL. DRMAC incorporates a redundancy-reduction regularization term to encourage the decoupling of information across dimensions within the learned representations of integrated messages. Additionally, we introduce a dimensional mask that dynamically adjusts gradient weights during training to eliminate the influence of decision-irrelevant dimensions. We evaluate DRMAC across a diverse set of multi-agent tasks, demonstrating its superior performance over existing state-of-the-art methods in complex scenarios. Furthermore, the plug-and-play nature of DRMAC's key modules highlights its generalizable performance, serving as a valuable complement rather than a replacement for existing multi-agent communication strategies.

CORD: Generalizable Cooperation via Role Diversity

Authors:Kanefumi Matsuyama, Kefan Su, Jiangxing Wang, Deheng Ye, Zongqing Lu
Date:2025-01-04 07:53:38

Cooperative multi-agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real-world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real-world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD's high-level controller assigns roles to low-level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non-redundant role clusters. Evaluated on a variety of cooperative multi-agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.

TACTIC: Task-Agnostic Contrastive pre-Training for Inter-Agent Communication

Authors:Peihong Yu, Manav Mishra, Syed Zaidi, Pratap Tokekar
Date:2025-01-04 03:29:15

The "sight range dilemma" in cooperative Multi-Agent Reinforcement Learning (MARL) presents a significant challenge: limited observability hinders team coordination, while extensive sight ranges lead to distracted attention and reduced performance. While communication can potentially address this issue, existing methods often struggle to generalize across different sight ranges, limiting their effectiveness. We propose TACTIC, Task-Agnostic Contrastive pre-Training strategy Inter-Agent Communication. TACTIC is an adaptive communication mechanism that enhances agent coordination even when the sight range during execution is vastly different from that during training. The communication mechanism encodes messages and integrates them with local observations, generating representations grounded in the global state using contrastive learning. By learning to generate and interpret messages that capture important information about the whole environment, TACTIC enables agents to effectively "see" more through communication, regardless of their sight ranges. We comprehensively evaluate TACTIC on the SMACv2 benchmark across various scenarios with broad sight ranges. The results demonstrate that TACTIC consistently outperforms traditional state-of-the-art MARL techniques with and without communication, in terms of generalizing to sight ranges different from those seen in training, particularly in cases of extremely limited or extensive observability.

Symmetries-enhanced Multi-Agent Reinforcement Learning

Authors:Nikolaos Bousias, Stefanos Pertigkiozoglou, Kostas Daniilidis, George Pappas
Date:2025-01-02 08:41:31

Multi-agent reinforcement learning has emerged as a powerful framework for enabling agents to learn complex, coordinated behaviors but faces persistent challenges regarding its generalization, scalability and sample efficiency. Recent advancements have sought to alleviate those issues by embedding intrinsic symmetries of the systems in the policy. Yet, most dynamical systems exhibit little to no symmetries to exploit. This paper presents a novel framework for embedding extrinsic symmetries in multi-agent system dynamics that enables the use of symmetry-enhanced methods to address systems with insufficient intrinsic symmetries, expanding the scope of equivariant learning to a wide variety of MARL problems. Central to our framework is the Group Equivariant Graphormer, a group-modular architecture specifically designed for distributed swarming tasks. Extensive experiments on a swarm of symmetry-breaking quadrotors validate the effectiveness of our approach, showcasing its potential for improved generalization and zero-shot scalability. Our method achieves significant reductions in collision rates and enhances task success rates across a diverse range of scenarios and varying swarm sizes.

Generative Emergent Communication: Large Language Model is a Collective World Model

Authors:Tadahiro Taniguchi, Ryo Ueda, Tomoaki Nakamura, Masahiro Suzuki, Akira Taniguchi
Date:2024-12-31 02:23:10

This study proposes a unifying theoretical framework called generative emergent communication (generative EmCom) that bridges emergent communication, world models, and large language models (LLMs) through the lens of collective predictive coding (CPC). The proposed framework formalizes the emergence of language and symbol systems through decentralized Bayesian inference across multiple agents, extending beyond conventional discriminative model-based approaches to emergent communication. This study makes the following two key contributions: First, we propose generative EmCom as a novel framework for understanding emergent communication, demonstrating how communication emergence in multi-agent reinforcement learning (MARL) can be derived from control as inference while clarifying its relationship to conventional discriminative approaches. Second, we propose a mathematical formulation showing the interpretation of LLMs as collective world models that integrate multiple agents' experiences through CPC. The framework provides a unified theoretical foundation for understanding how shared symbol systems emerge through collective predictive coding processes, bridging individual cognitive development and societal language evolution. Through mathematical formulations and discussion on prior works, we demonstrate how this framework explains fundamental aspects of language emergence and offers practical insights for understanding LLMs and developing sophisticated AI systems for improving human-AI interaction and multi-agent systems.

Distributed Traffic Control in Complex Dynamic Roadblocks: A Multi-Agent Deep RL Approach

Authors:Noor Aboueleneen, Yahuza Bello, Abdullatif Albaseer, Ahmed Refaey Hussein, Mohamed Abdallah, Ekram Hossain
Date:2024-12-31 01:27:15

Autonomous Vehicles (AVs) represent a transformative advancement in the transportation industry. These vehicles have sophisticated sensors, advanced algorithms, and powerful computing systems that allow them to navigate and operate without direct human intervention. However, AVs' systems still get overwhelmed when they encounter a complex dynamic change in the environment resulting from an accident or a roadblock for maintenance. The advanced features of Sixth Generation (6G) technology are set to offer strong support to AVs, enabling real-time data exchange and management of complex driving maneuvers. This paper proposes a Multi-Agent Reinforcement Learning (MARL) framework to improve AVs' decision-making in dynamic and complex Intelligent Transportation Systems (ITS) utilizing 6G-V2X communication. The primary objective is to enable AVs to avoid roadblocks efficiently by changing lanes while maintaining optimal traffic flow and maximizing the mean harmonic speed. To ensure realistic operations, key constraints such as minimum vehicle speed, roadblock count, and lane change frequency are integrated. We train and test the proposed MARL model with two traffic simulation scenarios using the SUMO and TraCI interface. Through extensive simulations, we demonstrate that the proposed model adapts to various traffic conditions and achieves efficient and robust traffic flow management. The trained model effectively navigates dynamic roadblocks, promoting improved traffic efficiency in AV operations with more than 70% efficiency over other benchmark solutions.

Advances in Multi-agent Reinforcement Learning: Persistent Autonomy and Robot Learning Lab Report 2024

Authors:Reza Azadeh
Date:2024-12-30 17:08:05

Multi-Agent Reinforcement Learning (MARL) approaches have emerged as popular solutions to address the general challenges of cooperation in multi-agent environments, where the success of achieving shared or individual goals critically depends on the coordination and collaboration between agents. However, existing cooperative MARL methods face several challenges intrinsic to multi-agent systems, such as the curse of dimensionality, non-stationarity, and the need for a global exploration strategy. Moreover, the presence of agents with constraints (e.g., limited battery life, restricted mobility) or distinct roles further exacerbates these challenges. This document provides an overview of recent advances in Multi-Agent Reinforcement Learning (MARL) conducted at the Persistent Autonomy and Robot Learning (PeARL) lab at the University of Massachusetts Lowell. We briefly discuss various research directions and present a selection of approaches proposed in our most recent publications. For each proposed approach, we also highlight potential future directions to further advance the field.

Game Theory and Multi-Agent Reinforcement Learning : From Nash Equilibria to Evolutionary Dynamics

Authors:Neil De La Fuente, Miquel Noguer i Alonso, Guim Casadellà
Date:2024-12-29 17:15:40

This paper explores advanced topics in complex multi-agent systems building upon our previous work. We examine four fundamental challenges in Multi-Agent Reinforcement Learning (MARL): non-stationarity, partial observability, scalability with large agent populations, and decentralized learning. The paper provides mathematical formulations and analysis of recent algorithmic advancements designed to address these challenges, with a particular focus on their integration with game-theoretic concepts. We investigate how Nash equilibria, evolutionary game theory, correlated equilibrium, and adversarial dynamics can be effectively incorporated into MARL algorithms to improve learning outcomes. Through this comprehensive analysis, we demonstrate how the synthesis of game theory and MARL can enhance the robustness and effectiveness of multi-agent systems in complex, dynamic environments.

SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

Authors:Yue Deng, Yan Yu, Weiyu Ma, Zirui Wang, Wenhui Zhu, Jian Zhao, Yin Zhang
Date:2024-12-23 16:36:21

The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL). In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm. However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies. To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness. SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability. Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms. We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents. Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries. We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems. Our code is available at https://github.com/devindeng94/smac-hard.

AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning

Authors:Guangchong Zhou, Zeren Zhang, Guoliang Fan
Date:2024-12-20 09:18:30

Exploration in cooperative multi-agent reinforcement learning (MARL) remains challenging for value-based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

Tacit Learning with Adaptive Information Selection for Cooperative Multi-Agent Reinforcement Learning

Authors:Lunjun Liu, Weilai Jiang, Yaonan Wang
Date:2024-12-20 07:55:59

In multi-agent reinforcement learning (MARL), the centralized training with decentralized execution (CTDE) framework has gained widespread adoption due to its strong performance. However, the further development of CTDE faces two key challenges. First, agents struggle to autonomously assess the relevance of input information for cooperative tasks, impairing their decision-making abilities. Second, in communication-limited scenarios with partial observability, agents are unable to access global information, restricting their ability to collaborate effectively from a global perspective. To address these challenges, we introduce a novel cooperative MARL framework based on information selection and tacit learning. In this framework, agents gradually develop implicit coordination during training, enabling them to infer the cooperative behavior of others in a discrete space without communication, relying solely on local information. Moreover, we integrate gating and selection mechanisms, allowing agents to adaptively filter information based on environmental changes, thereby enhancing their decision-making capabilities. Experiments on popular MARL benchmarks show that our framework can be seamlessly integrated with state-of-the-art algorithms, leading to significant performance improvements.

Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

Authors:Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, jun Hu, Qing Wang, Fanjiang Xu
Date:2024-12-20 07:24:43

Explaining multi-agent systems (MAS) is urgent as these systems become increasingly prevalent in various applications. Previous work has proveided explanations for the actions or states of agents, yet falls short in understanding the black-boxed agent's importance within a MAS and the overall team strategy. To bridge this gap, we propose EMAI, a novel agent-level explanation approach that evaluates the individual agent's importance. Inspired by counterfactual reasoning, a larger change in reward caused by the randomized action of agent indicates its higher importance. We model it as a MARL problem to capture interactions across agents. Utilizing counterfactual reasoning, EMAI learns the masking agents to identify important agents. Specifically, we define the optimization function to minimize the reward difference before and after action randomization and introduce sparsity constraints to encourage the exploration of more action randomization of agents during training. The experimental results in seven multi-agent tasks demonstratee that EMAI achieves higher fidelity in explanations than baselines and provides more effective guidance in practical applications concerning understanding policies, launching attacks, and patching policies.

Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning

Authors:Yangkun Chen, Kai Yang, Jian Tao, Jiafei Lyu
Date:2024-12-20 03:09:18

Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent's current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi-Agent Novelty-GuidEd sample Reuse (MANGER). This method increases sample efficiency and promotes exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super-hard StarCraft II micromanagement tasks.

Investigating Relational State Abstraction in Collaborative MARL

Authors:Sharlin Utke, Jeremie Houssineau, Giovanni Montana
Date:2024-12-19 20:34:00

This paper explores the impact of relational state abstraction on sample efficiency and performance in collaborative Multi-Agent Reinforcement Learning. The proposed abstraction is based on spatial relationships in environments where direct communication between agents is not allowed, leveraging the ubiquity of spatial reasoning in real-world multi-agent scenarios. We introduce MARC (Multi-Agent Relational Critic), a simple yet effective critic architecture incorporating spatial relational inductive biases by transforming the state into a spatial graph and processing it through a relational graph neural network. The performance of MARC is evaluated across six collaborative tasks, including a novel environment with heterogeneous agents. We conduct a comprehensive empirical analysis, comparing MARC against state-of-the-art MARL baselines, demonstrating improvements in both sample efficiency and asymptotic performance, as well as its potential for generalization. Our findings suggest that a minimal integration of spatial relational inductive biases as abstraction can yield substantial benefits without requiring complex designs or task-specific engineering. This work provides insights into the potential of relational state abstraction to address sample efficiency, a key challenge in MARL, offering a promising direction for developing more efficient algorithms in spatially complex environments.

Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs

Authors:Jiaming Yu, Le Liang, Chongtao Guo, Ziyang Guo, Shi Jin, Geoffrey Ye Li
Date:2024-12-18 13:50:31

This paper investigates the use of multi-agent reinforcement learning (MARL) to address distributed channel access in wireless local area networks. In particular, we consider the challenging yet more practical case where the agents heterogeneously adopt value-based or policy-based reinforcement learning algorithms to train the model. We propose a heterogeneous MARL training framework, named QPMIX, which adopts a centralized training with distributed execution paradigm to enable heterogeneous agents to collaborate. Moreover, we theoretically prove the convergence of the proposed heterogeneous MARL method when using the linear value function approximation. Our method maximizes the network throughput and ensures fairness among stations, therefore, enhancing the overall network performance. Simulation results demonstrate that the proposed QPMIX algorithm improves throughput, mean delay, delay jitter, and collision rates compared with conventional carrier-sense multiple access with collision avoidance in the saturated traffic scenario. Furthermore, the QPMIX is shown to be robust in unsaturated and delay-sensitive traffic scenarios, and promotes cooperation among heterogeneous agents.

A MARL Based Multi-Target Tracking Algorithm Under Jamming Against Radar

Authors:Ziang Wang, Lei Wang, Qi Yi, Yimin Liu
Date:2024-12-17 05:14:16

Unmanned aerial vehicles (UAVs) have played an increasingly important role in military operations and social life. Among all application scenarios, multi-target tracking tasks accomplished by UAV swarms have received extensive attention. However, when UAVs use radar to track targets, the tracking performance can be severely compromised by jammers. To track targets in the presence of jammers, UAVs can use passive radar to position the jammer. This paper proposes a system where a UAV swarm selects the radar's active or passive work mode to track multiple differently located and potentially jammer-carrying targets. After presenting the optimization problem and proving its solving difficulty, we use a multi-agent reinforcement learning algorithm to solve this control problem. We also propose a mechanism based on simulated annealing algorithm to avoid cases where UAV actions violate constraints. Simulation experiments demonstrate the effectiveness of the proposed algorithm.

Achieving Collective Welfare in Multi-Agent Reinforcement Learning via Suggestion Sharing

Authors:Yue Jin, Shuangqing Wei, Giovanni Montana
Date:2024-12-16 19:44:44

In human society, the conflict between self-interest and collective well-being often obstructs efforts to achieve shared welfare. Related concepts like the Tragedy of the Commons and Social Dilemmas frequently manifest in our daily lives. As artificial agents increasingly serve as autonomous proxies for humans, we propose using multi-agent reinforcement learning (MARL) to address this issue - learning policies to maximise collective returns even when individual agents' interests conflict with the collective one. Traditional MARL solutions involve sharing rewards, values, and policies or designing intrinsic rewards to encourage agents to learn collectively optimal policies. We introduce a novel MARL approach based on Suggestion Sharing (SS), where agents exchange only action suggestions. This method enables effective cooperation without the need to design intrinsic rewards, achieving strong performance while revealing less private information compared to sharing rewards, values, or policies. Our theoretical analysis establishes a bound on the discrepancy between collective and individual objectives, demonstrating how sharing suggestions can align agents' behaviours with the collective objective. Experimental results demonstrate that SS performs competitively with baselines that rely on value or policy sharing or intrinsic rewards.

GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

Authors:Mengxian Li, Qi Wang, Yongjun Xu
Date:2024-12-12 03:01:36

The rapid advancement of multi-agent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi-agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382\% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100\% win rate against the baseline.

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Authors:Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding
Date:2024-12-10 16:19:08

Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

Augmenting the action space with conventions to improve multi-agent cooperation in Hanabi

Authors:F. Bredell, H. A. Engelbrecht, J. C. Schoeman
Date:2024-12-09 09:34:40

The card game Hanabi is considered a strong medium for the testing and development of multi-agent reinforcement learning (MARL) algorithms, due to its cooperative nature, hidden information, limited communication and remarkable complexity. Previous research efforts have explored the capabilities of MARL algorithms within Hanabi, focusing largely on advanced architecture design and algorithmic manipulations to achieve state-of-the-art performance for a various number of cooperators. However, this often leads to complex solution strategies with high computational cost and requiring large amounts of training data. For humans to solve the Hanabi game effectively, they require the use of conventions, which often allows for a means to implicitly convey ideas or knowledge based on a predefined, and mutually agreed upon, set of ``rules''. Multi-agent problems containing partial observability, especially when limited communication is present, can benefit greatly from the use of implicit knowledge sharing. In this paper, we propose a novel approach to augmenting the action space using conventions, which act as special cooperative actions that span over multiple time steps and multiple agents, requiring agents to actively opt in for it to reach fruition. These conventions are based on existing human conventions, and result in a significant improvement on the performance of existing techniques for self-play and cross-play across a various number of cooperators within Hanabi.

Reinforcement Learning for Freeway Lane-Change Regulation via Connected Vehicles

Authors:Ke Sun, Huan Yu
Date:2024-12-05 16:59:31

Lane change decision-making is a complex task due to intricate vehicle-vehicle and vehicle-infrastructure interactions. Existing algorithms for lane-change control often depend on vehicles with a certain level of autonomy (e.g., autonomous or connected autonomous vehicles). To address the challenges posed by low penetration rates of autonomous vehicles and the high costs of precise data collection, this study proposes a dynamic lane change regulation design based on multi-agent reinforcement learning (MARL) to enhance freeway traffic efficiency. The proposed framework leverages multi-lane macroscopic traffic models that describe spatial-temporal dynamics of the density and speed for each lane. Lateral traffic flow between adjacent lanes, resulting from aggregated lane-changing behaviors, is modeled as source terms exchanged between the partial differential equations (PDEs). We propose a lane change regulation strategy using MARL, where one agent is placed at each discretized lane grid. The state of each agent is represented by aggregated vehicle attributes within its grid, generated from the SUMO microscopic simulation environment. The agent's actions are lane-change regulations for vehicles in its grid. Specifically, lane-change regulation signals are computed at a centralized traffic management center and then broadcast to connected vehicles in the corresponding lane grids. Compared to vehicle-level maneuver control, this approach achieves a higher regulation rate by leveraging vehicle connectivity while introducing no critical safety concerns, and accommodating varying levels of connectivity and autonomy within the traffic system. The proposed model is simulated and evaluated in varied traffic scenarios and demand conditions. Experimental results demonstrate that the method improves overall traffic efficiency with minimal additional energy consumption while maintaining driving safety.

HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

Authors:Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht
Date:2024-12-05 15:09:51

Adaptability is critical in cooperative multi-agent reinforcement learning (MARL), where agents must learn specialised or homogeneous behaviours for diverse tasks. While parameter sharing methods are sample-efficient, they often encounter gradient interference among agents, limiting their behavioural diversity. Conversely, non-parameter sharing approaches enable specialisation, but are computationally demanding and sample-inefficient. To address these issues, we propose HyperMARL, a parameter sharing approach that uses hypernetworks to dynamically generate agent-specific actor and critic parameters, without altering the learning objective or requiring preset diversity levels. By decoupling observation- and agent-conditioned gradients, HyperMARL empirically reduces policy gradient variance and facilitates specialisation within FuPS, suggesting it can mitigate cross-agent interference. Across multiple MARL benchmarks involving up to twenty agents -- and requiring homogeneous, heterogeneous, or mixed behaviours -- HyperMARL consistently performs competitively with fully shared, non-parameter-sharing, and diversity-promoting baselines, all while preserving a behavioural diversity level comparable to non-parameter sharing. These findings establish hypernetworks as a versatile approach for MARL across diverse environments.

Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning

Authors:Talha Azfar, Ruimin Ke
Date:2024-12-05 07:01:56

Traffic simulations are commonly used to optimize traffic flow, with reinforcement learning (RL) showing promising potential for automated traffic signal control. Multi-agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. However, existing methods often assume perfect vehicle detection, which overlooks real-world limitations related to infrastructure availability and sensor reliability. This study proposes a co-simulation framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling with large-scale traffic flow simulation. Cameras mounted on traffic light poles within the CARLA environment use a YOLO-based computer vision system to detect and count vehicles, providing real-time traffic data as input for adaptive signal control in SUMO. MARL agents, trained with four different reward structures, leverage this visual feedback to optimize signal timings and improve network-wide traffic flow. Experiments in the test-bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real-time camera-based detection. The framework also evaluates the robustness of MARL under faulty or sparse sensing and compares the performance of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better accuracy improves performance, MARL agents can still achieve significant improvements with imperfect detection, demonstrating adaptability for real-world scenarios.

Mobile Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning: A Scalable Framework

Authors:Ziheng Liu, Jiayi Zhang, Yiyang Zhu, Enyu Shi, Bo Ai
Date:2024-12-03 17:09:13

Cell-free massive multiple-input multiple-output (mMIMO) offers significant advantages in mobility scenarios, mainly due to the elimination of cell boundaries and strong macro diversity. In this paper, we examine the downlink performance of cell-free mMIMO systems equipped with mobile-APs utilizing the concept of unmanned aerial vehicles, where mobility and power control are jointly considered to effectively enhance coverage and suppress interference. However, the high computational complexity, poor collaboration, limited scalability, and uneven reward distribution of conventional optimization schemes lead to serious performance degradation and instability. These factors complicate the provision of consistent and high-quality service across all user equipments in downlink cell-free mMIMO systems. Consequently, we propose a novel scalable framework enhanced by multi-agent reinforcement learning (MARL) to tackle these challenges. The established framework incorporates a graph neural network (GNN)-aided communication mechanism to facilitate effective collaboration among agents, a permutation architecture to improve scalability, and a directional decoupling architecture to accurately distinguish contributions. In the numerical results, we present comparisons of different optimization schemes and network architectures, which reveal that the proposed scheme can effectively enhance system performance compared to conventional schemes due to the adoption of advanced technologies. In particular, appropriately compressing the observation space of agents is beneficial for achieving a better balance between performance and convergence.

Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support

Authors:Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins
Date:2024-12-03 00:30:19

In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real-time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi-agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q-Learning (IQL), where each farmer acts independently without coordination, Agent-by-Agent (ABA), which sequentially optimizes each farmer's policy in relation to the others, and the Multi-agent Rollout Policy, which jointly optimizes all farmers' actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi-agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer-centric agricultural decision-making systems.

Provable Partially Observable Reinforcement Learning with Privileged Information

Authors:Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang
Date:2024-12-01 22:26:27

Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emph{expert distillation} (also known as \emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are \emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of \emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Authors:Emile Anand, Ishani Karmarkar, Guannan Qu
Date:2024-12-01 03:45:17

Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We show that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. We empirically validate our method in Gaussian squeeze and global exploration settings.

Towards Fault Tolerance in Multi-Agent Reinforcement Learning

Authors:Yuchen Shi, Huaxin Pei, Liang Feng, Yi Zhang, Danya Yao
Date:2024-11-30 16:56:29

Agent faults pose a significant threat to the performance of multi-agent reinforcement learning (MARL) algorithms, introducing two key challenges. First, agents often struggle to extract critical information from the chaotic state space created by unexpected faults. Second, transitions recorded before and after faults in the replay buffer affect training unevenly, leading to a sample imbalance problem. To overcome these challenges, this paper enhances the fault tolerance of MARL by combining optimized model architecture with a tailored training data sampling strategy. Specifically, an attention mechanism is incorporated into the actor and critic networks to automatically detect faults and dynamically regulate the attention given to faulty agents. Additionally, a prioritization mechanism is introduced to selectively sample transitions critical to current training needs. To further support research in this area, we design and open-source a highly decoupled code platform for fault-tolerant MARL, aimed at improving the efficiency of studying related problems. Experimental results demonstrate the effectiveness of our method in handling various types of faults, faults occurring in any agent, and faults arising at random times.

Assessing How Ride-hailing Rebalancing Strategies Improve the Resilience of Multi-modal Transportation Systems

Authors:Euntak Lee, Rim Slama, Ludovic Leclercq
Date:2024-11-29 23:05:24

The global ride-hailing (RH) industry plays an essential role in multi-modal transportation systems by improving user mobility, particularly as first- and last-mile solutions. However, the flexibility of on-demand mobility services can lead to local supply-demand imbalances. While many RH rebalancing studies focus on nominal scenarios with regular demand patterns, it is crucial to consider disruptions - such as train line interruptions - that negatively impact operational efficiency, resulting in longer travel times, higher costs, increased transfers, and service delays. This study examines how RH rebalancing strategies can strengthen the resilience of multi-modal transportation systems against such disruptions. We incorporate RH services into systems where users choose and switch transportation modes based on their preferences, accounting for uncertainties in demand predictions that reflect discrepancies between forecasts and actual conditions. To address the stochastic supply-demand dynamics in large-scale networks, we propose a multi-agent reinforcement learning (MARL) strategy, specifically utilizing a multi-agent deep deterministic policy gradient (MADDPG) approach. The proposed framework is particularly well-suited for this problem due to its ability to handle continuous action spaces, which are prevalent in real-world transportation systems, and its capacity to enable effective coordination among multiple agents operating in dynamic and decentralized environments. Through a 900 km2 multi-modal traffic simulation, we evaluate the proposed model's performance against four existing RH rebalancing strategies, focusing on its ability to enhance system resilience. The results demonstrate significant improvements in key performance indicators, including user waiting time, resilience metrics, total travel time, and travel distance.

RMIO: A Model-Based MARL Framework for Scenarios with Observation Loss in Some Agents

Authors:Zifeng Shi, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong
Date:2024-11-29 11:45:21

In recent years, model-based reinforcement learning (MBRL) has emerged as a solution to address sample complexity in multi-agent reinforcement learning (MARL) by modeling agent-environment dynamics to improve sample efficiency. However, most MBRL methods assume complete and continuous observations from each agent during the inference stage, which can be overly idealistic in practical applications. A novel model-based MARL approach called RMIO is introduced to address this limitation, specifically designed for scenarios where observation is lost in some agent. RMIO leverages the world model to reconstruct missing observations, and further reduces reconstruction errors through inter-agent information integration to ensure stable multi-agent decision-making. Secondly, unlike CTCE methods such as MAMBA, RMIO adopts the CTDE paradigm in standard environment, and enabling limited communication only when agents lack observation data, thereby reducing reliance on communication. Additionally, RMIO improves asymptotic performance through strategies such as reward smoothing, a dual-layer experience replay buffer, and an RNN-augmented policy model, surpassing previous work. Our experiments conducted in both the SMAC and MaMuJoCo environments demonstrate that RMIO outperforms current state-of-the-art approaches in terms of asymptotic convergence performance and policy robustness, both in standard mission settings and in scenarios involving observation loss.

A Local Information Aggregation based Multi-Agent Reinforcement Learning for Robot Swarm Dynamic Task Allocation

Authors:Yang Lv, Jinlong Lei, Peng Yi
Date:2024-11-29 07:53:05

In this paper, we explore how to optimize task allocation for robot swarms in dynamic environments, emphasizing the necessity of formulating robust, flexible, and scalable strategies for robot cooperation. We introduce a novel framework using a decentralized partially observable Markov decision process (Dec_POMDP), specifically designed for distributed robot swarm networks. At the core of our methodology is the Local Information Aggregation Multi-Agent Deep Deterministic Policy Gradient (LIA_MADDPG) algorithm, which merges centralized training with distributed execution (CTDE). During the centralized training phase, a local information aggregation (LIA) module is meticulously designed to gather critical data from neighboring robots, enhancing decision-making efficiency. In the distributed execution phase, a strategy improvement method is proposed to dynamically adjust task allocation based on changing and partially observable environmental conditions. Our empirical evaluations show that the LIA module can be seamlessly integrated into various CTDE-based MARL methods, significantly enhancing their performance. Additionally, by comparing LIA_MADDPG with six conventional reinforcement learning algorithms and a heuristic algorithm, we demonstrate its superior scalability, rapid adaptation to environmental changes, and ability to maintain both stability and convergence speed. These results underscore LIA_MADDPG's outstanding performance and its potential to significantly improve dynamic task allocation in robot swarms through enhanced local collaboration and adaptive strategy execution.

Integrating Transit Signal Priority into Multi-Agent Reinforcement Learning based Traffic Signal Control

Authors:Dickness Kakitahi Kwesiga, Suyash Chandra Vishnoi, Angshuman Guin, Michael Hunter
Date:2024-11-28 20:09:12

This study integrates Transit Signal Priority (TSP) into multi-agent reinforcement learning (MARL) based traffic signal control. The first part of the study develops adaptive signal control based on MARL for a pair of coordinated intersections in a microscopic simulation environment. The two agents, one for each intersection, are centrally trained using a value decomposition network (VDN) architecture. The trained agents show slightly better performance compared to coordinated actuated signal control based on overall intersection delay at v/c of 0.95. In the second part of the study the trained signal control agents are used as background signal controllers while developing event-based TSP agents. In one variation, independent TSP agents are formulated and trained under a decentralized training and decentralized execution (DTDE) framework to implement TSP at each intersection. In the second variation, the two TSP agents are centrally trained under a centralized training and decentralized execution (CTDE) framework and VDN architecture to select and implement coordinated TSP strategies across the two intersections. In both cases the agents converge to the same bus delay value, but independent agents show high instability throughout the training process. For the test runs, the two independent agents reduce bus delay across the two intersections by 22% compared to the no TSP case while the coordinated TSP agents achieve 27% delay reduction. In both cases, there is only a slight increase in delay for a majority of the side street movements.

Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

Authors:Zeyang Li, Navid Azizan
Date:2024-11-22 16:08:42

Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with $\textit{state-wise}$ constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose $\textit{Multi-Agent Dual Actor-Critic}$ (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.

Explainable Multi-Agent Reinforcement Learning for Extended Reality Codec Adaptation

Authors:Pedro Enrique Iturria-Rivera, Raimundas Gaigalas, Medhat Elsayed, Majid Bavand, Yigit Ozcan, Melike Erol-Kantarci
Date:2024-11-21 16:20:31

Extended Reality (XR) services are set to transform applications over 5th and 6th generation wireless networks, delivering immersive experiences. Concurrently, Artificial Intelligence (AI) advancements have expanded their role in wireless networks, however, trust and transparency in AI remain to be strengthened. Thus, providing explanations for AI-enabled systems can enhance trust. We introduce Value Function Factorization (VFF)-based Explainable (X) Multi-Agent Reinforcement Learning (MARL) algorithms, explaining reward design in XR codec adaptation through reward decomposition. We contribute four enhancements to XMARL algorithms. Firstly, we detail architectural modifications to enable reward decomposition in VFF-based MARL algorithms: Value Decomposition Networks (VDN), Mixture of Q-Values (QMIX), and Q-Transformation (Q-TRAN). Secondly, inspired by multi-task learning, we reduce the overhead of vanilla XMARL algorithms. Thirdly, we propose a new explainability metric, Reward Difference Fluctuation Explanation (RDFX), suitable for problems with adjustable parameters. Lastly, we propose adaptive XMARL, leveraging network gradients and reward decomposition for improved action selection. Simulation results indicate that, in XR codec adaptation, the Packet Delivery Ratio reward is the primary contributor to optimal performance compared to the initial composite reward, which included delay and Data Rate Ratio components. Modifications to VFF-based XMARL algorithms, incorporating multi-headed structures and adaptive loss functions, enable the best-performing algorithm, Multi-Headed Adaptive (MHA)-QMIX, to achieve significant average gains over the Adjust Packet Size baseline up to 10.7%, 41.4%, 33.3%, and 67.9% in XR index, jitter, delay, and Packet Loss Ratio (PLR), respectively.

Cooperative Grasping and Transportation using Multi-agent Reinforcement Learning with Ternary Force Representation

Authors:Ing-Sheng Bernard-Tiong, Yoshihisa Tsurumine, Ryosuke Sota, Kazuki Shibata, Takamitsu Matsubara
Date:2024-11-21 08:52:49

Cooperative grasping and transportation require effective coordination to complete the task. This study focuses on the approach leveraging force-sensing feedback, where robots use sensors to detect forces applied by others on an object to achieve coordination. Unlike explicit communication, it avoids delays and interruptions; however, force-sensing is highly sensitive and prone to interference from variations in grasping environment, such as changes in grasping force, grasping pose, object size and geometry, which can interfere with force signals, subsequently undermining coordination. We propose multi-agent reinforcement learning (MARL) with ternary force representation, a force representation that maintains consistent representation against variations in grasping environment. The simulation and real-world experiments demonstrate the robustness of the proposed method to changes in grasping force, object size and geometry as well as inherent sim2real gap.

Incentives to Build Houses, Trade Houses, or Trade House Building Skills in Simulated Worlds under Various Governing Systems or Institutions: Comparing Multi-agent Reinforcement Learning to Generative Agent-based Model

Authors:Aslan S. Dizaji
Date:2024-11-21 08:52:42

It has been shown that social institutions impact human motivations to produce different behaviours, such as amount of working or specialisation in labor. With advancement in artificial intelligence (AI), specifically large language models (LLMs), now it is possible to perform in-silico simulations to test various hypotheses around this topic. Here, I simulate two somewhat similar worlds using multi-agent reinforcement learning (MARL) framework of the AI-Economist and generative agent-based model (GABM) framework of the Concordia. In the extended versions of the AI-Economist and Concordia, the agents are able to build houses, trade houses, and trade house building skill. Moreover, along the individualistic-collectivists axis, there are a set of three governing systems: Full-Libertarian, Semi-Libertarian/Utilitarian, and Full-Utilitarian. Additionally, in the extended AI-Economist, the Semi-Libertarian/Utilitarian system is further divided to a set of three governing institutions along the discriminative axis: Inclusive, Arbitrary, and Extractive. Building on these, I am able to show that among governing systems and institutions of the extended AI-Economist, under the Semi-Libertarian/Utilitarian and Inclusive government, the ratios of building houses to trading houses and trading house building skill are higher than the rest. Furthermore, I am able to show that in the extended Concordia when the central government care about equality in the society, the Full-Utilitarian system generates agents building more houses and trading more house building skill. In contrast, these economic activities are higher under the Full-Libertarian system when the central government cares about productivity in the society. Overall, the focus of this paper is to compare and contrast two advanced techniques of AI, MARL and GABM, to simulate a similar social phenomena with limitations.

Learning to Cooperate with Humans using Generative Agents

Authors:Yancheng Liang, Daphne Chen, Abhishek Gupta, Simon S. Du, Natasha Jaques
Date:2024-11-21 08:36:17

Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emph{learning a generative model of human partners} can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human's unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method -- \textbf{G}enerative \textbf{A}gent \textbf{M}odeling for \textbf{M}ulti-agent \textbf{A}daptation (GAMMA) -- on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero-shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

Adversarial Multi-Agent Reinforcement Learning for Proactive False Data Injection Detection

Authors:Kejun Chen, Truc Nguyen, Malik Hassanaly
Date:2024-11-19 00:06:46

Smart inverters are instrumental in the integration of renewable and distributed energy resources (DERs) into the electric grid. Such inverters rely on communication layers for continuous control and monitoring, potentially exposing them to cyber-physical attacks such as false data injection attacks (FDIAs). We propose to construct a defense strategy against a priori unknown FDIAs with a multi-agent reinforcement learning (MARL) framework. The first agent is an adversary that simulates and discovers various FDIA strategies, while the second agent is a defender in charge of detecting and localizing FDIAs. This approach enables the defender to be trained against new FDIAs continuously generated by the adversary. The numerical results demonstrate that the proposed MARL defender outperforms a supervised offline defender. Additionally, we show that the detection skills of an MARL defender can be combined with that of an offline defender through a transfer learning approach.

Joint Precoding and AP Selection for Energy Efficient RIS-aided Cell-Free Massive MIMO Using Multi-agent Reinforcement Learning

Authors:Enyu Shi, Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Chau Yuen, Derrick Wing Kwan Ng, Marco Di Renzo, Bo Ai
Date:2024-11-17 13:19:25

Cell-free (CF) massive multiple-input multiple-output (mMIMO) and reconfigurable intelligent surface (RIS) are two advanced transceiver technologies for realizing future sixth-generation (6G) networks. In this paper, we investigate the joint precoding and access point (AP) selection for energy efficient RIS-aided CF mMIMO system. To address the associated computational complexity and communication power consumption, we advocate for user-centric dynamic networks in which each user is served by a subset of APs rather than by all of them. Based on the user-centric network, we formulate a joint precoding and AP selection problem to maximize the energy efficiency (EE) of the considered system. To solve this complex nonconvex problem, we propose an innovative double-layer multi-agent reinforcement learning (MARL)-based scheme. Moreover, we propose an adaptive power threshold-based AP selection scheme to further enhance the EE of the considered system. To reduce the computational complexity of the RIS-aided CF mMIMO system, we introduce a fuzzy logic (FL) strategy into the MARL scheme to accelerate convergence. The simulation results show that the proposed FL-based MARL cooperative architecture effectively improves EE performance, offering a 85\% enhancement over the zero-forcing (ZF) method, and achieves faster convergence speed compared with MARL. It is important to note that increasing the transmission power of the APs or the number of RIS elements can effectively enhance the spectral efficiency (SE) performance, which also leads to an increase in power consumption, resulting in a non-trivial trade-off between the quality of service and EE performance.

Enforcing Cooperative Safety for Reinforcement Learning-based Mixed-Autonomy Platoon Control

Authors:Jingyuan Zhou, Longhao Yan, Jinhao Liang, Kaidi Yang
Date:2024-11-15 08:19:07

It is recognized that the control of mixed-autonomy platoons comprising connected and automated vehicles (CAVs) and human-driven vehicles (HDVs) can enhance traffic flow. Among existing methods, Multi-Agent Reinforcement Learning (MARL) appears to be a promising control strategy because it can manage complex scenarios in real time. However, current research on MARL-based mixed-autonomy platoon control suffers from several limitations. First, existing MARL approaches address safety by penalizing safety violations in the reward function, thus lacking theoretical safety guarantees due to the black-box nature of RL. Second, few studies have explored the cooperative safety of multi-CAV platoons, where CAVs can be coordinated to further enhance the system-level safety involving the safety of both CAVs and HDVs. Third, existing work tends to make an unrealistic assumption that the behavior of HDVs and CAVs is publicly known and rationale. To bridge the research gaps, we propose a safe MARL framework for mixed-autonomy platoons. Specifically, this framework (i) characterizes cooperative safety by designing a cooperative Control Barrier Function (CBF), enabling CAVs to collaboratively improve the safety of the entire platoon, (ii) provides a safety guarantee to the MARL-based controller by integrating the CBF-based safety constraints into MARL through a differentiable quadratic programming (QP) layer, and (iii) incorporates a conformal prediction module that enables each CAV to estimate the unknown behaviors of the surrounding vehicles with uncertainty qualification. Simulation results show that our proposed control strategy can effectively enhance the system-level safety through CAV cooperation of a mixed-autonomy platoon with a minimal impact on control performance.

InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma

Authors:Xiaoxuan Hou, Jiayi Yuan, Joel Z. Leibo, Natasha Jaques
Date:2024-11-15 00:31:45

InvestESG is a novel multi-agent reinforcement learning (MARL) benchmark designed to study the impact of Environmental, Social, and Governance (ESG) disclosure mandates on corporate climate investments. The benchmark models an intertemporal social dilemma where companies balance short-term profit losses from climate mitigation efforts and long-term benefits from reducing climate risk, while ESG-conscious investors attempt to influence corporate behavior through their investment decisions. Companies allocate capital across mitigation, greenwashing, and resilience, with varying strategies influencing climate outcomes and investor preferences. We are releasing open-source versions of InvestESG in both PyTorch and JAX, which enable scalable and hardware-accelerated simulations for investigating competing incentives in mitigate climate change. Our experiments show that without ESG-conscious investors with sufficient capital, corporate mitigation efforts remain limited under the disclosure mandate. However, when a critical mass of investors prioritizes ESG, corporate cooperation increases, which in turn reduces climate risks and enhances long-term financial stability. Additionally, providing more information about global climate risks encourages companies to invest more in mitigation, even without investor involvement. Our findings align with empirical research using real-world data, highlighting MARL's potential to inform policy by providing insights into large-scale socio-economic challenges through efficient testing of alternative policy and market designs.

DNN Task Assignment in UAV Networks: A Generative AI Enhanced Multi-Agent Reinforcement Learning Approach

Authors:Xin Tang, Qian Chen, Wenjie Weng, Binhan Liao, Jiacheng Wang, Xianbin Cao, Xiaohuan Li
Date:2024-11-13 02:41:02

Unmanned Aerial Vehicles (UAVs) possess high mobility and flexible deployment capabilities, prompting the development of UAVs for various application scenarios within the Internet of Things (IoT). The unique capabilities of UAVs give rise to increasingly critical and complex tasks in uncertain and potentially harsh environments. The substantial amount of data generated from these applications necessitates processing and analysis through deep neural networks (DNNs). However, UAVs encounter challenges due to their limited computing resources when managing DNN models. This paper presents a joint approach that combines multiple-agent reinforcement learning (MARL) and generative diffusion models (GDM) for assigning DNN tasks to a UAV swarm, aimed at reducing latency from task capture to result output. To address these challenges, we first consider the task size of the target area to be inspected and the shortest flying path as optimization constraints, employing a greedy algorithm to resolve the subproblem with a focus on minimizing the UAV's flying path and the overall system cost. In the second stage, we introduce a novel DNN task assignment algorithm, termed GDM-MADDPG, which utilizes the reverse denoising process of GDM to replace the actor network in multi-agent deep deterministic policy gradient (MADDPG). This approach generates specific DNN task assignment actions based on agents' observations in a dynamic environment. Simulation results indicate that our algorithm performs favorably compared to benchmarks in terms of path planning, Age of Information (AoI), energy consumption, and task load balancing.

Exploring Multi-Agent Reinforcement Learning for Unrelated Parallel Machine Scheduling

Authors:Maria Zampella, Urtzi Otamendi, Xabier Belaunzaran, Arkaitz Artetxe, Igor G. Olaizola, Giuseppe Longo, Basilio Sierra
Date:2024-11-12 08:27:27

Scheduling problems pose significant challenges in resource, industry, and operational management. This paper addresses the Unrelated Parallel Machine Scheduling Problem (UPMS) with setup times and resources using a Multi-Agent Reinforcement Learning (MARL) approach. The study introduces the Reinforcement Learning environment and conducts empirical analyses, comparing MARL with Single-Agent algorithms. The experiments employ various deep neural network policies for single- and Multi-Agent approaches. Results demonstrate the efficacy of the Maskable extension of the Proximal Policy Optimization (PPO) algorithm in Single-Agent scenarios and the Multi-Agent PPO algorithm in Multi-Agent setups. While Single-Agent algorithms perform adequately in reduced scenarios, Multi-Agent approaches reveal challenges in cooperative learning but a scalable capacity. This research contributes insights into applying MARL techniques to scheduling optimization, emphasizing the need for algorithmic sophistication balanced with scalability for intelligent scheduling solutions.

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

Authors:Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso
Date:2024-11-11 16:20:27

As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API documentation languages, such as the OpenAPI Specification, has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in fault detection. To address these limitations, we present AutoRestTest, the first black-box tool to adopt a dependency-embedded multi-agent approach for REST API testing that integrates multi-agent reinforcement learning (MARL) with a semantic property dependency graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents -- API, dependency, parameter, and value agents -- collaborate to optimize API exploration. LLMs handle domain-specific value generation, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Our evaluation of AutoRestTest on 12 real-world REST services shows that it outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which generates realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to trigger an internal server error in the Spotify service. Our ablation study illustrates that each component of AutoRestTest -- the SPDG, the LLM, and the agent-learning mechanism -- contributes to its overall effectiveness.

OffLight: An Offline Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

Authors:Rohit Bokade, Xiaoning Jin
Date:2024-11-10 21:26:17

Efficient traffic control (TSC) is essential for urban mobility, but traditional systems struggle to handle the complexity of real-world traffic. Multi-agent Reinforcement Learning (MARL) offers adaptive solutions, but online MARL requires extensive interactions with the environment, making it costly and impractical. Offline MARL mitigates these challenges by using historical traffic data for training but faces significant difficulties with heterogeneous behavior policies in real-world datasets, where mixed-quality data complicates learning. We introduce OffLight, a novel offline MARL framework designed to handle heterogeneous behavior policies in TSC datasets. To improve learning efficiency, OffLight incorporates Importance Sampling (IS) to correct for distributional shifts and Return-Based Prioritized Sampling (RBPS) to focus on high-quality experiences. OffLight utilizes a Gaussian Mixture Variational Graph Autoencoder (GMM-VGAE) to capture the diverse distribution of behavior policies from local observations. Extensive experiments across real-world urban traffic scenarios show that OffLight outperforms existing offline RL methods, achieving up to a 7.8% reduction in average travel time and 11.2% decrease in queue length. Ablation studies confirm the effectiveness of OffLight's components in handling heterogeneous data and improving policy performance. These results highlight OffLight's scalability and potential to improve urban traffic management without the risks of online learning.

Think Smart, Act SMARL! Analyzing Probabilistic Logic Driven Safety in Multi-Agent Reinforcement Learning

Authors:Satchit Chatterji, Erman Acar
Date:2024-11-07 16:59:32

An important challenge for enabling the deployment of reinforcement learning (RL) algorithms in the real world is safety. This has resulted in the recent research field of Safe RL, which aims to learn optimal policies that are safe. One successful approach in that direction is probabilistic logic shields (PLS), a model-based Safe RL technique that uses formal specifications based on probabilistic logic programming, constraining an agent's policy to comply with those specifications in a probabilistic sense. However, safety is inherently a multi-agent concept, since real-world environments often involve multiple agents interacting simultaneously, leading to a complex system which is hard to control. Moreover, safe multi-agent RL (Safe MARL) is still underexplored. In order to address this gap, in this paper we ($i$) introduce Shielded MARL (SMARL) by extending PLS to MARL -- in particular, we introduce Probabilistic Logic Temporal Difference Learning (PLTD) to enable shielded independent Q-learning (SIQL), and introduce shielded independent PPO (SIPPO) using probabilistic logic policy gradients; ($ii$) show its positive effect and use as an equilibrium selection mechanism in various game-theoretic environments including two-player simultaneous games, extensive-form games, stochastic games, and some grid-world extensions in terms of safety, cooperation, and alignment with normative behaviors; and ($iii$) look into the asymmetric case where only one agent is shielded, and show that the shielded agent has a significant influence on the unshielded one, providing further evidence of SMARL's ability to enhance safety and cooperation in diverse multi-agent environments.

Semantic-Aware Resource Management for C-V2X Platooning via Multi-Agent Reinforcement Learning

Authors:Zhiyu Shao, Qiong Wu, Pingyi Fan, Kezhi Wang, Qiang Fan, Wen Chen, Khaled B. Letaief
Date:2024-11-07 12:55:35

This paper presents a semantic-aware multi-modal resource allocation (SAMRA) for multi-task using multi-agent reinforcement learning (MARL), termed SAMRAMARL, utilizing in platoon systems where cellular vehicle-to-everything (C-V2X) communication is employed. The proposed approach leverages the semantic information to optimize the allocation of communication resources. By integrating a distributed multi-agent reinforcement learning (MARL) algorithm, SAMRAMARL enables autonomous decision-making for each vehicle, channel assignment optimization, power allocation, and semantic symbol length based on the contextual importance of the transmitted information. This semantic-awareness ensures that both vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications prioritize data that is critical for maintaining safe and efficient platoon operations. The framework also introduces a tailored quality of experience (QoE) metric for semantic communication, aiming to maximize QoE in V2V links while improving the success rate of semantic information transmission (SRS). Extensive simulations has demonstrated that SAMRAMARL outperforms existing methods, achieving significant gains in QoE and communication efficiency in C-V2X platooning scenarios.

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

Authors:Yuqian Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai, Dongbin Zhao
Date:2024-11-06 01:40:21

Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.

Lyapunov-guided Multi-Agent Reinforcement Learning for Delay-Sensitive Wireless Scheduling

Authors:Cheng Zhang, Lan Wei, Ji Fan, Zening Liu, Yongming Huang
Date:2024-11-04 03:20:22

In this paper, a two-stage intelligent scheduler is proposed to minimize the packet-level delay jitter while guaranteeing delay bound. Firstly, Lyapunov technology is employed to transform the delay-violation constraint into a sequential slot-level queue stability problem. Secondly, a hierarchical scheme is proposed to solve the resource allocation between multiple base stations and users, where the multi-agent reinforcement learning (MARL) gives the user priority and the number of scheduled packets, while the underlying scheduler allocates the resource. Our proposed scheme achieves lower delay jitter and delay violation rate than the Round-Robin Earliest Deadline First algorithm and MARL with delay violation penalty.

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Authors:Weifan Long, Wen Wen, Peng Zhai, Lihua Zhang
Date:2024-11-02 07:25:48

Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)-where individuals maintain stable value orientations during interactions with others-we propose a novel framework called \emph{Role Play} (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embedding observations and employs a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

Language-Driven Policy Distillation for Cooperative Driving in Multi-Agent Reinforcement Learning

Authors:Jiaqi Liu, Chengkai Xu, Peng Hang, Jian Sun, Mingyu Ding, Wei Zhan, Masayoshi Tomizuka
Date:2024-10-31 17:10:01

The cooperative driving technology of Connected and Autonomous Vehicles (CAVs) is crucial for improving the efficiency and safety of transportation systems. Learning-based methods, such as Multi-Agent Reinforcement Learning (MARL), have demonstrated strong capabilities in cooperative decision-making tasks. However, existing MARL approaches still face challenges in terms of learning efficiency and performance. In recent years, Large Language Models (LLMs) have rapidly advanced and shown remarkable abilities in various sequential decision-making tasks. To enhance the learning capabilities of cooperative agents while ensuring decision-making efficiency and cost-effectiveness, we propose LDPD, a language-driven policy distillation method for guiding MARL exploration. In this framework, a teacher agent based on LLM trains smaller student agents to achieve cooperative decision-making through its own decision-making demonstrations. The teacher agent enhances the observation information of CAVs and utilizes LLMs to perform complex cooperative decision-making reasoning, which also leverages carefully designed decision-making tools to achieve expert-level decisions, providing high-quality teaching experiences. The student agent then refines the teacher's prior knowledge into its own model through gradient policy updates. The experiments demonstrate that the students can rapidly improve their capabilities with minimal guidance from the teacher and eventually surpass the teacher's performance. Extensive experiments show that our approach demonstrates better performance and learning efficiency compared to baseline methods.

Energy-Aware Multi-Agent Reinforcement Learning for Collaborative Execution in Mission-Oriented Drone Networks

Authors:Ying Li, Changling Li, Jiyao Chen, Christine Roinou
Date:2024-10-29 22:43:26

Mission-oriented drone networks have been widely used for structural inspection, disaster monitoring, border surveillance, etc. Due to the limited battery capacity of drones, mission execution strategy impacts network performance and mission completion. However, collaborative execution is a challenging problem for drones in such a dynamic environment as it also involves efficient trajectory design. We leverage multi-agent reinforcement learning (MARL) to manage the challenge in this study, letting each drone learn to collaboratively execute tasks and plan trajectories based on its current status and environment. Simulation results show that the proposed collaborative execution model can successfully complete the mission at least 80% of the time, regardless of task locations and lengths, and can even achieve a 100% success rate when the task density is not way too sparse. To the best of our knowledge, our work is one of the pioneer studies on leveraging MARL on collaborative execution for mission-oriented drone networks; the unique value of this work lies in drone battery level driving our model design.

A Multi-Agent Reinforcement Learning Testbed for Cognitive Radio Applications

Authors:Sriniketh Vangaru, Daniel Rosen, Dylan Green, Raphael Rodriguez, Maxwell Wiecek, Amos Johnson, Alyse M. Jones, William C. Headley
Date:2024-10-28 20:45:52

Technological trends show that Radio Frequency Reinforcement Learning (RFRL) will play a prominent role in the wireless communication systems of the future. Applications of RFRL range from military communications jamming to enhancing WiFi networks. Before deploying algorithms for these purposes, they must be trained in a simulation environment to ensure adequate performance. For this reason, we previously created the RFRL Gym: a standardized, accessible tool for the development and testing of reinforcement learning (RL) algorithms in the wireless communications space. This environment leveraged the OpenAI Gym framework and featured customizable simulation scenarios within the RF spectrum. However, the RFRL Gym was limited to training a single RL agent per simulation; this is not ideal, as most real-world RF scenarios will contain multiple intelligent agents in cooperative, competitive, or mixed settings, which is a natural consequence of spectrum congestion. Therefore, through integration with Ray RLlib, multi-agent reinforcement learning (MARL) functionality for training and assessment has been added to the RFRL Gym, making it even more of a robust tool for RF spectrum simulation. This paper provides an overview of the updated RFRL Gym environment. In this work, the general framework of the tool is described relative to comparable existing resources, highlighting the significant additions and refactoring we have applied to the Gym. Afterward, results from testing various RF scenarios in the MARL environment and future additions are discussed.

Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Authors:Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang
Date:2024-10-25 10:24:19

Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

Multi-Agent Reinforcement Learning with Selective State-Space Models

Authors:Jemma Daniel, Ruan de Kock, Louay Ben Nessir, Sasha Abramowitz, Omayma Mahjoub, Wiem Khlifi, Claude Formanek, Arnu Pretorius
Date:2024-10-25 08:32:21

The Transformer model has demonstrated success across a wide range of domains, including in Multi-Agent Reinforcement Learning (MARL) where the Multi-Agent Transformer (MAT) has emerged as a leading algorithm in the field. However, a significant drawback of Transformer models is their quadratic computational complexity relative to input size, making them computationally expensive when scaling to larger inputs. This limitation restricts MAT's scalability in environments with many agents. Recently, State-Space Models (SSMs) have gained attention due to their computational efficiency, but their application in MARL remains unexplored. In this work, we investigate the use of Mamba, a recent SSM, in MARL and assess whether it can match the performance of MAT while providing significant improvements in efficiency. We introduce a modified version of MAT that incorporates standard and bi-directional Mamba blocks, as well as a novel "cross-attention" Mamba block. Extensive testing shows that our Multi-Agent Mamba (MAM) matches the performance of MAT across multiple standard multi-agent environments, while offering superior scalability to larger agent scenarios. This is significant for the MARL community, because it indicates that SSMs could replace Transformers without compromising performance, whilst also supporting more effective scaling to higher numbers of agents. Our project page is available at https://sites.google.com/view/multi-agent-mamba .

Leveraging Graph Neural Networks and Multi-Agent Reinforcement Learning for Inventory Control in Supply Chains

Authors:Niki Kotecha, Antonio del Rio Chanona
Date:2024-10-24 10:43:04

Inventory control in modern supply chains has attracted significant attention due to the increasing number of disruptive shocks and the challenges posed by complex dynamics, uncertainties, and limited collaboration. Traditional methods, which often rely on static parameters, struggle to adapt to changing environments. This paper proposes a Multi-Agent Reinforcement Learning (MARL) framework with Graph Neural Networks (GNNs) for state representation to address these limitations. Our approach redefines the action space by parameterizing heuristic inventory control policies, making it adaptive as the parameters dynamically adjust based on system conditions. By leveraging the inherent graph structure of supply chains, our framework enables agents to learn the system's topology, and we employ a centralized learning, decentralized execution scheme that allows agents to learn collaboratively while overcoming information-sharing constraints. Additionally, we incorporate global mean pooling and regularization techniques to enhance performance. We test the capabilities of our proposed approach on four different supply chain configurations and conduct a sensitivity analysis. This work paves the way for utilizing MARL-GNN frameworks to improve inventory management in complex, decentralized supply chain environments.

Evolutionary Dispersal of Ecological Species via Multi-Agent Deep Reinforcement Learning

Authors:Wonhyung Choi, Inkyung Ahn
Date:2024-10-24 10:21:23

Understanding species dynamics in heterogeneous environments is essential for ecosystem studies. Traditional models assumed homogeneous habitats, but recent approaches include spatial and temporal variability, highlighting species migration. We adopt starvation-driven diffusion (SDD) models as nonlinear diffusion to describe species dispersal based on local resource conditions, showing advantages for species survival. However, accurate prediction remains challenging due to model simplifications. This study uses multi-agent reinforcement learning (MARL) with deep Q-networks (DQN) to simulate single species and predator-prey interactions, incorporating SDD-type rewards. Our simulations reveal evolutionary dispersal strategies, providing insights into species dispersal mechanisms and validating traditional mathematical models.

PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control

Authors:Rohit Bokade, Xiaoning Jin
Date:2024-10-23 18:10:38

Multi-Agent Reinforcement Learning (MARL) presents a promising approach for addressing the complexity of Traffic Signal Control (TSC) in urban environments. However, existing platforms for MARL-based TSC research face challenges such as slow simulation speeds and convoluted, difficult-to-maintain codebases. To address these limitations, we introduce PyTSC, a robust and flexible simulation environment that facilitates the training and evaluation of MARL algorithms for TSC. PyTSC integrates multiple simulators, such as SUMO and CityFlow, and offers a streamlined API, empowering researchers to explore a broad spectrum of MARL approaches efficiently. PyTSC accelerates experimentation and provides new opportunities for advancing intelligent traffic management systems in real-world applications.

Magnetic field sorting of superconducting graphite particles with T$_c$$>$400K

Authors:Manuel Núñez-Regueiro, Thibaut Devillers, Eric Beaugnon, Armand de Marles, Thierry Crozes, Sébastien Pairis, Christopher Swale, Holger Klein, Olivier Leynaud, Abdelali Hadj-Azzem, Frédéric Gay, Didier Dufeu
Date:2024-10-23 16:51:09

It has been claimed that graphite hosts superconductivity at room temperature, although all efforts to isolate it have been vain. Here we report a separation method that uses magnetic field gradients to sort the superconducting from normal grains out of industrial graphite powders. We have obtained a concentrate of above room temperature superconducting particles. Electrical resistance measurements on agglomerates of sorted grains of three types of graphite show transition temperatures up to T$_{c{_{onset}}} \sim$ 700K with zero resistance up to $\sim$ 500K. Magnetization measurements confirm these values through jumps at \textit{T$_c$} in the zero field cooled curves, and by the occurrence diamagnetic hysteretic cycles shrinking with temperature. Our results open the door towards the study of above room temperature superconducting ill-stacked graphite phases.

Evolution with Opponent-Learning Awareness

Authors:Yann Bouteiller, Karthik Soma, Giovanni Beltrame
Date:2024-10-22 22:49:04

The universe involves many independent co-learning agents as an ever-evolving part of our observed environment. Yet, in practice, Multi-Agent Reinforcement Learning (MARL) applications are usually constrained to small, homogeneous populations and remain computationally intensive. In this paper, we study how large heterogeneous populations of learning agents evolve in normal-form games. We show how, under assumptions commonly made in the multi-armed bandit literature, Multi-Agent Policy Gradient closely resembles the Replicator Dynamic, and we further derive a fast, parallelizable implementation of Opponent-Learning Awareness tailored for evolutionary simulations. This enables us to simulate the evolution of very large populations made of heterogeneous co-learning agents, under both naive and advanced learning strategies. We demonstrate our approach in simulations of 200,000 agents, evolving in the classic games of Hawk-Dove, Stag-Hunt, and Rock-Paper-Scissors. Each game highlights distinct ways in which Opponent-Learning Awareness affects evolution.

Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense

Authors:Aditya Vikram Singh, Ethan Rathbun, Emma Graham, Lisa Oakley, Simona Boboila, Alina Oprea, Peter Chin
Date:2024-10-22 18:35:05

Recent advances in multi-agent reinforcement learning (MARL) have created opportunities to solve complex real-world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub-tasks like network investigation and host recovery. Our approach involves training sub-policies for each sub-task using PPO enhanced with domain expertise. These sub-policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub-policies can be fine-tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state-of-the-art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives on recoveries.

A New Approach to Solving SMAC Task: Generating Decision Tree Code from Large Language Models

Authors:Yue Deng, Weiyu Ma, Yuxin Fan, Yin Zhang, Haifeng Zhang, Jian Zhao
Date:2024-10-21 13:58:38

StarCraft Multi-Agent Challenge (SMAC) is one of the most commonly used experimental environments in multi-agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for up to 1 million steps to train a model, and the resulting policies are typically non-interpretable with weak transferability. In this paper, we propose a novel approach to solving SMAC tasks called LLM-SMAC. In our framework, agents leverage large language models (LLMs) to generate decision tree code by providing task descriptions. The model is further self-reflection using feedback from the rewards provided by the environment. We conduct experiments in the SMAC and demonstrate that our method can produce high-quality, interpretable decision trees with minimal environmental exploration. Moreover, these models exhibit strong transferability, successfully applying to similar SMAC environments without modification. We believe this approach offers a new direction for solving decision-making tasks in the future.

FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL

Authors:Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun
Date:2024-10-21 10:57:45

Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory -- a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-\`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.

Cooperation and Fairness in Multi-Agent Reinforcement Learning

Authors:Jasmine Jerry Aloor, Siddharth Nayak, Sydney Dolan, Hamsa Balakrishnan
Date:2024-10-19 00:10:52

Multi-agent systems are trained to maximize shared cost objectives, which typically reflect system-level efficiency. However, in the resource-constrained environments of mobility and transportation systems, efficiency may be achieved at the expense of fairness -- certain agents may incur significantly greater costs or lower rewards compared to others. Tasks could be distributed inequitably, leading to some agents receiving an unfair advantage while others incur disproportionately high costs. It is important to consider the tradeoffs between efficiency and fairness. We consider the problem of fair multi-agent navigation for a group of decentralized agents using multi-agent reinforcement learning (MARL). We consider the reciprocal of the coefficient of variation of the distances traveled by different agents as a measure of fairness and investigate whether agents can learn to be fair without significantly sacrificing efficiency (i.e., increasing the total distance traveled). We find that by training agents using min-max fair distance goal assignments along with a reward term that incentivizes fairness as they move towards their goals, the agents (1) learn a fair assignment of goals and (2) achieve almost perfect goal coverage in navigation scenarios using only local observations. For goal coverage scenarios, we find that, on average, our model yields a 14% improvement in efficiency and a 5% improvement in fairness over a baseline trained using random assignments. Furthermore, an average of 21% improvement in fairness can be achieved compared to a model trained on optimally efficient assignments; this increase in fairness comes at the expense of only a 7% decrease in efficiency. Finally, we extend our method to environments in which agents must complete coverage tasks in prescribed formations and show that it is possible to do so without tailoring the models to specific formation shapes.

Neutron-proton pairing in the unstable N=Z nuclei of the f-shell through two-nucleon transfer reactions

Authors:M. Assié, H. Jacob, Y. Blumenfeld, V. Girard-Alcindor
Date:2024-10-18 09:20:09

Pair transfer is a unique tool to study pairing correlations in nuclei. Neutron-proton pairing is investigated in the N=Z nuclei of the f-shell, through the reaction (p,3He) in inverse kinematics, that allows to populate at the same time the lowest J=0+, T=1 (isovector pairing) state and J=1+, T=0 (isoscalar pairing) state. Radioactive beams of 56Ni and 52Fe produced by fragmentation at the GANIL/LISE facility combined with particle and gamma-ray detection make it possible to carry out this study from 48Cr (mid-shell nucleus) to 56Ni (doubly-magic nucleus). The cross-sections were extracted and compared with second-order distorted-wave born approximation (DWBA) calculations performed with neutron-proton amplitudes obtained from shell model calculations with GXPF1 interaction. Very low cross-sections for the J=1+,T=0 state (isoscalar channel) were observed. The cross-section for 56Ni is one of order of magnitude lower than for 40Ca showing a strong reduction of the isoscalar channel in the f-shell as compared to the sd-shell. On the other hand, the increase of the cross-section towards the middle of the shell for the isovector channel points towards a possible superfluid phase.

Spectrum Sharing using Deep Reinforcement Learning in Vehicular Networks

Authors:Riya Dinesh Deshpande, Faheem A. Khan, Qasim Zeeshan Ahmed
Date:2024-10-16 12:59:59

As the number of devices getting connected to the vehicular network grows exponentially, addressing the numerous challenges of effectively allocating spectrum in dynamic vehicular environment becomes increasingly difficult. Traditional methods may not suffice to tackle this issue. In vehicular networks safety critical messages are involved and it is important to implement an efficient spectrum allocation paradigm for hassle free communication as well as manage the congestion in the network. To tackle this, a Deep Q Network (DQN) model is proposed as a solution, leveraging its ability to learn optimal strategies over time and make decisions. The paper presents a few results and analyses, demonstrating the efficacy of the DQN model in enhancing spectrum sharing efficiency. Deep Reinforcement Learning methods for sharing spectrum in vehicular networks have shown promising outcomes, demonstrating the system's ability to adjust to dynamic communication environments. Both SARL and MARL models have exhibited successful rates of V2V communication, with the cumulative reward of the RL model reaching its maximum as training progresses.

Multiple Ships Cooperative Navigation and Collision Avoidance using Multi-agent Reinforcement Learning with Communication

Authors:Y. Wang, Y. Zhao
Date:2024-10-12 09:27:40

In the real world, unmanned surface vehicles (USV) often need to coordinate with each other to accomplish specific tasks. However, achieving cooperative control in multi-agent systems is challenging due to issues such as non-stationarity and partial observability. Recent advancements in Multi-Agent Reinforcement Learning (MARL) provide new perspectives to address these challenges. Therefore, we propose using the multi-agent deep deterministic policy gradient (MADDPG) algorithm with communication to address multiple ships' cooperation problems under partial observability. We developed two tasks based on OpenAI's gym environment: cooperative navigation and cooperative collision avoidance. In these tasks, ships must not only learn effective control strategies but also establish communication protocols with other agents. We analyze the impact of external noise on communication, the effect of inter-agent communication on performance, and the communication patterns learned by the agents. The results demonstrate that our proposed framework effectively addresses cooperative navigation and collision avoidance among multiple vessels, significantly outperforming traditional single-agent algorithms. Agents establish a consistent communication protocol, enabling them to compensate for missing information through shared observations and achieve better coordination.

Kaleidoscope: Learnable Masks for Heterogeneous Multi-agent Reinforcement Learning

Authors:Xinran Li, Ling Pan, Jun Zhang
Date:2024-10-11 05:22:54

In multi-agent reinforcement learning (MARL), parameter sharing is commonly employed to enhance sample efficiency. However, the popular approach of full parameter sharing often leads to homogeneous policies among agents, potentially limiting the performance benefits that could be derived from policy diversity. To address this critical limitation, we introduce \emph{Kaleidoscope}, a novel adaptive partial parameter sharing scheme that fosters policy heterogeneity while still maintaining high sample efficiency. Specifically, Kaleidoscope maintains one set of common parameters alongside multiple sets of distinct, learnable masks for different agents, dictating the sharing of parameters. It promotes diversity among policy networks by encouraging discrepancy among these masks, without sacrificing the efficiencies of parameter sharing. This design allows Kaleidoscope to dynamically balance high sample efficiency with a broad policy representational capacity, effectively bridging the gap between full parameter sharing and non-parameter sharing across various environments. We further extend Kaleidoscope to critic ensembles in the context of actor-critic algorithms, which could help improve value estimations.Our empirical evaluations across extensive environments, including multi-agent particle environment, multi-agent MuJoCo and StarCraft multi-agent challenge v2, demonstrate the superior performance of Kaleidoscope compared with existing parameter sharing approaches, showcasing its potential for performance enhancement in MARL. The code is publicly available at \url{https://github.com/LXXXXR/Kaleidoscope}.

Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations

Authors:Henry Gasztowtt, Benjamin Smith, Vincent Zhu, Qinxun Bai, Edwin Zhang
Date:2024-10-10 20:04:58

The improvement of economic policymaking presents an opportunity for broad societal benefit, a notion that has inspired research towards AI-driven policymaking tools. AI policymaking holds the potential to surpass human performance through the ability to process data quickly at scale. However, existing RL-based methods exhibit sample inefficiency, and are further limited by an inability to flexibly incorporate nuanced information into their decision-making processes. Thus, we propose a novel method in which we instead utilize pre-trained Large Language Models (LLMs), as sample-efficient policymakers in socially complex multi-agent reinforcement learning (MARL) scenarios. We demonstrate significant efficiency gains, outperforming existing methods across three environments. Our code is available at https://github.com/hegasz/large-legislative-models.

Addressing Rotational Learning Dynamics in Multi-Agent Reinforcement Learning

Authors:Baraah A. M. Sidahmed, Tatjana Chavdarova
Date:2024-10-10 14:34:14

Multi-agent reinforcement learning (MARL) has emerged as a powerful paradigm for solving complex problems through agents' cooperation and competition, finding widespread applications across domains. Despite its success, MARL faces a reproducibility crisis. We show that, in part, this issue is related to the rotational optimization dynamics arising from competing agents' objectives, and require methods beyond standard optimization algorithms. We reframe MARL approaches using Variational Inequalities (VIs), offering a unified framework to address such issues. Leveraging optimization techniques designed for VIs, we propose a general approach for integrating gradient-based VI methods capable of handling rotational dynamics into existing MARL algorithms. Empirical results demonstrate significant performance improvements across benchmarks. In zero-sum games, Rock--paper--scissors and Matching pennies, VI methods achieve better convergence to equilibrium strategies, and in the Multi-Agent Particle Environment: Predator-prey, they also enhance team coordination. These results underscore the transformative potential of advanced optimization techniques in MARL.

Cooperative Multi-Target Positioning for Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning

Authors:Ziheng Liu, Jiayi Zhang, Enyu Shi, Yiyang Zhu, Derrick Wing Kwan Ng, Bo Ai
Date:2024-10-09 03:15:13

Cell-free massive multiple-input multiple-output (mMIMO) is a promising technology to empower next-generation mobile communication networks. In this paper, to address the computational complexity associated with conventional fingerprint positioning, we consider a novel cooperative positioning architecture that involves certain relevant access points (APs) to establish positioning similarity coefficients. Then, we propose an innovative joint positioning and correction framework employing multi-agent reinforcement learning (MARL) to tackle the challenges of high-dimensional sophisticated signal processing, which mainly leverages on the received signal strength information for preliminary positioning, supplemented by the angle of arrival information to refine the initial position estimation. Moreover, to mitigate the bias effects originating from remote APs, we design a cooperative weighted K-nearest neighbor (Co-WKNN)-based estimation scheme to select APs with a high correlation to participate in user positioning. In the numerical results, we present comparisons of various user positioning schemes, which reveal that the proposed MARL-based positioning scheme with Co-WKNN can effectively improve positioning performance. It is important to note that the cooperative positioning architecture is a critical element in striking a balance between positioning performance and computational complexity.

Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots

Authors:Milad Farjadnasab, Shahin Sirouspour
Date:2024-10-08 21:14:09

Cooperative mission planning for heterogeneous teams of mobile robots presents a unique set of challenges, particularly when operating under communication constraints and limited computational resources. To address these challenges, we propose the Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework, which leverages multi-agent reinforcement learning (MARL) to coordinate distributed decision making among agents with diverse sensing, motion, and actuation capabilities, operating under sporadic ad hoc communication. A Class-based Macro-Action Decentralized Partially Observable Markov Decision Process (CMacDec-POMDP) is also formulated to effectively model asynchronous decision-making for heterogeneous teams of agents. The framework utilizes an asynchronous centralized training and distributed execution scheme that is developed based on the Multi-Agent Transformer (MAT) architecture. This design allows a single trained model to generalize to larger environments and accommodate varying team sizes and compositions. We evaluate CATMiP in a 2D grid-world simulation environment and compare its performance against planning-based exploration methods. Results demonstrate CATMiP's superior efficiency, scalability, and robustness to communication dropouts, highlighting its potential for real-world heterogeneous mobile robot systems. The code is available at https://github.com/mylad13/CATMiP.

Learning Equilibria in Adversarial Team Markov Games: A Nonconvex-Hidden-Concave Min-Max Optimization Problem

Authors:Fivos Kalogiannis, Jingming Yan, Ioannis Panageas
Date:2024-10-08 04:07:26

We study the problem of learning a Nash equilibrium (NE) in Markov games which is a cornerstone in multi-agent reinforcement learning (MARL). In particular, we focus on infinite-horizon adversarial team Markov games (ATMGs) in which agents that share a common reward function compete against a single opponent, the adversary. These games unify two-player zero-sum Markov games and Markov potential games, resulting in a setting that encompasses both collaboration and competition. Kalogiannis et al. (2023a) provided an efficient equilibrium computation algorithm for ATMGs which presumes knowledge of the reward and transition functions and has no sample complexity guarantees. We contribute a learning algorithm that utilizes MARL policy gradient methods with iteration and sample complexity that is polynomial in the approximation error $\epsilon$ and the natural parameters of the ATMG, resolving the main caveats of the solution by (Kalogiannis et al., 2023a). It is worth noting that previously, the existence of learning algorithms for NE was known for Markov two-player zero-sum and potential games but not for ATMGs. Seen through the lens of min-max optimization, computing a NE in these games consists a nonconvex-nonconcave saddle-point problem. Min-max optimization has received extensive study. Nevertheless, the case of nonconvex-nonconcave landscapes remains elusive: in full generality, finding saddle-points is computationally intractable (Daskalakis et al., 2021). We circumvent the aforementioned intractability by developing techniques that exploit the hidden structure of the objective function via a nonconvex-concave reformulation. However, this introduces the challenge of a feasibility set with coupled constraints. We tackle these challenges by establishing novel techniques for optimizing weakly-smooth nonconvex functions, extending the framework of (Devolder et al., 2014).

Distributed Collaborative User Positioning for Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning

Authors:Ziheng Liu, Jiayi Zhang, Enyu Shi, Yiyang Zhu, Derrick Wing Kwan Ng, Bo Ai
Date:2024-10-07 09:37:28

In this paper, we investigate a cell-free massive multiple-input multiple-output system, which exhibits great potential in enhancing the capabilities of next-generation mobile communication networks. We first study the distributed positioning problem to lay the groundwork for solving resource allocation and interference management issues. Instead of relying on computationally and spatially complex fingerprint positioning methods, we propose a novel two-stage distributed collaborative positioning architecture with multi-agent reinforcement learning (MARL) network, consisting of a received signal strength-based preliminary positioning network and an angle of arrival-based auxiliary correction network. Our experimental results demonstrate that the two-stage distributed collaborative user positioning architecture can outperform conventional fingerprint positioning methods in terms of positioning accuracy.

YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning

Authors:Yuan Zhuang, Yi Shen, Zhili Zhang, Yuxiao Chen, Fei Miao
Date:2024-10-05 01:44:11

Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, the trained decentralized normal-sized neural network-based policies operate independently of the LLM. We evaluate our method across three different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms.

Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance

Authors:Joshua McClellan, Naveed Haghani, John Winder, Furong Huang, Pratap Tokekar
Date:2024-10-03 15:25:37

Multi-Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi-agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [ 2 ]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration-enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x-5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi-agent systems.

Breaking the mold: The challenge of large scale MARL specialization

Authors:Stefan Juang, Hugh Cao, Arielle Zhou, Ruochen Liu, Nevin L. Zhang, Elvis Liu
Date:2024-10-03 01:16:22

In multi-agent learning, the predominant approach focuses on generalization, often neglecting the optimization of individual agents. This emphasis on generalization limits the ability of agents to utilize their unique strengths, resulting in inefficiencies. This paper introduces Comparative Advantage Maximization (CAM), a method designed to enhance individual agent specialization in multiagent systems. CAM employs a two-phase process, combining centralized population training with individual specialization through comparative advantage maximization. CAM achieved a 13.2% improvement in individual agent performance and a 14.9% increase in behavioral diversity compared to state-of-the-art systems. The success of CAM highlights the importance of individual agent specialization, suggesting new directions for multi-agent system development.

ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization

Authors:The Viet Bui, Thanh Hong Nguyen, Tien Mai
Date:2024-10-02 18:56:10

Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents' local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi-agent value decomposition strategy to facilitate multi-agent training. Through extensive experiments on the multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks.

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Authors:Omayma Mahjoub, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Simon du Toit, Jemma Daniel, Louay Ben Nessir, Louise Beyers, Claude Formanek, Liam Clark, Arnu Pretorius
Date:2024-10-02 16:15:26

As multi-agent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency and (3) scalability. In this work, we introduce Sable, a performant, memory efficient and scalable sequence modeling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi-agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.

MARLens: Understanding Multi-agent Reinforcement Learning for Traffic Signal Control via Visual Analytics

Authors:Yutian Zhang, Guohong Zheng, Zhiyuan Liu, Quan Li, Haipeng Zeng
Date:2024-10-02 09:24:44

The issue of traffic congestion poses a significant obstacle to the development of global cities. One promising solution to tackle this problem is intelligent traffic signal control (TSC). Recently, TSC strategies leveraging reinforcement learning (RL) have garnered attention among researchers. However, the evaluation of these models has primarily relied on fixed metrics like reward and queue length. This limited evaluation approach provides only a narrow view of the model's decision-making process, impeding its practical implementation. Moreover, effective TSC necessitates coordinated actions across multiple intersections. Existing visual analysis solutions fall short when applied in multi-agent settings. In this study, we delve into the challenge of interpretability in multi-agent reinforcement learning (MARL), particularly within the context of TSC. We propose MARLens a visual analytics system tailored to understand MARL-based TSC. Our system serves as a versatile platform for both RL and TSC researchers. It empowers them to explore the model's features from various perspectives, revealing its decision-making processes and shedding light on interactions among different agents. To facilitate quick identification of critical states, we have devised multiple visualization views, complemented by a traffic simulation module that allows users to replay specific training scenarios. To validate the utility of our proposed system, we present three comprehensive case studies, incorporate insights from domain experts through interviews, and conduct a user study. These collective efforts underscore the feasibility and effectiveness of MARLens in enhancing our understanding of MARL-based TSC systems and pave the way for more informed and efficient traffic management strategies.

Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

Authors:Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni
Date:2024-10-01 22:16:22

We study the problem of learning an approximate equilibrium in the offline multi-agent reinforcement learning (MARL) setting. We introduce a structural assumption -- the interaction rank -- and establish that functions with low interaction rank are significantly more robust to distribution shift compared to general ones. Leveraging this observation, we demonstrate that utilizing function classes with low interaction rank, when combined with regularization and no-regret learning, admits decentralized, computationally and statistically efficient learning in offline MARL. Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single-agent value decomposition architectures.

MARLadona - Towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

Authors:Zichong Li, Filip Bjelonic, Victor Klemm, Marco Hutter
Date:2024-09-30 14:26:53

Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Further, we created an open-source multi-agent soccer environment based on Isaac Gym. Utilizing our MARL framework and a modified a global entity encoder as our core architecture, our approach achieves a 66.8% win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. Furthermore, we provided an in-depth analysis of the policy behavior and interpreted the agent's intention using the critic network.

Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

Authors:Laixi Shi, Jingchu Gai, Eric Mazumdar, Yuejie Chi, Adam Wierman
Date:2024-09-30 08:09:41

Standard multi-agent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under-explored, from reasonable problem formulation to the development of sample-efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well-posedness of this class of RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.

Value-Based Deep Multi-Agent Reinforcement Learning with Dynamic Sparse Training

Authors:Pihe Hu, Shaolong Li, Zhuoran Li, Ling Pan, Longbo Huang
Date:2024-09-28 15:57:24

Deep Multi-agent Reinforcement Learning (MARL) relies on neural networks with numerous parameters in multi-agent scenarios, often incurring substantial computational overhead. Consequently, there is an urgent need to expedite training and enable model compression in MARL. This paper proposes the utilization of dynamic sparse training (DST), a technique proven effective in deep supervised learning tasks, to alleviate the computational burdens in MARL training. However, a direct adoption of DST fails to yield satisfactory MARL agents, leading to breakdowns in value learning within deep sparse value-based MARL models. Motivated by this challenge, we introduce an innovative Multi-Agent Sparse Training (MAST) framework aimed at simultaneously enhancing the reliability of learning targets and the rationality of sample distribution to improve value learning in sparse models. Specifically, MAST incorporates the Soft Mellowmax Operator with a hybrid TD-($\lambda$) schema to establish dependable learning targets. Additionally, it employs a dual replay buffer mechanism to enhance the distribution of training samples. Building upon these aspects, MAST utilizes gradient-based topology evolution to exclusively train multiple MARL agents using sparse networks. Our comprehensive experimental investigation across various value-based MARL algorithms on multiple benchmarks demonstrates, for the first time, significant reductions in redundancy of up to $20\times$ in Floating Point Operations (FLOPs) for both training and inference, with less than $3\%$ performance degradation.

Multi-agent Reinforcement Learning for Dynamic Dispatching in Material Handling Systems

Authors:Xian Yeow Lee, Haiyan Wang, Daisuke Katsumata, Takaharu Matsui, Chetan Gupta
Date:2024-09-27 03:57:54

This paper proposes a multi-agent reinforcement learning (MARL) approach to learn dynamic dispatching strategies, which is crucial for optimizing throughput in material handling systems across diverse industries. To benchmark our method, we developed a material handling environment that reflects the complexities of an actual system, such as various activities at different locations, physical constraints, and inherent uncertainties. To enhance exploration during learning, we propose a method to integrate domain knowledge in the form of existing dynamic dispatching heuristics. Our experimental results show that our method can outperform heuristics by up to 7.4 percent in terms of median throughput. Additionally, we analyze the effect of different architectures on MARL performance when training multiple agents with different functions. We also demonstrate that the MARL agents performance can be further improved by using the first iteration of MARL agents as heuristics to train a second iteration of MARL agents. This work demonstrates the potential of applying MARL to learn effective dynamic dispatching strategies that may be deployed in real-world systems to improve business outcomes.

Cat-and-Mouse Satellite Dynamics: Divergent Adversarial Reinforcement Learning for Contested Multi-Agent Space Operations

Authors:Cameron Mehlman, Joseph Abramov, Gregory Falco
Date:2024-09-26 00:32:56

As space becomes increasingly crowded and contested, robust autonomous capabilities for multi-agent environments are gaining critical importance. Current autonomous systems in space primarily rely on optimization-based path planning or long-range orbital maneuvers, which have not yet proven effective in adversarial scenarios where one satellite is actively pursuing another. We introduce Divergent Adversarial Reinforcement Learning (DARL), a two-stage Multi-Agent Reinforcement Learning (MARL) approach designed to train autonomous evasion strategies for satellites engaged with multiple adversarial spacecraft. Our method enhances exploration during training by promoting diverse adversarial strategies, leading to more robust and adaptable evader models. We validate DARL through a cat-and-mouse satellite scenario, modeled as a partially observable multi-agent capture the flag game where two adversarial `cat' spacecraft pursue a single `mouse' evader. DARL's performance is compared against several benchmarks, including an optimization-based satellite path planner, demonstrating its ability to produce highly robust models for adversarial multi-agent space environments.

Language Grounded Multi-agent Reinforcement Learning with Human-interpretable Communication

Authors:Huao Li, Hossein Nourkhiz Mahjoub, Behdad Chalaki, Vaishnav Tadiparthi, Kwonjoon Lee, Ehsan Moradi-Pari, Charles Michael Lewis, Katia P Sycara
Date:2024-09-25 20:49:41

Multi-Agent Reinforcement Learning (MARL) methods have shown promise in enabling agents to learn a shared communication protocol from scratch and accomplish challenging team tasks. However, the learned language is usually not interpretable to humans or other agents not co-trained together, limiting its applicability in ad-hoc teamwork scenarios. In this work, we propose a novel computational pipeline that aligns the communication space between MARL agents with an embedding space of human natural language by grounding agent communications on synthetic data generated by embodied Large Language Models (LLMs) in interactive teamwork scenarios. Our results demonstrate that introducing language grounding not only maintains task performance but also accelerates the emergence of communication. Furthermore, the learned communication protocols exhibit zero-shot generalization capabilities in ad-hoc teamwork scenarios with unseen teammates and novel task states. This work presents a significant step toward enabling effective communication and collaboration between artificial agents and humans in real-world teamwork settings.

Multi-UAV Pursuit-Evasion with Online Planning in Unknown Environments by Deep Reinforcement Learning

Authors:Jiayu Chen, Chao Yu, Guosheng Li, Wenhao Tang, Xinyi Yang, Botian Xu, Huazhong Yang, Yu Wang
Date:2024-09-24 08:40:04

Multi-UAV pursuit-evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi-agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL-based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real-world pursuit-evasion are largely restricted to two-dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we address multi-UAV pursuit-evasion by considering UAV dynamics and physical constraints. We introduce an evader prediction-enhanced network to tackle partial observability in cooperative strategy learning. Additionally, we propose an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, we derive a feasible policy via a two-stage reward refinement and deploy the policy on real quadrotors in a zero-shot manner. To our knowledge, this is the first work to derive and deploy an RL-based policy using collective thrust and body rates control commands for multi-UAV pursuit-evasion in unknown environments. The open-source code and videos are available at https://sites.google.com/view/pursuit-evasion-rl.

Scalable Multi-agent Reinforcement Learning for Factory-wide Dynamic Scheduling

Authors:Jaeyeon Jang, Diego Klabjan, Han Liu, Nital S. Patel, Xiuqi Li, Balakrishnan Ananthanarayanan, Husam Dauod, Tzung-Han Juang
Date:2024-09-20 15:16:37

Real-time dynamic scheduling is a crucial but notoriously challenging task in modern manufacturing processes due to its high decision complexity. Recently, reinforcement learning (RL) has been gaining attention as an impactful technique to handle this challenge. However, classical RL methods typically rely on human-made dispatching rules, which are not suitable for large-scale factory-wide scheduling. To bridge this gap, this paper applies a leader-follower multi-agent RL (MARL) concept to obtain desired coordination after decomposing the scheduling problem into a set of sub-problems that are handled by each individual agent for scalability. We further strengthen the procedure by proposing a rule-based conversion algorithm to prevent catastrophic loss of production capacity due to an agent's error. Our experimental results demonstrate that the proposed model outperforms the state-of-the-art deep RL-based scheduling models in various aspects. Additionally, the proposed model provides the most robust scheduling performance to demand changes. Overall, the proposed MARL-based scheduling model presents a promising solution to the real-time scheduling problem, with potential applications in various manufacturing industries.

Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

Authors:Claude Formanek, Louise Beyers, Callum Rhys Tilbury, Jonathan P. Shock, Arnu Pretorius
Date:2024-09-18 14:13:24

Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

XP-MARL: Auxiliary Prioritization in Multi-Agent Reinforcement Learning to Address Non-Stationarity

Authors:Jianye Xu, Omar Sobhy, Bassam Alrifaee
Date:2024-09-18 10:10:55

Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement Learning (MARL), arising from agents simultaneously learning and altering their policies. This creates a non-stationary environment from the perspective of each individual agent, often leading to suboptimal or even unconverged learning outcomes. We propose an open-source framework named XP-MARL, which augments MARL with auxiliary prioritization to address this challenge in cooperative settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents and letting higher-priority agents establish their actions first would stabilize the learning process and thus mitigate non-stationarity and 2) enabled by our proposed mechanism called action propagation, where higher-priority agents act first and communicate their actions, providing a more stationary environment for others. Moreover, instead of using a predefined or heuristic priority assignment, XP-MARL learns priority-assignment policies with an auxiliary MARL problem, leading to a joint learning scheme. Experiments in a motion-planning scenario involving Connected and Automated Vehicles (CAVs) demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and outperforms a state-of-the-art approach, which improves the baseline by only 12.8%. Code: github.com/cas-lab-munich/sigmarl

Hyper-SAMARL: Hypergraph-based Coordinated Task Allocation and Socially-aware Navigation for Multi-Robot Systems

Authors:Weizheng Wang, Aniket Bera, Byung-Cheol Min
Date:2024-09-17 21:20:17

A team of multiple robots seamlessly and safely working in human-filled public environments requires adaptive task allocation and socially-aware navigation that account for dynamic human behavior. Current approaches struggle with highly dynamic pedestrian movement and the need for flexible task allocation. We propose Hyper-SAMARL, a hypergraph-based system for multi-robot task allocation and socially-aware navigation, leveraging multi-agent reinforcement learning (MARL). Hyper-SAMARL models the environmental dynamics between robots, humans, and points of interest (POIs) using a hypergraph, enabling adaptive task assignment and socially-compliant navigation through a hypergraph diffusion mechanism. Our framework, trained with MARL, effectively captures interactions between robots and humans, adapting tasks based on real-time changes in human activity. Experimental results demonstrate that Hyper-SAMARL outperforms baseline models in terms of social navigation, task completion efficiency, and adaptability in various simulated scenarios.

Symplectic Reduction in Infinite Dimensions

Authors:Tobias Diez, Gerd Rudolph
Date:2024-09-09 17:35:53

This paper develops a theory of symplectic reduction in the infinite-dimensional setting, covering both the regular and singular case. Extending the classical work of Marsden, Weinstein, Sjamaar and Lerman, we address challenges unique to infinite dimensions, such as the failure of the Darboux theorem and the absence of the Marle-Guillemin-Sternberg normal form. Our novel approach centers on a normal form of only the momentum map, for which we utilize new local normal form theorems for smooth equivariant maps in the infinite-dimensional setting. This normal form is then used to formulate the theory of singular symplectic reduction in infinite dimensions. We apply our results to important examples like the Yang-Mills equation and the Teichm\"uller space over a Riemann surface.

Cooperative Decision-Making for CAVs at Unsignalized Intersections: A MARL Approach with Attention and Hierarchical Game Priors

Authors:Jiaqi Liu, Peng Hang, Xiaoxiang Na, Chao Huang, Jian Sun
Date:2024-09-09 15:22:36

The development of autonomous vehicles has shown great potential to enhance the efficiency and safety of transportation systems. However, the decision-making issue in complex human-machine mixed traffic scenarios, such as unsignalized intersections, remains a challenge for autonomous vehicles. While reinforcement learning (RL) has been used to solve complex decision-making problems, existing RL methods still have limitations in dealing with cooperative decision-making of multiple connected autonomous vehicles (CAVs), ensuring safety during exploration, and simulating realistic human driver behaviors. In this paper, a novel and efficient algorithm, Multi-Agent Game-prior Attention Deep Deterministic Policy Gradient (MA-GA-DDPG), is proposed to address these limitations. Our proposed algorithm formulates the decision-making problem of CAVs at unsignalized intersections as a decentralized multi-agent reinforcement learning problem and incorporates an attention mechanism to capture interaction dependencies between ego CAV and other agents. The attention weights between the ego vehicle and other agents are then used to screen interaction objects and obtain prior hierarchical game relations, based on which a safety inspector module is designed to improve the traffic safety. Furthermore, both simulation and hardware-in-the-loop experiments were conducted, demonstrating that our method outperforms other baseline approaches in terms of driving safety, efficiency, and comfort.

Adaptive Multi-Layer Deployment for A Digital Twin Empowered Satellite-Terrestrial Integrated Network

Authors:Yihong Tao, Bo Lei, Haoyang Shi, Jingkai Chen, Xing Zhang
Date:2024-09-09 10:14:30

With the development of satellite communication technology, satellite-terrestrial integrated networks (STIN), which integrate satellite networks and ground networks, can realize seamless global coverage of communication services. Confronting the intricacies of network dynamics, the diversity of resource heterogeneity, and the unpredictability of user mobility, dynamic resource allocation within networks faces formidable challenges. Digital twin (DT), as a new technique, can reflect a physical network to a virtual network to monitor, analyze, and optimize the physical network. Nevertheless, in the process of constructing the DT model, the deployment location and resource allocation of DTs may adversely affect its performance. Therefore, we propose a STIN model, which alleviates the problem of insufficient single-layer deployment flexibility of the traditional edge network by deploying DTs in multi-layer nodes in a STIN. To address the challenge of deploying DTs in the network, we propose multi-layer DT deployment in a STIN to reduce system delay. Then we adopt a multi-agent reinforcement learning (MARL) scheme to explore the optimal strategy of the DT multi-layer deployment problem. The implemented scheme demonstrates a notable reduction in system delay, as evidenced by simulation outcomes.

An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

Authors:Christopher Amato
Date:2024-09-04 19:54:40

Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner -- using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.

Cooperative Path Planning with Asynchronous Multiagent Reinforcement Learning

Authors:Jiaming Yin, Weixiong Rao, Yu Xiao, Keshuang Tang
Date:2024-09-01 15:48:14

In this paper, we study the shortest path problem (SPP) with multiple source-destination pairs (MSD), namely MSD-SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi-agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD-SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub-graphs and subsequently execute a two-stage process of inter-region and intra-region route planning. To address the asynchronous issue, in the proposed asyn-MARL framework, we first design a global state, which exploits a low-dimensional vector to implicitly represent the joint observations and actions of multi-agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state-of-the-art planning approaches.

Learning Multi-agent Multi-machine Tending by Mobile Robots

Authors:Abdalwhab Abdalwhab, Giovanni Beltrame, Samira Ebrahimi Kahou, David St-Onge
Date:2024-08-29 19:57:52

Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborative robots can tackle that can also highly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability. In this work, we introduce a multi-agent multi-machine tending learning framework by mobile robots based on Multi-agent Reinforcement Learning (MARL) techniques with the design of a suitable observation and reward. Moreover, an attention-based encoding mechanism is developed and integrated into Multi-agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine tending scenarios. Our model (AB-MAPPO) outperformed MAPPO in this new challenging scenario in terms of task success, safety, and resources utilization. Furthermore, we provided an extensive ablation study to support our various design decisions.

Exploiting Approximate Symmetry for Efficient Multi-Agent Reinforcement Learning

Authors:Batuhan Yardim, Niao He
Date:2024-08-27 16:11:20

Mean-field games (MFG) have become significant tools for solving large-scale multi-agent reinforcement learning problems under symmetry. However, the assumption of exact symmetry limits the applicability of MFGs, as real-world scenarios often feature inherent heterogeneity. Furthermore, most works on MFG assume access to a known MFG model, which might not be readily available for real-world finite-agent games. In this work, we broaden the applicability of MFGs by providing a methodology to extend any finite-player, possibly asymmetric, game to an "induced MFG". First, we prove that $N$-player dynamic games can be symmetrized and smoothly extended to the infinite-player continuum via explicit Kirszbraun extensions. Next, we propose the notion of $\alpha,\beta$-symmetric games, a new class of dynamic population games that incorporate approximate permutation invariance. For $\alpha,\beta$-symmetric games, we establish explicit approximation bounds, demonstrating that a Nash policy of the induced MFG is an approximate Nash of the $N$-player dynamic game. We show that TD learning converges up to a small bias using trajectories of the $N$-player game with finite-sample guarantees, permitting symmetrized learning without building an explicit MFG model. Finally, for certain games satisfying monotonicity, we prove a sample complexity of $\widetilde{\mathcal{O}}(\varepsilon^{-6})$ for the $N$-agent game to learn an $\varepsilon$-Nash up to symmetrization bias. Our theory is supported by evaluations on MARL benchmarks with thousands of agents.

On Centralized Critics in Multi-Agent Reinforcement Learning

Authors:Xueguang Lyu, Andrea Baisero, Yuchen Xiao, Brett Daley, Christopher Amato
Date:2024-08-26 19:27:06

Centralized Training for Decentralized Execution where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.

MASQ: Multi-Agent Reinforcement Learning for Single Quadruped Robot Locomotion

Authors:Qi Liu, Jingxiang Guo, Sixu Lin, Shuaikang Ma, Jinxuan Zhu, Yanjie Li
Date:2024-08-25 08:04:20

This paper proposes a novel method to improve locomotion learning for a single quadruped robot using multi-agent deep reinforcement learning (MARL). Many existing methods use single-agent reinforcement learning for an individual robot or MARL for the cooperative task in multi-robot systems. Unlike existing methods, this paper proposes using MARL for the locomotion learning of a single quadruped robot. We develop a learning structure called Multi-Agent Reinforcement Learning for Single Quadruped Robot Locomotion (MASQ), considering each leg as an agent to explore the action space of the quadruped robot, sharing a global critic, and learning collaboratively. Experimental results indicate that MASQ not only speeds up learning convergence but also enhances robustness in real-world settings, suggesting that applying MASQ to single robots such as quadrupeds could surpass traditional single-robot reinforcement learning approaches. Our study provides insightful guidance on integrating MARL with single-robot locomotion learning.

Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning

Authors:Mingliang Zhang, Sichang Su, Chengyang He, Guillaume Sartoretti
Date:2024-08-24 12:37:03

In multi-agent reinforcement learning (MARL), achieving multi-task generalization to diverse agents and objectives presents significant challenges. Existing online MARL algorithms primarily focus on single-task performance, but their lack of multi-task generalization capabilities typically results in substantial computational waste and limited real-life applicability. Meanwhile, existing offline multi-task MARL approaches are heavily dependent on data quality, often resulting in poor performance on unseen tasks. In this paper, we introduce HyGen, a novel hybrid MARL framework, Hybrid Training for Enhanced Multi-Task Generalization, which integrates online and offline learning to ensure both multi-task generalization and training efficiency. Specifically, our framework extracts potential general skills from offline multi-task datasets. We then train policies to select the optimal skills under the centralized training and decentralized execution paradigm (CTDE). During this stage, we utilize a replay buffer that integrates both offline data and online interactions. We empirically demonstrate that our framework effectively extracts and refines general skills, yielding impressive generalization to unseen tasks. Comparative analyses on the StarCraft multi-agent challenge show that HyGen outperforms a wide range of existing solely online and offline methods.

Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning

Authors:Jihwan Oh, Sungnyun Kim, Gahee Kim, Sunghwan Kim, Se-Young Yun
Date:2024-08-23 14:17:17

Offline multi-agent reinforcement learning (MARL) is increasingly recognized as crucial for effectively deploying RL algorithms in environments where real-time interaction is impractical, risky, or costly. In the offline setting, learning from a static dataset of past interactions allows for the development of robust and safe policies without the need for live data collection, which can be fraught with challenges. Building on this foundational importance, we present EAQ, Episodes Augmentation guided by Q-total loss, a novel approach for offline MARL framework utilizing diffusion models. EAQ integrates the Q-total function directly into the diffusion model as a guidance to maximize the global returns in an episode, eliminating the need for separate training. Our focus primarily lies on cooperative scenarios, where agents are required to act collectively towards achieving a shared goal-essentially, maximizing global returns. Consequently, we demonstrate that our episodes augmentation in a collaborative manner significantly boosts offline MARL algorithm compared to the original dataset, improving the normalized return by +17.3% and +12.9% for medium and poor behavioral policies in SMAC simulator, respectively.

Empirical Equilibria in Agent-based Economic systems with Learning agents

Authors:Kshama Dwarakanath, Svitlana Vyetrenko, Tucker Balch
Date:2024-08-21 23:47:46

We present an agent-based simulator for economic systems with heterogeneous households, firms, central bank, and government agents. These agents interact to define production, consumption, and monetary flow. Each agent type has distinct objectives, such as households seeking utility from consumption and the central bank targeting inflation and production. We define this multi-agent economic system using an OpenAI Gym-style environment, enabling agents to optimize their objectives through reinforcement learning. Standard multi-agent reinforcement learning (MARL) schemes, like independent learning, enable agents to learn concurrently but do not address whether the resulting strategies are at equilibrium. This study integrates the Policy Space Response Oracle (PSRO) algorithm, which has shown superior performance over independent MARL in games with homogeneous agents, with economic agent-based modeling. We use PSRO to develop agent policies approximating Nash equilibria of the empirical economic game, thereby linking to economic equilibria. Our results demonstrate that PSRO strategies achieve lower regret values than independent MARL strategies in our economic system with four agent types. This work aims to bridge artificial intelligence, economics, and empirical game theory towards future research.

Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

Authors:Yun Qu, Boyuan Wang, Jianzhun Shao, Yuhang Jiang, Chen Chen, Zhenbin Ye, Lin Liu, Junfeng Yang, Lin Lai, Hongyang Qin, Minwen Deng, Juchao Zhuo, Deheng Ye, Qiang Fu, Wei Yang, Guang Yang, Lanxiao Huang, Xiangyang Ji
Date:2024-08-20 05:38:50

The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

Algorithmic Contract Design with Reinforcement Learning Agents

Authors:David Molina Concha, Kyeonghyeon Park, Hyun-Rok Lee, Taesik Lee, Chi-Guhn Lee
Date:2024-08-19 03:48:38

We introduce a novel problem setting for algorithmic contract design, named the principal-MARL contract design problem. This setting extends traditional contract design to account for dynamic and stochastic environments using Markov Games and Multi-Agent Reinforcement Learning. To tackle this problem, we propose a Multi-Objective Bayesian Optimization (MOBO) framework named Constrained Pareto Maximum Entropy Search (cPMES). Our approach integrates MOBO and MARL to explore the highly constrained contract design space, identifying promising incentive and recruitment decisions. cPMES transforms the principal-MARL contract design problem into an unconstrained multi-objective problem, leveraging the probability of feasibility as part of the objectives and ensuring promising designs predicted on the feasibility border are included in the Pareto front. By focusing the entropy prediction on designs within the Pareto set, cPMES mitigates the risk of the search strategy being overwhelmed by entropy from constraints. We demonstrate the effectiveness of cPMES through extensive benchmark studies in synthetic and simulated environments, showing its ability to find feasible contract designs that maximize the principal's objectives. Additionally, we provide theoretical support with a sub-linear regret bound concerning the number of iterations.

Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey

Authors:Ruiqi Zhang, Jing Hou, Florian Walter, Shangding Gu, Jiayi Guan, Florian Röhrbein, Yali Du, Panpan Cai, Guang Chen, Alois Knoll
Date:2024-08-19 03:31:20

Reinforcement Learning (RL) is a potent tool for sequential decision-making and has achieved performance surpassing human capabilities across many challenging real-world tasks. As the extension of RL in the multi-agent system domain, multi-agent RL (MARL) not only need to learn the control policy but also requires consideration regarding interactions with all other agents in the environment, mutual influences among different system components, and the distribution of computational resources. This augments the complexity of algorithmic design and poses higher requirements on computational resources. Simultaneously, simulators are crucial to obtain realistic data, which is the fundamentals of RL. In this paper, we first propose a series of metrics of simulators and summarize the features of existing benchmarks. Second, to ease comprehension, we recall the foundational knowledge and then synthesize the recently advanced studies of MARL-related autonomous driving and intelligent transportation systems. Specifically, we examine their environmental modeling, state representation, perception units, and algorithm design. Conclusively, we discuss open challenges as well as prospects and opportunities. We hope this paper can help the researchers integrate MARL technologies and trigger more insightful ideas toward the intelligent and autonomous driving.

GNN-Empowered Effective Partial Observation MARL Method for AoI Management in Multi-UAV Network

Authors:Yuhao Pan, Xiucheng Wang, Zhiyao Xu, Nan Cheng, Wenchao Xu, Jun-jie Zhang
Date:2024-08-18 02:29:10

Unmanned Aerial Vehicles (UAVs), due to their low cost and high flexibility, have been widely used in various scenarios to enhance network performance. However, the optimization of UAV trajectories in unknown areas or areas without sufficient prior information, still faces challenges related to poor planning performance and low distributed execution. These challenges arise when UAVs rely solely on their own observation information and the information from other UAVs within their communicable range, without access to global information. To address these challenges, this paper proposes the Qedgix framework, which combines graph neural networks (GNNs) and the QMIX algorithm to achieve distributed optimization of the Age of Information (AoI) for users in unknown scenarios. The framework utilizes GNNs to extract information from UAVs, users within the observable range, and other UAVs within the communicable range, thereby enabling effective UAV trajectory planning. Due to the discretization and temporal features of AoI indicators, the Qedgix framework employs QMIX to optimize distributed partially observable Markov decision processes (Dec-POMDP) based on centralized training and distributed execution (CTDE) with respect to mean AoI values of users. By modeling the UAV network optimization problem in terms of AoI and applying the Kolmogorov-Arnold representation theorem, the Qedgix framework achieves efficient neural network training through parameter sharing based on permutation invariance. Simulation results demonstrate that the proposed algorithm significantly improves convergence speed while reducing the mean AoI values of users. The code is available at https://github.com/UNIC-Lab/Qedgix.

SustainDC: Benchmarking for Sustainable Data Center Control

Authors:Avisek Naug, Antonio Guillen, Ricardo Luna, Vineet Gundecha, Desik Rengarajan, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Dejan Markovikj, Lekhapriya D Kashyap, Soumyendu Sarkar
Date:2024-08-14 22:43:52

Machine learning has driven an exponential increase in computational demand, leading to massive data centers that consume significant amounts of energy and contribute to climate change. This makes sustainable data center control a priority. In this paper, we introduce SustainDC, a set of Python environments for benchmarking multi-agent reinforcement learning (MARL) algorithms for data centers (DC). SustainDC supports custom DC configurations and tasks such as workload scheduling, cooling optimization, and auxiliary battery management, with multiple agents managing these operations while accounting for the effects of each other. We evaluate various MARL algorithms on SustainDC, showing their performance across diverse DC designs, locations, weather conditions, grid carbon intensity, and workload requirements. Our results highlight significant opportunities for improvement of data center operations using MARL algorithms. Given the increasing use of DC due to AI, SustainDC provides a crucial platform for the development and benchmarking of advanced algorithms essential for achieving sustainable computing and addressing other heterogeneous real-world challenges.

Bridging Training and Execution via Dynamic Directed Graph-Based Communication in Cooperative Multi-Agent Systems

Authors:Zhuohui Zhang, Bin He, Bin Cheng, Gang Li
Date:2024-08-14 09:16:42

Multi-agent systems must learn to communicate and understand interactions between agents to achieve cooperative goals in partially observed tasks. However, existing approaches lack a dynamic directed communication mechanism and rely on global states, thus diminishing the role of communication in centralized training. Thus, we propose the Transformer-based graph coarsening network (TGCNet), a novel multi-agent reinforcement learning (MARL) algorithm. TGCNet learns the topological structure of a dynamic directed graph to represent the communication policy and integrates graph coarsening networks to approximate the representation of global state during training. It also utilizes the Transformer decoder for feature extraction during execution. Experiments on multiple cooperative MARL benchmarks demonstrate state-of-the-art performance compared to popular MARL algorithms. Further ablation studies validate the effectiveness of our dynamic directed graph communication mechanism and graph coarsening networks.

Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space

Authors:Xiaoyang Yu, Youfang Lin, Shuo Wang, Kai Lv, Sheng Han
Date:2024-08-14 09:15:11

In a multi-agent system (MAS), action semantics indicates the different influences of agents' actions toward other entities, and can be used to divide agents into groups in a physically heterogeneous MAS. Previous multi-agent reinforcement learning (MARL) algorithms apply global parameter-sharing across different types of heterogeneous agents without careful discrimination of different action semantics. This common implementation decreases the cooperation and coordination between agents in complex situations. However, fully independent agent parameters dramatically increase the computational cost and training difficulty. In order to benefit from the usage of different action semantics while also maintaining a proper parameter-sharing structure, we introduce the Unified Action Space (UAS) to fulfill the requirement. The UAS is the union set of all agent actions with different semantics. All agents first calculate their unified representation in the UAS, and then generate their heterogeneous action policies using different available-action-masks. To further improve the training of extra UAS parameters, we introduce a Cross-Group Inverse (CGI) loss to predict other groups' agent policies with the trajectory information. As a universal method for solving the physically heterogeneous MARL problem, we implement the UAS adding to both value-based and policy-based MARL algorithms, and propose two practical algorithms: U-QMIX and U-MAPPO. Experimental results in the SMAC environment prove the effectiveness of both U-QMIX and U-MAPPO compared with several state-of-the-art MARL methods.

MAPPO-PIS: A Multi-Agent Proximal Policy Optimization Method with Prior Intent Sharing for CAVs' Cooperative Decision-Making

Authors:Yicheng Guo, Jiaqi Liu, Rongjie Yu, Peng Hang, Jian Sun
Date:2024-08-13 05:58:47

Vehicle-to-Vehicle (V2V) technologies have great potential for enhancing traffic flow efficiency and safety. However, cooperative decision-making in multi-agent systems, particularly in complex human-machine mixed merging areas, remains challenging for connected and autonomous vehicles (CAVs). Intent sharing, a key aspect of human coordination, may offer an effective solution to these decision-making problems, but its application in CAVs is under-explored. This paper presents an intent-sharing-based cooperative method, the Multi-Agent Proximal Policy Optimization with Prior Intent Sharing (MAPPO-PIS), which models the CAV cooperative decision-making problem as a Multi-Agent Reinforcement Learning (MARL) problem. It involves training and updating the agents' policies through the integration of two key modules: the Intention Generator Module (IGM) and the Safety Enhanced Module (SEM). The IGM is specifically crafted to generate and disseminate CAVs' intended trajectories spanning multiple future time-steps. On the other hand, the SEM serves a crucial role in assessing the safety of the decisions made and rectifying them if necessary. Merging area with human-machine mixed traffic flow is selected to validate our method. Results show that MAPPO-PIS significantly improves decision-making performance in multi-agent systems, surpassing state-of-the-art baselines in safety, efficiency, and overall traffic system performance. The code and video demo can be found at: \url{https://github.com/CCCC1dhcgd/A-MAPPO-PIS}.

Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards

Authors:Jahir Sadik Monon, Deeparghya Dutta Barua, Md. Mosaddek Khan
Date:2024-08-12 21:38:40

Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.

Asynchronous Credit Assignment Framework for Multi-Agent Reinforcement Learning

Authors:Yongheng Liang, Hejun Wu, Haitao Wang, Hao Cai
Date:2024-08-07 11:13:26

Credit assignment is a core problem that distinguishes agents' marginal contributions for optimizing cooperative strategies in multi-agent reinforcement learning (MARL). Current credit assignment methods usually assume synchronous decision-making among agents. However, a prerequisite for many realistic cooperative tasks is asynchronous decision-making by agents, without waiting for others to avoid disastrous consequences. To address this issue, we propose an asynchronous credit assignment framework with a problem model called ADEX-POMDP and a multiplicative value decomposition (MVD) algorithm. ADEX-POMDP is an asynchronous problem model with extra virtual agents for a decentralized partially observable markov decision process. We prove that ADEX-POMDP preserves both the task equilibrium and the algorithm convergence. MVD utilizes multiplicative interaction to efficiently capture the interactions of asynchronous decisions, and we theoretically demonstrate its advantages in handling asynchronous tasks. Experimental results show that on two asynchronous decision-making benchmarks, Overcooked and POAC, MVD not only consistently outperforms state-of-the-art MARL methods but also provides the interpretability for asynchronous cooperation.

Environment Complexity and Nash Equilibria in a Sequential Social Dilemma

Authors:Mustafa Yasir, Andrew Howes, Vasilios Mavroudis, Chris Hicks
Date:2024-08-04 21:27:36

Multi-agent reinforcement learning (MARL) methods, while effective in zero-sum or positive-sum games, often yield suboptimal outcomes in general-sum games where cooperation is essential for achieving globally optimal outcomes. Matrix game social dilemmas, which abstract key aspects of general-sum interactions, such as cooperation, risk, and trust, fail to model the temporal and spatial dynamics characteristic of real-world scenarios. In response, our study extends matrix game social dilemmas into more complex, higher-dimensional MARL environments. We adapt a gridworld implementation of the Stag Hunt dilemma to more closely match the decision-space of a one-shot matrix game while also introducing variable environment complexity. Our findings indicate that as complexity increases, MARL agents trained in these environments converge to suboptimal strategies, consistent with the risk-dominant Nash equilibria strategies found in matrix games. Our work highlights the impact of environment complexity on achieving optimal outcomes in higher-dimensional game-theoretic MARL environments.

Multi-agent reinforcement learning for the control of three-dimensional Rayleigh-Bénard convection

Authors:Joel Vasanth, Jean Rabault, Francisco Alcántara-Ávila, Mikael Mortensen, Ricardo Vinuesa
Date:2024-07-31 12:41:20

Deep reinforcement learning (DRL) has found application in numerous use-cases pertaining to flow control. Multi-agent RL (MARL), a variant of DRL, has shown to be more effective than single-agent RL in controlling flows exhibiting locality and translational invariance. We present, for the first time, an implementation of MARL-based control of three-dimensional Rayleigh-B\'enard convection (RBC). Control is executed by modifying the temperature distribution along the bottom wall divided into multiple control segments, each of which acts as an independent agent. Two regimes of RBC are considered at Rayleigh numbers $\mathrm{Ra}=500$ and $750$. Evaluation of the learned control policy reveals a reduction in convection intensity by $23.5\%$ and $8.7\%$ at $\mathrm{Ra}=500$ and $750$, respectively. The MARL controller converts irregularly shaped convective patterns to regular straight rolls with lower convection that resemble flow in a relatively more stable regime. We draw comparisons with proportional control at both $\mathrm{Ra}$ and show that MARL is able to outperform the proportional controller. The learned control strategy is complex, featuring different non-linear segment-wise actuator delays and actuation magnitudes. We also perform successful evaluations on a larger domain than used for training, demonstrating that the invariant property of MARL allows direct transfer of the learnt policy.

Architectural Influence on Variational Quantum Circuits in Multi-Agent Reinforcement Learning: Evolutionary Strategies for Optimization

Authors:Michael Kölle, Karola Schneider, Sabrina Egger, Felix Topp, Thomy Phan, Philipp Altmann, Jonas Nüßlein, Claudia Linnhoff-Popien
Date:2024-07-30 11:16:25

In recent years, Multi-Agent Reinforcement Learning (MARL) has found application in numerous areas of science and industry, such as autonomous driving, telecommunications, and global health. Nevertheless, MARL suffers from, for instance, an exponential growth of dimensions. Inherent properties of quantum mechanics help to overcome these limitations, e.g., by significantly reducing the number of trainable parameters. Previous studies have developed an approach that uses gradient-free quantum Reinforcement Learning and evolutionary optimization for variational quantum circuits (VQCs) to reduce the trainable parameters and avoid barren plateaus as well as vanishing gradients. This leads to a significantly better performance of VQCs compared to classical neural networks with a similar number of trainable parameters and a reduction in the number of parameters by more than 97 \% compared to similarly good neural networks. We extend an approach of K\"olle et al. by proposing a Gate-Based, a Layer-Based, and a Prototype-Based concept to mutate and recombine VQCs. Our results show the best performance for mutation-only strategies and the Gate-Based approach. In particular, we observe a significantly better score, higher total and own collected coins, as well as a superior own coin rate for the best agent when evaluated in the Coin Game environment.

Finite-Time Analysis of Asynchronous Multi-Agent TD Learning

Authors:Nicolò Dal Fabbro, Arman Adibi, Aritra Mitra, George J. Pappas
Date:2024-07-29 22:36:07

Recent research endeavours have theoretically shown the beneficial effect of cooperation in multi-agent reinforcement learning (MARL). In a setting involving $N$ agents, this beneficial effect usually comes in the form of an $N$-fold linear convergence speedup, i.e., a reduction - proportional to $N$ - in the number of iterations required to reach a certain convergence precision. In this paper, we show for the first time that this speedup property also holds for a MARL framework subject to asynchronous delays in the local agents' updates. In particular, we consider a policy evaluation problem in which multiple agents cooperate to evaluate a common policy by communicating with a central aggregator. In this setting, we study the finite-time convergence of \texttt{AsyncMATD}, an asynchronous multi-agent temporal difference (TD) learning algorithm in which agents' local TD update directions are subject to asynchronous bounded delays. Our main contribution is providing a finite-time analysis of \texttt{AsyncMATD}, for which we establish a linear convergence speedup while highlighting the effect of time-varying asynchronous delays on the resulting convergence rate.

Language-Conditioned Offline RL for Multi-Robot Navigation

Authors:Steven Morad, Ajay Shankar, Jan Blumenkamp, Amanda Prorok
Date:2024-07-29 16:49:30

We present a method for developing navigation policies for multi-robot teams that interpret and follow natural language instructions. We condition these policies on embeddings from pretrained Large Language Models (LLMs), and train them via offline reinforcement learning with as little as 20 minutes of randomly-collected data. Experiments on a team of five real robots show that these policies generalize well to unseen commands, indicating an understanding of the LLM latent space. Our method requires no simulators or environment models, and produces low-latency control policies that can be deployed directly to real robots without finetuning. We provide videos of our experiments at https://sites.google.com/view/llm-marl.

Quantum Computing and Neuromorphic Computing for Safe, Reliable, and explainable Multi-Agent Reinforcement Learning: Optimal Control in Autonomous Robotics

Authors:Mazyar Taghavi
Date:2024-07-29 15:43:30

This paper investigates the utilization of Quantum Computing and Neuromorphic Computing for Safe, Reliable, and Explainable Multi_Agent Reinforcement Learning (MARL) in the context of optimal control in autonomous robotics. The objective was to address the challenges of optimizing the behavior of autonomous agents while ensuring safety, reliability, and explainability. Quantum Computing techniques, including Quantum Approximate Optimization Algorithm (QAOA), were employed to efficiently explore large solution spaces and find approximate solutions to complex MARL problems. Neuromorphic Computing, inspired by the architecture of the human brain, provided parallel and distributed processing capabilities, which were leveraged to develop intelligent and adaptive systems. The combination of these technologies held the potential to enhance the safety, reliability, and explainability of MARL in autonomous robotics. This research contributed to the advancement of autonomous robotics by exploring cutting-edge technologies and their applications in multi-agent systems. Codes and data are available.

Counterfactual rewards promote collective transport using individually controlled swarm microrobots

Authors:Veit-Lorenz Heuthe, Emanuele Panizon, Hongri Gu, Clemens Bechinger
Date:2024-07-29 14:26:46

Swarm robots offer fascinating opportunities to perform complex tasks beyond the capabilities of individual machines. Just as a swarm of ants collectively moves a large object, similar functions can emerge within a group of robots through individual strategies based on local sensing. However, realizing collective functions with individually controlled microrobots is particularly challenging due to their micrometer size, large number of degrees of freedom, strong thermal noise relative to the propulsion speed, complex physical coupling between neighboring microrobots, and surface collisions. Here, we implement Multi-Agent Reinforcement Learning (MARL) to generate a control strategy for up to 200 microrobots whose motions are individually controlled by laser spots. During the learning process, we employ so-called counterfactual rewards that automatically assign credit to the individual microrobots, which allows for fast and unbiased training. With the help of this efficient reward scheme, swarm microrobots learn to collectively transport a large cargo object to an arbitrary position and orientation, similar to ant swarms. We demonstrate that this flexible and versatile swarm robotic system is robust to variations in group size, the presence of malfunctioning units, and environmental noise. Such control strategies can potentially enable complex and automated assembly of mobile micromachines, programmable drug delivery capsules, and other advanced lab-on-a-chip applications.

Collaborative Adaptation for Recovery from Unforeseen Malfunctions in Discrete and Continuous MARL Domains

Authors:Yasin Findik, Hunter Hasenfus, Reza Azadeh
Date:2024-07-27 02:04:58

Cooperative multi-agent learning plays a crucial role for developing effective strategies to achieve individual or shared objectives in multi-agent teams. In real-world settings, agents may face unexpected failures, such as a robot's leg malfunctioning or a teammate's battery running out. These malfunctions decrease the team's ability to accomplish assigned task(s), especially if they occur after the learning algorithms have already converged onto a collaborative strategy. Current leading approaches in Multi-Agent Reinforcement Learning (MARL) often recover slowly -- if at all -- from such malfunctions. To overcome this limitation, we present the Collaborative Adaptation (CA) framework, highlighting its unique capability to operate in both continuous and discrete domains. Our framework enhances the adaptability of agents to unexpected failures by integrating inter-agent relationships into their learning processes, thereby accelerating the recovery from malfunctions. We evaluated our framework's performance through experiments in both discrete and continuous environments. Empirical results reveal that in scenarios involving unforeseen malfunction, although state-of-the-art algorithms often converge on sub-optimal solutions, the proposed CA framework mitigates and recovers more effectively.

Advanced deep-reinforcement-learning methods for flow control: group-invariant and positional-encoding networks improve learning speed and quality

Authors:Joongoo Jeon, Jean Rabault, Joel Vasanth, Francisco Alcántara-Ávila, Shilaj Baral, Ricardo Vinuesa
Date:2024-07-25 07:24:41

Flow control is key to maximize energy efficiency in a wide range of applications. However, traditional flow-control methods face significant challenges in addressing non-linear systems and high-dimensional data, limiting their application in realistic energy systems. This study advances deep-reinforcement-learning (DRL) methods for flow control, particularly focusing on integrating group-invariant networks and positional encoding into DRL architectures. Our methods leverage multi-agent reinforcement learning (MARL) to exploit policy invariance in space, in combination with group-invariant networks to ensure local symmetry invariance. Additionally, a positional encoding inspired by the transformer architecture is incorporated to provide location information to the agents, mitigating action constraints from strict invariance. The proposed methods are verified using a case study of Rayleigh-B\'enard convection, where the goal is to minimize the Nusselt number Nu. The group-invariant neural networks (GI-NNs) show faster convergence compared to the base MARL, achieving better average policy performance. The GI-NNs not only cut DRL training time in half but also notably enhance learning reproducibility. Positional encoding further enhances these results, effectively reducing the minimum Nu and stabilizing convergence. Interestingly, group invariant networks specialize in improving learning speed and positional encoding specializes in improving learning quality. These results demonstrate that choosing a suitable feature-representation method according to the purpose as well as the characteristics of each control problem is essential. We believe that the results of this study will not only inspire novel DRL methods with invariant and unique representations, but also provide useful insights for industrial applications.

Evaluating Uncertainties in Electricity Markets via Machine Learning and Quantum Computing

Authors:Shuyang Zhu, Ziqing Zhu, Linghua Zhu, Yujian Ye, Siqi Bu, Sasa Z. Djokic
Date:2024-07-23 11:46:13

The analysis of decision-making process in electricity markets is crucial for understanding and resolving issues related to market manipulation and reduced social welfare. Traditional Multi-Agent Reinforcement Learning (MARL) method can model decision-making of generation companies (GENCOs), but faces challenges due to uncertainties in policy functions, reward functions, and inter-agent interactions. Quantum computing offers a promising solution to resolve these uncertainties, and this paper introduces the Quantum Multi-Agent Deep Q-Network (Q-MADQN) method, which integrates variational quantum circuits into the traditional MARL framework. The main contributions of the paper are: identifying the correspondence between market uncertainties and quantum properties, proposing the Q-MADQN algorithm for simulating electricity market bidding, and demonstrating that Q-MADQN allows for a more thorough exploration and simulates more potential bidding strategies of profit-oriented GENCOs, compared to conventional methods, without compromising computational efficiency. The proposed method is illustrated on IEEE 30-bus test network, confirming that it offers a more accurate model for simulating complex market dynamics.

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

Authors:Wenjing Zhang, Wei Zhang, Wenqing Hu, Yifan Wang
Date:2024-07-21 06:58:14

Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Authors:Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
Date:2024-07-20 16:37:21

Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obstacle avoidance, that have been conventionally approached with the classical non-learnable methods (e.g., heuristic search) is currently suggested to be solved by the learning-based or hybrid methods. Still, in this domain, it is hard, not to say impossible, to conduct a fair comparison between classical, learning-based, and hybrid approaches due to the lack of a unified framework that supports both learning and evaluation. To this end, we introduce POGEMA, a set of comprehensive tools that includes a fast environment for learning, a generator of problem instances, the collection of pre-defined ones, a visualization toolkit, and a benchmarking tool that allows automated evaluation. We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators (such as success rate and path length), allowing a fair multi-fold comparison. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.

Navigating the Smog: A Cooperative Multi-Agent RL for Accurate Air Pollution Mapping through Data Assimilation

Authors:Ichrak Mokhtari, Walid Bechkit, Mohamed Sami Assenine, Hervé Rivano
Date:2024-07-17 13:24:27

The rapid rise of air pollution events necessitates accurate, real-time monitoring for informed mitigation strategies. Data Assimilation (DA) methods provide promising solutions, but their effectiveness hinges heavily on optimal measurement locations. This paper presents a novel approach for air quality mapping where autonomous drones, guided by a collaborative multi-agent reinforcement learning (MARL) framework, act as airborne detectives. Ditching the limitations of static sensor networks, the drones engage in a synergistic interaction, adapting their flight paths in real time to gather optimal data for Data Assimilation (DA). Our approach employs a tailored reward function with dynamic credit assignment, enabling drones to prioritize informative measurements without requiring unavailable ground truth data, making it practical for real-world deployments. Extensive experiments using a real-world dataset demonstrate that our solution achieves significantly improved pollution estimates, even with limited drone resources or limited prior knowledge of the pollution plume. Beyond air quality, this solution unlocks possibilities for tackling diverse environmental challenges like wildfire detection and management through scalable and autonomous drone cooperation.

Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models

Authors:Xihe Qiu, Haoyu Wang, Xiaoyu Tan, Chao Qu, Yujie Xiong, Yuan Cheng, Yinghui Xu, Wei Chu, Yuan Qi
Date:2024-07-17 13:14:00

Effective collaboration in multi-agent systems requires communicating goals and intentions between agents. Current agent frameworks often suffer from dependencies on single-agent execution and lack robust inter-module communication, frequently leading to suboptimal multi-agent reinforcement learning (MARL) policies and inadequate task coordination. To address these challenges, we present a framework for training large language models (LLMs) as collaborative agents to enable coordinated behaviors in cooperative MARL. Each agent maintains a private intention consisting of its current goal and associated sub-tasks. Agents broadcast their intentions periodically, allowing other agents to infer coordination tasks. A propagation network transforms broadcast intentions into teammate-specific communication messages, sharing relevant goals with designated teammates. The architecture of our framework is structured into planning, grounding, and execution modules. During execution, multiple agents interact in a downstream environment and communicate intentions to enable coordinated behaviors. The grounding module dynamically adapts comprehension strategies based on emerging coordination patterns, while feedback from execution agents influnces the planning module, enabling the dynamic re-planning of sub-tasks. Results in collaborative environment simulation demonstrate intention propagation reduces miscoordination errors by aligning sub-task dependencies between agents. Agents learn when to communicate intentions and which teammates require task details, resulting in emergent coordinated behaviors. This demonstrates the efficacy of intention sharing for cooperative multi-agent RL based on LLMs.

Cooperative Reward Shaping for Multi-Agent Pathfinding

Authors:Zhenyu Song, Ronghao Zheng, Senlin Zhang, Meiqin Liu
Date:2024-07-15 02:44:41

The primary objective of Multi-Agent Pathfinding (MAPF) is to plan efficient and conflict-free paths for all agents. Traditional multi-agent path planning algorithms struggle to achieve efficient distributed path planning for multiple agents. In contrast, Multi-Agent Reinforcement Learning (MARL) has been demonstrated as an effective approach to achieve this objective. By modeling the MAPF problem as a MARL problem, agents can achieve efficient path planning and collision avoidance through distributed strategies under partial observation. However, MARL strategies often lack cooperation among agents due to the absence of global information, which subsequently leads to reduced MAPF efficiency. To address this challenge, this letter introduces a unique reward shaping technique based on Independent Q-Learning (IQL). The aim of this method is to evaluate the influence of one agent on its neighbors and integrate such an interaction into the reward function, leading to active cooperation among agents. This reward shaping method facilitates cooperation among agents while operating in a distributed manner. The proposed approach has been evaluated through experiments across various scenarios with different scales and agent counts. The results are compared with those from other state-of-the-art (SOTA) planners. The evidence suggests that the approach proposed in this letter parallels other planners in numerous aspects, and outperforms them in scenarios featuring a large number of agents.

Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

Authors:Shun Kotoku, Takatomo Mihana, André Röhm, Ryoichi Horisaki
Date:2024-07-12 09:38:47

Multi-agent reinforcement learning (MARL) studies crucial principles that are applicable to a variety of fields, including wireless networking and autonomous driving. We propose a photonic-based decision-making algorithm to address one of the most fundamental problems in MARL, called the competitive multi-armed bandit (CMAB) problem. Our numerical simulations demonstrate that chaotic oscillations and cluster synchronization of optically coupled lasers, along with our proposed decentralized coupling adjustment, efficiently balance exploration and exploitation while facilitating cooperative decision-making without explicitly sharing information among agents. Our study demonstrates how decentralized reinforcement learning can be achieved by exploiting complex physical processes controlled by simple algorithms.

Communication-Aware Reinforcement Learning for Cooperative Adaptive Cruise Control

Authors:Sicong Jiang, Seongjin Choi, Lijun Sun
Date:2024-07-12 03:28:24

Cooperative Adaptive Cruise Control (CACC) plays a pivotal role in enhancing traffic efficiency and safety in Connected and Autonomous Vehicles (CAVs). Reinforcement Learning (RL) has proven effective in optimizing complex decision-making processes in CACC, leading to improved system performance and adaptability. Among RL approaches, Multi-Agent Reinforcement Learning (MARL) has shown remarkable potential by enabling coordinated actions among multiple CAVs through Centralized Training with Decentralized Execution (CTDE). However, MARL often faces scalability issues, particularly when CACC vehicles suddenly join or leave the platoon, resulting in performance degradation. To address these challenges, we propose Communication-Aware Reinforcement Learning (CA-RL). CA-RL includes a communication-aware module that extracts and compresses vehicle communication information through forward and backward information transmission modules. This enables efficient cyclic information propagation within the CACC traffic flow, ensuring policy consistency and mitigating the scalability problems of MARL in CACC. Experimental results demonstrate that CA-RL significantly outperforms baseline methods in various traffic scenarios, achieving superior scalability, robustness, and overall system performance while maintaining reliable performance despite changes in the number of participating vehicles.

Dynamic Co-Optimization Compiler: Leveraging Multi-Agent Reinforcement Learning for Enhanced DNN Accelerator Performance

Authors:Arya Fayyazi, Mehdi Kamal, Massoud Pedram
Date:2024-07-11 05:22:04

This paper introduces a novel Dynamic Co-Optimization Compiler (DCOC), which employs an adaptive Multi-Agent Reinforcement Learning (MARL) framework to enhance the efficiency of mapping machine learning (ML) models, particularly Deep Neural Networks (DNNs), onto diverse hardware platforms. DCOC incorporates three specialized actor-critic agents within MARL, each dedicated to different optimization facets: one for hardware and two for software. This cooperative strategy results in an integrated hardware/software co-optimization approach, improving the precision and speed of DNN deployments. By focusing on high-confidence configurations, DCOC effectively reduces the search space, achieving remarkable performance over existing methods. Our results demonstrate that DCOC enhances throughput by up to 37.95% while reducing optimization time by up to 42.2% across various DNN models, outperforming current state-of-the-art frameworks.

Hierarchical Consensus-Based Multi-Agent Reinforcement Learning for Multi-Robot Cooperation Tasks

Authors:Pu Feng, Junkang Liang, Size Wang, Xin Yu, Xin Ji, Yiting Chen, Kui Zhang, Rongye Shi, Wenjun Wu
Date:2024-07-11 03:55:55

In multi-agent reinforcement learning (MARL), the Centralized Training with Decentralized Execution (CTDE) framework is pivotal but struggles due to a gap: global state guidance in training versus reliance on local observations in execution, lacking global signals. Inspired by human societal consensus mechanisms, we introduce the Hierarchical Consensus-based Multi-Agent Reinforcement Learning (HC-MARL) framework to address this limitation. HC-MARL employs contrastive learning to foster a global consensus among agents, enabling cooperative behavior without direct communication. This approach enables agents to form a global consensus from local observations, using it as an additional piece of information to guide collaborative actions during execution. To cater to the dynamic requirements of various tasks, consensus is divided into multiple layers, encompassing both short-term and long-term considerations. Short-term observations prompt the creation of an immediate, low-layer consensus, while long-term observations contribute to the formation of a strategic, high-layer consensus. This process is further refined through an adaptive attention mechanism that dynamically adjusts the influence of each consensus layer. This mechanism optimizes the balance between immediate reactions and strategic planning, tailoring it to the specific demands of the task at hand. Extensive experiments and real-world applications in multi-robot systems showcase our framework's superior performance, marking significant advancements over baselines.

Field Deployment of Multi-Agent Reinforcement Learning Based Variable Speed Limit Controllers

Authors:Yuhang Zhang, Zhiyao Zhang, Marcos Quiñones-Grueiro, William Barbour, Clay Weston, Gautam Biswas, Daniel Work
Date:2024-07-10 19:59:59

This article presents the first field deployment of a multi-agent reinforcement-learning (MARL) based variable speed limit (VSL) control system on the I-24 freeway near Nashville, Tennessee. We describe how we train MARL agents in a traffic simulator and directly deploy the simulation-based policy on a 17-mile stretch of Interstate 24 with 67 VSL controllers. We use invalid action masking and several safety guards to ensure the posted speed limits satisfy the real-world constraints from the traffic management center and the Tennessee Department of Transportation. Since the time of launch of the system through April, 2024, the system has made approximately 10,000,000 decisions on 8,000,000 trips. The analysis of the controller shows that the MARL policy takes control for up to 98% of the time without intervention from safety guards. The time-space diagrams of traffic speed and control commands illustrate how the algorithm behaves during rush hour. Finally, we quantify the domain mismatch between the simulation and real-world data and demonstrate the robustness of the MARL policy to this mismatch.

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

Authors:Logan Cross, Violet Xiang, Agam Bhatia, Daniel LK Yamins, Nick Haber
Date:2024-07-09 17:57:15

Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems and fail to adaptively learn online when tested with novel agents. Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges. Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction. We introduce the Theory of Mind module that scaffolds the high-level planning process by generating hypotheses about other agents' strategies in natural language. It then evaluates and iteratively refines these hypotheses by reinforcing hypotheses that make correct predictions about the other agents' behavior. Hypothetical Minds significantly improves performance over previous LLM-agent and RL baselines on a range of competitive, mixed motive, and collaborative domains in the Melting Pot benchmark, including both dyadic and population-based environments. Additionally, comparisons against LLM-agent baselines and ablations reveal the importance of hypothesis evaluation and refinement for succeeding on complex scenarios.

Diffusive shock acceleration in relativistic, oblique shocks

Authors:Allard Jan van Marle, Artem Bohdan, Anabella Araudo, Fabien Casse, Alexandre Marcowith
Date:2024-07-08 11:50:03

Cosmic rays are charged particles that are accelerated to relativistic speeds by astrophysical shocks. Numerical models have been successful in confirming the acceleration process for (quasi-)parallel shocks, which have the magnetic field aligned with the direction of the shock motion. However, the process is less clear when it comes to (quasi-)perpendicular shocks, where the field makes a large angle with the shock-normal. For such shocks, the angle between the magnetic field and flow ensures that only highly energetic particles can travel upstream at all, reducing the upstream current. This process is further inhibited for relativistic shocks, since the shock can become superluminal when the required particle velocity exceeds the speed of light, effectively inhibiting any upstream particle flow. In order to determine whether such shocks can accelerate particles, we use the particle-in-cell (PIC) method to determine what fraction of particles gets reflected initially at the shock. We then use this as input for a new simulation that combines the PIC method with grid-based magnetohydrodynamics to follow the acceleration (if any) of the particles over a larger time-period in a two-dimensional grid. We find that quasi-perpendicular, relativistic shocks are capable of accelerating particles through the DSA process, provided that the shock has a sufficiently high Alfvenic Mach number.

FedMRL: Data Heterogeneity Aware Federated Multi-agent Deep Reinforcement Learning for Medical Imaging

Authors:Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal
Date:2024-07-08 10:10:07

Despite recent advancements in federated learning (FL) for medical image diagnosis, addressing data heterogeneity among clients remains a significant challenge for practical implementation. A primary hurdle in FL arises from the non-IID nature of data samples across clients, which typically results in a decline in the performance of the aggregated global model. In this study, we introduce FedMRL, a novel federated multi-agent deep reinforcement learning framework designed to address data heterogeneity. FedMRL incorporates a novel loss function to facilitate fairness among clients, preventing bias in the final global model. Additionally, it employs a multi-agent reinforcement learning (MARL) approach to calculate the proximal term $(\mu)$ for the personalized local objective function, ensuring convergence to the global optimum. Furthermore, FedMRL integrates an adaptive weight adjustment method using a Self-organizing map (SOM) on the server side to counteract distribution shifts among clients' local data distributions. We assess our approach using two publicly available real-world medical datasets, and the results demonstrate that FedMRL significantly outperforms state-of-the-art techniques, showing its efficacy in addressing data heterogeneity in federated learning. The code can be found here~{\url{https://github.com/Pranabiitp/FedMRL}}.

Multi-Scenario Combination Based on Multi-Agent Reinforcement Learning to Optimize the Advertising Recommendation System

Authors:Yang Zhao, Chang Zhou, Jin Cao, Yi Zhao, Shaobo Liu, Chiyu Cheng, Xingchen Li
Date:2024-07-03 02:33:20

This paper explores multi-scenario optimization on large platforms using multi-agent reinforcement learning (MARL). We address this by treating scenarios like search, recommendation, and advertising as a cooperative, partially observable multi-agent decision problem. We introduce the Multi-Agent Recurrent Deterministic Policy Gradient (MARDPG) algorithm, which aligns different scenarios under a shared objective and allows for strategy communication to boost overall performance. Our results show marked improvements in metrics such as click-through rate (CTR), conversion rate, and total sales, confirming our method's efficacy in practical settings.

Wildfire Autonomous Response and Prediction Using Cellular Automata (WARP-CA)

Authors:Abdelrahman Ramadan
Date:2024-07-02 19:01:59

Wildfires pose a severe challenge to ecosystems and human settlements, exacerbated by climate change and environmental factors. Traditional wildfire modeling, while useful, often fails to adapt to the rapid dynamics of such events. This report introduces the (Wildfire Autonomous Response and Prediction Using Cellular Automata) WARP-CA model, a novel approach that integrates terrain generation using Perlin noise with the dynamism of Cellular Automata (CA) to simulate wildfire spread. We explore the potential of Multi-Agent Reinforcement Learning (MARL) to manage wildfires by simulating autonomous agents, such as UAVs and UGVs, within a collaborative framework. Our methodology combines world simulation techniques and investigates emergent behaviors in MARL, focusing on efficient wildfire suppression and considering critical environmental factors like wind patterns and terrain features.

Coordination Failure in Cooperative Offline MARL

Authors:Callum Rhys Tilbury, Claude Formanek, Louise Beyers, Jonathan P. Shock, Arnu Pretorius
Date:2024-07-01 14:51:29

Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the 'Best Response Under Data' (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.

Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

Authors:Jianuo Huang
Date:2024-06-30 16:05:31

In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications.

CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems

Authors:Zhen Chen, Yong Liao, Youpeng Zhao, Zipeng Dai, Jian Zhao
Date:2024-06-25 09:59:31

Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well known to be vulnerable to adversarial perturbations. Previous works on adversarial attacks have primarily focused on white-box attacks that directly perturb the states or actions of victim agents, often in scenarios with a limited number of attacks. However, gaining complete access to victim agents in real-world environments is exceedingly difficult. To create more realistic adversarial attacks, we introduce a novel method that involves injecting traitor agents into the CMARL system. We model this problem as a Traitor Markov Decision Process (TMDP), where traitors cannot directly attack the victim agents but can influence their formation or positioning through collisions. In TMDP, traitors are trained using the same MARL algorithm as the victim agents, with their reward function set as the negative of the victim agents' reward. Despite this, the training efficiency for traitors remains low because it is challenging for them to directly associate their actions with the victim agents' rewards. To address this issue, we propose the Curiosity-Driven Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and aggressiveness of attacks on the specified victim agents' policies while maintaining the optimal policy invariance of the traitors. Specifically, we employ a pre-trained Random Network Distillation (RND) module, where the extra reward generated by the RND module encourages traitors to explore states unencountered by the victim agents. Extensive experiments on various scenarios from SMAC demonstrate that our CuDA2 framework offers comparable or superior adversarial attack capabilities compared to other baselines.

Temporal Prototype-Aware Learning for Active Voltage Control on Power Distribution Networks

Authors:Feiyang Xu, Shunyu Liu, Yunpeng Qing, Yihe Zhou, Yuwen Wang, Mingli Song
Date:2024-06-25 08:07:00

Active Voltage Control (AVC) on the Power Distribution Networks (PDNs) aims to stabilize the voltage levels to ensure efficient and reliable operation of power systems. With the increasing integration of distributed energy resources, recent efforts have explored employing multi-agent reinforcement learning (MARL) techniques to realize effective AVC. Existing methods mainly focus on the acquisition of short-term AVC strategies, i.e., only learning AVC within the short-term training trajectories of a singular diurnal cycle. However, due to the dynamic nature of load demands and renewable energy, the operation states of real-world PDNs may exhibit significant distribution shifts across varying timescales (e.g., daily and seasonal changes). This can render those short-term strategies suboptimal or even obsolete when performing continuous AVC over extended periods. In this paper, we propose a novel temporal prototype-aware learning method, abbreviated as TPA, to learn time-adaptive AVC under short-term training trajectories. At the heart of TPA are two complementary components, namely multi-scale dynamic encoder and temporal prototype-aware policy, that can be readily incorporated into various MARL methods. The former component integrates a stacked transformer network to learn underlying temporal dependencies at different timescales of the PDNs, while the latter implements a learnable prototype matching mechanism to construct a dedicated AVC policy that can dynamically adapt to the evolving operation states. Experimental results on the AVC benchmark with different PDN sizes demonstrate that the proposed TPA surpasses the state-of-the-art counterparts not only in terms of control performance but also by offering model transferability. Our code is available at https://github.com/Canyizl/TPA-for-AVC.

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Authors:Yang Zhang, Chenjia Bai, Bin Zhao, Junchi Yan, Xiu Li, Xuelong Li
Date:2024-06-22 12:40:03

Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.

Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization

Authors:Wentse Chen, Shiyu Huang, Jeff Schneider
Date:2024-06-20 01:55:08

Multi-agent reinforcement learning (MARL) tasks often utilize a centralized training with decentralized execution (CTDE) framework. QMIX is a successful CTDE method that learns a credit assignment function to derive local value functions from a global value function, defining a deterministic local policy. However, QMIX is hindered by its poor exploration strategy. While maximum entropy reinforcement learning (RL) promotes better exploration through stochastic policies, QMIX's process of credit assignment conflicts with the maximum entropy objective and the decentralized execution requirement, making it unsuitable for maximum entropy RL. In this paper, we propose an enhancement to QMIX by incorporating an additional local Q-value learning method within the maximum entropy RL framework. Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions. Due to the monotonicity of the QMIX value function, these updates ensure that locally optimal actions align with globally optimal actions. We theoretically prove the monotonic improvement and convergence of our method to an optimal solution. Experimentally, we validate our algorithm in matrix games, Multi-Agent Particle Environment and demonstrate state-of-the-art performance in SMAC-v2.

VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework

Authors:Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, Weijia Jia
Date:2024-06-19 09:41:37

The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs, thereby impacting the Quality of Services (QoS) at the network edge. Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests, which has been overlooked by previous research. Addressing these gaps, this paper introduces a novel Vector database-assisted cloud-Edge collaborative LLM QoS Optimization (VELO) framework. Firstly, we propose the VELO framework, which ingeniously employs vector database to cache the results of some LLM requests at the edge to reduce the response time of subsequent similar requests. Diverging from direct optimization of the LLM, our VELO framework does not necessitate altering the internal structure of LLM and is broadly applicable to diverse LLMs. Subsequently, building upon the VELO framework, we formulate the QoS optimization problem as a Markov Decision Process (MDP) and devise an algorithm grounded in Multi-Agent Reinforcement Learning (MARL) to decide whether to request the LLM in the cloud or directly return the results from the vector database at the edge. Moreover, to enhance request feature extraction and expedite training, we refine the policy network of MARL and integrate expert demonstrations. Finally, we implement the proposed algorithm within a real edge system. Experimental findings confirm that our VELO framework substantially enhances user satisfaction by concurrently diminishing delay and resource consumption for edge users utilizing LLMs.

Communication-Efficient MARL for Platoon Stability and Energy-efficiency Co-optimization in Cooperative Adaptive Cruise Control of CAVs

Authors:Min Hua, Dong Chen, Kun Jiang, Fanggang Zhang, Jinhai Wang, Bo Wang, Quan Zhou, Hongming Xu
Date:2024-06-17 15:34:37

Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize platoon stability and energy efficiency simultaneously. The optimal use of communication bandwidth is the key to guaranteeing learning performance in real-world driving, and thus this paper proposes a communication-efficient MARL by incorporating the quantified stochastic gradient descent (QSGD) and a binary differential consensus (BDC) method into a fully-decentralized MARL framework. We benchmarked the performance of our proposed BDC-MARL algorithm against several several non-communicative andcommunicative MARL algorithms, e.g., IA2C, FPrint, and DIAL, through the evaluation of platoon stability, fuel economy, and driving comfort. Our results show that BDC-MARL achieved the highest energy savings, improving by up to 5.8%, with an average velocity of 15.26 m/s and an inter-vehicle spacing of 20.76 m. In addition, we conducted different information-sharing analyses to assess communication efficacy, along with sensitivity analyses and scalability tests with varying platoon sizes. The practical effectiveness of our approach is further demonstrated using real-world scenarios sourced from open-sourced OpenACC.

Balancing Performance and Cost for Two-Hop Cooperative Communications: Stackelberg Game and Distributed Multi-Agent Reinforcement Learning

Authors:Yuanzhe Geng, Erwu Liu, Wei Ni, Rui Wang, Yan Liu, Hao Xu, Chen Cai, Abbas Jamalipour
Date:2024-06-17 07:12:44

This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays form an alliance in an attempt to maximize the benefit of relaying while the source aims to increase the channel capacity cost-effectively. To this end, we establish the trade problem as a Stackelberg game, and prove the existence of its equilibrium. Another important aspect is that we use multi-agent reinforcement learning (MARL) to approach the equilibrium in a situation where the instantaneous channel state information (CSI) is unavailable, and the source and relays do not have knowledge of each other's goal. A multi-agent deep deterministic policy gradient-based framework is designed, where the relay alliance and the source act as agents. Experiments demonstrate that the proposed method can obtain an acceptable performance that is close to the game-theoretic equilibrium for all players under time-invariant environments, which considerably outperforms its potential alternatives and is only about 2.9% away from the optimal solution.

The Benefits of Power Regularization in Cooperative Reinforcement Learning

Authors:Michelle Li, Michael Dennis
Date:2024-06-17 06:10:37

Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.

Tree Search for Simultaneous Move Games via Equilibrium Approximation

Authors:Ryan Yu, Alex Olshevsky, Peter Chin
Date:2024-06-14 21:02:35

Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on partial information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of partial information games which are most similar to perfect information games: both agents know the game state with the exception of the opponent's move, which is revealed only after each agent makes its own move. Simultaneous move games include popular benchmarks such as Google Research Football and Starcraft. In this study we answer the question: can we take tree search algorithms trained through self-play from perfect information settings and adapt them to simultaneous move games without significant loss of performance? We answer this question by deriving a practical method that attempts to approximate a coarse correlated equilibrium as a subroutine within a tree search. Our algorithm works on cooperative, competitive, and mixed tasks. Our results are better than the current best MARL algorithms on a wide range of accepted baseline environments.

Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

Authors:Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Shock, Arnu Pretorius
Date:2024-06-13 12:54:29

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.

Equilibrium Selection for Multi-agent Reinforcement Learning: A Unified Framework

Authors:Runyu Zhang, Jeff Shamma, Na Li
Date:2024-06-13 06:09:20

While there are numerous works in multi-agent reinforcement learning (MARL), most of them focus on designing algorithms and proving convergence to a Nash equilibrium (NE) or other equilibrium such as coarse correlated equilibrium. However, NEs can be non-unique and their performance varies drastically. Thus, it is important to design algorithms that converge to Nash equilibrium with better rewards or social welfare. In contrast, classical game theory literature has extensively studied equilibrium selection for multi-agent learning in normal-form games, demonstrating that decentralized learning algorithms can asymptotically converge to potential-maximizing or Pareto-optimal NEs. These insights motivate this paper to investigate equilibrium selection in the MARL setting. We focus on the stochastic game model, leveraging classical equilibrium selection results from normal-form games to propose a unified framework for equilibrium selection in stochastic games. The proposed framework is highly modular and can extend various learning rules and their corresponding equilibrium selection results from normal-form games to the stochastic game setting.

Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning

Authors:Yizhe Huang, Anji Liu, Fanqi Kong, Yaodong Yang, Song-Chun Zhu, Xue Feng
Date:2024-06-12 08:48:06

Despite the recent successes of multi-agent reinforcement learning (MARL) algorithms, efficiently adapting to co-players in mixed-motive environments remains a significant challenge. One feasible approach is to hierarchically model co-players' behavior based on inferring their characteristics. However, these methods often encounter difficulties in efficient reasoning and utilization of inferred information. To address these issues, we propose Hierarchical Opponent modeling and Planning (HOP), a novel multi-agent decision-making algorithm that enables few-shot adaptation to unseen policies in mixed-motive environments. HOP is hierarchically composed of two modules: an opponent modeling module that infers others' goals and learns corresponding goal-conditioned policies, and a planning module that employs Monte Carlo Tree Search (MCTS) to identify the best response. Our approach improves efficiency by updating beliefs about others' goals both across and within episodes and by using information from the opponent modeling module to guide planning. Experimental results demonstrate that in mixed-motive environments, HOP exhibits superior few-shot adaptation capabilities when interacting with various unseen agents, and excels in self-play scenarios. Furthermore, the emergence of social intelligence during our experiments underscores the potential of our approach in complex multi-agent environments.

Carbon Market Simulation with Adaptive Mechanism Design

Authors:Han Wang, Wenhao Li, Hongyuan Zha, Baoxiang Wang
Date:2024-06-12 05:08:51

A carbon market is a market-based tool that incentivizes economic agents to align individual profits with the global utility, i.e., reducing carbon emissions to tackle climate change. Cap and trade stands as a critical principle based on allocating and trading carbon allowances (carbon emission credit), enabling economic agents to follow planned emissions and penalizing excess emissions. A central authority is responsible for introducing and allocating those allowances in cap and trade. However, the complexity of carbon market dynamics makes accurate simulation intractable, which in turn hinders the design of effective allocation strategies. To address this, we propose an adaptive mechanism design framework, simulating the market using hierarchical, model-free multi-agent reinforcement learning (MARL). Government agents allocate carbon credits, while enterprises engage in economic activities and carbon trading. This framework illustrates agents' behavior comprehensively. Numerical results show MARL enables government agents to balance productivity, equality, and carbon emissions. Our project is available at https://github.com/xwanghan/Carbon-Simulator.

Multi-agent Reinforcement Learning with Deep Networks for Diverse Q-Vectors

Authors:Zhenglong Luo, Zhiyong Chen, James Welsh
Date:2024-06-12 03:30:10

Multi-agent reinforcement learning (MARL) has become a significant research topic due to its ability to facilitate learning in complex environments. In multi-agent tasks, the state-action value, commonly referred to as the Q-value, can vary among agents because of their individual rewards, resulting in a Q-vector. Determining an optimal policy is challenging, as it involves more than just maximizing a single Q-value. Various optimal policies, such as a Nash equilibrium, have been studied in this context. Algorithms like Nash Q-learning and Nash Actor-Critic have shown effectiveness in these scenarios. This paper extends this research by proposing a deep Q-networks (DQN) algorithm capable of learning various Q-vectors using Max, Nash, and Maximin strategies. The effectiveness of this approach is demonstrated in an environment where dual robotic arms collaborate to lift a pot.

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Authors:Mohidul Haque Mridul, Mohammad Foysal Khan, Redwan Ahmed Rizvee, Md Mosaddek Khan
Date:2024-06-10 17:34:44

In Multi-agent Reinforcement Learning (MARL), accurately perceiving opponents' strategies is essential for both cooperative and adversarial contexts, particularly within dynamic environments. While Proximal Policy Optimization (PPO) and related algorithms such as Actor-Critic with Experience Replay (ACER), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG) perform well in single-agent, stationary environments, they suffer from high variance in MARL due to non-stationary and hidden policies of opponents, leading to diminished reward performance. Additionally, existing methods in MARL face significant challenges, including the need for inter-agent communication, reliance on explicit reward information, high computational demands, and sampling inefficiencies. These issues render them less effective in continuous environments where opponents may abruptly change their policies without prior notice. Against this background, we present OPS-DeMo (Online Policy Switch-Detection Model), an online algorithm that employs dynamic error decay to detect changes in opponents' policies. OPS-DeMo continuously updates its beliefs using an Assumed Opponent Policy (AOP) Bank and selects corresponding responses from a pre-trained Response Policy Bank. Each response policy is trained against consistently strategizing opponents, reducing training uncertainty and enabling the effective use of algorithms like PPO in multi-agent environments. Comparative assessments show that our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting, providing greater robustness to sudden policy shifts and enabling more informed decision-making through precise opponent policy insights.

Risk Sensitivity in Markov Games and Multi-Agent Reinforcement Learning: A Systematic Review

Authors:Hafez Ghaemi, Shirin Jamshidi, Mohammad Mashreghi, Majid Nili Ahmadabadi, Hamed Kebriaei
Date:2024-06-10 06:19:33

Markov games (MGs) and multi-agent reinforcement learning (MARL) are studied to model decision making in multi-agent systems. Traditionally, the objective in MG and MARL has been risk-neutral, i.e., agents are assumed to optimize a performance metric such as expected return, without taking into account subjective or cognitive preferences of themselves or of other agents. However, ignoring such preferences leads to inaccurate models of decision making in many real-world scenarios in finance, operations research, and behavioral economics. Therefore, when these preferences are present, it is necessary to incorporate a suitable measure of risk into the optimization objective of agents, which opens the door to risk-sensitive MG and MARL. In this paper, we systemically review the literature on risk sensitivity in MG and MARL that has been growing in recent years alongside other areas of reinforcement learning and game theory. We define and mathematically describe different risk measures used in MG and MARL and individually for each measure, discuss articles that incorporate it. Finally, we identify recent trends in theoretical and applied works in the field and discuss possible directions of future research.

Joint Cooperative Clustering and Power Control for Energy-Efficient Cell-Free XL-MIMO with Multi-Agent Reinforcement Learning

Authors:Ziheng Liu, Jiayi Zhang, Zhilong Liu, Derrick Wing Kwan Ng, Bo Ai
Date:2024-06-08 14:09:14

In this paper, we investigate the amalgamation of cell-free (CF) and extremely large-scale multiple-input multiple-output (XL-MIMO) technologies, referred to as a CF XL-MIMO, as a promising advancement for enabling future mobile networks. To address the computational complexity and communication power consumption associated with conventional centralized optimization, we focus on user-centric dynamic networks in which each user is served by an adaptive subset of access points (AP) rather than all of them. We begin our research by analyzing a joint resource allocation problem for energy-efficient CF XL-MIMO systems, encompassing cooperative clustering and power control design, where all clusters are adaptively adjustable. Then, we propose an innovative double-layer multi-agent reinforcement learning (MARL)-based scheme, which offers an effective strategy to tackle the challenges of high-dimensional signal processing. In the section of numerical results, we compare various algorithms with different network architectures. These comparisons reveal that the proposed MARL-based cooperative architecture can effectively strike a balance between system performance and communication overhead, thereby improving energy efficiency performance. It is important to note that increasing the number of user equipments participating in information sharing can effectively enhance SE performance, which also leads to an increase in power consumption, resulting in a non-trivial trade-off between the number of participants and EE performance.

Mini Honor of Kings: A Lightweight Environment for Multi-Agent Reinforcement Learning

Authors:Lin Liu, Jian Zhao, Cheng Hu, Zhengtao Cao, Youpeng Zhao, Zhenbin Ye, Meng Meng, Wenjun Wang, Zhaofeng He, Houqiang Li, Xia Lin, Lanxiao Huang
Date:2024-06-06 11:42:33

Games are widely used as research environments for multi-agent reinforcement learning (MARL), but they pose three significant challenges: limited customization, high computational demands, and oversimplification. To address these issues, we introduce the first publicly available map editor for the popular mobile game Honor of Kings and design a lightweight environment, Mini Honor of Kings (Mini HoK), for researchers to conduct experiments. Mini HoK is highly efficient, allowing experiments to be run on personal PCs or laptops while still presenting sufficient challenges for existing MARL algorithms. We have tested our environment on common MARL algorithms and demonstrated that these algorithms have yet to find optimal solutions within this environment. This facilitates the dissemination and advancement of MARL methods within the research community. Additionally, we hope that more researchers will leverage the Honor of Kings map editor to develop innovative and scientifically valuable new maps. Our code and user manual are available at: https://github.com/tencent-ailab/mini-hok.

Kinetic simulations of electron-positron induced streaming instability in the context of gamma-ray halos around pulsars

Authors:Illya Plotnikov, Allard Jan van Marle, Claire Guépin, Alexandre Marcowith, Pierrick Martin
Date:2024-06-05 05:08:57

The possibility of slow diffusion regions as the origin for extended TeV emission halos around some pulsars (such as PSR J0633+1746 and PSR B0656+14) challenges the standard scaling of the electron diffusion coefficient in the interstellar medium. Self-generated turbulence by electron-positron pairs streaming out of the pulsar wind nebula was proposed as a possible mechanism to produce the enhanced turbulence required to explain the morphology and brightness of these TeV halos. We perform fully kinetic 1D3V particle-in-cell simulations of this instability, considering the case where streaming electrons and positrons have the same density. This implies purely resonant instability as the beam does not carry any current. We compare the linear phase of the instability with analytical theory and find very reasonable agreement. The non-linear phase of the instability is also studied, which reveals that the intensity of saturated waves is consistent with a momentum exchange criterion between a decelerating beam and growing magnetic waves. With the adopted parameters, the instability-driven wavemodes cover both the Alfv\'enic (fluid) and kinetic scales. The spectrum of the produced waves is non-symmetric, with left-handed circular polarisation waves being strongly damped when entering the ion-cyclotron branch, while right-handed waves are suppressed at smaller wavelength when entering the Whistler branch. The low-wavenumber part of the spectrum remains symmetric when in the Alfv\'enic branch. As a result, positrons behave dynamically differently compared to electrons. We also observed a second harmonic plasma emission in the wave spectrum. An MHD-PIC approach is warranted to probe hotter beams and investigate the Alfv\'en branch physics. This work confirms that the self-confinement scenario develops essentially according to analytical expectations [...](abridged)

Representation Learning For Efficient Deep Multi-Agent Reinforcement Learning

Authors:Dom Huh, Prasant Mohapatra
Date:2024-06-05 03:11:44

Sample efficiency remains a key challenge in multi-agent reinforcement learning (MARL). A promising approach is to learn a meaningful latent representation space through auxiliary learning objectives alongside the MARL objective to aid in learning a successful control policy. In our work, we present MAPO-LSO (Multi-Agent Policy Optimization with Latent Space Optimization) which applies a form of comprehensive representation learning devised to supplement MARL training. Specifically, MAPO-LSO proposes a multi-agent extension of transition dynamics reconstruction and self-predictive learning that constructs a latent state optimization scheme that can be trivially extended to current state-of-the-art MARL algorithms. Empirical results demonstrate MAPO-LSO to show notable improvements in sample efficiency and learning performance compared to its vanilla MARL counterpart without any additional MARL hyperparameter tuning on a diverse suite of MARL tasks.

FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning

Authors:Wenzhe Li, Zihan Ding, Seth Karten, Chi Jin
Date:2024-06-04 08:04:23

Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder provides meticulously designed environments to address critical challenges in competitive MARL research, aiming to catalyze a new era of discovery and advancement in the field. Videos and code at https://sites.google.com/view/fightladder/home.

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

Authors:Ziyan Wang, Meng Fang, Tristan Tomilin, Fei Fang, Yali Du
Date:2024-05-30 12:57:35

The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked. While Safe MARL has vast potential, especially in fields like robotics and autonomous vehicles, its full potential is limited by the need to define constraints in pre-designed mathematical terms, which requires extensive domain expertise and reinforcement learning knowledge, hindering its broader adoption. To address this limitation and make Safe MARL more accessible and adaptable, we propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL). Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings that capture the essence of prohibited states and behaviours. These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a multi-task benchmark designed to assess the performance of multiple agents in adhering to natural language constraints. Empirical evaluations across various environments demonstrate that SMALL achieves comparable rewards and significantly fewer constraint violations, highlighting its effectiveness in understanding and enforcing natural language constraints.

LAGMA: LAtent Goal-guided Multi-Agent Reinforcement Learning

Authors:Hyungho Na, Il-chul Moon
Date:2024-05-30 12:34:58

In cooperative multi-agent reinforcement learning (MARL), agents collaborate to achieve common goals, such as defeating enemies and scoring a goal. However, learning goal-reaching paths toward such a semantic goal takes a considerable amount of time in complex tasks and the trained model often fails to find such paths. To address this, we present LAtent Goal-guided Multi-Agent reinforcement learning (LAGMA), which generates a goal-reaching trajectory in latent space and provides a latent goal-guided incentive to transitions toward this reference trajectory. LAGMA consists of three major components: (a) quantized latent space constructed via a modified VQ-VAE for efficient sample utilization, (b) goal-reaching trajectory generation via extended VQ codebook, and (c) latent goal-guided intrinsic reward generation to encourage transitions towards the sampled goal-reaching path. The proposed method is evaluated by StarCraft II with both dense and sparse reward settings and Google Research Football. Empirical results show further performance improvement over state-of-the-art baselines.

Efficient Learning in Chinese Checkers: Comparing Parameter Sharing in Multi-Agent Reinforcement Learning

Authors:Noah Adhikari, Allen Gu
Date:2024-05-29 03:27:30

We show that multi-agent reinforcement learning (MARL) with full parameter sharing outperforms independent and partially shared architectures in the competitive perfect-information homogenous game of Chinese Checkers. To run our experiments, we develop a new MARL environment: variable-size, six-player Chinese Checkers. This custom environment was developed in PettingZoo and supports all traditional rules of the game including chaining jumps. This is, to the best of our knowledge, the first implementation of Chinese Checkers that remains faithful to the true game. Chinese Checkers is difficult to learn due to its large branching factor and potentially infinite horizons. We borrow the concept of branching actions (submoves) from complex action spaces in other RL domains, where a submove may not end a player's turn immediately. This drastically reduces the dimensionality of the action space. Our observation space is inspired by AlphaGo with many binary game boards stacked in a 3D array to encode information. The PettingZoo environment, training and evaluation logic, and analysis scripts can be found on \href{https://github.com/noahadhikari/pettingzoo-chinese-checkers}{Github}.

Safe Multi-Agent Reinforcement Learning with Bilevel Optimization in Autonomous Driving

Authors:Zhi Zheng, Shangding Gu
Date:2024-05-28 14:15:18

Ensuring safety in MARL, particularly when deploying it in real-world applications such as autonomous driving, emerges as a critical challenge. To address this challenge, traditional safe MARL methods extend MARL approaches to incorporate safety considerations, aiming to minimize safety risk values. However, these safe MARL algorithms often fail to model other agents and lack convergence guarantees, particularly in dynamically complex environments. In this study, we propose a safe MARL method grounded in a Stackelberg model with bi-level optimization, for which convergence analysis is provided. Derived from our theoretical analysis, we develop two practical algorithms, namely Constrained Stackelberg Q-learning (CSQ) and Constrained Stackelberg Multi-Agent Deep Deterministic Policy Gradient (CS-MADDPG), designed to facilitate MARL decision-making in autonomous driving applications. To evaluate the effectiveness of our algorithms, we developed a safe MARL autonomous driving benchmark and conducted experiments on challenging autonomous driving scenarios, such as merges, roundabouts, intersections, and racetracks. The experimental results indicate that our algorithms, CSQ and CS-MADDPG, outperform several strong MARL baselines, such as Bi-AC, MACPO, and MAPPO-L, regarding reward and safety performance. The demos and source code are available at {https://github.com/SafeRL-Lab/Safe-MARL-in-Autonomous-Driving.git}.

Individual Contributions as Intrinsic Exploration Scaffolds for Multi-agent Reinforcement Learning

Authors:Xinran Li, Zifan Liu, Shibo Chen, Jun Zhang
Date:2024-05-28 12:18:19

In multi-agent reinforcement learning (MARL), effective exploration is critical, especially in sparse reward environments. Although introducing global intrinsic rewards can foster exploration in such settings, it often complicates credit assignment among agents. To address this difficulty, we propose Individual Contributions as intrinsic Exploration Scaffolds (ICES), a novel approach to motivate exploration by assessing each agent's contribution from a global view. In particular, ICES constructs exploration scaffolds with Bayesian surprise, leveraging global transition information during centralized training. These scaffolds, used only in training, help to guide individual agents towards actions that significantly impact the global latent state transitions. Additionally, ICES separates exploration policies from exploitation policies, enabling the former to utilize privileged global information during training. Extensive experiments on cooperative benchmark tasks with sparse rewards, including Google Research Football (GRF) and StarCraft Multi-agent Challenge (SMAC), demonstrate that ICES exhibits superior exploration capabilities compared with baselines. The code is publicly available at https://github.com/LXXXXR/ICES.

LNS2+RL: Combining Multi-Agent Reinforcement Learning with Large Neighborhood Search in Multi-Agent Path Finding

Authors:Yutong Wang, Tanishq Duhan, Jiaoyang Li, Guillaume Sartoretti
Date:2024-05-28 03:45:32

Multi-Agent Path Finding (MAPF) is a critical component of logistics and warehouse management, which focuses on planning collision-free paths for a team of robots in a known environment. Recent work introduced a novel MAPF approach, LNS2, which proposed to repair a quickly obtained set of infeasible paths via iterative replanning, by relying on a fast, yet lower-quality, prioritized planning (PP) algorithm. At the same time, there has been a recent push for Multi-Agent Reinforcement Learning (MARL) based MAPF algorithms, which exhibit improved cooperation over such PP algorithms, although inevitably remaining slower. In this paper, we introduce a new MAPF algorithm, LNS2+RL, which combines the distinct yet complementary characteristics of LNS2 and MARL to effectively balance their individual limitations and get the best from both worlds. During early iterations, LNS2+RL relies on MARL for low-level replanning, which we show eliminates collisions much more than a PP algorithm. There, our MARL-based planner allows agents to reason about past and future information to gradually learn cooperative decision-making through a finely designed curriculum learning. At later stages of planning, LNS2+RL adaptively switches to PP algorithm to quickly resolve the remaining collisions, naturally trading off solution quality (number of collisions in the solution) and computational efficiency. Our comprehensive experiments on high-agent-density tasks across various team sizes, world sizes, and map structures consistently demonstrate the superior performance of LNS2+RL compared to many MAPF algorithms, including LNS2, LaCAM, EECBS, and SCRIMP. In maps with complex structures, the advantages of LNS2+RL are particularly pronounced, with LNS2+RL achieving a success rate of over 50% in nearly half of the tested tasks, while that of LaCAM, EECBS and SCRIMP falls to 0%.

Active flow control for drag reduction through multi-agent reinforcement learning on a turbulent cylinder at $Re_D=3900$

Authors:P. Suárez, F. Álcantara-Ávila, A. Miró, J. Rabault, B. Font, O. Lehmkuhl, R. Vinuesa
Date:2024-05-27 20:55:10

This study presents novel drag reduction active-flow-control (AFC) strategies} for a three-dimensional cylinder immersed in a flow at a Reynolds number based on freestream velocity and cylinder diameter of $Re_D=3900$. The cylinder in this subcritical flow regime has been extensively studied in the literature and is considered a classic case of turbulent flow arising from a bluff body. The strategies presented are explored through the use of deep reinforcement learning. The cylinder is equipped with 10 independent zero-net-mass-flux jet pairs, distributed on the top and bottom surfaces, which define the AFC setup. The method is based on the coupling between a computational-fluid-dynamics solver and a multi-agent reinforcement-learning (MARL) framework using the proximal-policy-optimization algorithm. This work introduces a multi-stage training approach to expand the exploration space and enhance drag reduction stabilization. By accelerating training through the exploitation of local invariants with MARL, a drag reduction of approximately 9% is achieved. The cooperative closed-loop strategy developed by the agents is sophisticated, as it utilizes a wide bandwidth of mass-flow-rate frequencies, which classical control methods are unable to match. Notably, the mass cost efficiency is demonstrated to be two orders of magnitude lower than that of classical control methods reported in the literature. These developments represent a significant advancement in active flow control in turbulent regimes, critical for industrial applications.

Flow control of three-dimensional cylinders transitioning to turbulence via multi-agent reinforcement learning

Authors:P. Suárez, F. Álcantara-Ávila, J. Rabault, A. Miró, B. Font, O. Lehmkuhl, R. Vinuesa
Date:2024-05-27 14:30:09

Designing active-flow-control (AFC) strategies for three-dimensional (3D) bluff bodies is a challenging task with critical industrial implications. In this study we explore the potential of discovering novel control strategies for drag reduction using deep reinforcement learning. We introduce a high-dimensional AFC setup on a 3D cylinder, considering Reynolds numbers ($Re_D$) from $100$ to $400$, which is a range including the transition to 3D wake instabilities. The setup involves multiple zero-net-mass-flux jets positioned on the top and bottom surfaces, aligned into two slots. The method relies on coupling the computational-fluid-dynamics solver with a multi-agent reinforcement-learning (MARL) framework based on the proximal-policy-optimization algorithm. MARL offers several advantages: it exploits local invariance, adaptable control across geometries, facilitates transfer learning and cross-application of agents, and results in a significant training speedup. For instance, our results demonstrate $21\%$ drag reduction for $Re_D=300$, outperforming classical periodic control, which yields up to $6\%$ reduction. To the authors' knowledge, the present MARL-based framework represents the first time where training is conducted in 3D cylinders. This breakthrough paves the way for conducting AFC on progressively more complex turbulent-flow configurations.

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Authors:Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, Bian Jiang
Date:2024-05-27 06:00:24

Multi-agent reinforcement learning (MARL) is employed to develop autonomous agents that can learn to adopt cooperative or competitive strategies within complex environments. However, the linear increase in the number of agents leads to a combinatorial explosion of the action space, which may result in algorithmic instability, difficulty in convergence, or entrapment in local optima. While researchers have designed a variety of effective algorithms to compress the action space, these methods also introduce new challenges, such as the need for manually designed prior knowledge or reliance on the structure of the problem, which diminishes the applicability of these techniques. In this paper, we introduce Evolutionary action SPAce Reduction with Knowledge (eSpark), an exploration function generation framework driven by large language models (LLMs) to boost exploration and prune unnecessary actions in MARL. Using just a basic prompt that outlines the overall task and setting, eSpark is capable of generating exploration functions in a zero-shot manner, identifying and pruning redundant or irrelevant state-action pairs, and then achieving autonomous improvement from policy feedback. In reinforcement learning tasks involving inventory management and traffic light control encompassing a total of 15 scenarios, eSpark consistently outperforms the combined MARL algorithm in all scenarios, achieving an average performance gain of 34.4% and 9.9% in the two types of tasks respectively. Additionally, eSpark has proven to be capable of managing situations with a large number of agents, securing a 29.7% improvement in scalability challenges that featured over 500 agents. The code can be found in https://github.com/LiuZhihao2022/eSpark.git.

Variational Offline Multi-agent Skill Discovery

Authors:Jiayu Chen, Bhargav Ganguly, Tian Lan, Vaneet Aggarwal
Date:2024-05-26 00:24:46

Skills are effective temporal abstractions established for sequential decision making, which enable efficient hierarchical learning for long-horizon tasks and facilitate multi-task learning through their transferability. Despite extensive research, research gaps remain in multi-agent scenarios, particularly for automatically extracting subgroup coordination patterns in a multi-agent task. In this case, we propose two novel auto-encoder schemes: VO-MASD-3D and VO-MASD-Hier, to simultaneously capture subgroup- and temporal-level abstractions and form multi-agent skills, which firstly solves the aforementioned challenge. An essential algorithm component of these schemes is a dynamic grouping function that can automatically detect latent subgroups based on agent interactions in a task. Our method can be applied to offline multi-task data, and the discovered subgroup skills can be transferred across relevant tasks without retraining. Empirical evaluations on StarCraft tasks indicate that our approach significantly outperforms existing hierarchical multi-agent reinforcement learning (MARL) methods. Moreover, skills discovered using our method can effectively reduce the learning difficulty in MARL scenarios with delayed and sparse reward signals.

eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels

Authors:Alexander DeRieux, Walid Saad
Date:2024-05-24 18:43:05

Collaboration is a key challenge in distributed multi-agent reinforcement learning (MARL) environments. Learning frameworks for these decentralized systems must weigh the benefits of explicit player coordination against the communication overhead and computational cost of sharing local observations and environmental data. Quantum computing has sparked a potential synergy between quantum entanglement and cooperation in multi-agent environments, which could enable more efficient distributed collaboration with minimal information sharing. This relationship is largely unexplored, however, as current state-of-the-art quantum MARL (QMARL) implementations rely on classical information sharing rather than entanglement over a quantum channel as a coordination medium. In contrast, in this paper, a novel framework dubbed entangled QMARL (eQMARL) is proposed. The proposed eQMARL is a distributed actor-critic framework that facilitates cooperation over a quantum channel and eliminates local observation sharing via a quantum entangled split critic. Introducing a quantum critic uniquely spread across the agents allows coupling of local observation encoders through entangled input qubits over a quantum channel, which requires no explicit sharing of local observations and reduces classical communication overhead. Further, agent policies are tuned through joint observation-value function estimation via joint quantum measurements, thereby reducing the centralized computational burden. Experimental results show that eQMARL with ${\Psi}^{+}$ entanglement converges to a cooperative strategy up to $17.8\%$ faster and with a higher overall score compared to split classical and fully centralized classical and quantum baselines. The results also show that eQMARL achieves this performance with a constant factor of $25$-times fewer centralized parameters compared to the split classical baseline.

Extended Reality (XR) Codec Adaptation in 5G using Multi-Agent Reinforcement Learning with Attention Action Selection

Authors:Pedro Enrique Iturria-Rivera, Raimundas Gaigalas, Medhat Elsayed, Majid Bavand, Yigit Ozcan, Melike Erol-Kantarci
Date:2024-05-24 18:34:00

Extended Reality (XR) services will revolutionize applications over 5th and 6th generation wireless networks by providing seamless virtual and augmented reality experiences. These applications impose significant challenges on network infrastructure, which can be addressed by machine learning algorithms due to their adaptability. This paper presents a Multi- Agent Reinforcement Learning (MARL) solution for optimizing codec parameters of XR traffic, comparing it to the Adjust Packet Size (APS) algorithm. Our cooperative multi-agent system uses an Optimistic Mixture of Q-Values (oQMIX) approach for handling Cloud Gaming (CG), Augmented Reality (AR), and Virtual Reality (VR) traffic. Enhancements include an attention mechanism and slate-Markov Decision Process (MDP) for improved action selection. Simulations show our solution outperforms APS with average gains of 30.1%, 15.6%, 16.5% 50.3% in XR index, jitter, delay, and Packet Loss Ratio (PLR), respectively. APS tends to increase throughput but also packet losses, whereas oQMIX reduces PLR, delay, and jitter while maintaining goodput.

Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

Authors:Matteo Bettini, Ryan Kortvelesy, Amanda Prorok
Date:2024-05-23 21:03:33

The study of behavioral diversity in Multi-Agent Reinforcement Learning (MARL) is a nascent yet promising field. In this context, the present work deals with the question of how to control the diversity of a multi-agent system. With no existing approaches to control diversity to a set value, current solutions focus on blindly promoting it via intrinsic rewards or additional loss functions, effectively changing the learning objective and lacking a principled measure for it. To address this, we introduce Diversity Control (DiCo), a method able to control diversity to an exact value of a given metric by representing policies as the sum of a parameter-shared component and dynamically scaled per-agent components. By applying constraints directly to the policy architecture, DiCo leaves the learning objective unchanged, enabling its applicability to any actor-critic MARL algorithm. We theoretically prove that DiCo achieves the desired diversity, and we provide several experiments, both in cooperative and competitive tasks, that show how DiCo can be employed as a novel paradigm to increase performance and sample efficiency in MARL. Multimedia results are available on the paper's website: https://sites.google.com/view/dico-marl.

A finite time analysis of distributed Q-learning

Authors:Han-Dong Lim, Donghwan Lee
Date:2024-05-23 00:52:38

Multi-agent reinforcement learning (MARL) has witnessed a remarkable surge in interest, fueled by the empirical success achieved in applications of single-agent reinforcement learning (RL). In this study, we consider a distributed Q-learning scenario, wherein a number of agents cooperatively solve a sequential decision making problem without access to the central reward function which is an average of the local rewards. In particular, we study finite-time analysis of a distributed Q-learning algorithm, and provide a new sample complexity result of $\tilde{\mathcal{O}}\left( \min\left\{\frac{1}{\epsilon^2}\frac{t_{\text{mix}}}{(1-\gamma)^6 d_{\min}^4 } ,\frac{1}{\epsilon}\frac{\sqrt{|\gS||\gA|}}{(1-\sigma_2(\boldsymbol{W}))(1-\gamma)^4 d_{\min}^3} \right\}\right)$ under tabular lookup

Euclid. I. Overview of the Euclid mission

Authors:Euclid Collaboration, Y. Mellier, Abdurro'uf, J. A. Acevedo Barroso, A. Achúcarro, J. Adamek, R. Adam, G. E. Addison, N. Aghanim, M. Aguena, V. Ajani, Y. Akrami, A. Al-Bahlawan, A. Alavi, I. S. Albuquerque, G. Alestas, G. Alguero, A. Allaoui, S. W. Allen, V. Allevato, A. V. Alonso-Tetilla, B. Altieri, A. Alvarez-Candal, S. Alvi, A. Amara, L. Amendola, J. Amiaux, I. T. Andika, S. Andreon, A. Andrews, G. Angora, R. E. Angulo, F. Annibali, A. Anselmi, S. Anselmi, S. Arcari, M. Archidiacono, G. Aricò, M. Arnaud, S. Arnouts, M. Asgari, J. Asorey, L. Atayde, H. Atek, F. Atrio-Barandela, M. Aubert, E. Aubourg, T. Auphan, N. Auricchio, B. Aussel, H. Aussel, P. P. Avelino, A. Avgoustidis, S. Avila, S. Awan, R. Azzollini, C. Baccigalupi, E. Bachelet, D. Bacon, M. Baes, M. B. Bagley, B. Bahr-Kalus, A. Balaguera-Antolinez, E. Balbinot, M. Balcells, M. Baldi, I. Baldry, A. Balestra, M. Ballardini, O. Ballester, M. Balogh, E. Bañados, R. Barbier, S. Bardelli, M. Baron, T. Barreiro, R. Barrena, J. -C. Barriere, B. J. Barros, A. Barthelemy, N. Bartolo, A. Basset, P. Battaglia, A. J. Battisti, C. M. Baugh, L. Baumont, L. Bazzanini, J. -P. Beaulieu, V. Beckmann, A. N. Belikov, J. Bel, F. Bellagamba, M. Bella, E. Bellini, K. Benabed, R. Bender, G. Benevento, C. L. Bennett, K. Benson, P. Bergamini, J. R. Bermejo-Climent, F. Bernardeau, D. Bertacca, M. Berthe, J. Berthier, M. Bethermin, F. Beutler, C. Bevillon, S. Bhargava, R. Bhatawdekar, D. Bianchi, L. Bisigello, A. Biviano, R. P. Blake, A. Blanchard, J. Blazek, L. Blot, A. Bosco, C. Bodendorf, T. Boenke, H. Böhringer, P. Boldrini, M. Bolzonella, A. Bonchi, M. Bonici, D. Bonino, L. Bonino, C. Bonvin, W. Bon, J. T. Booth, S. Borgani, A. S. Borlaff, E. Borsato, A. Bosco, B. Bose, M. T. Botticella, A. Boucaud, F. Bouche, J. S. Boucher, D. Boutigny, T. Bouvard, R. Bouwens, H. Bouy, R. A. A. Bowler, V. Bozza, E. Bozzo, E. Branchini, G. Brando, S. Brau-Nogue, P. Brekke, M. N. Bremer, M. Brescia, M. -A. Breton, J. Brinchmann, T. Brinckmann, C. Brockley-Blatt, M. Brodwin, L. Brouard, M. L. Brown, S. Bruton, J. Bucko, H. Buddelmeijer, G. Buenadicha, F. Buitrago, P. Burger, C. Burigana, V. Busillo, D. Busonero, R. Cabanac, L. Cabayol-Garcia, M. S. Cagliari, A. Caillat, L. Caillat, M. Calabrese, A. Calabro, G. Calderone, F. Calura, B. Camacho Quevedo, S. Camera, L. Campos, G. Canas-Herrera, G. P. Candini, M. Cantiello, V. Capobianco, E. Cappellaro, N. Cappelluti, A. Cappi, K. I. Caputi, C. Cara, C. Carbone, V. F. Cardone, E. Carella, R. G. Carlberg, M. Carle, L. Carminati, F. Caro, J. M. Carrasco, J. Carretero, P. Carrilho, J. Carron Duque, B. Carry, A. Carvalho, C. S. Carvalho, R. Casas, S. Casas, P. Casenove, C. M. Casey, P. Cassata, F. J. Castander, D. Castelao, M. Castellano, L. Castiblanco, G. Castignani, T. Castro, C. Cavet, S. Cavuoti, P. -Y. Chabaud, K. C. Chambers, Y. Charles, S. Charlot, N. Chartab, R. Chary, F. Chaumeil, H. Cho, G. Chon, E. Ciancetta, P. Ciliegi, A. Cimatti, M. Cimino, M. -R. L. Cioni, R. Claydon, C. Cleland, B. Clément, D. L. Clements, N. Clerc, S. Clesse, S. Codis, F. Cogato, J. Colbert, R. E. Cole, P. Coles, T. E. Collett, R. S. Collins, C. Colodro-Conde, C. Colombo, F. Combes, V. Conforti, G. Congedo, S. Conseil, C. J. Conselice, S. Contarini, T. Contini, L. Conversi, A. R. Cooray, Y. Copin, P. -S. Corasaniti, P. Corcho-Caballero, L. Corcione, O. Cordes, O. Corpace, M. Correnti, M. Costanzi, A. Costille, F. Courbin, L. Courcoult Mifsud, H. M. Courtois, M. -C. Cousinou, G. Covone, T. Cowell, C. Cragg, G. Cresci, S. Cristiani, M. Crocce, M. Cropper, P. E Crouzet, B. Csizi, J. -G. Cuby, E. Cucchetti, O. Cucciati, J. -C. Cuillandre, P. A. C. Cunha, V. Cuozzo, E. Daddi, M. D'Addona, C. Dafonte, N. Dagoneau, E. Dalessandro, G. B. Dalton, G. D'Amico, H. Dannerbauer, P. Danto, I. Das, A. Da Silva, R. da Silva, W. d'Assignies Doumerg, G. Daste, J. E. Davies, S. Davini, P. Dayal, T. de Boer, R. Decarli, B. De Caro, H. Degaudenzi, G. Degni, J. T. A. de Jong, L. F. de la Bella, S. de la Torre, F. Delhaise, D. Delley, G. Delucchi, G. De Lucia, J. Denniston, F. De Paolis, M. De Petris, A. Derosa, S. Desai, V. Desjacques, G. Despali, G. Desprez, J. De Vicente-Albendea, Y. Deville, J. D. F. Dias, A. Díaz-Sánchez, J. J. Diaz, S. Di Domizio, J. M. Diego, D. Di Ferdinando, A. M. Di Giorgio, P. Dimauro, J. Dinis, K. Dolag, C. Dolding, H. Dole, H. Domínguez Sánchez, O. Doré, F. Dournac, M. Douspis, H. Dreihahn, B. Droge, B. Dryer, F. Dubath, P. -A. Duc, F. Ducret, C. Duffy, F. Dufresne, C. A. J. Duncan, X. Dupac, V. Duret, R. Durrer, F. Durret, S. Dusini, A. Ealet, A. Eggemeier, P. R. M. Eisenhardt, D. Elbaz, M. Y. Elkhashab, A. Ellien, J. Endicott, A. Enia, T. Erben, J. A. Escartin Vigo, S. Escoffier, I. Escudero Sanz, J. Essert, S. Ettori, M. Ezziati, G. Fabbian, M. Fabricius, Y. Fang, A. Farina, M. Farina, R. Farinelli, S. Farrens, F. Faustini, A. Feltre, A. M. N. Ferguson, P. Ferrando, A. G. Ferrari, A. Ferré-Mateu, P. G. Ferreira, I. Ferreras, I. Ferrero, S. Ferriol, P. Ferruit, D. Filleul, F. Finelli, S. L. Finkelstein, A. Finoguenov, B. Fiorini, F. Flentge, P. Focardi, J. Fonseca, A. Fontana, F. Fontanot, F. Fornari, P. Fosalba, M. Fossati, S. Fotopoulou, D. Fouchez, N. Fourmanoit, M. Frailis, D. Fraix-Burnet, E. Franceschi, A. Franco, P. Franzetti, J. Freihoefer, C. . S. Frenk, G. Frittoli, P. -A. Frugier, N. Frusciante, A. Fumagalli, M. Fumagalli, M. Fumana, Y. Fu, L. Gabarra, S. Galeotta, L. Galluccio, K. Ganga, H. Gao, J. García-Bellido, K. Garcia, J. P. Gardner, B. Garilli, L. -M. Gaspar-Venancio, T. Gasparetto, V. Gautard, R. Gavazzi, E. Gaztanaga, L. Genolet, R. Genova Santos, F. Gentile, K. George, M. Gerbino, Z. Ghaffari, F. Giacomini, F. Gianotti, G. P. S. Gibb, W. Gillard, B. Gillis, M. Ginolfi, C. Giocoli, M. Girardi, S. K. Giri, L. W. K. Goh, P. Gómez-Alvarez, V. Gonzalez-Perez, A. H. Gonzalez, E. J. Gonzalez, J. C. Gonzalez, S. Gouyou Beauchamps, G. Gozaliasl, J. Gracia-Carpio, S. Grandis, B. R. Granett, M. Granvik, A. Grazian, A. Gregorio, C. Grenet, C. Grillo, F. Grupp, C. Gruppioni, A. Gruppuso, C. Guerbuez, S. Guerrini, M. Guidi, P. Guillard, C. M. Gutierrez, P. Guttridge, L. Guzzo, S. Gwyn, J. Haapala, J. Haase, C. R. Haddow, M. Hailey, A. Hall, D. Hall, N. Hamaus, B. S. Haridasu, J. Harnois-Déraps, C. Harper, W. G. Hartley, G. Hasinger, F. Hassani, N. A. Hatch, S. V. H. Haugan, B. Häußler, A. Heavens, L. Heisenberg, A. Helmi, G. Helou, S. Hemmati, K. Henares, O. Herent, C. Hernández-Monteagudo, T. Heuberger, P. C. Hewett, S. Heydenreich, H. Hildebrandt, M. Hirschmann, J. Hjorth, J. Hoar, H. Hoekstra, A. D. Holland, M. S. Holliman, W. Holmes, I. Hook, B. Horeau, F. Hormuth, A. Hornstrup, S. Hosseini, D. Hu, P. Hudelot, M. J. Hudson, M. Huertas-Company, E. M. Huff, A. C. N. Hughes, A. Humphrey, L. K. Hunt, D. D. Huynh, R. Ibata, K. Ichikawa, S. Iglesias-Groth, O. Ilbert, S. Ilić, L. Ingoglia, E. Iodice, H. Israel, U. E. Israelsson, L. Izzo, P. Jablonka, N. Jackson, J. Jacobson, M. Jafariyazani, K. Jahnke, B. Jain, H. Jansen, M. J. Jarvis, J. Jasche, M. Jauzac, N. Jeffrey, M. Jhabvala, Y. Jimenez-Teja, A. Jimenez Muñoz, B. Joachimi, P. H. Johansson, S. Joudaki, E. Jullo, J. J. E. Kajava, Y. Kang, A. Kannawadi, V. Kansal, D. Karagiannis, M. Kärcher, A. Kashlinsky, M. V. Kazandjian, F. Keck, E. Keihänen, E. Kerins, S. Kermiche, A. Khalil, A. Kiessling, K. Kiiveri, M. Kilbinger, J. Kim, R. King, C. C. Kirkpatrick, T. Kitching, M. Kluge, M. Knabenhans, J. H. Knapen, A. Knebe, J. -P. Kneib, R. Kohley, L. V. E. Koopmans, H. Koskinen, E. Koulouridis, R. Kou, A. Kovács, I. Kovačić, A. Kowalczyk, K. Koyama, K. Kraljic, O. Krause, S. Kruk, B. Kubik, U. Kuchner, K. Kuijken, M. Kümmel, M. Kunz, H. Kurki-Suonio, F. Lacasa, C. G. Lacey, F. La Franca, N. Lagarde, O. Lahav, C. Laigle, A. La Marca, O. La Marle, B. Lamine, M. C. Lam, A. Lançon, H. Landt, M. Langer, A. Lapi, C. Larcheveque, S. S. Larsen, M. Lattanzi, F. Laudisio, D. Laugier, R. Laureijs, V. Laurent, G. Lavaux, A. Lawrenson, A. Lazanu, T. Lazeyras, Q. Le Boulc'h, A. M. C. Le Brun, V. Le Brun, F. Leclercq, S. Lee, J. Le Graet, L. Legrand, K. N. Leirvik, M. Le Jeune, M. Lembo, D. Le Mignant, M. D. Lepinzan, F. Lepori, A. Le Reun, G. Leroy, G. F. Lesci, J. Lesgourgues, L. Leuzzi, M. E. Levi, T. I. Liaudat, G. Libet, P. Liebing, S. Ligori, P. B. Lilje, C. -C. Lin, D. Linde, E. Linder, V. Lindholm, L. Linke, S. -S. Li, S. J. Liu, I. Lloro, F. S. N. Lobo, N. Lodieu, M. Lombardi, L. Lombriser, P. Lonare, G. Longo, M. López-Caniego, X. Lopez Lopez, J. Lorenzo Alvarez, A. Loureiro, J. Loveday, E. Lusso, J. Macias-Perez, T. Maciaszek, G. Maggio, M. Magliocchetti, F. Magnard, E. A. Magnier, A. Magro, G. Mahler, G. Mainetti, D. Maino, E. Maiorano, E. Maiorano, N. Malavasi, G. A. Mamon, C. Mancini, R. Mandelbaum, M. Manera, A. Manjón-García, F. Mannucci, O. Mansutti, M. Manteiga Outeiro, R. Maoli, C. Maraston, S. Marcin, P. Marcos-Arenal, B. Margalef-Bentabol, O. Marggraf, D. Marinucci, M. Marinucci, K. Markovic, F. R. Marleau, J. Marpaud, J. Martignac, J. Martín-Fleitas, P. Martin-Moruno, E. L. Martin, M. Martinelli, N. Martinet, H. Martin, C. J. A. P. Martins, F. Marulli, D. Massari, R. Massey, D. C. Masters, S. Matarrese, Y. Matsuoka, S. Matthew, B. J. Maughan, N. Mauri, L. Maurin, S. Maurogordato, K. McCarthy, A. W. McConnachie, H. J. McCracken, I. McDonald, J. D. McEwen, C. J. R. McPartland, E. Medinaceli, V. Mehta, S. Mei, M. Melchior, J. -B. Melin, B. Ménard, J. Mendes, J. Mendez-Abreu, M. Meneghetti, A. Mercurio, E. Merlin, R. B. Metcalf, G. Meylan, M. Migliaccio, M. Mignoli, L. Miller, M. Miluzio, B. Milvang-Jensen, J. P. Mimoso, R. Miquel, H. Miyatake, B. Mobasher, J. J. Mohr, P. Monaco, M. Monguió, A. Montoro, A. Mora, A. Moradinezhad Dizgah, M. Moresco, C. Moretti, G. Morgante, N. Morisset, T. J. Moriya, P. W. Morris, D. J. Mortlock, L. Moscardini, D. F. Mota, S. Mottet, L. A. Moustakas, T. Moutard, T. Müller, E. Munari, G. Murphree, C. Murray, N. Murray, P. Musi, S. Nadathur, B. C. Nagam, T. Nagao, K. Naidoo, R. Nakajima, C. Nally, P. Natoli, A. Navarro-Alsina, D. Navarro Girones, C. Neissner, A. Nersesian, S. Nesseris, H. N. Nguyen-Kim, L. Nicastro, R. C. Nichol, M. Nielbock, S. -M. Niemi, S. Nieto, K. Nilsson, J. Noller, P. Norberg, A. Nouri-Zonoz, P. Ntelis, A. A. Nucita, P. Nugent, N. J. Nunes, T. Nutma, I. Ocampo, J. Odier, P. A. Oesch, M. Oguri, D. Magalhaes Oliveira, M. Onoue, T. Oosterbroek, F. Oppizzi, C. Ordenovic, K. Osato, F. Pacaud, F. Pace, C. Padilla, K. Paech, L. Pagano, M. J. Page, E. Palazzi, S. Paltani, S. Pamuk, S. Pandolfi, D. Paoletti, M. Paolillo, P. Papaderos, K. Pardede, G. Parimbelli, A. Parmar, C. Partmann, F. Pasian, F. Passalacqua, K. Paterson, L. Patrizii, C. Pattison, A. Paulino-Afonso, R. Paviot, J. A. Peacock, F. R. Pearce, K. Pedersen, A. Peel, R. F. Peletier, M. Pellejero Ibanez, R. Pello, M. T. Penny, W. J. Percival, A. Perez-Garrido, L. Perotto, V. Pettorino, A. Pezzotta, S. Pezzuto, A. Philippon, M. Pierre, O. Piersanti, M. Pietroni, L. Piga, L. Pilo, S. Pires, A. Pisani, A. Pizzella, L. Pizzuti, C. Plana, G. Polenta, J. E. Pollack, M. Poncet, M. Pöntinen, P. Pool, L. A. Popa, V. Popa, J. Popp, C. Porciani, L. Porth, D. Potter, M. Poulain, A. Pourtsidou, L. Pozzetti, I. Prandoni, G. W. Pratt, S. Prezelus, E. Prieto, A. Pugno, S. Quai, L. Quilley, G. D. Racca, A. Raccanelli, G. Rácz, S. Radinović, M. Radovich, A. Ragagnin, U. Ragnit, F. Raison, N. Ramos-Chernenko, C. Ranc, Y. Rasera, N. Raylet, R. Rebolo, A. Refregier, P. Reimberg, T. H. Reiprich, F. Renk, A. Renzi, J. Retre, Y. Revaz, C. Reylé, L. Reynolds, J. Rhodes, F. Ricci, M. Ricci, G. Riccio, S. O. Ricken, S. Rissanen, I. Risso, H. -W. Rix, A. C. Robin, B. Rocca-Volmerange, P. -F. Rocci, M. Rodenhuis, G. Rodighiero, M. Rodriguez Monroy, R. P. Rollins, M. Romanello, J. Roman, E. Romelli, M. Romero-Gomez, M. Roncarelli, P. Rosati, C. Rosset, E. Rossetti, W. Roster, H. J. A. Rottgering, A. Rozas-Fernández, K. Ruane, J. A. Rubino-Martin, A. Rudolph, F. Ruppin, B. Rusholme, S. Sacquegna, I. Sáez-Casares, S. Saga, R. Saglia, M. Sahlén, T. Saifollahi, Z. Sakr, J. Salvalaggio, R. Salvaterra, L. Salvati, M. Salvato, J. -C. Salvignol, A. G. Sánchez, E. Sanchez, D. B. Sanders, D. Sapone, M. Saponara, E. Sarpa, F. Sarron, S. Sartori, B. Sartoris, B. Sassolas, L. Sauniere, M. Sauvage, M. Sawicki, R. Scaramella, C. Scarlata, L. Scharré, J. Schaye, J. A. Schewtschenko, J. -T. Schindler, E. Schinnerer, M. Schirmer, F. Schmidt, F. Schmidt, M. Schmidt, A. Schneider, M. Schneider, P. Schneider, N. Schöneberg, T. Schrabback, M. Schultheis, S. Schulz, N. Schuster, J. Schwartz, D. Sciotti, M. Scodeggio, D. Scognamiglio, D. Scott, V. Scottez, A. Secroun, E. Sefusatti, G. Seidel, M. Seiffert, E. Sellentin, M. Selwood, E. Semboloni, M. Sereno, S. Serjeant, S. Serrano, G. Setnikar, F. Shankar, R. M. Sharples, A. Short, A. Shulevski, M. Shuntov, M. Sias, G. Sikkema, A. Silvestri, P. Simon, C. Sirignano, G. Sirri, J. Skottfelt, E. Slezak, D. Sluse, G. P. Smith, L. C. Smith, R. E. Smith, S. J. A. Smit, F. Soldano, B. G. B. Solheim, J. G. Sorce, F. Sorrenti, E. Soubrie, L. Spinoglio, A. Spurio Mancini, J. Stadel, L. Stagnaro, L. Stanco, S. A. Stanford, J. -L. Starck, P. Stassi, J. Steinwagner, D. Stern, C. Stone, P. Strada, F. Strafella, D. Stramaccioni, C. Surace, F. Sureau, S. H. Suyu, I. Swindells, M. Szafraniec, I. Szapudi, S. Taamoli, M. Talia, P. Tallada-Crespí, K. Tanidis, C. Tao, P. Tarrío, D. Tavagnacco, A. N. Taylor, J. E. Taylor, P. L. Taylor, E. M. Teixeira, M. Tenti, P. Teodoro Idiago, H. I. Teplitz, I. Tereno, N. Tessore, V. Testa, G. Testera, M. Tewes, R. Teyssier, N. Theret, C. Thizy, P. D. Thomas, Y. Toba, S. Toft, R. Toledo-Moreo, E. Tolstoy, E. Tommasi, O. Torbaniuk, F. Torradeflot, C. Tortora, S. Tosi, S. Tosti, M. Trifoglio, A. Troja, T. Trombetti, A. Tronconi, M. Tsedrik, A. Tsyganov, M. Tucci, I. Tutusaus, C. Uhlemann, L. Ulivi, M. Urbano, L. Vacher, L. Vaillon, P. Valageas, I. Valdes, E. A. Valentijn, L. Valenziano, C. Valieri, J. Valiviita, M. Van den Broeck, T. Vassallo, R. Vavrek, J. Vega-Ferrero, B. Venemans, A. Venhola, S. Ventura, G. Verdoes Kleijn, D. Vergani, A. Verma, F. Vernizzi, A. Veropalumbo, G. Verza, C. Vescovi, D. Vibert, M. Viel, P. Vielzeuf, C. Viglione, A. Viitanen, F. Villaescusa-Navarro, S. Vinciguerra, F. Visticot, K. Voggel, M. von Wietersheim-Kramsta, W. J. Vriend, S. Wachter, M. Walmsley, G. Walth, D. M. Walton, N. A. Walton, M. Wander, L. Wang, Y. Wang, J. R. Weaver, J. Weller, M. Wetzstein, D. J. Whalen, I. H. Whittam, A. Widmer, M. Wiesmann, J. Wilde, O. R. Williams, H. -A. Winther, A. Wittje, J. H. W. Wong, A. H. Wright, V. Yankelevich, H. W. Yeung, M. Yoon, S. Youles, L. Y. A. Yung, A. Zacchei, L. Zalesky, G. Zamorani, A. Zamorano Vitorelli, M. Zanoni Marc, M. Zennaro, F. M. Zerbi, I. A. Zinchenko, J. Zoubian, E. Zucca, M. Zumalacarregui
Date:2024-05-22 10:05:49

The current standard model of cosmology successfully describes a variety of measurements, but the nature of its main ingredients, dark matter and dark energy, remains unknown. Euclid is a medium-class mission in the Cosmic Vision 2015-2025 programme of the European Space Agency (ESA) that will provide high-resolution optical imaging, as well as near-infrared imaging and spectroscopy, over about 14,000 deg^2 of extragalactic sky. In addition to accurate weak lensing and clustering measurements that probe structure formation over half of the age of the Universe, its primary probes for cosmology, these exquisite data will enable a wide range of science. This paper provides a high-level overview of the mission, summarising the survey characteristics, the various data-processing steps, and data products. We also highlight the main science objectives and expected performance.

Exponential Steepest Ascent from Valued Constraint Graphs of Pathwidth Four

Authors:Artem Kaznatcheev, Melle van Marle
Date:2024-05-21 16:22:06

We examine the complexity of maximising fitness via local search on valued constraint satisfaction problems (VCSPs). We consider two kinds of local ascents: (1) steepest ascents, where each step changes the domain that produces a maximal increase in fitness; and (2) $\prec$-ordered ascents, where -- of the domains with available fitness increasing changes -- each step changes the $\prec$-minimal domain. We provide a general padding argument to simulate any ordered ascent by a steepest ascent. We construct a VCSP that is a path of binary constraints between alternating 2-state and 3-state domains with exponentially long ordered ascents. We apply our padding argument to this VCSP to obtain a Boolean VCSP that has a constraint (hyper)graph of arity 5 and pathwidth 4 with exponential steepest ascents. This is an improvement on the previous best known construction for long steepest ascents, which had arity 8 and pathwidth 7.

Efficient Multi-agent Reinforcement Learning by Planning

Authors:Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie Zhang
Date:2024-05-20 04:36:02

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($\lambda$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.

LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions

Authors:Chuanneng Sun, Songjun Huang, Dario Pompili
Date:2024-05-17 22:10:23

In recent years, Large Language Models (LLMs) have shown great abilities in various tasks, including question answering, arithmetic problem solving, and poem writing, among others. Although research on LLM-as-an-agent has shown that LLM can be applied to Reinforcement Learning (RL) and achieve decent results, the extension of LLM-based RL to Multi-Agent System (MAS) is not trivial, as many aspects, such as coordination and communication between agents, are not considered in the RL frameworks of a single agent. To inspire more research on LLM-based MARL, in this letter, we survey the existing LLM-based single-agent and multi-agent RL frameworks and provide potential research directions for future research. In particular, we focus on the cooperative tasks of multiple agents with a common goal and communication among them. We also consider human-in/on-the-loop scenarios enabled by the language component in the framework.

Fully Distributed Fog Load Balancing with Multi-Agent Reinforcement Learning

Authors:Maad Ebrahim, Abdelhakim Hafid
Date:2024-05-15 23:44:06

Real-time Internet of Things (IoT) applications require real-time support to handle the ever-growing demand for computing resources to process IoT workloads. Fog Computing provides high availability of such resources in a distributed manner. However, these resources must be efficiently managed to distribute unpredictable traffic demands among heterogeneous Fog resources. This paper proposes a fully distributed load-balancing solution with Multi-Agent Reinforcement Learning (MARL) that intelligently distributes IoT workloads to optimize the waiting time while providing fair resource utilization in the Fog network. These agents use transfer learning for life-long self-adaptation to dynamic changes in the environment. By leveraging distributed decision-making, MARL agents effectively minimize the waiting time compared to a single centralized agent solution and other baselines, enhancing end-to-end execution delay. Besides performance gain, a fully distributed solution allows for a global-scale implementation where agents can work independently in small collaboration regions, leveraging nearby local resources. Furthermore, we analyze the impact of a realistic frequency to observe the state of the environment, unlike the unrealistic common assumption in the literature of having observations readily available in real-time for every required action. The findings highlight the trade-off between realism and performance using an interval-based Gossip-based multi-casting protocol against assuming real-time observation availability for every generated workload.

A Distributed Approach to Autonomous Intersection Management via Multi-Agent Reinforcement Learning

Authors:Matteo Cederle, Marco Fabris, Gian Antonio Susto
Date:2024-05-14 14:34:24

Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.

Safety Constrained Multi-Agent Reinforcement Learning for Active Voltage Control

Authors:Yang Qu, Jinming Ma, Feng Wu
Date:2024-05-14 09:03:00

Active voltage control presents a promising avenue for relieving power congestion and enhancing voltage quality, taking advantage of the distributed controllable generators in the power network, such as roof-top photovoltaics. While Multi-Agent Reinforcement Learning (MARL) has emerged as a compelling approach to address this challenge, existing MARL approaches tend to overlook the constrained optimization nature of this problem, failing in guaranteeing safety constraints. In this paper, we formalize the active voltage control problem as a constrained Markov game and propose a safety-constrained MARL algorithm. We expand the primal-dual optimization RL method to multi-agent settings, and augment it with a novel approach of double safety estimation to learn the policy and to update the Lagrange-multiplier. In addition, we proposed different cost functions and investigated their influences on the behavior of our constrained MARL method. We evaluate our approach in the power distribution network simulation environment with real-world scale scenarios. Experimental results demonstrate the effectiveness of the proposed method compared with the state-of-the-art MARL methods. This paper is published at \url{https://www.ijcai.org/Proceedings/2024/}.

Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client

Authors:Jun Xia, Yi Zhang, Yiyu Shi
Date:2024-05-13 21:02:31

Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the ``wooden barrel effect'' caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments conducted with various widely recognized datasets demonstrate that DR-FL has the capability to optimize the exchange of knowledge among diverse models in large-scale AIoT systems while adhering to energy limitations. Additionally, it improves the performance of each individual heterogeneous device's model.

Towards Adaptive IMFs -- Generalization of utility functions in Multi-Agent Frameworks

Authors:Kaushik Dey, Satheesh K. Perepu, Abir Das, Pallab Dasgupta
Date:2024-05-13 10:27:11

Intent Management Function (IMF) is an integral part of future-generation networks. In recent years, there has been some work on AI-based IMFs that can handle conflicting intents and prioritize the global objective based on apriori definition of the utility function and accorded priorities for competing intents. Some of the earlier works use Multi-Agent Reinforcement Learning (MARL) techniques with AdHoc Teaming (AHT) approaches for efficient conflict handling in IMF. However, the success of such frameworks in real-life scenarios requires them to be flexible to business situations. The intent priorities can change and the utility function, which measures the extent of intent fulfilment, may also vary in definition. This paper proposes a novel mechanism whereby the IMF can generalize to different forms of utility functions and change of intent priorities at run-time without additional training. Such generalization ability, without additional training requirements, would help to deploy IMF in live networks where customer intents and priorities change frequently. Results on the network emulator demonstrate the efficacy of the approach, scalability for new intents, outperforming existing techniques that require additional training to achieve the same degree of flexibility thereby saving cost, and increasing efficiency and adaptability.

AdaptNet: Rethinking Sensing and Communication for a Seamless Internet of Drones Experience

Authors:Ananya Hazarika, Mehdi Rahmati
Date:2024-05-12 16:10:56

In the evolving era of Unmanned Aerial Vehicles (UAVs), the emphasis has moved from mere data collection to strategically obtaining timely and relevant data within the Internet of Drones (IoDs) ecosystem. However, the unpredictable conditions in dynamic IoDs pose safety challenges for drones. Addressing this, our approach introduces a multi-UAV framework using spatial-temporal clustering and the Frechet distance for enhancing reliability. Seamlessly coupled with Integrated Sensing and Communication (ISAC), it enhances the precision and agility of UAV networks. Our Multi-Agent Reinforcement Learning (MARL) mechanism ensures UAVs adapt strategies through ongoing environmental interactions and enhancing intelligent sensing. This focus ensures operational safety and efficiency, considering data capture and transmission viability. By evaluating the relevance of the sensed information, we can communicate only the most crucial data variations beyond a set threshold and optimize bandwidth usage. Our methodology transforms the UAV domain, transitioning drones from data gatherers to adept information orchestrators, establishing a benchmark for efficiency and adaptability in modern aerial systems.

A First Introduction to Cooperative Multi-Agent Reinforcement Learning

Authors:Christopher Amato
Date:2024-05-10 00:50:08

Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. While numerous approaches have been developed, they can be broadly categorized into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and decentralized training and execution (DTE). CTE methods assume centralization during training and execution (e.g., with fast, free, and perfect communication) and have the most information during execution. CTDE methods are the most common, as they leverage centralized information during training while enabling decentralized execution -- using only information available to that agent during execution. Decentralized training and execution methods make the fewest assumptions and are often simple to implement. This text is an introduction to cooperative MARL -- MARL in which all agents share a single, joint reward. It is meant to explain the setting, basic concepts, and common methods for the CTE, CTDE, and DTE settings. It does not cover all work in cooperative MARL as the area is quite extensive. I have included work that I believe is important for understanding the main concepts in the area and apologize to those that I have omitted. Topics include simple applications of single-agent methods to CTE as well as some more scalable methods that exploit the multi-agent structure, independent Q-learning and policy gradient methods and their extensions, as well as value function factorization methods including the well-known VDN, QMIX, and QPLEX approaches, abd centralized critic methods including MADDPG, COMA, and MAPPO. I also discuss common misconceptions, the relationship between different approaches, and some open questions.

Dynamic Deep Factor Graph for Multi-Agent Reinforcement Learning

Authors:Yuchen Shi, Shihong Duan, Cheng Xu, Ran Wang, Fangwen Ye, Chau Yuen
Date:2024-05-09 04:57:26

This work introduces a novel value decomposition algorithm, termed \textit{Dynamic Deep Factor Graphs} (DDFG). Unlike traditional coordination graphs, DDFG leverages factor graphs to articulate the decomposition of value functions, offering enhanced flexibility and adaptability to complex value function structures. Central to DDFG is a graph structure generation policy that innovatively generates factor graph structures on-the-fly, effectively addressing the dynamic collaboration requirements among agents. DDFG strikes an optimal balance between the computational overhead associated with aggregating value functions and the performance degradation inherent in their complete decomposition. Through the application of the max-sum algorithm, DDFG efficiently identifies optimal policies. We empirically validate DDFG's efficacy in complex scenarios, including higher-order predator-prey tasks and the StarCraft II Multi-agent Challenge (SMAC), thus underscoring its capability to surmount the limitations faced by existing value decomposition algorithms. DDFG emerges as a robust solution for MARL challenges that demand nuanced understanding and facilitation of dynamic agent collaboration. The implementation of DDFG is made publicly accessible, with the source code available at \url{https://github.com/SICC-Group/DDFG}.

The Cambridge RoboMaster: An Agile Multi-Robot Research Platform

Authors:Jan Blumenkamp, Ajay Shankar, Matteo Bettini, Joshua Bird, Amanda Prorok
Date:2024-05-03 15:54:20

Compact robotic platforms with powerful compute and actuation capabilities are key enablers for practical, real-world deployments of multi-agent research. This article introduces a tightly integrated hardware, control, and simulation software stack on a fleet of holonomic ground robot platforms designed with this motivation. Our robots, a fleet of customised DJI Robomaster S1 vehicles, offer a balance between small robots that do not possess sufficient compute or actuation capabilities and larger robots that are unsuitable for indoor multi-robot tests. They run a modular ROS2-based optimal estimation and control stack for full onboard autonomy, contain ad-hoc peer-to-peer communication infrastructure, and can zero-shot run multi-agent reinforcement learning (MARL) policies trained in our vectorized multi-agent simulation framework. We present an in-depth review of other platforms currently available, showcase new experimental validation of our system's capabilities, and introduce case studies that highlight the versatility and reliability of our system as a testbed for a wide range of research demonstrations. Our system as well as supplementary material is available online. https://proroklab.github.io/cambridge-robomaster

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Authors:Zhicheng Zhang, Yancheng Liang, Yi Wu, Fei Fang
Date:2024-05-01 23:19:48

Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.

A Meta-Game Evaluation Framework for Deep Multiagent Reinforcement Learning

Authors:Zun Li, Michael P. Wellman
Date:2024-04-30 23:19:39

Evaluating deep multiagent reinforcement learning (MARL) algorithms is complicated by stochasticity in training and sensitivity of agent performance to the behavior of other agents. We propose a meta-game evaluation framework for deep MARL, by framing each MARL algorithm as a meta-strategy, and repeatedly sampling normal-form empirical games over combinations of meta-strategies resulting from different random seeds. Each empirical game captures both self-play and cross-play factors across seeds. These empirical games provide the basis for constructing a sampling distribution, using bootstrapping, over a variety of game analysis statistics. We use this approach to evaluate state-of-the-art deep MARL algorithms on a class of negotiation games. From statistics on individual payoffs, social welfare, and empirical best-response graphs, we uncover strategic relationships among self-play, population-based, model-free, and model-based MARL methods.We also investigate the effect of run-time search as a meta-strategy operator, and find via meta-game analysis that the search version of a meta-strategy generally leads to improved performance.

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning

Authors:Qiaosheng Zhang, Chenjia Bai, Shuyue Hu, Zhen Wang, Xuelong Li
Date:2024-04-30 06:48:56

This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS). These algorithms draw inspiration from foundational concepts in information theory, and are proven to be sample efficient in MARL settings such as two-player zero-sum Markov games (MGs) and multi-player general-sum MGs. For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium. The basic algorithm, referred to as MAIDS, employs an asymmetric learning structure where the max-player first solves a minimax optimization problem based on the joint information ratio of the joint policy, and the min-player then minimizes the marginal information ratio with the max-player's policy fixed. Theoretical analyses show that it achieves a Bayesian regret of tilde{O}(sqrt{K}) for K episodes. To reduce the computational load of MAIDS, we develop an improved algorithm called Reg-MAIDS, which has the same Bayesian regret bound while enjoying less computational complexity. Moreover, by leveraging the flexibility of IDS principle in choosing the learning target, we propose two methods for constructing compressed environments based on rate-distortion theory, upon which we develop an algorithm Compressed-MAIDS wherein the learning target is a compressed environment. Finally, we extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.

Multi-Agent Synchronization Tasks

Authors:Rolando Fernandez, Garrett Warnell, Derrik E. Asher, Peter Stone
Date:2024-04-29 15:34:32

In multi-agent reinforcement learning (MARL), coordination plays a crucial role in enhancing agents' performance beyond what they could achieve through cooperation alone. The interdependence of agents' actions, coupled with the need for communication, leads to a domain where effective coordination is crucial. In this paper, we introduce and define $\textit{Multi-Agent Synchronization Tasks}$ (MSTs), a novel subset of multi-agent tasks. We describe one MST, that we call $\textit{Synchronized Predator-Prey}$, offering a detailed description that will serve as the basis for evaluating a selection of recent state-of-the-art (SOTA) MARL algorithms explicitly designed to address coordination challenges through the use of communication strategies. Furthermore, we present empirical evidence that reveals the limitations of the algorithms assessed to solve MSTs, demonstrating their inability to scale effectively beyond 2-agent coordination tasks in scenarios where communication is a requisite component. Finally, the results raise questions about the applicability of recent SOTA approaches for complex coordination tasks (i.e. MSTs) and prompt further exploration into the underlying causes of their limitations in this context.

Delay-Aware Multi-Agent Reinforcement Learning for Cooperative Adaptive Cruise Control with Model-based Stability Enhancement

Authors:Jiaqi Liu, Ziran Wang, Peng Hang, Jian Sun
Date:2024-04-24 07:19:43

Cooperative Adaptive Cruise Control (CACC) represents a quintessential control strategy for orchestrating vehicular platoon movement within Connected and Automated Vehicle (CAV) systems, significantly enhancing traffic efficiency and reducing energy consumption. In recent years, the data-driven methods, such as reinforcement learning (RL), have been employed to address this task due to their significant advantages in terms of efficiency and flexibility. However, the delay issue, which often arises in real-world CACC systems, is rarely taken into account by current RL-based approaches. To tackle this problem, we propose a Delay-Aware Multi-Agent Reinforcement Learning (DAMARL) framework aimed at achieving safe and stable control for CACC. We model the entire decision-making process using a Multi-Agent Delay-Aware Markov Decision Process (MADA-MDP) and develop a centralized training with decentralized execution (CTDE) MARL framework for distributed control of CACC platoons. An attention mechanism-integrated policy network is introduced to enhance the performance of CAV communication and decision-making. Additionally, a velocity optimization model-based action filter is incorporated to further ensure the stability of the platoon. Experimental results across various delay conditions and platoon sizes demonstrate that our approach consistently outperforms baseline methods in terms of platoon safety, stability and overall performance.

GRSN: Gated Recurrent Spiking Neurons for POMDPs and MARL

Authors:Lang Qin, Ziming Wang, Runhao Jiang, Rui Yan, Huajin Tang
Date:2024-04-24 02:20:50

Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.

Multi-Agent Reinforcement Learning for Energy Networks: Computational Challenges, Progress and Open Problems

Authors:Sarah Keren, Chaimaa Essayeh, Stefano V. Albrecht, Thomas Morstyn
Date:2024-04-24 01:35:27

The rapidly changing architecture and functionality of electrical networks and the increasing penetration of renewable and distributed energy resources have resulted in various technological and managerial challenges. These have rendered traditional centralized energy-market paradigms insufficient due to their inability to support the dynamic and evolving nature of the network. This survey explores how multi-agent reinforcement learning (MARL) can support the decentralization and decarbonization of energy networks and mitigate the associated challenges. This is achieved by specifying key computational challenges in managing energy networks, reviewing recent research progress on addressing them, and highlighting open challenges that may be addressed using MARL.

Multi-agent Reinforcement Learning-based Joint Precoding and Phase Shift Optimization for RIS-aided Cell-Free Massive MIMO Systems

Authors:Yiyang Zhu, Enyu Shi, Ziheng Liu, Jiayi Zhang, Bo Ai
Date:2024-04-22 11:25:37

Cell-free (CF) massive multiple-input multiple-output (mMIMO) is a promising technique for achieving high spectral efficiency (SE) using multiple distributed access points (APs). However, harsh propagation environments often lead to significant communication performance degradation due to high penetration loss. To overcome this issue, we introduce the reconfigurable intelligent surface (RIS) into the CF mMIMO system as a low-cost and power-efficient solution. In this paper, we focus on optimizing the joint precoding design of the RIS-aided CF mMIMO system to maximize the sum SE. This involves optimizing the precoding matrix at the APs and the reflection coefficients at the RIS. To tackle this problem, we propose a fully distributed multi-agent reinforcement learning (MARL) algorithm that incorporates fuzzy logic (FL). Unlike conventional approaches that rely on alternating optimization techniques, our FL-based MARL algorithm only requires local channel state information, which reduces the need for high backhaul capacity. Simulation results demonstrate that our proposed FL-MARL algorithm effectively reduces computational complexity while achieving similar performance as conventional MARL methods.

Multi-AUV Cooperative Underwater Multi-Target Tracking Based on Dynamic-Switching-enabled Multi-Agent Reinforcement Learning

Authors:Shengbo Wang, Chuan Lin, Guangjie Han, Shengchao Zhu, Zhixian Li, Zhenyu Wang, Yunpeng Ma
Date:2024-04-21 13:09:30

In recent years, autonomous underwater vehicle (AUV) swarms are gradually becoming popular and have been widely promoted in ocean exploration or underwater tracking, etc. In this paper, we propose a multi-AUV cooperative underwater multi-target tracking algorithm especially when the real underwater factors are taken into account. We first give normally modelling approach for the underwater sonar-based detection and the ocean current interference on the target tracking process. Then, based on software-defined networking (SDN), we regard the AUV swarm as a underwater ad-hoc network and propose a hierarchical software-defined multi-AUV reinforcement learning (HSARL) architecture. Based on the proposed HSARL architecture, we propose the "Dynamic-Switching" mechanism, it includes "Dynamic-Switching Attention" and "Dynamic-Switching Resampling" mechanisms which accelerate the HSARL algorithm's convergence speed and effectively prevents it from getting stuck in a local optimum state. Additionally, we introduce the reward reshaping mechanism for further accelerating the convergence speed of the proposed HSARL algorithm in early phase. Finally, based on a proposed AUV classification method, we propose a cooperative tracking algorithm called Dynamic-Switching-Based MARL (DSBM)-driven tracking algorithm. Evaluation results demonstrate that our proposed DSBM tracking algorithm can perform precise underwater multi-target tracking, comparing with many of recent research products in terms of various important metrics.

MAexp: A Generic Platform for RL-based Multi-Agent Exploration

Authors:Shaohao Zhu, Jiacheng Zhou, Anjun Chen, Mingming Bai, Jiming Chen, Jinming Xu
Date:2024-04-19 12:00:10

The sim-to-real gap poses a significant challenge in RL-based multi-agent exploration due to scene quantization and action discretization. Existing platforms suffer from the inefficiency in sampling and the lack of diversity in Multi-Agent Reinforcement Learning (MARL) algorithms across different scenarios, restraining their widespread applications. To fill these gaps, we propose MAexp, a generic platform for multi-agent exploration that integrates a broad range of state-of-the-art MARL algorithms and representative scenarios. Moreover, we employ point clouds to represent our exploration scenarios, leading to high-fidelity environment mapping and a sampling speed approximately 40 times faster than existing platforms. Furthermore, equipped with an attention-based Multi-Agent Target Generator and a Single-Agent Motion Planner, MAexp can work with arbitrary numbers of agents and accommodate various types of robots. Extensive experiments are conducted to establish the first benchmark featuring several high-performance MARL algorithms across typical scenarios for robots with continuous actions, which highlights the distinct strengths of each algorithm in different scenarios.

Centralized vs. Decentralized Multi-Agent Reinforcement Learning for Enhanced Control of Electric Vehicle Charging Networks

Authors:Amin Shojaeighadikolaei, Zsolt Talata, Morteza Hashemi
Date:2024-04-18 21:50:03

The widespread adoption of electric vehicles (EVs) poses several challenges to power distribution networks and smart grid infrastructure due to the possibility of significantly increasing electricity demands, especially during peak hours. Furthermore, when EVs participate in demand-side management programs, charging expenses can be reduced by using optimal charging control policies that fully utilize real-time pricing schemes. However, devising optimal charging methods and control strategies for EVs is challenging due to various stochastic and uncertain environmental factors. Currently, most EV charging controllers operate based on a centralized model. In this paper, we introduce a novel approach for distributed and cooperative charging strategy using a Multi-Agent Reinforcement Learning (MARL) framework. Our method is built upon the Deep Deterministic Policy Gradient (DDPG) algorithm for a group of EVs in a residential community, where all EVs are connected to a shared transformer. This method, referred to as CTDE-DDPG, adopts a Centralized Training Decentralized Execution (CTDE) approach to establish cooperation between agents during the training phase, while ensuring a distributed and privacy-preserving operation during execution. We theoretically examine the performance of centralized and decentralized critics for the DDPG-based MARL implementation and demonstrate their trade-offs. Furthermore, we numerically explore the efficiency, scalability, and performance of centralized and decentralized critics. Our theoretical and numerical results indicate that, despite higher policy gradient variances and training complexity, the CTDE-DDPG framework significantly improves charging efficiency by reducing total variation by approximately %36 and charging cost by around %9.1 on average...

JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning

Authors:Chenxing Liu, Guizhong Liu
Date:2024-04-18 01:27:02

While Centralized Training with Decentralized Execution (CTDE) has become the prevailing paradigm in Multi-Agent Reinforcement Learning (MARL), it may not be suitable for scenarios in which agents can fully communicate and share observations with each other. Fully centralized methods, also know as Centralized Training with Centralized Execution (CTCE) methods, can fully utilize observations of all the agents by treating the entire system as a single agent. However, traditional CTCE methods suffer from scalability issues due to the exponential growth of the joint action space. To address these challenges, in this paper we propose JointPPO, a CTCE method that uses Proximal Policy Optimization (PPO) to directly optimize the joint policy of the multi-agent system. JointPPO decomposes the joint policy into conditional probabilities, transforming the decision-making process into a sequence generation task. A Transformer-based joint policy network is constructed, trained with a PPO loss tailored for the joint policy. JointPPO effectively handles a large joint action space and extends PPO to multi-agent setting in a clear and concise manner. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) testbed demonstrate the superiority of JointPPO over strong baselines. Ablation experiments and analyses are conducted to explores the factors influencing JointPPO's performance.

Function Approximation for Reinforcement Learning Controller for Energy from Spread Waves

Authors:Soumyendu Sarkar, Vineet Gundecha, Sahand Ghorbanpour, Alexander Shmakov, Ashwin Ramesh Babu, Avisek Naug, Alexandre Pichard, Mathieu Cocho
Date:2024-04-17 02:04:10

The industrial multi-generator Wave Energy Converters (WEC) must handle multiple simultaneous waves coming from different directions called spread waves. These complex devices in challenging circumstances need controllers with multiple objectives of energy capture efficiency, reduction of structural stress to limit maintenance, and proactive protection against high waves. The Multi-Agent Reinforcement Learning (MARL) controller trained with the Proximal Policy Optimization (PPO) algorithm can handle these complexities. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of a fully connected neural network (FCN), LSTM, and Transformer model variants with varying depths and gated residual connections. Our results show that the transformer model of moderate depth with gated residual connections around the multi-head attention, multi-layer perceptron, and the transformer block (STrXL) proposed in this paper is optimal and boosts energy efficiency by an average of 22.1% for these complex spread waves over the existing spring damper (SD) controller. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion for angled waves. Demo: https://tinyurl.com/yueda3jh

Group-Aware Coordination Graph for Multi-Agent Reinforcement Learning

Authors:Wei Duan, Jie Lu, Junyu Xuan
Date:2024-04-17 01:17:10

Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG's superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Authors:Hao-Lun Hsu, Weixin Wang, Miroslav Pajic, Pan Xu
Date:2024-04-16 17:01:38

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$ regret bound with communication complexity $\widetilde{\mathcal{O}}(dHM^2)$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, and $K$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (\textit{i.e.,} $N$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.

Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning

Authors:Linjie Xu, Zichuan Liu, Alexander Dockhorn, Diego Perez-Liebana, Jinyu Wang, Lei Song, Jiang Bian
Date:2024-04-15 12:18:09

One of the notorious issues for Reinforcement Learning (RL) is poor sample efficiency. Compared to single agent RL, the sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space. Although much effort has been devoted to developing new methods and enhancing sample efficiency, we look at the widely used episodic training mechanism. In each training step, tens of frames are collected, but only one gradient step is made. We argue that this episodic training could be a source of poor sample efficiency. To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction (a.k.a. Replay Ratio or Update-To-Data ratio). To show its generality, we evaluate $3$ MARL methods on $6$ SMAC tasks. The empirical results validate that a higher replay ratio significantly improves the sample efficiency for MARL algorithms. The codes to reimplement the results presented in this paper are open-sourced at https://anonymous.4open.science/r/rr_for_MARL-0D83/.

The Power in Communication: Power Regularization of Communication for Autonomy in Cooperative Multi-Agent Reinforcement Learning

Authors:Nancirose Piazza, Vahid Behzadan, Stefan Sarkadi
Date:2024-04-09 15:29:16

Communication plays a vital role for coordination in Multi-Agent Reinforcement Learning (MARL) systems. However, misaligned agents can exploit other agents' trust and delegated power to the communication medium. In this paper, we propose power regularization as a method to limit the adverse effects of communication by misaligned agents, specifically communication which impairs the performance of cooperative agents. Power is a measure of the influence one agent's actions have over another agent's policy. By introducing power regularization, we aim to allow designers to control or reduce agents' dependency on communication when appropriate, and make them more resilient to performance deterioration due to misuses of communication. We investigate several environments in which power regularization can be a valuable capability for learning different policies that reduce the effect of power dynamics between agents during communication.

Attention-Driven Multi-Agent Reinforcement Learning: Enhancing Decisions with Expertise-Informed Tasks

Authors:Andre R Kuroswiski, Annie S Wu, Angelo Passaro
Date:2024-04-08 20:06:33

In this paper, we introduce an alternative approach to enhancing Multi-Agent Reinforcement Learning (MARL) through the integration of domain knowledge and attention-based policy mechanisms. Our methodology focuses on the incorporation of domain-specific expertise into the learning process, which simplifies the development of collaborative behaviors. This approach aims to reduce the complexity and learning overhead typically associated with MARL by enabling agents to concentrate on essential aspects of complex tasks, thus optimizing the learning curve. The utilization of attention mechanisms plays a key role in our model. It allows for the effective processing of dynamic context data and nuanced agent interactions, leading to more refined decision-making. Applied in standard MARL scenarios, such as the Stanford Intelligent Systems Laboratory (SISL) Pursuit and Multi-Particle Environments (MPE) Simple Spread, our method has been shown to improve both learning efficiency and the effectiveness of collaborative behaviors. The results indicate that our attention-based approach can be a viable approach for improving the efficiency of MARL training process, integrating domain-specific knowledge at the action level.

Graph Neural Network Meets Multi-Agent Reinforcement Learning: Fundamentals, Applications, and Future Directions

Authors:Ziheng Liu, Jiayi Zhang, Enyu Shi, Zhilong Liu, Dusit Niyato, Bo Ai, Xuemin, Shen
Date:2024-04-07 09:48:05

Multi-agent reinforcement learning (MARL) has become a fundamental component of next-generation wireless communication systems. Theoretically, although MARL has the advantages of low computational complexity and fast convergence rate, there exist several challenges including partial observability, non-stationary, and scalability. In this article, we investigate a novel MARL with graph neural network-aided communication (GNNComm-MARL) to address the aforementioned challenges by making use of graph attention networks to effectively sample neighborhoods and selectively aggregate messages. Furthermore, we thoroughly study the architecture of GNNComm-MARL and present a systematic design solution. We then present the typical applications of GNNComm-MARL from two aspects: resource allocation and mobility management. The results obtained unveil that GNNComm-MARL can achieve better performance with lower communication overhead compared to conventional communication schemes. Finally, several important research directions regarding GNNComm-MARL are presented to facilitate further investigation.

Heterogeneous Multi-Agent Reinforcement Learning for Zero-Shot Scalable Collaboration

Authors:Xudong Guo, Daming Shi, Junjie Yu, Wenhui Fan
Date:2024-04-05 03:02:57

The emergence of multi-agent reinforcement learning (MARL) is significantly transforming various fields like autonomous vehicle networks. However, real-world multi-agent systems typically contain multiple roles, and the scale of these systems dynamically fluctuates. Consequently, in order to achieve zero-shot scalable collaboration, it is essential that strategies for different roles can be updated flexibly according to the scales, which is still a challenge for current MARL frameworks. To address this, we propose a novel MARL framework named Scalable and Heterogeneous Proximal Policy Optimization (SHPPO), integrating heterogeneity into parameter-shared PPO-based MARL networks. We first leverage a latent network to learn strategy patterns for each agent adaptively. Second, we introduce a heterogeneous layer to be inserted into decision-making networks, whose parameters are specifically generated by the learned latent variables. Our approach is scalable as all the parameters are shared except for the heterogeneous layer, and gains both inter-individual and temporal heterogeneity, allowing SHPPO to adapt effectively to varying scales. SHPPO exhibits superior performance in classic MARL environments like Starcraft Multi-Agent Challenge (SMAC) and Google Research Football (GRF), showcasing enhanced zero-shot scalability, and offering insights into the learned latent variables' impact on team performance by visualization.

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Authors:Yannick Molinghen, Raphaël Avalos, Mark Van Achter, Ann Nowé, Tom Lenaerts
Date:2024-04-04 17:05:42

We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.

MARL-LNS: Cooperative Multi-agent Reinforcement Learning via Large Neighborhoods Search

Authors:Weizhe Chen, Sven Koenig, Bistra Dilkina
Date:2024-04-03 22:51:54

Cooperative multi-agent reinforcement learning (MARL) has been an increasingly important research topic in the last half-decade because of its great potential for real-world applications. Because of the curse of dimensionality, the popular "centralized training decentralized execution" framework requires a long time in training, yet still cannot converge efficiently. In this paper, we propose a general training framework, MARL-LNS, to algorithmically address these issues by training on alternating subsets of agents using existing deep MARL algorithms as low-level trainers, while not involving any additional parameters to be trained. Based on this framework, we provide three algorithm variants based on the framework: random large neighborhood search (RLNS), batch large neighborhood search (BLNS), and adaptive large neighborhood search (ALNS), which alternate the subsets of agents differently. We test our algorithms on both the StarCraft Multi-Agent Challenge and Google Research Football, showing that our algorithms can automatically reduce at least 10% of training time while reaching the same final skill level as the original algorithm.

EnergAIze: Multi Agent Deep Deterministic Policy Gradient for Vehicle to Grid Energy Management

Authors:Tiago Fonseca, Luis Ferreira, Bernardo Cabral, Ricardo Severino, Isabel Praca
Date:2024-04-02 23:16:17

This paper investigates the increasing roles of Renewable Energy Sources (RES) and Electric Vehicles (EVs). While indicating a new era of sustainable energy, these also introduce complex challenges, including the need to balance supply and demand and smooth peak consumptions amidst rising EV adoption rates. Addressing these challenges requires innovative solutions such as Demand Response (DR), energy flexibility management, Renewable Energy Communities (RECs), and more specifically for EVs, Vehicle-to-Grid (V2G). However, existing V2G approaches often fall short in real-world adaptability, global REC optimization with other flexible assets, scalability, and user engagement. To bridge this gap, this paper introduces EnergAIze, a Multi-Agent Reinforcement Learning (MARL) energy management framework, leveraging the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. EnergAIze enables user-centric and multi-objective energy management by allowing each prosumer to select from a range of personal management objectives, thus encouraging engagement. Additionally, it architects' data protection and ownership through decentralized computing, where each prosumer can situate an energy management optimization node directly at their own dwelling. The local node not only manages local energy assets but also fosters REC wide optimization. The efficacy of EnergAIze was evaluated through case studies employing the CityLearn simulation framework. These simulations were instrumental in demonstrating EnergAIze's adeptness at implementing V2G technology within a REC and other energy assets. The results show reduction in peak loads, ramping, carbon emissions, and electricity costs at the REC level while optimizing for individual prosumers objectives.

Distributed Autonomous Swarm Formation for Dynamic Network Bridging

Authors:Raffaele Galliera, Thies Möhlenhof, Alessandro Amato, Daniel Duran, Kristen Brent Venable, Niranjan Suri
Date:2024-04-02 01:45:03

Effective operation and seamless cooperation of robotic systems are a fundamental component of next-generation technologies and applications. In contexts such as disaster response, swarm operations require coordinated behavior and mobility control to be handled in a distributed manner, with the quality of the agents' actions heavily relying on the communication between them and the underlying network. In this paper, we formulate the problem of dynamic network bridging in a novel Decentralized Partially Observable Markov Decision Process (Dec-POMDP), where a swarm of agents cooperates to form a link between two distant moving targets. Furthermore, we propose a Multi-Agent Reinforcement Learning (MARL) approach for the problem based on Graph Convolutional Reinforcement Learning (DGN) which naturally applies to the networked, distributed nature of the task. The proposed method is evaluated in a simulated environment and compared to a centralized heuristic baseline showing promising results. Moreover, a further step in the direction of sim-to-real transfer is presented, by additionally evaluating the proposed approach in a near Live Virtual Constructive (LVC) UAV framework.

Multi-Agent Reinforcement Learning with Control-Theoretic Safety Guarantees for Dynamic Network Bridging

Authors:Raffaele Galliera, Konstantinos Mitsopoulos, Niranjan Suri, Raffaele Romagnoli
Date:2024-04-02 01:30:41

Addressing complex cooperative tasks in safety-critical environments poses significant challenges for Multi-Agent Systems, especially under conditions of partial observability. This work introduces a hybrid approach that integrates Multi-Agent Reinforcement Learning with control-theoretic methods to ensure safe and efficient distributed strategies. Our contributions include a novel setpoint update algorithm that dynamically adjusts agents' positions to preserve safety conditions without compromising the mission's objectives. Through experimental validation, we demonstrate significant advantages over conventional MARL strategies, achieving comparable task performance with zero safety violations. Our findings indicate that integrating safe control with learning approaches not only enhances safety compliance but also achieves good performance in mission objectives.

GOV-REK: Governed Reward Engineering Kernels for Designing Robust Multi-Agent Reinforcement Learning Systems

Authors:Ashish Rana, Michael Oesterle, Jannik Brinkmann
Date:2024-04-01 14:19:00

For multi-agent reinforcement learning systems (MARLS), the problem formulation generally involves investing massive reward engineering effort specific to a given problem. However, this effort often cannot be translated to other problems; worse, it gets wasted when system dynamics change drastically. This problem is further exacerbated in sparse reward scenarios, where a meaningful heuristic can assist in the policy convergence task. We propose GOVerned Reward Engineering Kernels (GOV-REK), which dynamically assign reward distributions to agents in MARLS during its learning stage. We also introduce governance kernels, which exploit the underlying structure in either state or joint action space for assigning meaningful agent reward distributions. During the agent learning stage, it iteratively explores different reward distribution configurations with a Hyperband-like algorithm to learn ideal agent reward models in a problem-agnostic manner. Our experiments demonstrate that our meaningful reward priors robustly jumpstart the learning process for effectively learning different MARL problems.

Stationary solutions to the relativistic BGK model for gas mixtures in a slab

Authors:Byung-Hoon Hwang, Myeong-Su Lee
Date:2024-03-31 17:47:56

In a recent paper [16], the authors proposed a BGK model for relativistic gas mixtures based on the Marle-type approximation, which satisfies the fundamental kinetic properties: non-negativity of distribution functions, conservation laws, H-theorem, and indifferentiability principle. In this paper, we are concerned with the stationary problems to the relativistic BGK model for gas mixtures in slab geometry. We establish the existence of a unique mild solution with the fixed inflow boundary data when the collision frequencies for each species are sufficiently small.

Inferring Latent Temporal Sparse Coordination Graph for Multi-Agent Reinforcement Learning

Authors:Wei Duan, Jie Lu, Junyu Xuan
Date:2024-03-28 09:20:15

Effective agent coordination is crucial in cooperative Multi-Agent Reinforcement Learning (MARL). While agent cooperation can be represented by graph structures, prevailing graph learning methods in MARL are limited. They rely solely on one-step observations, neglecting crucial historical experiences, leading to deficient graphs that foster redundant or detrimental information exchanges. Additionally, high computational demands for action-pair calculations in dense graphs impede scalability. To address these challenges, we propose inferring a Latent Temporal Sparse Coordination Graph (LTS-CG) for MARL. The LTS-CG leverages agents' historical observations to calculate an agent-pair probability matrix, where a sparse graph is sampled from and used for knowledge exchange between agents, thereby simultaneously capturing agent dependencies and relation uncertainty. The computational complexity of this procedure is only related to the number of agents. This graph learning process is further augmented by two innovative characteristics: Predict-Future, which enables agents to foresee upcoming observations, and Infer-Present, ensuring a thorough grasp of the environmental context from limited data. These features allow LTS-CG to construct temporal graphs from historical and real-time information, promoting knowledge exchange during policy learning and effective collaboration. Graph learning and agent training occur simultaneously in an end-to-end manner. Our demonstrated results on the StarCraft II benchmark underscore LTS-CG's superior performance.

Paths to Equilibrium in Games

Authors:Bora Yongacoglu, Gürdal Arslan, Lacra Pavel, Serdar Yüksel
Date:2024-03-26 19:58:39

In multi-agent reinforcement learning (MARL) and game theory, agents repeatedly interact and revise their strategies as new data arrives, producing a sequence of strategy profiles. This paper studies sequences of strategies satisfying a pairwise constraint inspired by policy updating in reinforcement learning, where an agent who is best responding in one period does not switch its strategy in the next period. This constraint merely requires that optimizing agents do not switch strategies, but does not constrain the non-optimizing agents in any way, and thus allows for exploration. Sequences with this property are called satisficing paths, and arise naturally in many MARL algorithms. A fundamental question about strategic dynamics is such: for a given game and initial strategy profile, is it always possible to construct a satisficing path that terminates at an equilibrium? The resolution of this question has implications about the capabilities or limitations of a class of MARL algorithms. We answer this question in the affirmative for normal-form games. Our analysis reveals a counterintuitive insight that reward deteriorating strategic updates are key to driving play to equilibrium along a satisficing path.

Self-Clustering Hierarchical Multi-Agent Reinforcement Learning with Extensible Cooperation Graph

Authors:Qingxu Fu, Tenghai Qiu, Jianqiang Yi, Zhiqiang Pu, Xiaolin Ai
Date:2024-03-26 19:19:16

Multi-Agent Reinforcement Learning (MARL) has been successful in solving many cooperative challenges. However, classic non-hierarchical MARL algorithms still cannot address various complex multi-agent problems that require hierarchical cooperative behaviors. The cooperative knowledge and policies learned in non-hierarchical algorithms are implicit and not interpretable, thereby restricting the integration of existing knowledge. This paper proposes a novel hierarchical MARL model called Hierarchical Cooperation Graph Learning (HCGL) for solving general multi-agent problems. HCGL has three components: a dynamic Extensible Cooperation Graph (ECG) for achieving self-clustering cooperation; a group of graph operators for adjusting the topology of ECG; and an MARL optimizer for training these graph operators. HCGL's key distinction from other MARL models is that the behaviors of agents are guided by the topology of ECG instead of policy neural networks. ECG is a three-layer graph consisting of an agent node layer, a cluster node layer, and a target node layer. To manipulate the ECG topology in response to changing environmental conditions, four graph operators are trained to adjust the edge connections of ECG dynamically. The hierarchical feature of ECG provides a unique approach to merge primitive actions (actions executed by the agents) and cooperative actions (actions executed by the clusters) into a unified action space, allowing us to integrate fundamental cooperative knowledge into an extensible interface. In our experiments, the HCGL model has shown outstanding performance in multi-agent benchmarks with sparse rewards. We also verify that HCGL can easily be transferred to large-scale scenarios with high zero-shot transfer success rates.

MA4DIV: Multi-Agent Reinforcement Learning for Search Result Diversification

Authors:Yiqun Chen, Jiaxin Mao, Yi Zhang, Dehong Ma, Long Xia, Jun Fan, Daiting Shi, Zhicong Cheng, Simiu Gu, Dawei Yin
Date:2024-03-26 06:34:23

Search result diversification (SRD), which aims to ensure that documents in a ranking list cover a broad range of subtopics, is a significant and widely studied problem in Information Retrieval and Web Search. Existing methods primarily utilize a paradigm of "greedy selection", i.e., selecting one document with the highest diversity score at a time or optimize an approximation of the objective function. These approaches tend to be inefficient and are easily trapped in a suboptimal state. To address these challenges, we introduce Multi-Agent reinforcement learning (MARL) for search result DIVersity, which called MA4DIV. In this approach, each document is an agent and the search result diversification is modeled as a cooperative task among multiple agents. By modeling the SRD ranking problem as a cooperative MARL problem, this approach allows for directly optimizing the diversity metrics, such as $\alpha$-NDCG, while achieving high training efficiency. We conducted experiments on public TREC datasets and a larger scale dataset in the industrial setting. The experiemnts show that MA4DIV achieves substantial improvements in both effectiveness and efficiency than existing baselines, especially on the industrial dataset. The code of MA4DIV can be seen on https://github.com/chenyiqun/MA4DIV.

MQE: Unleashing the Power of Interaction with Multi-agent Quadruped Environment

Authors:Ziyan Xiong, Bo Chen, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, Yang Gao
Date:2024-03-24 05:13:37

The advent of deep reinforcement learning (DRL) has significantly advanced the field of robotics, particularly in the control and coordination of quadruped robots. However, the complexity of real-world tasks often necessitates the deployment of multi-robot systems capable of sophisticated interaction and collaboration. To address this need, we introduce the Multi-agent Quadruped Environment (MQE), a novel platform designed to facilitate the development and evaluation of multi-agent reinforcement learning (MARL) algorithms in realistic and dynamic scenarios. MQE emphasizes complex interactions between robots and objects, hierarchical policy structures, and challenging evaluation scenarios that reflect real-world applications. We present a series of collaborative and competitive tasks within MQE, ranging from simple coordination to complex adversarial interactions, and benchmark state-of-the-art MARL algorithms. Our findings indicate that hierarchical reinforcement learning can simplify task learning, but also highlight the need for advanced algorithms capable of handling the intricate dynamics of multi-agent interactions. MQE serves as a stepping stone towards bridging the gap between simulation and practical deployment, offering a rich environment for future research in multi-agent systems and robot learning. For open-sourced code and more details of MQE, please refer to https://ziyanx02.github.io/multiagent-quadruped-environment/ .

Sample and Communication Efficient Fully Decentralized MARL Policy Evaluation via a New Approach: Local TD update

Authors:Fnu Hairi, Zifan Zhang, Jia Liu
Date:2024-03-23 21:39:56

In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication complexities, which are defined as the number of training samples and communication rounds needed to converge to some $\epsilon$-stationary point. To lower communication complexity in MARL-PE, a "natural'' idea is to perform multiple local TD-update steps between each consecutive rounds of communication to reduce the communication frequency. However, the validity of the local TD-update approach remains unclear due to the potential "agent-drift'' phenomenon resulting from heterogeneous rewards across agents in general. This leads to an interesting open question: Can the local TD-update approach entail low sample and communication complexities? In this paper, we make the first attempt to answer this fundamental question. We focus on the setting of MARL-PE with average reward, which is motivated by many multi-agent network optimization problems. Our theoretical and experimental results confirm that allowing multiple local TD-update steps is indeed an effective approach in lowering the sample and communication complexities of MARL-PE compared to consensus-based MARL-PE algorithms. Specifically, the local TD-update steps between two consecutive communication rounds can be as large as $\mathcal{O}(1/\epsilon^{1/2}\log{(1/\epsilon)})$ in order to converge to an $\epsilon$-stationary point of MARL-PE. Moreover, we show theoretically that in order to reach the optimal sample complexity, the communication complexity of local TD-update approach is $\mathcal{O}(1/\epsilon^{1/2}\log{(1/\epsilon)})$.

Carbon Footprint Reduction for Sustainable Data Centers in Real-Time

Authors:Soumyendu Sarkar, Avisek Naug, Ricardo Luna, Antonio Guillen, Vineet Gundecha, Sahand Ghorbanpour, Sajad Mousavi, Dejan Markovikj, Ashwin Ramesh Babu
Date:2024-03-21 02:59:56

As machine learning workloads significantly increase energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.

Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Authors:Anthony Goeckner, Yueyuan Sui, Nicolas Martinet, Xinliang Li, Qi Zhu
Date:2024-03-19 18:42:22

Existing multi-agent coordination techniques are often fragile and vulnerable to anomalies such as agent attrition and communication disturbances, which are quite common in the real-world deployment of systems like field robotics. To better prepare these systems for the real world, we present a graph neural network (GNN)-based multi-agent reinforcement learning (MARL) method for resilient distributed coordination of a multi-robot system. Our method, Multi-Agent Graph Embedding-based Coordination (MAGEC), is trained using multi-agent proximal policy optimization (PPO) and enables distributed coordination around global objectives under agent attrition, partial observability, and limited or disturbed communications. We use a multi-robot patrolling scenario to demonstrate our MAGEC method in a ROS 2-based simulator and then compare its performance with prior coordination approaches. Results demonstrate that MAGEC outperforms existing methods in several experiments involving agent attrition and communication disturbance, and provides competitive results in scenarios without such anomalies.

A Scalable and Parallelizable Digital Twin Framework for Sustainable Sim2Real Transition of Multi-Agent Reinforcement Learning Systems

Authors:Chinmay Vilas Samak, Tanmay Vilas Samak, Venkat Krovi
Date:2024-03-16 18:47:04

Multi-agent reinforcement learning (MARL) systems usually require significantly long training times due to their inherent complexity. Furthermore, deploying them in the real world demands a feature-rich environment along with multiple embodied agents, which may not be feasible due to budget or space limitations, not to mention energy consumption and safety issues. This work tries to address these pain points by presenting a sustainable digital twin framework capable of accelerating MARL training by selectively scaling parallelized workloads on-demand, and transferring the trained policies from simulation to reality using minimal hardware resources. The applicability of the proposed digital twin framework is highlighted through two representative use cases, which cover cooperative as well as competitive classes of MARL problems. We study the effect of agent and environment parallelization on training time and that of systematic domain randomization on zero-shot sim2real transfer across both the case studies. Results indicate up to 76.3% reduction in training time with the proposed parallelization scheme and as low as 2.9% sim2real gap using the suggested deployment method.

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

Authors:Peihong Yu, Manav Mishra, Alec Koppel, Carl Busart, Priya Narayan, Dinesh Manocha, Amrit Bedi, Pratap Tokekar
Date:2024-03-13 20:11:20

Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The experimental results demonstrate that PegMARL outperforms state-of-the-art MARL algorithms in solving coordinated tasks, achieving strong performance even when provided with suboptimal personalized demonstrations. We also showcase PegMARL's capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.

Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Authors:Jing Tan, Ramin Khalili, Holger Karl
Date:2024-03-13 18:05:16

The Intelligent Transportation System (ITS) environment is known to be dynamic and distributed, where participants (vehicle users, operators, etc.) have multiple, changing and possibly conflicting objectives. Although Reinforcement Learning (RL) algorithms are commonly applied to optimize ITS applications such as resource management and offloading, most RL algorithms focus on single objectives. In many situations, converting a multi-objective problem into a single-objective one is impossible, intractable or insufficient, making such RL algorithms inapplicable. We propose a multi-objective, multi-agent reinforcement learning (MARL) algorithm with high learning efficiency and low computational requirements, which automatically triggers adaptive few-shot learning in a dynamic, distributed and noisy environment with sparse and delayed reward. We test our algorithm in an ITS environment with edge cloud computing. Empirical results show that the algorithm is quick to adapt to new environments and performs better in all individual and system metrics compared to the state-of-the-art benchmark. Our algorithm also addresses various practical concerns with its modularized and asynchronous online training method. In addition to the cloud simulation, we test our algorithm on a single-board computer and show that it can make inference in 6 milliseconds.

SpaceOctopus: An Octopus-inspired Motion Planning Framework for Multi-arm Space Robot

Authors:Wenbo Zhao, Shengjie Wang, Yixuan Fan, Yang Gao, Tao Zhang
Date:2024-03-13 03:34:00

Space robots have played a critical role in autonomous maintenance and space junk removal. Multi-arm space robots can efficiently complete the target capture and base reorientation tasks due to their flexibility and the collaborative capabilities between the arms. However, the complex coupling properties arising from both the multiple arms and the free-floating base present challenges to the motion planning problems of multi-arm space robots. We observe that the octopus elegantly achieves similar goals when grabbing prey and escaping from danger. Inspired by the distributed control of octopuses' limbs, we develop a multi-level decentralized motion planning framework to manage the movement of different arms of space robots. This motion planning framework integrates naturally with the multi-agent reinforcement learning (MARL) paradigm. The results indicate that our method outperforms the previous method (centralized training). Leveraging the flexibility of the decentralized framework, we reassemble policies trained for different tasks, enabling the space robot to complete trajectory planning tasks while adjusting the base attitude without further learning. Furthermore, our experiments confirm the superior robustness of our method in the face of external disturbances, changing base masses, and even the failure of one arm.

Ensembling Prioritized Hybrid Policies for Multi-agent Pathfinding

Authors:Huijie Tang, Federico Berto, Jinkyoo Park
Date:2024-03-12 11:47:12

Multi-Agent Reinforcement Learning (MARL) based Multi-Agent Path Finding (MAPF) has recently gained attention due to its efficiency and scalability. Several MARL-MAPF methods choose to use communication to enrich the information one agent can perceive. However, existing works still struggle in structured environments with high obstacle density and a high number of agents. To further improve the performance of the communication-based MARL-MAPF solvers, we propose a new method, Ensembling Prioritized Hybrid Policies (EPH). We first propose a selective communication block to gather richer information for better agent coordination within multi-agent environments and train the model with a Q learning-based algorithm. We further introduce three advanced inference strategies aimed at bolstering performance during the execution phase. First, we hybridize the neural policy with single-agent expert guidance for navigating conflict-free zones. Secondly, we propose Q value-based methods for prioritized resolution of conflicts as well as deadlock situations. Finally, we introduce a robust ensemble method that can efficiently collect the best out of multiple possible solutions. We empirically evaluate EPH in complex multi-agent environments and demonstrate competitive performance against state-of-the-art neural methods for MAPF. We open-source our code at https://github.com/ai4co/eph-mapf.

Generalising Multi-Agent Cooperation through Task-Agnostic Communication

Authors:Dulhan Jayalath, Steven Morad, Amanda Prorok
Date:2024-03-11 14:20:13

Existing communication methods for multi-agent reinforcement learning (MARL) in cooperative multi-robot problems are almost exclusively task-specific, training new communication strategies for each unique task. We address this inefficiency by introducing a communication strategy applicable to any task within a given environment. We pre-train the communication strategy without task-specific reward guidance in a self-supervised manner using a set autoencoder. Our objective is to learn a fixed-size latent Markov state from a variable number of agent observations. Under mild assumptions, we prove that policies using our latent representations are guaranteed to converge, and upper bound the value error introduced by our Markov state approximation. Our method enables seamless adaptation to novel tasks without fine-tuning the communication strategy, gracefully supports scaling to more agents than present during training, and detects out-of-distribution events in an environment. Empirical results on diverse MARL scenarios validate the effectiveness of our approach, surpassing task-specific communication strategies in unseen tasks. Our implementation of this work is available at https://github.com/proroklab/task-agnostic-comms.

DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning

Authors:Xuefeng Wang, Henglin Pu, Hyung Jun Kim, Husheng Li
Date:2024-03-11 03:17:33

Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have been only very limited applications of Model Predictive Control (MPC) methods in this domain, primarily due to the complex and implicit dynamics characteristic of multi-agent environments. To bridge this gap, we propose a novel method called Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning (DeepSafeMPC). The key insight of DeepSafeMPC is leveraging a entralized deep learning model to well predict environmental dynamics. Our method applies MARL principles to search for optimal solutions. Through the employment of MPC, the actions of agents can be restricted within safe states concurrently. We demonstrate the effectiveness of our approach using the Safe Multi-agent MuJoCo environment, showcasing significant advancements in addressing safety concerns in MARL.

Multi-Agent Reinforcement Learning with a Hierarchy of Reward Machines

Authors:Xuejing Zheng, Chao Yu
Date:2024-03-08 06:38:22

In this paper, we study the cooperative Multi-Agent Reinforcement Learning (MARL) problems using Reward Machines (RMs) to specify the reward functions such that the prior knowledge of high-level events in a task can be leveraged to facilitate the learning efficiency. Unlike the existing work that RMs have been incorporated into MARL for task decomposition and policy learning in relatively simple domains or with an assumption of independencies among the agents, we present Multi-Agent Reinforcement Learning with a Hierarchy of RMs (MAHRM) that is capable of dealing with more complex scenarios when the events among agents can occur concurrently and the agents are highly interdependent. MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity. Experimental results in three cooperative MARL domains show that MAHRM outperforms other MARL methods using the same prior knowledge of high-level events.

iTRPL: An Intelligent and Trusted RPL Protocol based on Multi-Agent Reinforcement Learning

Authors:Debasmita Dey, Nirnay Ghosh
Date:2024-03-07 11:28:26

Routing Protocol for Low Power and Lossy Networks (RPL) is the de-facto routing standard in IoT networks. It enables nodes to collaborate and autonomously build ad-hoc networks modeled by tree-like destination-oriented direct acyclic graphs (DODAG). Despite its widespread usage in industry and healthcare domains, RPL is susceptible to insider attacks. Although the state-of-the-art RPL ensures that only authenticated nodes participate in DODAG, such hard security measures are still inadequate to prevent insider threats. This entails a need to integrate soft security mechanisms to support decision-making. This paper proposes iTRPL, an intelligent and behavior-based framework that incorporates trust to segregate honest and malicious nodes within a DODAG. It also leverages multi-agent reinforcement learning (MARL) to make autonomous decisions concerning the DODAG. The framework enables a parent node to compute the trust for its child and decide if the latter can join the DODAG. It tracks the behavior of the child node, updates the trust, computes the rewards (or penalties), and shares with the root. The root aggregates the rewards/penalties of all nodes, computes the overall return, and decides via its $\epsilon$-Greedy MARL module if the DODAG will be retained or modified for the future. A simulation-based performance evaluation demonstrates that iTRPL learns to make optimal decisions with time.

Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal Imagination

Authors:Liangzhou Wang, Kaiwen Zhu, Fengming Zhu, Xinghu Yao, Shujie Zhang, Deheng Ye, Haobo Fu, Qiang Fu, Wei Yang
Date:2024-03-05 18:07:34

Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consensus mechanism to explicitly coordinate multiple agents. The proposed Multi-agent Goal Imagination (MAGI) framework guides agents to reach consensus with an Imagined common goal. The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states. We directly model this distribution with a self-supervised generative model, thus alleviating the "curse of dimensinality" problem induced by multi-agent multi-step policy rollout commonly used in model-based methods. We show that such efficient consensus mechanism can guide all agents cooperatively reaching valuable future states. Results on Multi-agent Particle-Environments and Google Research Football environment demonstrate the superiority of MAGI in both sample efficiency and performance.

PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning

Authors:Ke Zhang, DanDan Zhu, Qiuhan Xu, Hao Zhou, Ce Zheng
Date:2024-03-05 03:59:01

Training for multi-agent reinforcement learning(MARL) is a time-consuming process caused by distribution shift of each agent. One drawback is that strategy of each agent in MARL is independent but actually in cooperation. Thus, a vertical issue in multi-agent reinforcement learning is how to efficiently accelerate training process. To address this problem, current research has leveraged a centralized function(CF) across multiple agents to learn contribution of the team reward for each agent. However, CF based methods introduce joint error from other agents in estimation of value network. In so doing, inspired by federated learning, we propose three simple novel approaches called Average Periodically Parameter Sharing(A-PPS), Reward-Scalability Periodically Parameter Sharing(RS-PPS) and Partial Personalized Periodically Parameter Sharing(PP-PPS) mechanism to accelerate training of MARL. Agents share Q-value network periodically during the training process. Agents which has same identity adapt collected reward as scalability and update partial neural network during period to share different parameters. We apply our approaches in classical MARL method QMIX and evaluate our approaches on various tasks in StarCraft Multi-Agent Challenge(SMAC) environment. Performance of numerical experiments yield enormous enhancement, with an average improvement of 10\%-30\%, and enable to win tasks that QMIX cannot. Our code can be downloaded from https://github.com/ColaZhang22/PPS-QMIX

SMAUG: A Sliding Multidimensional Task Window-Based MARL Framework for Adaptive Real-Time Subtask Recognition

Authors:Wenjing Zhang, Wei Zhang
Date:2024-03-04 08:04:41

Instead of making behavioral decisions directly from the exponentially expanding joint observational-action space, subtask-based multi-agent reinforcement learning (MARL) methods enable agents to learn how to tackle different subtasks. Most existing subtask-based MARL methods are based on hierarchical reinforcement learning (HRL). However, these approaches often limit the number of subtasks, perform subtask recognition periodically, and can only identify and execute a specific subtask within the predefined fixed time period, which makes them inflexible and not suitable for diverse and dynamic scenarios with constantly changing subtasks. To break through above restrictions, a \textbf{S}liding \textbf{M}ultidimensional t\textbf{A}sk window based m\textbf{U}ti-agent reinforcement learnin\textbf{G} framework (SMAUG) is proposed for adaptive real-time subtask recognition. It leverages a sliding multidimensional task window to extract essential information of subtasks from trajectory segments concatenated based on observed and predicted trajectories in varying lengths. An inference network is designed to iteratively predict future trajectories with the subtask-oriented policy network. Furthermore, intrinsic motivation rewards are defined to promote subtask exploration and behavior diversity. SMAUG can be integrated with any Q-learning-based approach. Experiments on StarCraft II show that SMAUG not only demonstrates performance superiority in comparison with all baselines but also presents a more prominent and swift rise in rewards during the initial training stage.

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Authors:Hyungho Na, Yunkyeong Seo, Il-chul Moon
Date:2024-03-02 07:37:05

In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.

Understanding Iterative Combinatorial Auction Designs via Multi-Agent Reinforcement Learning

Authors:Greg d'Eon, Neil Newman, Kevin Leyton-Brown
Date:2024-02-29 18:16:13

Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to analyze, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) algorithms can be used to understand iterative combinatorial auctions, given that these algorithms have recently shown empirical success in several other domains. We find that MARL can indeed benefit auction analysis, but that deploying it effectively is nontrivial. We begin by describing modelling decisions that keep the resulting game tractable without sacrificing important features such as imperfect information or asymmetry between bidders. We also discuss how to navigate pitfalls of various MARL algorithms, how to overcome challenges in verifying convergence, and how to generate and interpret multiple equilibria. We illustrate the promise of our resulting approach by using it to evaluate a specific rule change to a clock auction, finding substantially different auction outcomes due to complex changes in bidders' behavior.

Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning

Authors:Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, Xuguang Lan
Date:2024-02-28 01:45:01

Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.

Learning to Model Diverse Driving Behaviors in Highly Interactive Autonomous Driving Scenarios with Multi-Agent Reinforcement Learning

Authors:Liu Weiwei, Hu Wenxuan, Jing Wei, Lei Lanxin, Gao Lingping, Liu Yong
Date:2024-02-21 02:44:33

Autonomous vehicles trained through Multi-Agent Reinforcement Learning (MARL) have shown impressive results in many driving scenarios. However, the performance of these trained policies can be impacted when faced with diverse driving styles and personalities, particularly in highly interactive situations. This is because conventional MARL algorithms usually operate under the assumption of fully cooperative behavior among all agents and focus on maximizing team rewards during training. To address this issue, we introduce the Personality Modeling Network (PeMN), which includes a cooperation value function and personality parameters to model the varied interactions in high-interactive scenarios. The PeMN also enables the training of a background traffic flow with diverse behaviors, thereby improving the performance and generalization of the ego vehicle. Our extensive experimental studies, which incorporate different personality parameters in high-interactive driving scenarios, demonstrate that the personality parameters effectively model diverse driving styles and that policies trained with PeMN demonstrate better generalization compared to traditional MARL methods.

A Neuro-Symbolic Approach to Multi-Agent RL for Interpretability and Probabilistic Decision Making

Authors:Chitra Subramanian, Miao Liu, Naweed Khan, Jonathan Lenchner, Aporva Amarnath, Sarathkrishna Swaminathan, Ryan Riegel, Alexander Gray
Date:2024-02-21 00:16:08

Multi-agent reinforcement learning (MARL) is well-suited for runtime decision-making in optimizing the performance of systems where multiple agents coexist and compete for shared resources. However, applying common deep learning-based MARL solutions to real-world problems suffers from issues of interpretability, sample efficiency, partial observability, etc. To address these challenges, we present an event-driven formulation, where decision-making is handled by distributed co-operative MARL agents using neuro-symbolic methods. The recently introduced neuro-symbolic Logical Neural Networks (LNN) framework serves as a function approximator for the RL, to train a rules-based policy that is both logical and interpretable by construction. To enable decision-making under uncertainty and partial observability, we developed a novel probabilistic neuro-symbolic framework, Probabilistic Logical Neural Networks (PLNN), which combines the capabilities of logical reasoning with probabilistic graphical models. In PLNN, the upward/downward inference strategy, inherited from LNN, is coupled with belief bounds by setting the activation function for the logical operator associated with each neural network node to a probability-respecting generalization of the Fr\'echet inequalities. These PLNN nodes form the unifying element that combines probabilistic logic and Bayes Nets, permitting inference for variables with unobserved states. We demonstrate our contributions by addressing key MARL challenges for power sharing in a system-on-chip application.

Modelling crypto markets by multi-agent reinforcement learning

Authors:Johann Lussange, Stefano Vrizzi, Stefano Palminteri, Boris Gutkin
Date:2024-02-16 16:28:58

Building on a previous foundation work (Lussange et al. 2020), this study introduces a multi-agent reinforcement learning (MARL) model simulating crypto markets, which is calibrated to the Binance's daily closing prices of $153$ cryptocurrencies that were continuously traded between 2018 and 2022. Unlike previous agent-based models (ABM) or multi-agent systems (MAS) which relied on zero-intelligence agents or single autonomous agent methodologies, our approach relies on endowing agents with reinforcement learning (RL) techniques in order to model crypto markets. This integration is designed to emulate, with a bottom-up approach to complexity inference, both individual and collective agents, ensuring robustness in the recent volatile conditions of such markets and during the COVID-19 era. A key feature of our model also lies in the fact that its autonomous agents perform asset price valuation based on two sources of information: the market prices themselves, and the approximation of the crypto assets fundamental values beyond what those market prices are. Our MAS calibration against real market data allows for an accurate emulation of crypto markets microstructure and probing key market behaviors, in both the bearish and bullish regimes of that particular time period.

Optimal Task Assignment and Path Planning using Conflict-Based Search with Precedence and Temporal Constraints

Authors:Yu Quan Chong, Jiaoyang Li, Katia Sycara
Date:2024-02-13 20:07:58

The Multi-Agent Path Finding (MAPF) problem entails finding collision-free paths for a set of agents, guiding them from their start to goal locations. However, MAPF does not account for several practical task-related constraints. For example, agents may need to perform actions at goal locations with specific execution times, adhering to predetermined orders and timeframes. Moreover, goal assignments may not be predefined for agents, and the optimization objective may lack an explicit definition. To incorporate task assignment, path planning, and a user-defined objective into a coherent framework, this paper examines the Task Assignment and Path Finding with Precedence and Temporal Constraints (TAPF-PTC) problem. We augment Conflict-Based Search (CBS) to simultaneously generate task assignments and collision-free paths that adhere to precedence and temporal constraints, maximizing an objective quantified by the return from a user-defined reward function in reinforcement learning (RL). Experimentally, we demonstrate that our algorithm, CBS-TA-PTC, can solve highly challenging bomb-defusing tasks with precedence and temporal constraints efficiently relative to MARL and adapted Target Assignment and Path Finding (TAPF) methods.

Conservative and Risk-Aware Offline Multi-Agent Reinforcement Learning

Authors:Eslam Eldeeb, Houssem Sifaou, Osvaldo Simeone, Mohammad Shehab, Hirley Alves
Date:2024-02-13 12:49:22

Reinforcement learning (RL) has been widely adopted for controlling and optimizing complex engineering systems such as next-generation wireless networks. An important challenge in adopting RL is the need for direct access to the physical environment. This limitation is particularly severe in multi-agent systems, for which conventional multi-agent reinforcement learning (MARL) requires a large number of coordinated online interactions with the environment during training. When only offline data is available, a direct application of online MARL schemes would generally fail due to the epistemic uncertainty entailed by the lack of exploration during training. In this work, we propose an offline MARL scheme that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data. We explore both independent and joint learning strategies. The proposed MARL scheme, referred to as multi-agent conservative quantile regression, addresses general risk-sensitive design criteria and is applied to the trajectory planning problem in drone networks, showcasing its advantages.

Enabling Multi-Agent Transfer Reinforcement Learning via Scenario Independent Representation

Authors:Ayesha Siddika Nipu, Siming Liu, Anthony Harris
Date:2024-02-13 02:48:18

Multi-Agent Reinforcement Learning (MARL) algorithms are widely adopted in tackling complex tasks that require collaboration and competition among agents in dynamic Multi-Agent Systems (MAS). However, learning such tasks from scratch is arduous and may not always be feasible, particularly for MASs with a large number of interactive agents due to the extensive sample complexity. Therefore, reusing knowledge gained from past experiences or other agents could efficiently accelerate the learning process and upscale MARL algorithms. In this study, we introduce a novel framework that enables transfer learning for MARL through unifying various state spaces into fixed-size inputs that allow one unified deep-learning policy viable in different scenarios within a MAS. We evaluated our approach in a range of scenarios within the StarCraft Multi-Agent Challenge (SMAC) environment, and the findings show significant enhancements in multi-agent learning performance using maneuvering skills learned from other scenarios compared to agents learning from scratch. Furthermore, we adopted Curriculum Transfer Learning (CTL), enabling our deep learning policy to progressively acquire knowledge and skills across pre-designed homogeneous learning scenarios organized by difficulty levels. This process promotes inter- and intra-agent knowledge transfer, leading to high multi-agent learning performance in more complicated heterogeneous scenarios.

Mixed Q-Functionals: Advancing Value-Based Methods in Cooperative MARL with Continuous Action Domains

Authors:Yasin Findik, S. Reza Ahmadzadeh
Date:2024-02-12 16:21:50

Tackling multi-agent learning problems efficiently is a challenging task in continuous action domains. While value-based algorithms excel in sample efficiency when applied to discrete action domains, they are usually inefficient when dealing with continuous actions. Policy-based algorithms, on the other hand, attempt to address this challenge by leveraging critic networks for guiding the learning process and stabilizing the gradient estimation. The limitations in the estimation of true return and falling into local optima in these methods result in inefficient and often sub-optimal policies. In this paper, we diverge from the trend of further enhancing critic networks, and focus on improving the effectiveness of value-based methods in multi-agent continuous domains by concurrently evaluating numerous actions. We propose a novel multi-agent value-based algorithm, Mixed Q-Functionals (MQF), inspired from the idea of Q-Functionals, that enables agents to transform their states into basis functions. Our algorithm fosters collaboration among agents by mixing their action-values. We evaluate the efficacy of our algorithm in six cooperative multi-agent scenarios. Our empirical findings reveal that MQF outperforms four variants of Deep Deterministic Policy Gradient through rapid action evaluation and increased sample efficiency.

Refined Sample Complexity for Markov Games with Independent Linear Function Approximation

Authors:Yan Dai, Qiwen Cui, Simon S. Du
Date:2024-02-11 01:51:15

Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the "curse of multi-agents" (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023; Cui et al., 2023; Wang et al., 2023). While these works resolved the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions $A_{\max}$ -- which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time. This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing *data-dependent* (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel *action-dependent bonuses* to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T^{-1/2})$ convergence rate, and avoids $\text{poly}(A_{\max})$ dependency simultaneously.

Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov Games

Authors:Hafez Ghaemi, Hamed Kebriaei, Alireza Ramezani Moghaddam, Majid Nili Ahamdabadi
Date:2024-02-08 18:43:27

Classical multi-agent reinforcement learning (MARL) assumes risk neutrality and complete objectivity for agents. However, in settings where agents need to consider or model human economic or social preferences, a notion of risk must be incorporated into the RL optimization problem. This will be of greater importance in MARL where other human or non-human agents are involved, possibly with their own risk-sensitive policies. In this work, we consider risk-sensitive and non-cooperative MARL with cumulative prospect theory (CPT), a non-convex risk measure and a generalization of coherent measures of risk. CPT is capable of explaining loss aversion in humans and their tendency to overestimate/underestimate small/large probabilities. We propose a distributed sampling-based actor-critic (AC) algorithm with CPT risk for network aggregative Markov games (NAMGs), which we call Distributed Nested CPT-AC. Under a set of assumptions, we prove the convergence of the algorithm to a subjective notion of Markov perfect Nash equilibrium in NAMGs. The experimental results show that subjective CPT policies obtained by our algorithm can be different from the risk-neutral ones, and agents with a higher loss aversion are more inclined to socially isolate themselves in an NAMG.

When is Mean-Field Reinforcement Learning Tractable and Relevant?

Authors:Batuhan Yardim, Artur Goldman, Niao He
Date:2024-02-08 15:41:22

Mean-field reinforcement learning has become a popular theoretical framework for efficiently approximating large-scale multi-agent reinforcement learning (MARL) problems exhibiting symmetry. However, questions remain regarding the applicability of mean-field approximations: in particular, their approximation accuracy of real-world systems and conditions under which they become computationally tractable. We establish explicit finite-agent bounds for how well the MFG solution approximates the true $N$-player game for two popular mean-field solution concepts. Furthermore, for the first time, we establish explicit lower bounds indicating that MFGs are poor or uninformative at approximating $N$-player games assuming only Lipschitz dynamics and rewards. Finally, we analyze the computational complexity of solving MFGs with only Lipschitz properties and prove that they are in the class of \textsc{PPAD}-complete problems conjectured to be intractable, similar to general sum $N$ player games. Our theoretical results underscore the limitations of MFGs and complement and justify existing work by proving difficulty in the absence of common theoretical assumptions.

SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems

Authors:Oubo Ma, Yuwen Pu, Linkang Du, Yang Dai, Ruo Wang, Xiaolei Liu, Yingcai Wu, Shouling Ji
Date:2024-02-06 06:18:16

Recent advancements in multi-agent reinforcement learning (MARL) have opened up vast application prospects, such as swarm control of drones, collaborative manipulation by robotic arms, and multi-target encirclement. However, potential security threats during the MARL deployment need more attention and thorough investigation. Recent research reveals that attackers can rapidly exploit the victim's vulnerabilities, generating adversarial policies that result in the failure of specific tasks. For instance, reducing the winning rate of a superhuman-level Go AI to around 20%. Existing studies predominantly focus on two-player competitive environments, assuming attackers possess complete global state observation. In this study, we unveil, for the first time, the capability of attackers to generate adversarial policies even when restricted to partial observations of the victims in multi-agent competitive environments. Specifically, we propose a novel black-box attack (SUB-PLAY) that incorporates the concept of constructing multiple subgames to mitigate the impact of partial observability and suggests sharing transitions among subpolicies to improve attackers' exploitative ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under three typical partial observability limitations. Visualization results indicate that adversarial policies induce significantly different activations of the victims' policy networks. Furthermore, we evaluate three potential defenses aimed at exploring ways to mitigate security threats posed by adversarial policies, providing constructive recommendations for deploying MARL in competitive environments.

Multi-agent Reinforcement Learning for Energy Saving in Multi-Cell Massive MIMO Systems

Authors:Tianzhang Cai, Qichen Wang, Shuai Zhang, Özlem Tuğfe Demir, Cicek Cavdar
Date:2024-02-05 17:15:00

We develop a multi-agent reinforcement learning (MARL) algorithm to minimize the total energy consumption of multiple massive MIMO (multiple-input multiple-output) base stations (BSs) in a multi-cell network while preserving the overall quality-of-service (QoS) by making decisions on the multi-level advanced sleep modes (ASMs) and antenna switching of these BSs. The problem is modeled as a decentralized partially observable Markov decision process (DEC-POMDP) to enable collaboration between individual BSs, which is necessary to tackle inter-cell interference. A multi-agent proximal policy optimization (MAPPO) algorithm is designed to learn a collaborative BS control policy. To enhance its scalability, a modified version called MAPPO-neighbor policy is further proposed. Simulation results demonstrate that the trained MAPPO agent achieves better performance compared to baseline policies. Specifically, compared to the auto sleep mode 1 (symbol-level sleeping) algorithm, the MAPPO-neighbor policy reduces power consumption by approximately 8.7% during low-traffic hours and improves energy efficiency by approximately 19% during high-traffic hours, respectively.

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Authors:Dan Qiao, Yu-Xiang Wang
Date:2024-02-02 03:00:40

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Learning and Calibrating Heterogeneous Bounded Rational Market Behaviour with Multi-Agent Reinforcement Learning

Authors:Benjamin Patrick Evans, Sumitra Ganesh
Date:2024-02-01 17:21:45

Agent-based models (ABMs) have shown promise for modelling various real world phenomena incompatible with traditional equilibrium analysis. However, a critical concern is the manual definition of behavioural rules in ABMs. Recent developments in multi-agent reinforcement learning (MARL) offer a way to address this issue from an optimisation perspective, where agents strive to maximise their utility, eliminating the need for manual rule specification. This learning-focused approach aligns with established economic and financial models through the use of rational utility-maximising agents. However, this representation departs from the fundamental motivation for ABMs: that realistic dynamics emerging from bounded rationality and agent heterogeneity can be modelled. To resolve this apparent disparity between the two approaches, we propose a novel technique for representing heterogeneous processing-constrained agents within a MARL framework. The proposed approach treats agents as constrained optimisers with varying degrees of strategic skills, permitting departure from strict utility maximisation. Behaviour is learnt through repeated simulations with policy gradients to adjust action likelihoods. To allow efficient computation, we use parameterised shared policy learning with distributions of agent skill levels. Shared policy learning avoids the need for agents to learn individual policies yet still enables a spectrum of bounded rational behaviours. We validate our model's effectiveness using real-world data on a range of canonical $n$-agent settings, demonstrating significantly improved predictive capability.

Nash Soft Actor-Critic LEO Satellite Handover Management Algorithm for Flying Vehicles

Authors:Jinxuan Chen, Mustafa Ozger, Cicek Cavdar
Date:2024-01-31 10:44:59

Compared with the terrestrial networks (TN), which can only support limited coverage areas, low-earth orbit (LEO) satellites can provide seamless global coverage and high survivability in case of emergencies. Nevertheless, the swift movement of the LEO satellites poses a challenge: frequent handovers are inevitable, compromising the quality of service (QoS) of users and leading to discontinuous connectivity. Moreover, considering LEO satellite connectivity for different flying vehicles (FVs) when coexisting with ground terminals, an efficient satellite handover decision control and mobility management strategy is required to reduce the number of handovers and allocate resources that align with different users' requirements. In this paper, a novel distributed satellite handover strategy based on Multi-Agent Reinforcement Learning (MARL) and game theory named Nash-SAC has been proposed to solve these problems. From the simulation results, the Nash-SAC-based handover strategy can effectively reduce the handovers by over 16 percent and the blocking rate by over 18 percent, outperforming local benchmarks such as traditional Q-learning. It also greatly improves the network utility used to quantify the performance of the whole system by up to 48 percent and caters to different users requirements, providing reliable and robust connectivity for both FVs and ground terminals.

Camouflage Adversarial Attacks on Multiple Agent Systems

Authors:Ziqing Lu, Guanlin Liu, Lifeng Lai, Weiyu Xu
Date:2024-01-30 19:47:01

The multi-agent reinforcement learning systems (MARL) based on the Markov decision process (MDP) have emerged in many critical applications. To improve the robustness/defense of MARL systems against adversarial attacks, the study of various adversarial attacks on reinforcement learning systems is very important. Previous works on adversarial attacks considered some possible features to attack in MDP, such as the action poisoning attacks, the reward poisoning attacks, and the state perception attacks. In this paper, we propose a brand-new form of attack called the camouflage attack in the MARL systems. In the camouflage attack, the attackers change the appearances of some objects without changing the actual objects themselves; and the camouflaged appearances may look the same to all the targeted recipient (victim) agents. The camouflaged appearances can mislead the recipient agents to misguided actions. We design algorithms that give the optimal camouflage attacks minimizing the rewards of recipient agents. Our numerical and theoretical results show that camouflage attacks can rival the more conventional, but likely more difficult state perception attacks. We also investigate cost-constrained camouflage attacks and showed numerically how cost budgets affect the attack performance.

Fully Independent Communication in Multi-Agent Reinforcement Learning

Authors:Rafael Pina, Varuna De Silva, Corentin Artaud, Xiaolan Liu
Date:2024-01-26 18:42:01

Multi-Agent Reinforcement Learning (MARL) comprises a broad area of research within the field of multi-agent systems. Several recent works have focused specifically on the study of communication approaches in MARL. While multiple communication methods have been proposed, these might still be too complex and not easily transferable to more practical contexts. One of the reasons for that is due to the use of the famous parameter sharing trick. In this paper, we investigate how independent learners in MARL that do not share parameters can communicate. We demonstrate that this setting might incur into some problems, to which we propose a new learning scheme as a solution. Our results show that, despite the challenges, independent agents can still learn communication strategies following our method. Additionally, we use this method to investigate how communication in MARL is affected by different network capacities, both for sharing and not sharing parameters. We observe that communication may not always be needed and that the chosen agent network sizes need to be considered when used together with communication in order to achieve efficient learning.

Peer-to-Peer Energy Trading of Solar and Energy Storage: A Networked Multiagent Reinforcement Learning Approach

Authors:Chen Feng, Andrew L. Liu
Date:2024-01-25 05:05:55

Utilizing distributed renewable and energy storage resources in local distribution networks via peer-to-peer (P2P) energy trading has long been touted as a solution to improve energy systems' resilience and sustainability. Consumers and prosumers (those who have energy generation resources), however, do not have the expertise to engage in repeated P2P trading, and the zero-marginal costs of renewables present challenges in determining fair market prices. To address these issues, we propose multi-agent reinforcement learning (MARL) frameworks to help automate consumers' bidding and management of their solar PV and energy storage resources, under a specific P2P clearing mechanism that utilizes the so-called supply-demand ratio. In addition, we show how the MARL frameworks can integrate physical network constraints to realize voltage control, hence ensuring physical feasibility of the P2P energy trading and paving way for real-world implementations.

Generalization of Heterogeneous Multi-Robot Policies via Awareness and Communication of Capabilities

Authors:Pierce Howell, Max Rudolph, Reza Torbati, Kevin Fu, Harish Ravichandar
Date:2024-01-23 22:31:34

Recent advances in multi-agent reinforcement learning (MARL) are enabling impressive coordination in heterogeneous multi-robot teams. However, existing approaches often overlook the challenge of generalizing learned policies to teams of new compositions, sizes, and robots. While such generalization might not be important in teams of virtual agents that can retrain policies on-demand, it is pivotal in multi-robot systems that are deployed in the real-world and must readily adapt to inevitable changes. As such, multi-robot policies must remain robust to team changes -- an ability we call adaptive teaming. In this work, we investigate if awareness and communication of robot capabilities can provide such generalization by conducting detailed experiments involving an established multi-robot test bed. We demonstrate that shared decentralized policies, that enable robots to be both aware of and communicate their capabilities, can achieve adaptive teaming by implicitly capturing the fundamental relationship between collective capabilities and effective coordination. Videos of trained policies can be viewed at: https://sites.google.com/view/cap-comm

Emergent Communication Protocol Learning for Task Offloading in Industrial Internet of Things

Authors:Salwa Mostafa, Mateus P. Mota, Alvaro Valcarce, Mehdi Bennis
Date:2024-01-23 17:06:13

In this paper, we leverage a multi-agent reinforcement learning (MARL) framework to jointly learn a computation offloading decision and multichannel access policy with corresponding signaling. Specifically, the base station and industrial Internet of Things mobile devices are reinforcement learning agents that need to cooperate to execute their computation tasks within a deadline constraint. We adopt an emergent communication protocol learning framework to solve this problem. The numerical results illustrate the effectiveness of emergent communication in improving the channel access success rate and the number of successfully computed tasks compared to contention-based, contention-free, and no-communication approaches. Moreover, the proposed task offloading policy outperforms remote and local computation baselines.

Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach

Authors:Christian Fabian, Kai Cui, Heinz Koeppl
Date:2024-01-23 11:52:00

Learning the behavior of large agent populations is an important task for numerous research areas. Although the field of multi-agent reinforcement learning (MARL) has made significant progress towards solving these systems, solutions for many agents often remain computationally infeasible and lack theoretical guarantees. Mean Field Games (MFGs) address both of these issues and can be extended to Graphon MFGs (GMFGs) to include network structures between agents. Despite their merits, the real world applicability of GMFGs is limited by the fact that graphons only capture dense graphs. Since most empirically observed networks show some degree of sparsity, such as power law graphs, the GMFG framework is insufficient for capturing these network topologies. Thus, we introduce the novel concept of Graphex MFGs (GXMFGs) which builds on the graph theoretical concept of graphexes. Graphexes are the limiting objects to sparse graph sequences that also have other desirable features such as the small world property. Learning equilibria in these games is challenging due to the rich and sparse structure of the underlying graphs. To tackle these challenges, we design a new learning algorithm tailored to the GXMFG setup. This hybrid graphex learning approach leverages that the system mainly consists of a highly connected core and a sparse periphery. After defining the system and providing a theoretical analysis, we state our learning approach and demonstrate its learning capabilities on both synthetic graphs and real-world networks. This comparison shows that our GXMFG learning algorithm successfully extends MFGs to a highly relevant class of hard, realistic learning problems that are not accurately addressed by current MARL and MFG methods.

Backpropagation Through Agents

Authors:Zhiyuan Li, Wenshuai Zhao, Lijun Wu, Joni Pajarinen
Date:2024-01-23 09:07:31

A fundamental challenge in multi-agent reinforcement learning (MARL) is to learn the joint policy in an extremely large search space, which grows exponentially with the number of agents. Moreover, fully decentralized policy factorization significantly restricts the search space, which may lead to sub-optimal policies. In contrast, the auto-regressive joint policy can represent a much richer class of joint policies by factorizing the joint policy into the product of a series of conditional individual policies. While such factorization introduces the action dependency among agents explicitly in sequential execution, it does not take full advantage of the dependency during learning. In particular, the subsequent agents do not give the preceding agents feedback about their decisions. In this paper, we propose a new framework Back-Propagation Through Agents (BPTA) that directly accounts for both agents' own policy updates and the learning of their dependent counterparts. This is achieved by propagating the feedback through action chains. With the proposed framework, our Bidirectional Proximal Policy Optimisation (BPPO) outperforms the state-of-the-art methods. Extensive experiments on matrix games, StarCraftII v2, Multi-agent MuJoCo, and Google Research Football demonstrate the effectiveness of the proposed method.

Emergent Dominance Hierarchies in Reinforcement Learning Agents

Authors:Ram Rachum, Yonatan Nakar, Bill Tomlinson, Nitay Alon, Reuth Mirsky
Date:2024-01-21 16:59:45

Modern Reinforcement Learning (RL) algorithms are able to outperform humans in a wide variety of tasks. Multi-agent reinforcement learning (MARL) settings present additional challenges, and successful cooperation in mixed-motive groups of agents depends on a delicate balancing act between individual and group objectives. Social conventions and norms, often inspired by human institutions, are used as tools for striking this balance. In this paper, we examine a fundamental, well-studied social convention that underlies cooperation in both animal and human societies: dominance hierarchies. We adapt the ethological theory of dominance hierarchies to artificial agents, borrowing the established terminology and definitions with as few amendments as possible. We demonstrate that populations of RL agents, operating without explicit programming or intrinsic rewards, can invent, learn, enforce, and transmit a dominance hierarchy to new populations. The dominance hierarchies that emerge have a similar structure to those studied in chickens, mice, fish, and other species.

Measuring Policy Distance for Multi-Agent Reinforcement Learning

Authors:Tianyi Hu, Zhiqiang Pu, Xiaolin Ai, Tenghai Qiu, Jianqiang Yi
Date:2024-01-20 15:34:51

Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance.

The Synergy Between Optimal Transport Theory and Multi-Agent Reinforcement Learning

Authors:Ali Baheri, Mykel J. Kochenderfer
Date:2024-01-18 19:34:46

This paper explores the integration of optimal transport (OT) theory with multi-agent reinforcement learning (MARL). This integration uses OT to handle distributions and transportation problems to enhance the efficiency, coordination, and adaptability of MARL. There are five key areas where OT can impact MARL: (1) policy alignment, where OT's Wasserstein metric is used to align divergent agent strategies towards unified goals; (2) distributed resource management, employing OT to optimize resource allocation among agents; (3) addressing non-stationarity, using OT to adapt to dynamic environmental shifts; (4) scalable multi-agent learning, harnessing OT for decomposing large-scale learning objectives into manageable tasks; and (5) enhancing energy efficiency, applying OT principles to develop sustainable MARL systems. This paper articulates how the synergy between OT and MARL can address scalability issues, optimize resource distribution, align agent policies in cooperative environments, and ensure adaptability in dynamically changing conditions.

Model-Assisted Learning for Adaptive Cooperative Perception of Connected Autonomous Vehicles

Authors:Kaige Qu, Weihua Zhuang, Qiang Ye, Wen Wu, Xuemin Shen
Date:2024-01-18 17:30:25

Cooperative perception (CP) is a key technology to facilitate consistent and accurate situational awareness for connected and autonomous vehicles (CAVs). To tackle the network resource inefficiency issue in traditional broadcast-based CP, unicast-based CP has been proposed to associate CAV pairs for cooperative perception via vehicle-to-vehicle transmission. In this paper, we investigate unicast-based CP among CAV pairs. With the consideration of dynamic perception workloads and channel conditions due to vehicle mobility and dynamic radio resource availability, we propose an adaptive cooperative perception scheme for CAV pairs in a mixed-traffic autonomous driving scenario with both CAVs and human-driven vehicles. We aim to determine when to switch between cooperative perception and stand-alone perception for each CAV pair, and allocate communication and computing resources to cooperative CAV pairs for maximizing the computing efficiency gain under perception task delay requirements. A model-assisted multi-agent reinforcement learning (MARL) solution is developed, which integrates MARL for an adaptive CAV cooperation decision and an optimization model for communication and computing resource allocation. Simulation results demonstrate the effectiveness of the proposed scheme in achieving high computing efficiency gain, as compared with benchmark schemes.

Multi-Agent Reinforcement Learning for Maritime Operational Technology Cyber Security

Authors:Alec Wilson, Ryan Menzies, Neela Morarji, David Foster, Marco Casassa Mont, Esin Turkbeyler, Lisa Gralewski
Date:2024-01-18 17:22:22

This paper demonstrates the potential for autonomous cyber defence to be applied on industrial control systems and provides a baseline environment to further explore Multi-Agent Reinforcement Learning's (MARL) application to this problem domain. It introduces a simulation environment, IPMSRL, of a generic Integrated Platform Management System (IPMS) and explores the use of MARL for autonomous cyber defence decision-making on generic maritime based IPMS Operational Technology (OT). OT cyber defensive actions are less mature than they are for Enterprise IT. This is due to the relatively brittle nature of OT infrastructure originating from the use of legacy systems, design-time engineering assumptions, and lack of full-scale modern security controls. There are many obstacles to be tackled across the cyber landscape due to continually increasing cyber-attack sophistication and the limitations of traditional IT-centric cyber defence solutions. Traditional IT controls are rarely deployed on OT infrastructure, and where they are, some threats aren't fully addressed. In our experiments, a shared critic implementation of Multi Agent Proximal Policy Optimisation (MAPPO) outperformed Independent Proximal Policy Optimisation (IPPO). MAPPO reached an optimal policy (episode outcome mean of 1) after 800K timesteps, whereas IPPO was only able to reach an episode outcome mean of 0.966 after one million timesteps. Hyperparameter tuning greatly improved training performance. Across one million timesteps the tuned hyperparameters reached an optimal policy whereas the default hyperparameters only managed to win sporadically, with most simulations resulting in a draw. We tested a real-world constraint, attack detection alert success, and found that when alert success probability is reduced to 0.75 or 0.9, the MARL defenders were still able to win in over 97.5% or 99.5% of episodes, respectively.

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Authors:Gengrui Zhang, Yao Wang, Xiaoshuang Chen, Hongyi Qian, Kaiqiao Zhan, Ben Wang
Date:2024-01-12 09:32:34

In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.

Confidence-Based Curriculum Learning for Multi-Agent Path Finding

Authors:Thomy Phan, Joseph Driscoll, Justin Romberg, Sven Koenig
Date:2024-01-11 12:11:24

A wide range of real-world applications can be formulated as Multi-Agent Path Finding (MAPF) problem, where the goal is to find collision-free paths for multiple agents with individual start and goal locations. State-of-the-art MAPF solvers are mainly centralized and depend on global information, which limits their scalability and flexibility regarding changes or new maps that would require expensive replanning. Multi-agent reinforcement learning (MARL) offers an alternative way by learning decentralized policies that can generalize over a variety of maps. While there exist some prior works that attempt to connect both areas, the proposed techniques are heavily engineered and very complex due to the integration of many mechanisms that limit generality and are expensive to use. We argue that much simpler and general approaches are needed to bring the areas of MARL and MAPF closer together with significantly lower costs. In this paper, we propose Confidence-based Auto-Curriculum for Team Update Stability (CACTUS) as a lightweight MARL approach to MAPF. CACTUS defines a simple reverse curriculum scheme, where the goal of each agent is randomly placed within an allocation radius around the agent's start location. The allocation radius increases gradually as all agents improve, which is assessed by a confidence-based measure. We evaluate CACTUS in various maps of different sizes, obstacle densities, and numbers of agents. Our experiments demonstrate better performance and generalization capabilities than state-of-the-art MARL approaches with less than 600,000 trainable parameters, which is less than 5% of the neural network size of current MARL approaches to MAPF.

A Tensor Network Implementation of Multi Agent Reinforcement Learning

Authors:Sunny Howard
Date:2024-01-08 13:50:49

Recently it has been shown that tensor networks (TNs) have the ability to represent the expected return of a single-agent finite Markov decision process (FMDP). The TN represents a distribution model, where all possible trajectories are considered. When extending these ideas to a multi-agent setting, distribution models suffer from the curse of dimensionality: the exponential relation between the number of possible trajectories and the number of agents. The key advantage of using TNs in this setting is that there exists a large number of established optimisation and decomposition techniques that are specific to TNs, that one can apply to ensure the most efficient representation is found. In this report, these methods are used to form a TN that represents the expected return of a multi-agent reinforcement learning (MARL) task. This model is then applied to a 2 agent random walker example, where it was shown that the policy is correctly optimised using a DMRG technique. Finally, I demonstrate the use of an exact decomposition technique, reducing the number of elements in the tensors by 97.5%, without experiencing any loss of information.

ClusterComm: Discrete Communication in Decentralized MARL using Internal Representation Clustering

Authors:Robert Müller, Hasan Turalic, Thomy Phan, Michael Kölle, Jonas Nüßlein, Claudia Linnhoff-Popien
Date:2024-01-07 14:53:43

In the realm of Multi-Agent Reinforcement Learning (MARL), prevailing approaches exhibit shortcomings in aligning with human learning, robustness, and scalability. Addressing this, we introduce ClusterComm, a fully decentralized MARL framework where agents communicate discretely without a central control unit. ClusterComm utilizes Mini-Batch-K-Means clustering on the last hidden layer's activations of an agent's policy network, translating them into discrete messages. This approach outperforms no communication and competes favorably with unbounded, continuous communication and hence poses a simple yet effective strategy for enhancing collaborative task-solving in MARL.

Geometric Structure and Polynomial-time Algorithm of Game Equilibria

Authors:Hongbo Sun, Chongkun Xia, Junbo Tan, Bo Yuan, Xueqian Wang, Bin Liang
Date:2024-01-01 13:06:57

Whether a PTAS (polynomial-time approximation scheme) exists for game equilibria has been an open question, and its absence has indications and consequences in three fields: the practicality of methods in algorithmic game theory, non-stationarity and curse of multiagency in MARL (multi-agent reinforcement learning), and the tractability of PPAD in computational complexity theory. In this paper, we formalize the game equilibrium problem as an optimization problem that splits into two subproblems with respect to policy and value function, which are solved respectively by interior point method and dynamic programming. Combining these two parts, we obtain an FPTAS (fully PTAS) for the weak approximation (approximating to an $\epsilon$-equilibrium) of any perfect equilibrium of any dynamic game, implying PPAD=FP since the weak approximation problem is PPAD-complete. In addition, we introduce a geometric object called equilibrium bundle, regarding which, first, perfect equilibria of dynamic games are formalized as zero points of its canonical section, second, the hybrid iteration of dynamic programming and interior point method is formalized as a line search on it, third, it derives the existence and oddness theorems as an extension of those of Nash equilibria. In experiment, the line search process is animated, and the method is tested on 2000 randomly generated dynamic games where it converges to a perfect equilibrium in every single case.

Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning

Authors:Xin Yu, Rongye Shi, Pu Feng, Yongkai Tian, Simin Li, Shuhao Liao, Wenjun Wu
Date:2023-12-30 08:13:44

Incorporating symmetry as an inductive bias into multi-agent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority.

Age-of-Information in UAV-assisted Networks: a Decentralized Multi-Agent Optimization

Authors:Mouhamed Naby Ndiaye, El Houcine Bergou, Hajar El Hammouti
Date:2023-12-25 17:28:56

Unmanned aerial vehicles (UAVs) are a highly promising technology with diverse applications in wireless networks. One of their primary uses is the collection of time-sensitive data from Internet of Things (IoT) devices. In UAV-assisted networks, the Age-of-Information (AoI) serves as a fundamental metric for quantifying data timeliness and freshness. In this work, we are interested in a generalized AoI formulation, where each packet's age is weighted based on its generation time. Our objective is to find the optimal UAVs' trajectories and the subsets of selected devices such that the weighted AoI is minimized. To address this challenge, we formulate the problem as a Mixed-Integer Nonlinear Programming (MINLP), incorporating time and quality of service constraints. To efficiently tackle this complex problem and minimize communication overhead among UAVs, we propose a distributed approach. This approach enables drones to make independent decisions based on locally acquired data. Specifically, we reformulate our problem such that our objective function is easily decomposed into individual rewards. The reformulated problem is solved using a distributed implementation of Multi-Agent Reinforcement Learning (MARL). Our empirical results show that the proposed decentralized approach achieves results that are nearly equivalent to a centralized implementation with a notable reduction in communication overhead.

Context-aware Communication for Multi-agent Reinforcement Learning

Authors:Xinran Li, Jun Zhang
Date:2023-12-25 03:33:08

Effective communication protocols in multi-agent reinforcement learning (MARL) are critical to fostering cooperation and enhancing team performance. To leverage communication, many previous works have proposed to compress local information into a single message and broadcast it to all reachable agents. This simplistic messaging mechanism, however, may fail to provide adequate, critical, and relevant information to individual agents, especially in severely bandwidth-limited scenarios. This motivates us to develop context-aware communication schemes for MARL, aiming to deliver personalized messages to different agents. Our communication protocol, named CACOM, consists of two stages. In the first stage, agents exchange coarse representations in a broadcast fashion, providing context for the second stage. Following this, agents utilize attention mechanisms in the second stage to selectively generate messages personalized for the receivers. Furthermore, we employ the learned step size quantization (LSQ) technique for message quantization to reduce the communication overhead. To evaluate the effectiveness of CACOM, we integrate it with both actor-critic and value-based MARL algorithms. Empirical results on cooperative benchmark tasks demonstrate that CACOM provides evident performance gains over baselines under communication-constrained scenarios. The code is publicly available at https://github.com/LXXXXR/CACOM.

CARSS: Cooperative Attention-guided Reinforcement Subpath Synthesis for Solving Traveling Salesman Problem

Authors:Yuchen Shi, Congying Han, Tiande Guo
Date:2023-12-24 05:25:43

This paper introduces CARSS (Cooperative Attention-guided Reinforcement Subpath Synthesis), a novel approach to address the Traveling Salesman Problem (TSP) by leveraging cooperative Multi-Agent Reinforcement Learning (MARL). CARSS decomposes the TSP solving process into two distinct yet synergistic steps: "subpath generation" and "subpath merging." In the former, a cooperative MARL framework is employed to iteratively generate subpaths using multiple agents. In the latter, these subpaths are progressively merged to form a complete cycle. The algorithm's primary objective is to enhance efficiency in terms of training memory consumption, testing time, and scalability, through the adoption of a multi-agent divide and conquer paradigm. Notably, attention mechanisms play a pivotal role in feature embedding and parameterization strategies within CARSS. The training of the model is facilitated by the independent REINFORCE algorithm. Empirical experiments reveal CARSS's superiority compared to single-agent alternatives: it demonstrates reduced GPU memory utilization, accommodates training graphs nearly 2.5 times larger, and exhibits the potential for scaling to even more extensive problem sizes. Furthermore, CARSS substantially reduces testing time and optimization gaps by approximately 50% for TSP instances of up to 1000 vertices, when compared to standard decoding methods.

Dynamic Routing for Integrated Satellite-Terrestrial Networks: A Constrained Multi-Agent Reinforcement Learning Approach

Authors:Yifeng Lyu, Han Hu, Rongfei Fan, Zhi Liu, Jianping An, Shiwen Mao
Date:2023-12-23 03:36:35

The integrated satellite-terrestrial network (ISTN) system has experienced significant growth, offering seamless communication services in remote areas with limited terrestrial infrastructure. However, designing a routing scheme for ISTN is exceedingly difficult, primarily due to the heightened complexity resulting from the inclusion of additional ground stations, along with the requirement to satisfy various constraints related to satellite service quality. To address these challenges, we study packet routing with ground stations and satellites working jointly to transmit packets, while prioritizing fast communication and meeting energy efficiency and packet loss requirements. Specifically, we formulate the problem of packet routing with constraints as a max-min problem using the Lagrange method. Then we propose a novel constrained Multi-Agent reinforcement learning (MARL) dynamic routing algorithm named CMADR, which efficiently balances objective improvement and constraint satisfaction during the updating of policy and Lagrange multipliers. Finally, we conduct extensive experiments and an ablation study using the OneWeb and Telesat mega-constellations. Results demonstrate that CMADR reduces the packet delay by a minimum of 21% and 15%, while meeting stringent energy consumption and packet loss rate constraints, outperforming several baseline algorithms.

First-principle-like reinforcement learning of nonlinear numerical schemes for conservation laws

Authors:Hao-Chen Wang, Meilin Yu, Heng Xiao
Date:2023-12-20 18:38:43

In this study, we present a universal nonlinear numerical scheme design method enabled by multi-agent reinforcement learning (MARL). Different from contemporary supervised-learning-based and reinforcement-learning-based approaches, no reference data and special numerical treatments are used in the MARL-based method developed here; instead, a first-principle-like approach using fundamental computational fluid dynamics (CFD) principles, including total variation diminishing (TVD) and $k$-exact reconstruction, is used to design nonlinear numerical schemes. The third-order finite volume scheme is employed as the workhorse to test the performance of the MARL-based nonlinear numerical scheme design method. Numerical results demonstrate that the new MARL-based method is able to strike a balance between accuracy and numerical dissipation in nonlinear numerical scheme design, and outperforms the third-order MUSCL (Monotonic Upstream-centered Scheme for Conservation Laws) with the van Albada limiter for shock capturing. Furthermore, we demonstrate for the first time that a numerical scheme trained from one-dimensional (1D) Burger's equation simulations can be directly used for numerical simulations of both 1D and 2D (two-dimensional constructions using the tensor product operation) Euler equations. The working environment of the MARL-based numerical scheme design concepts can incorporate, in general, all types of numerical schemes as simulation machines.

Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

Authors:Murad Dawood, Sicong Pan, Nils Dengler, Siqi Zhou, Angela P. Schoellig, Maren Bennewitz
Date:2023-12-20 09:23:58

In this paper, we address the problem of behavior-based cooperative navigation of mobile robots using safe multi-agent reinforcement learning~(MARL). Our work is the first to focus on cooperative navigation without individual reference targets for the robots, using a single target for the formation's centroid. This eliminates the complexities involved in having several path planners to control a team of robots. To ensure safety, our MARL framework uses model predictive control (MPC) to prevent actions that could lead to collisions during training and execution. We demonstrate the effectiveness of our method in simulation and on real robots, achieving safe behavior-based cooperative navigation without using individual reference targets, with zero collisions, and faster target reaching compared to baselines. Finally, we study the impact of MPC safety filters on the learning process, revealing that we achieve faster convergence during training and we show that our approach can be safely deployed on real robots, even during early stages of the training.

Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning

Authors:Yanwen Ba, Xuan Liu, Xinning Chen, Hao Wang, Yang Xu, Kenli Li, Shigeng Zhang
Date:2023-12-19 12:17:55

While decentralized training is attractive in multi-agent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance.

Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics

Authors:Hisato Komatsu
Date:2023-12-19 04:02:50

In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.

Multi-Agent Reinforcement Learning for Connected and Automated Vehicles Control: Recent Advancements and Future Prospects

Authors:Min Hua, Dong Chen, Xinda Qi, Kun Jiang, Zemin Eitan Liu, Quan Zhou, Hongming Xu
Date:2023-12-18 10:23:50

Connected and automated vehicles (CAVs) are considered a potential solution for future transportation challenges, aiming to develop systems that are efficient, safe, and environmentally friendly. However, CAV control presents significant challenges due to the complexity of interconnectivity and coordination required among vehicles. Multi-agent reinforcement learning (MARL), which has shown notable advancements in addressing complex problems in autonomous driving, robotics, and human-vehicle interaction, emerges as a promising tool to enhance CAV capabilities. Despite its potential, there is a notable absence of current reviews on mainstream MARL algorithms for CAVs. To fill this gap, this paper offers a comprehensive review of MARL's application in CAV control. The paper begins with an introduction to MARL, explaining its unique advantages in handling complex and multi-agent scenarios. It then presents a detailed survey of MARL applications across various control dimensions for CAVs, including critical scenarios such as platooning control, lane-changing, and unsignalized intersections. Additionally, the paper reviews prominent simulation platforms essential for developing and testing MARL algorithms. Lastly, it examines the current challenges in deploying MARL for CAV control, including macro-micro optimization, communication, mixed traffic, and sim-to-real challenges. Potential solutions discussed include hierarchical MARL, decentralized MARL, adaptive interactions, and offline MARL.

Robust Communicative Multi-Agent Reinforcement Learning with Active Defense

Authors:Lebin Yu, Yunbo Qiu, Quanming Yao, Yuan Shen, Xudong Zhang, Jian Wang
Date:2023-12-16 09:02:56

Communication in multi-agent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks.

Multi-agent Reinforcement Learning: A Comprehensive Survey

Authors:Dom Huh, Prasant Mohapatra
Date:2023-12-15 23:16:54

Multi-agent systems (MAS) are widely prevalent and crucially important in numerous real-world applications, where multiple agents must make decisions to achieve their objectives in a shared environment. Despite their ubiquity, the development of intelligent decision-making agents in MAS poses several open challenges to their effective implementation. This survey examines these challenges, placing an emphasis on studying seminal concepts from game theory (GT) and machine learning (ML) and connecting them to recent advancements in multi-agent reinforcement learning (MARL), i.e. the research of data-driven decision-making within MAS. Therefore, the objective of this survey is to provide a comprehensive perspective along the various dimensions of MARL, shedding light on the unique opportunities that are presented in MARL applications while highlighting the inherent challenges that accompany this potential. Therefore, we hope that our work will not only contribute to the field by analyzing the current landscape of MARL but also motivate future directions with insights for deeper integration of concepts from related domains of GT and ML. With this in mind, this work delves into a detailed exploration of recent and past efforts of MARL and its related fields and describes prior solutions that were proposed and their limitations, as well as their applications.

Communication-Efficient Soft Actor-Critic Policy Collaboration via Regulated Segment Mixture

Authors:Xiaoxue Yu, Rongpeng Li, Chengchao Liang, Zhifeng Zhao
Date:2023-12-15 13:39:55

Multi-Agent Reinforcement Learning (MARL) has emerged as a foundational approach for addressing diverse, intelligent control tasks in various scenarios like the Internet of Vehicles, Internet of Things, and Unmanned Aerial Vehicles. However, the widely assumed existence of a central node for centralized, federated learning-assisted MARL might be impractical in highly dynamic environments. This can lead to excessive communication overhead, potentially overwhelming the system. To address these challenges, we design a novel communication-efficient, fully distributed algorithm for collaborative MARL under the frameworks of Soft Actor-Critic (SAC) and Decentralized Federated Learning (DFL), named RSM-MASAC. In particular, RSM-MASAC enhances multi-agent collaboration and prioritizes higher communication efficiency in dynamic systems by incorporating the concept of segmented aggregation in DFL and augmenting multiple model replicas from received neighboring policy segments, which are subsequently employed as reconstructed referential policies for mixing. Distinctively diverging from traditional RL approaches, RSM-MASAC introduces new bounds under the framework of Maximum Entropy Reinforcement Learning (MERL). Correspondingly, it adopts a theory-guided mixture metric to regulate the selection of contributive referential policies, thus guaranteeing soft policy improvement during the communication-assisted mixing phase. Finally, the extensive simulations in mixed-autonomy traffic control scenarios verify the effectiveness and superiority of our algorithm.

Situation-Dependent Causal Influence-Based Cooperative Multi-agent Reinforcement Learning

Authors:Xiao Du, Yutong Ye, Pengyu Zhang, Yaning Yang, Mingsong Chen, Ting Wang
Date:2023-12-15 05:09:32

Learning to collaborate has witnessed significant progress in multi-agent reinforcement learning (MARL). However, promoting coordination among agents and enhancing exploration capabilities remain challenges. In multi-agent environments, interactions between agents are limited in specific situations. Effective collaboration between agents thus requires a nuanced understanding of when and how agents' actions influence others. To this end, in this paper, we propose a novel MARL algorithm named Situation-Dependent Causal Influence-Based Cooperative Multi-agent Reinforcement Learning (SCIC), which incorporates a novel Intrinsic reward mechanism based on a new cooperation criterion measured by situation-dependent causal influence among agents. Our approach aims to detect inter-agent causal influences in specific situations based on the criterion using causal intervention and conditional mutual information. This effectively assists agents in exploring states that can positively impact other agents, thus promoting cooperation between agents. The resulting update links coordinated exploration and intrinsic reward distribution, which enhance overall collaboration and performance. Experimental results on various MARL benchmarks demonstrate the superiority of our method compared to state-of-the-art approaches.

From Centralized to Self-Supervised: Pursuing Realistic Multi-Agent Reinforcement Learning

Authors:Violet Xiang, Logan Cross, Jan-Philipp Fränken, Nick Haber
Date:2023-12-14 05:32:55

In real-world environments, autonomous agents rely on their egocentric observations. They must learn adaptive strategies to interact with others who possess mixed motivations, discernible only through visible cues. Several Multi-Agent Reinforcement Learning (MARL) methods adopt centralized approaches that involve either centralized training or reward-sharing, often violating the realistic ways in which living organisms, like animals or humans, process information and interact. MARL strategies deploying decentralized training with intrinsic motivation offer a self-supervised approach, enable agents to develop flexible social strategies through the interaction of autonomous agents. However, by contrasting the self-supervised and centralized methods, we reveal that populations trained with reward-sharing methods surpass those using self-supervised methods in a mixed-motive environment. We link this superiority to specialized role emergence and an agent's expertise in its role. Interestingly, this gap shrinks in pure-motive settings, emphasizing the need for evaluations in more complex, realistic environments (mixed-motive). Our preliminary results suggest a gap in population performance that can be closed by improving self-supervised methods and thereby pushing MARL closer to real-world readiness.

On Diagnostics for Understanding Agent Training Behaviour in Cooperative MARL

Authors:Wiem Khlifi, Siddarth Singh, Omayma Mahjoub, Ruan de Kock, Abidine Vall, Rihab Gorsane, Arnu Pretorius
Date:2023-12-13 19:10:10

Cooperative multi-agent reinforcement learning (MARL) has made substantial strides in addressing the distributed decision-making challenges. However, as multi-agent systems grow in complexity, gaining a comprehensive understanding of their behaviour becomes increasingly challenging. Conventionally, tracking team rewards over time has served as a pragmatic measure to gauge the effectiveness of agents in learning optimal policies. Nevertheless, we argue that relying solely on the empirical returns may obscure crucial insights into agent behaviour. In this paper, we explore the application of explainable AI (XAI) tools to gain profound insights into agent behaviour. We employ these diagnostics tools within the context of Level-Based Foraging and Multi-Robot Warehouse environments and apply them to a diverse array of MARL algorithms. We demonstrate how our diagnostics can enhance the interpretability and explainability of MARL systems, providing a better understanding of agent behaviour.

Efficiently Quantifying Individual Agent Importance in Cooperative MARL

Authors:Omayma Mahjoub, Ruan de Kock, Siddarth Singh, Wiem Khlifi, Abidine Vall, Kale-ab Tessera, Arnu Pretorius
Date:2023-12-13 19:09:37

Measuring the contribution of individual agents is challenging in cooperative multi-agent reinforcement learning (MARL). In cooperative MARL, team performance is typically inferred from a single shared global reward. Arguably, among the best current approaches to effectively measure individual agent contributions is to use Shapley values. However, calculating these values is expensive as the computational complexity grows exponentially with respect to the number of agents. In this paper, we adapt difference rewards into an efficient method for quantifying the contribution of individual agents, referred to as Agent Importance, offering a linear computational complexity relative to the number of agents. We show empirically that the computed values are strongly correlated with the true Shapley values, as well as the true underlying individual agent rewards, used as the ground truth in environments where these are available. We demonstrate how Agent Importance can be used to help study MARL systems by diagnosing algorithmic failures discovered in prior MARL benchmarking work. Our analysis illustrates Agent Importance as a valuable explainability component for future MARL benchmarks.

How much can change in a year? Revisiting Evaluation in Multi-Agent Reinforcement Learning

Authors:Siddarth Singh, Omayma Mahjoub, Ruan de Kock, Wiem Khlifi, Abidine Vall, Kale-ab Tessera, Arnu Pretorius
Date:2023-12-13 19:06:34

Establishing sound experimental standards and rigour is important in any growing field of research. Deep Multi-Agent Reinforcement Learning (MARL) is one such nascent field. Although exciting progress has been made, MARL has recently come under scrutiny for replicability issues and a lack of standardised evaluation methodology, specifically in the cooperative setting. Although protocols have been proposed to help alleviate the issue, it remains important to actively monitor the health of the field. In this work, we extend the database of evaluation methodology previously published by containing meta-data on MARL publications from top-rated conferences and compare the findings extracted from this updated database to the trends identified in their work. Our analysis shows that many of the worrying trends in performance reporting remain. This includes the omission of uncertainty quantification, not reporting all relevant evaluation details and a narrowing of algorithmic development classes. Promisingly, we do observe a trend towards more difficult scenarios in SMAC-v1, which if continued into SMAC-v2 will encourage novel algorithmic development. Our data indicate that replicability needs to be approached more proactively by the MARL community to ensure trust in the field as we move towards exciting new frontiers.

Noise Distribution Decomposition based Multi-Agent Distributional Reinforcement Learning

Authors:Wei Geng, Baidi Xiao, Rongpeng Li, Ning Wei, Dong Wang, Zhifeng Zhao
Date:2023-12-12 07:24:15

Generally, Reinforcement Learning (RL) agent updates its policy by repetitively interacting with the environment, contingent on the received rewards to observed states and undertaken actions. However, the environmental disturbance, commonly leading to noisy observations (e.g., rewards and states), could significantly shape the performance of agent. Furthermore, the learning performance of Multi-Agent Reinforcement Learning (MARL) is more susceptible to noise due to the interference among intelligent agents. Therefore, it becomes imperative to revolutionize the design of MARL, so as to capably ameliorate the annoying impact of noisy rewards. In this paper, we propose a novel decomposition-based multi-agent distributional RL method by approximating the globally shared noisy reward by a Gaussian mixture model (GMM) and decomposing it into the combination of individual distributional local rewards, with which each agent can be updated locally through distributional RL. Moreover, a diffusion model (DM) is leveraged for reward generation in order to mitigate the issue of costly interaction expenditure for learning distributions. Furthermore, the optimality of the distribution decomposition is theoretically validated, while the design of loss function is carefully calibrated to avoid the decomposition ambiguity. We also verify the effectiveness of the proposed method through extensive simulation experiments with noisy rewards. Besides, different risk-sensitive policies are evaluated in order to demonstrate the superiority of distributional RL in different MARL tasks.

Multi-Agent Reinforcement Learning for Multi-Cell Spectrum and Power Allocation

Authors:Yiming Zhang, Dongning Guo
Date:2023-12-10 04:06:45

This paper introduces a novel approach to radio resource allocation in multi-cell wireless networks using a fully scalable multi-agent reinforcement learning (MARL) framework. A distributed method is developed where agents control individual cells and determine spectrum and power allocation based on limited local information, yet achieve quality of service (QoS) performance comparable to centralized methods using global information. The objective is to minimize packet delays across devices under stochastic arrivals and applies to both conflict graph abstractions and cellular network configurations. This is formulated as a distributed learning problem, implementing a multi-agent proximal policy optimization (MAPPO) algorithm with recurrent neural networks and queueing dynamics. This traffic-driven MARL-based solution enables decentralized training and execution, ensuring scalability to large networks. Extensive simulations demonstrate that the proposed methods achieve comparable QoS performance to genie-aided centralized algorithms with significantly less execution time. The trained policies also exhibit scalability and robustness across various network sizes and traffic conditions.

Privacy Preserving Multi-Agent Reinforcement Learning in Supply Chains

Authors:Ananta Mukherjee, Peeyush Kumar, Boling Yang, Nishanth Chandran, Divya Gupta
Date:2023-12-09 21:25:21

This paper addresses privacy concerns in multi-agent reinforcement learning (MARL), specifically within the context of supply chains where individual strategic data must remain confidential. Organizations within the supply chain are modeled as agents, each seeking to optimize their own objectives while interacting with others. As each organization's strategy is contingent on neighboring strategies, maintaining privacy of state and action-related information is crucial. To tackle this challenge, we propose a game-theoretic, privacy-preserving mechanism, utilizing a secure multi-party computation (MPC) framework in MARL settings. Our major contribution is the successful implementation of a secure MPC framework, SecFloat on EzPC, to solve this problem. However, simply implementing policy gradient methods such as MADDPG operations using SecFloat, while conceptually feasible, would be programmatically intractable. To overcome this hurdle, we devise a novel approach that breaks down the forward and backward pass of the neural network into elementary operations compatible with SecFloat , creating efficient and secure versions of the MADDPG algorithm. Furthermore, we present a learning mechanism that carries out floating point operations in a privacy-preserving manner, an important feature for successful learning in MARL framework. Experiments reveal that there is on average 68.19% less supply chain wastage in 2 PC compared to no data share, while also giving on average 42.27% better average cumulative revenue for each player. This work paves the way for practical, privacy-preserving MARL, promising significant improvements in secure computation within supply chain contexts and broadly.

Attention-Guided Contrastive Role Representations for Multi-Agent Reinforcement Learning

Authors:Zican Hu, Zongzhang Zhang, Huaxiong Li, Chunlin Chen, Hongyu Ding, Zhi Wang
Date:2023-12-08 03:40:38

Real-world multi-agent tasks usually involve dynamic team composition with the emergence of roles, which should also be a key to efficient cooperation in multi-agent reinforcement learning (MARL). Drawing inspiration from the correlation between roles and agent's behavior patterns, we propose a novel framework of **A**ttention-guided **CO**ntrastive **R**ole representation learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge transfer, and skillful coordination across agents. First, we introduce mutual information maximization to formalize role representation learning, derive a contrastive learning objective, and concisely approximate the distribution of negative pairs. Second, we leverage an attention mechanism to prompt the global state to attend to learned role representations in value decomposition, implicitly guiding agent coordination in a skillful role space to yield more expressive credit assignment. Experiments on challenging StarCraft II micromanagement and Google research football tasks demonstrate the state-of-the-art performance of our method and its advantages over existing approaches. Our code is available at [https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM).

A Scalable Network-Aware Multi-Agent Reinforcement Learning Framework for Decentralized Inverter-based Voltage Control

Authors:Han Xu, Jialin Zheng, Guannan Qu
Date:2023-12-07 15:42:53

This paper addresses the challenges associated with decentralized voltage control in power grids due to an increase in distributed generations (DGs). Traditional model-based voltage control methods struggle with the rapid energy fluctuations and uncertainties of these DGs. While multi-agent reinforcement learning (MARL) has shown potential for decentralized secondary control, scalability issues arise when dealing with a large number of DGs. This problem lies in the dominant centralized training and decentralized execution (CTDE) framework, where the critics take global observations and actions. To overcome these challenges, we propose a scalable network-aware (SNA) framework that leverages network structure to truncate the input to the critic's Q-function, thereby improving scalability and reducing communication costs during training. Further, the SNA framework is theoretically grounded with provable approximation guarantee, and it can seamlessly integrate with multiple multi-agent actor-critic algorithms. The proposed SNA framework is successfully demonstrated in a system with 114 DGs, providing a promising solution for decentralized voltage control in increasingly complex power grid systems.

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Authors:Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang
Date:2023-12-06 17:59:34

Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning

Authors:Matteo Bettini, Amanda Prorok, Vincent Moens
Date:2023-12-03 18:15:58

The field of Multi-Agent Reinforcement Learning (MARL) is currently facing a reproducibility crisis. While solutions for standardized reporting have been proposed to address the issue, we still lack a benchmarking tool that enables standardization and reproducibility, while leveraging cutting-edge Reinforcement Learning (RL) implementations. In this paper, we introduce BenchMARL, the first MARL training library created to enable standardized benchmarking across different algorithms, models, and environments. BenchMARL uses TorchRL as its backend, granting it high performance and maintained state-of-the-art implementations while addressing the broad community of MARL PyTorch users. Its design enables systematic configuration and reporting, thus allowing users to create and run complex benchmarks from simple one-line inputs. BenchMARL is open-sourced on GitHub: https://github.com/facebookresearch/BenchMARL

A Survey of Progress on Cooperative Multi-agent Reinforcement Learning in Open Environment

Authors:Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, Yang Yu
Date:2023-12-02 08:04:31

Multi-agent Reinforcement Learning (MARL) has gained wide attention in recent years and has made progress in various fields. Specifically, cooperative MARL focuses on training a team of agents to cooperatively achieve tasks that are difficult for a single agent to handle. It has shown great potential in applications such as path planning, autonomous driving, active voltage control, and dynamic algorithm configuration. One of the research focuses in the field of cooperative MARL is how to improve the coordination efficiency of the system, while research work has mainly been conducted in simple, static, and closed environment settings. To promote the application of artificial intelligence in real-world, some research has begun to explore multi-agent coordination in open environments. These works have made progress in exploring and researching the environments where important factors might change. However, the mainstream work still lacks a comprehensive review of the research direction. In this paper, starting from the concept of reinforcement learning, we subsequently introduce multi-agent systems (MAS), cooperative MARL, typical methods, and test environments. Then, we summarize the research work of cooperative MARL from closed to open environments, extract multiple research directions, and introduce typical works. Finally, we summarize the strengths and weaknesses of the current research, and look forward to the future development direction and research problems in cooperative MARL in open environments.

Generalisable Agents for Neural Network Optimisation

Authors:Kale-ab Tessera, Callum Rhys Tilbury, Sasha Abramowitz, Ruan de Kock, Omayma Mahjoub, Benjamin Rosman, Sara Hooker, Arnu Pretorius
Date:2023-11-30 14:45:51

Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome.

Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play

Authors:Daniel Bairamian, Philippe Marcotte, Joshua Romoff, Gabriel Robert, Derek Nowrouzezahrai
Date:2023-11-28 19:34:40

Recent advances in Competitive Self-Play (CSP) have achieved, or even surpassed, human level performance in complex game environments such as Dota 2 and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL). One core component of these methods relies on creating a pool of learning agents -- consisting of the Main Agent, past versions of this agent, and Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main Agents. A key drawback of these approaches is the large computational cost and physical time that is required to train the system, making them impractical to deploy in highly iterative real-life settings such as video game productions. In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.

Overview of the experimental quest for the giant pairing vibration

Authors:M. Assié
Date:2023-11-27 07:30:15

The search for the giant pairing vibration (GPV) has a long standing history since the 1970's when it was predicted. First experimental measurements focused on (p,t) transfer reactions in the heavy nuclei and did not show convincing evidence. The discovery of a signal compatible with the GPV in the light carbon isotopes has renewed the interest for the GPV. It triggered new theoretical models showing that the GPV in the heavy nuclei might be too wide or too melted to be observed and triggered new experiments with radioactive probes based on ($^{6}$He,$^{4}$He) transfer.

Determination of equilibrium parameters of the Marle model for polyatomic gases

Authors:Byung-Hoon Hwang
Date:2023-11-17 03:20:05

The BGK model is a relaxation-time approximation of the celebrated Boltzmann equation, and the Marle model is a direct extension of the BGK model in a relativistic framework. In this paper, we introduce the Marle model for polyatomic gases based on the J\"{u}ttner distribution devised in [Ann. Phys., 377, (2017), 414--445], and show the existence of a unique set of equilibrium parameters of the J\"{u}ttner distribution.

JaxMARL: Multi-Agent RL Environments and Algorithms in JAX

Authors:Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Gardar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktaschel, Chris Lu, Jakob Nicolaus Foerster
Date:2023-11-16 18:58:43

Benchmarks are crucial in the development of machine learning algorithms, with available environments significantly influencing reinforcement learning (RL) research. Traditionally, RL environments run on the CPU, which limits their scalability with typical academic compute. However, recent advancements in JAX have enabled the wider use of hardware acceleration, enabling massively parallel RL training pipelines and environments. While this has been successfully applied to single-agent RL, it has not yet been widely adopted for multi-agent scenarios. In this paper, we present JaxMARL, the first open-source, Python-based library that combines GPU-enabled efficiency with support for a large number of commonly used MARL environments and popular baseline algorithms. Our experiments show that, in terms of wall clock time, our JAX-based training pipeline is around 14 times faster than existing approaches, and up to 12500x when multiple training runs are vectorized. This enables efficient and thorough evaluations, potentially alleviating the evaluation crisis in the field. We also introduce and benchmark SMAX, a JAX-based approximate reimplementation of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. The code is available at https://github.com/flairox/jaxmarl.

Enhancing Multi-Agent Coordination through Common Operating Picture Integration

Authors:Peihong Yu, Bhoram Lee, Aswin Raghavan, Supun Samarasekara, Pratap Tokekar, James Zachary Hare
Date:2023-11-08 15:08:55

In multi-agent systems, agents possess only local observations of the environment. Communication between teammates becomes crucial for enhancing coordination. Past research has primarily focused on encoding local information into embedding messages which are unintelligible to humans. We find that using these messages in agent's policy learning leads to brittle policies when tested on out-of-distribution initial states. We present an approach to multi-agent coordination, where each agent is equipped with the capability to integrate its (history of) observations, actions and messages received into a Common Operating Picture (COP) and disseminate the COP. This process takes into account the dynamic nature of the environment and the shared mission. We conducted experiments in the StarCraft2 environment to validate our approach. Our results demonstrate the efficacy of COP integration, and show that COP-based training leads to robust policies compared to state-of-the-art Multi-Agent Reinforcement Learning (MARL) methods when faced with out-of-distribution initial states.

Learning Decentralized Traffic Signal Controllers with Multi-Agent Graph Reinforcement Learning

Authors:Yao Zhang, Zhiwen Yu, Jun Zhang, Liang Wang, Tom H. Luan, Bin Guo, Chau Yuen
Date:2023-11-07 06:43:15

This paper considers optimal traffic signal control in smart cities, which has been taken as a complex networked system control problem. Given the interacting dynamics among traffic lights and road networks, attaining controller adaptivity and scalability stands out as a primary challenge. Capturing the spatial-temporal correlation among traffic lights under the framework of Multi-Agent Reinforcement Learning (MARL) is a promising solution. Nevertheless, existing MARL algorithms ignore effective information aggregation which is fundamental for improving the learning capacity of decentralized agents. In this paper, we design a new decentralized control architecture with improved environmental observability to capture the spatial-temporal correlation. Specifically, we first develop a topology-aware information aggregation strategy to extract correlation-related information from unstructured data gathered in the road network. Particularly, we transfer the road network topology into a graph shift operator by forming a diffusion process on the topology, which subsequently facilitates the construction of graph signals. A diffusion convolution module is developed, forming a new MARL algorithm, which endows agents with the capabilities of graph learning. Extensive experiments based on both synthetic and real-world datasets verify that our proposal outperforms existing decentralized algorithms.

Kindness in Multi-Agent Reinforcement Learning

Authors:Farinaz Alamiyan-Harandi, Mersad Hassanjani, Pouria Ramazi
Date:2023-11-06 19:53:26

In human societies, people often incorporate fairness in their decisions and treat reciprocally by being kind to those who act kindly. They evaluate the kindness of others' actions not only by monitoring the outcomes but also by considering the intentions. This behavioral concept can be adapted to train cooperative agents in Multi-Agent Reinforcement Learning (MARL). We propose the KindMARL method, where agents' intentions are measured by counterfactual reasoning over the environmental impact of the actions that were available to the agents. More specifically, the current environment state is compared with the estimation of the current environment state provided that the agent had chosen another action. The difference between each agent's reward, as the outcome of its action, with that of its fellow, multiplied by the intention of the fellow is then taken as the fellow's "kindness". If the result of each reward-comparison confirms the agent's superiority, it perceives the fellow's kindness and reduces its own reward. Experimental results in the Cleanup and Harvest environments show that training based on the KindMARL method enabled the agents to earn 89\% (resp. 37\%) and 44% (resp. 43\%) more total rewards than training based on the Inequity Aversion and Social Influence methods. The effectiveness of KindMARL is further supported by experiments in a traffic light control problem.

A Brain-inspired Theory of Collective Mind Model for Efficient Social Cooperation

Authors:Zhuoya Zhao, Feifei Zhao, Shiwen Wang, Yinqian Sun, Yi Zeng
Date:2023-11-06 14:45:43

Social intelligence manifests the capability, often referred to as the Theory of Mind (ToM), to discern others' behavioral intentions, beliefs, and other mental states. ToM is especially important in multi-agent and human-machine interaction environments because each agent needs to understand the mental states of other agents in order to better respond, interact, and collaborate. Recent research indicates that the ToM model possesses the capability to infer beliefs, intentions, and anticipate future observations and actions; nonetheless, its deployment in tackling intricate tasks remains notably limited. The challenges arise when the number of agents increases, the environment becomes more complex, and interacting with the environment and predicting the mental state of each other becomes difficult and time consuming. To overcome such limits, we take inspiration from the Theory of Collective Mind (ToCM) mechanism, predicting observations of all other agents into a unified but plural representation and discerning how our own actions affect this mental state representation. Based on this foundation, we construct an imaginative space to simulate the multi-agent interaction process, thus improving the efficiency of cooperation among multiple agents in complex decision-making environments. In various cooperative tasks with different numbers of agents, the experimental results highlight the superior cooperative efficiency and performance of our approach compared to the Multi-Agent Reinforcement Learning (MARL) baselines. We achieve consistent boost on SNN- and DNN-based decision networks, and demonstrate that ToCM's inferences about others' mental states can be transferred to new tasks for quickly and flexible adaptation.

Learning Independently from Causality in Multi-Agent Environments

Authors:Rafael Pina, Varuna De Silva, Corentin Artaud
Date:2023-11-05 19:12:08

Multi-Agent Reinforcement Learning (MARL) comprises an area of growing interest in the field of machine learning. Despite notable advances, there are still problems that require investigation. The lazy agent pathology is a famous problem in MARL that denotes the event when some of the agents in a MARL team do not contribute to the common goal, letting the teammates do all the work. In this work, we aim to investigate this problem from a causality-based perspective. We intend to create the bridge between the fields of MARL and causality and argue about the usefulness of this link. We study a fully decentralised MARL setup where agents need to learn cooperation strategies and show that there is a causal relation between individual observations and the team reward. The experiments carried show how this relation can be used to improve independent agents in MARL, resulting not only on better performances as a team but also on the rise of more intelligent behaviours on individual agents.

MELCHIORS: The Mercator Library of High Resolution Stellar Spectroscopy

Authors:P. Royer, T. Merle, K. Dsilva, S. Sekaran, H. Van Winckel, Y. Frémat, M. Van der Swaelmen, S. Gebruers, A. Tkachenko, M. Laverick, M. Dirickx, G. Raskin, H. Hensberge, M. Abdul-Masih, B. Acke, M. L. Alonso, S. Bandhu Mahato, P. G. Beck, N. Behara, S. Bloemen, B. Buysschaert, N. Cox, J. Debosscher, P. De Cat, P. Degroote, R. De Nutte, K. De Smedt, B. de Vries, L. Dumortier, A. Escorza, K. Exter, S. Goriely, N. Gorlova, M. Hillen, W. Homan, A. Jorissen, D. Kamath, M. Karjalainen, R. Karjalainen, P. Lampens, A. Lobel, R. Lombaert, P. Marcos-Arenal, J. Menu, F. Merges, E. Moravveji, P. Nemeth, P. Neyskens, R. Ostensen, P. I. Pápics, J. Perez, S. Prins S. Royer, A. Samadi-Ghadim, H. Sana, A. Sans Fuentes, S. Scaringi, V. Schmid, L. Siess, C. Siopis, K. Smolders, S. Sodor, A. Thoul, S. Triana, B. Vandenbussche, M. Van de Sande, G. Van De Steene, S. Van Eck, P. A. M. van Hoof, A. J. Van Marle, T. Van Reeth, L. Vermeylen, D. Volpi, J. Vos, C. Waelkens
Date:2023-11-05 16:39:34

Over the past decades, libraries of stellar spectra have been used in a large variety of science cases, including as sources of reference spectra for a given object or a given spectral type. Despite the existence of large libraries and the increasing number of projects of large-scale spectral surveys, there is to date only one very high-resolution spectral library offering spectra from a few hundred objects from the southern hemisphere (UVES-POP) . We aim to extend the sample, offering a finer coverage of effective temperatures and surface gravity with a uniform collection of spectra obtained in the northern hemisphere. Between 2010 and 2020, we acquired several thousand echelle spectra of bright stars with the Mercator-HERMES spectrograph located in the Roque de Los Muchachos Observatory in La Palma, whose pipeline offers high-quality data reduction products. We have also developed methods to correct for the instrumental response in order to approach the true shape of the spectral continuum. Additionally, we have devised a normalisation process to provide a homogeneous normalisation of the full spectral range for most of the objects. We present a new spectral library consisting of 3256 spectra covering 2043 stars. It combines high signal-to-noise and high spectral resolution over the entire range of effective temperatures and luminosity classes. The spectra are presented in four versions: raw, corrected from the instrumental response, with and without correction from the atmospheric molecular absorption, and normalised (including the telluric correction).

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Authors:Daiki E. Matsunaga, Jongmin Lee, Jaeseok Yoon, Stefanos Leonardos, Pieter Abbeel, Kee-Eung Kim
Date:2023-11-03 18:56:48

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization

Authors:Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, Cheng Wang
Date:2023-11-03 07:18:36

Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.

A Multi-Agent Reinforcement Learning Framework for Evaluating the U.S. Ending the HIV Epidemic Plan

Authors:Dinesh Sharma, Ankit Shah, Chaitra Gopalappa
Date:2023-11-01 21:19:35

Human immunodeficiency virus (HIV) is a major public health concern in the United States, with about 1.2 million people living with HIV and 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 2019 Ending the HIV Epidemic (EHE) initiative aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. Identifying optimal scale-up of intervention combinations will help inform resource allocation. Existing HIV decision analytic models either evaluate specific cities or the overall national population, thus overlooking jurisdictional interactions or differences. In this paper, we propose a multi-agent reinforcement learning (MARL) model, that enables jurisdiction-specific decision analyses but in an environment with cross-jurisdictional epidemiological interactions. In experimental analyses, conducted on jurisdictions within California and Florida, optimal policies from MARL were significantly different than those generated from single-agent RL, highlighting the influence of jurisdictional variations and interactions. By using comprehensive modeling of HIV and formulations of state space, action space, and reward functions, this work helps demonstrate the strengths and applicability of MARL for informing public health policies, and provides a framework for expanding to the national-level to inform the EHE.

QFree: A Universal Value Function Factorization for Multi-Agent Reinforcement Learning

Authors:Rizhong Wang, Huiping Li, Di Cui, Demin Xu
Date:2023-11-01 08:07:16

Centralized training is widely utilized in the field of multi-agent reinforcement learning (MARL) to assure the stability of training process. Once a joint policy is obtained, it is critical to design a value function factorization method to extract optimal decentralized policies for the agents, which needs to satisfy the individual-global-max (IGM) principle. While imposing additional limitations on the IGM function class can help to meet the requirement, it comes at the cost of restricting its application to more complex multi-agent environments. In this paper, we propose QFree, a universal value function factorization method for MARL. We start by developing mathematical equivalent conditions of the IGM principle based on the advantage function, which ensures that the principle holds without any compromise, removing the conservatism of conventional methods. We then establish a more expressive mixing network architecture that can fulfill the equivalent factorization. In particular, the novel loss function is developed by considering the equivalent conditions as regularization term during policy evaluation in the MARL algorithm. Finally, the effectiveness of the proposed method is verified in a nonmonotonic matrix game scenario. Moreover, we show that QFree achieves the state-of-the-art performance in a general-purpose complex MARL benchmark environment, Starcraft Multi-Agent Challenge (SMAC).

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

Authors:Kaushik Dey, Satheesh K. Perepu, Abir Das
Date:2023-10-26 14:21:36

Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.

DePAint: A Decentralized Safe Multi-Agent Reinforcement Learning Algorithm considering Peak and Average Constraints

Authors:Raheeb Hassan, K. M. Shadman Wadith, Md. Mamun or Rashid, Md. Mosaddek Khan
Date:2023-10-22 16:36:03

The domain of safe multi-agent reinforcement learning (MARL), despite its potential applications in areas ranging from drone delivery and vehicle automation to the development of zero-energy communities, remains relatively unexplored. The primary challenge involves training agents to learn optimal policies that maximize rewards while adhering to stringent safety constraints, all without the oversight of a central controller. These constraints are critical in a wide array of applications. Moreover, ensuring the privacy of sensitive information in decentralized settings introduces an additional layer of complexity, necessitating innovative solutions that uphold privacy while achieving the system's safety and efficiency goals. In this paper, we address the problem of multi-agent policy optimization in a decentralized setting, where agents communicate with their neighbors to maximize the sum of their cumulative rewards while also satisfying each agent's safety constraints. We consider both peak and average constraints. In this scenario, there is no central controller coordinating the agents and both the rewards and constraints are only known to each agent locally/privately. We formulate the problem as a decentralized constrained multi-agent Markov Decision Problem and propose a momentum-based decentralized policy gradient method, DePAint, to solve it. To the best of our knowledge, this is the first privacy-preserving fully decentralized multi-agent reinforcement learning algorithm that considers both peak and average constraints. We then provide theoretical analysis and empirical evaluation of our algorithm in a number of scenarios and compare its performance to centralized algorithms that consider similar constraints.

Collaborative Adaptation: Learning to Recover from Unforeseen Malfunctions in Multi-Robot Teams

Authors:Yasin Findik, Paul Robinette, Kshitij Jerath, S. Reza Ahmadzadeh
Date:2023-10-19 17:00:09

Cooperative multi-agent reinforcement learning (MARL) approaches tackle the challenge of finding effective multi-agent cooperation strategies for accomplishing individual or shared objectives in multi-agent teams. In real-world scenarios, however, agents may encounter unforeseen failures due to constraints like battery depletion or mechanical issues. Existing state-of-the-art methods in MARL often recover slowly -- if at all -- from such malfunctions once agents have already converged on a cooperation strategy. To address this gap, we present the Collaborative Adaptation (CA) framework. CA introduces a mechanism that guides collaboration and accelerates adaptation from unforeseen failures by leveraging inter-agent relationships. Our findings demonstrate that CA enables agents to act on the knowledge of inter-agent relations, recovering from unforeseen agent failures and selecting appropriate cooperative strategies.

Fact-based Agent modeling for Multi-Agent Reinforcement Learning

Authors:Baofu Fang, Caiming Zheng, Hao Wang
Date:2023-10-18 19:43:38

In multi-agent systems, agents need to interact and collaborate with other agents in environments. Agent modeling is crucial to facilitate agent interactions and make adaptive cooperation strategies. However, it is challenging for agents to model the beliefs, behaviors, and intentions of other agents in non-stationary environment where all agent policies are learned simultaneously. In addition, the existing methods realize agent modeling through behavior cloning which assume that the local information of other agents can be accessed during execution or training. However, this assumption is infeasible in unknown scenarios characterized by unknown agents, such as competition teams, unreliable communication and federated learning due to privacy concerns. To eliminate this assumption and achieve agent modeling in unknown scenarios, Fact-based Agent modeling (FAM) method is proposed in which fact-based belief inference (FBI) network models other agents in partially observable environment only based on its local information. The reward and observation obtained by agents after taking actions are called facts, and FAM uses facts as reconstruction target to learn the policy representation of other agents through a variational autoencoder. We evaluate FAM on various Multiagent Particle Environment (MPE) and compare the results with several state-of-the-art MARL algorithms. Experimental results show that compared with baseline methods, FAM can effectively improve the efficiency of agent policy learning by making adaptive cooperation strategies in multi-agent reinforcement learning tasks, while achieving higher returns in complex competitive-cooperative mixed scenarios.

Combat Urban Congestion via Collaboration: Heterogeneous GNN-based MARL for Coordinated Platooning and Traffic Signal Control

Authors:Xianyue Peng, Hang Gao, Hao Wang, H. Michael Zhang
Date:2023-10-17 02:46:04

Over the years, reinforcement learning has emerged as a popular approach to develop signal control and vehicle platooning strategies either independently or in a hierarchical way. However, jointly controlling both in real-time to alleviate traffic congestion presents new challenges, such as the inherent physical and behavioral heterogeneity between signal control and platooning, as well as coordination between them. This paper proposes an innovative solution to tackle these challenges based on heterogeneous graph multi-agent reinforcement learning and traffic theories. Our approach involves: 1) designing platoon and signal control as distinct reinforcement learning agents with their own set of observations, actions, and reward functions to optimize traffic flow; 2) designing coordination by incorporating graph neural networks within multi-agent reinforcement learning to facilitate seamless information exchange among agents on a regional scale. We evaluate our approach through SUMO simulation, which shows a convergent result in terms of various transportation metrics and better performance over sole signal or platooning control.

Robust Multi-Agent Reinforcement Learning via Adversarial Regularization: Theoretical Foundation and Stable Algorithms

Authors:Alexander Bukharin, Yan Li, Yue Yu, Qingru Zhang, Zhehui Chen, Simiao Zuo, Chao Zhang, Songan Zhang, Tuo Zhao
Date:2023-10-16 20:14:06

Multi-Agent Reinforcement Learning (MARL) has shown promising results across several domains. Despite this promise, MARL policies often lack robustness and are therefore sensitive to small changes in their environment. This presents a serious concern for the real world deployment of MARL algorithms, where the testing environment may slightly differ from the training environment. In this work we show that we can gain robustness by controlling a policy's Lipschitz constant, and under mild conditions, establish the existence of a Lipschitz and close-to-optimal policy. Based on these insights, we propose a new robust MARL framework, ERNIE, that promotes the Lipschitz continuity of the policies with respect to the state observations and actions by adversarial regularization. The ERNIE framework provides robustness against noisy observations, changing transition dynamics, and malicious actions of agents. However, ERNIE's adversarial regularization may introduce some training instability. To reduce this instability, we reformulate adversarial regularization as a Stackelberg game. We demonstrate the effectiveness of the proposed framework with extensive experiments in traffic light control and particle environments. In addition, we extend ERNIE to mean-field MARL with a formulation based on distributionally robust optimization that outperforms its non-robust counterpart and is of independent interest. Our code is available at https://github.com/abukharin3/ERNIE.

Theory of Mind for Multi-Agent Collaboration via Large Language Models

Authors:Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, Katia Sycara
Date:2023-10-16 07:51:19

While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents' planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.

Robust Multi-Agent Reinforcement Learning by Mutual Information Regularization

Authors:Simin Li, Ruixiao Xu, Jingqiao Xiu, Yuwei Zheng, Pu Feng, Yaodong Yang, Xianglong Liu
Date:2023-10-15 13:35:51

In multi-agent reinforcement learning (MARL), ensuring robustness against unpredictable or worst-case actions by allies is crucial for real-world deployment. Existing robust MARL methods either approximate or enumerate all possible threat scenarios against worst-case adversaries, leading to computational intensity and reduced robustness. In contrast, human learning efficiently acquires robust behaviors in daily life without preparing for every possible threat. Inspired by this, we frame robust MARL as an inference problem, with worst-case robustness implicitly optimized under all threat scenarios via off-policy evaluation. Within this framework, we demonstrate that Mutual Information Regularization as Robust Regularization (MIR3) during routine training is guaranteed to maximize a lower bound on robustness, without the need for adversaries. Further insights show that MIR3 acts as an information bottleneck, preventing agents from over-reacting to others and aligning policies with robust action priors. In the presence of worst-case adversaries, our MIR3 significantly surpasses baseline methods in robustness and training efficiency while maintaining cooperative performance in StarCraft II and robot swarm control. When deploying the robot swarm control algorithm in the real world, our method also outperforms the best baseline by 14.29%.

Robustness to Multi-Modal Environment Uncertainty in MARL using Curriculum Learning

Authors:Aakriti Agrawal, Rohith Aralikatti, Yanchao Sun, Furong Huang
Date:2023-10-12 22:19:36

Multi-agent reinforcement learning (MARL) plays a pivotal role in tackling real-world challenges. However, the seamless transition of trained policies from simulations to real-world requires it to be robust to various environmental uncertainties. Existing works focus on finding Nash Equilibrium or the optimal policy under uncertainty in one environment variable (i.e. action, state or reward). This is because a multi-agent system itself is highly complex and unstationary. However, in real-world situation uncertainty can occur in multiple environment variables simultaneously. This work is the first to formulate the generalised problem of robustness to multi-modal environment uncertainty in MARL. To this end, we propose a general robust training approach for multi-modal uncertainty based on curriculum learning techniques. We handle two distinct environmental uncertainty simultaneously and present extensive results across both cooperative and competitive MARL environments, demonstrating that our approach achieves state-of-the-art levels of robustness.

Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization

Authors:Yuxin Chen, Chen Tang, Ran Tian, Chenran Li, Jinning Li, Masayoshi Tomizuka, Wei Zhan
Date:2023-10-11 06:09:26

Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL). The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. A quantitative examination of this relationship sheds light on effectively training agents for diverse scenarios. In this study, we present the Level of Influence (LoI), a metric quantifying the interaction intensity among agents within a given scenario and environment. We observe that, generally, a more diverse set of co-play agents during training enhances the generalization performance of the ego agent; however, this improvement varies across distinct scenarios and environments. LoI proves effective in predicting these improvement disparities within specific scenarios. Furthermore, we introduce a LoI-guided resource allocation method tailored to train a set of policies for diverse scenarios under a constrained budget. Our results demonstrate that strategic resource allocation based on LoI can achieve higher performance than uniform allocation under the same computation budget.

Sample-Efficient Multi-Agent RL: An Optimization Perspective

Authors:Nuoya Xiong, Zhihan Liu, Zhaoran Wang, Zhuoran Yang
Date:2023-10-10 01:39:04

We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under the general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm combines an equilibrium-solving oracle with a single objective optimization subprocedure that solves for the regularized payoff of each deterministic joint policy, which avoids solving constrained optimization problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al. 2023), thus being more amenable to empirical implementation.

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Authors:Xihuai Wang, Shao Zhang, Wenhao Zhang, Wentao Dong, Jingxiao Chen, Ying Wen, Weinan Zhang
Date:2023-10-08 15:49:36

Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.

FP3O: Enabling Proximal Policy Optimization in Multi-Agent Cooperation with Parameter-Sharing Versatility

Authors:Lang Feng, Dong Xing, Junru Zhang, Gang Pan
Date:2023-10-08 07:26:35

Existing multi-agent PPO algorithms lack compatibility with different types of parameter sharing when extending the theoretical guarantee of PPO to cooperative multi-agent reinforcement learning (MARL). In this paper, we propose a novel and versatile multi-agent PPO algorithm for cooperative MARL to overcome this limitation. Our approach is achieved upon the proposed full-pipeline paradigm, which establishes multiple parallel optimization pipelines by employing various equivalent decompositions of the advantage function. This procedure successfully formulates the interconnections among agents in a more general manner, i.e., the interconnections among pipelines, making it compatible with diverse types of parameter sharing. We provide a solid theoretical foundation for policy improvement and subsequently develop a practical algorithm called Full-Pipeline PPO (FP3O) by several approximations. Empirical evaluations on Multi-Agent MuJoCo and StarCraftII tasks demonstrate that FP3O outperforms other strong baselines and exhibits remarkable versatility across various parameter-sharing configurations.

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Authors:Jiayu Chen, Zelai Xu, Yunfei Li, Chao Yu, Jiaming Song, Huazhong Yang, Fei Fang, Yu Wang, Yi Wu
Date:2023-10-07 13:09:37

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames -- games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-rl.

Self-Confirming Transformer for Belief-Conditioned Adaptation in Offline Multi-Agent Reinforcement Learning

Authors:Tao Li, Juan Guevara, Xinhong Xie, Quanyan Zhu
Date:2023-10-06 20:43:08

Offline reinforcement learning (RL) suffers from the distribution shift between the offline dataset and the online environment. In multi-agent RL (MARL), this distribution shift may arise from the nonstationary opponents in the online testing who display distinct behaviors from those recorded in the offline dataset. Hence, the key to the broader deployment of offline MARL is the online adaptation to nonstationary opponents. Recent advances in foundation models, e.g., large language models, have demonstrated the generalization ability of the transformer, an emerging neural network architecture, in sequence modeling, of which offline RL is a special case. One naturally wonders \textit{whether offline-trained transformer-based RL policies adapt to nonstationary opponents online}. We propose a novel auto-regressive training to equip transformer agents with online adaptability based on the idea of self-augmented pre-conditioning. The transformer agent first learns offline to predict the opponent's action based on past observations. When deployed online, such a fictitious opponent play, referred to as the belief, is fed back to the transformer, together with other environmental feedback, to generate future actions conditional on the belief. Motivated by self-confirming equilibrium in game theory, the training loss consists of belief consistency loss, requiring the beliefs to match the opponent's actual actions and best response loss, mandating the agent to behave optimally under the belief. We evaluate the online adaptability of the proposed self-confirming transformer (SCT) in a structured environment, iterated prisoner's dilemma games, to demonstrate SCT's belief consistency and equilibrium behaviors as well as more involved multi-particle environments to showcase its superior performance against nonstationary opponents over prior transformers and offline MARL baselines.

Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed Cooperative-Competitive Games

Authors:Zelai Xu, Yancheng Liang, Chao Yu, Yu Wang, Yi Wu
Date:2023-10-05 07:19:33

Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counter-example where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w.r.t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperative-competitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the best responses to the main policy's past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.

A Two-stage Based Social Preference Recognition in Multi-Agent Autonomous Driving System

Authors:Jintao Xue, Dongkun Zhang, Rong Xiong, Yue Wang, Eryun Liu
Date:2023-10-05 04:10:56

Multi-Agent Reinforcement Learning (MARL) has become a promising solution for constructing a multi-agent autonomous driving system (MADS) in complex and dense scenarios. But most methods consider agents acting selfishly, which leads to conflict behaviors. Some existing works incorporate the concept of social value orientation (SVO) to promote coordination, but they lack the knowledge of other agents' SVOs, resulting in conservative maneuvers. In this paper, we aim to tackle the mentioned problem by enabling the agents to understand other agents' SVOs. To accomplish this, we propose a two-stage system framework. Firstly, we train a policy by allowing the agents to share their ground truth SVOs to establish a coordinated traffic flow. Secondly, we develop a recognition network that estimates agents' SVOs and integrates it with the policy trained in the first stage. Experiments demonstrate that our developed method significantly improves the performance of the driving policy in MADS compared to two state-of-the-art MARL algorithms.

Multi-Agent Reinforcement Learning for Power Grid Topology Optimization

Authors:Erica van der Sar, Alessandro Zocca, Sandjai Bhulai
Date:2023-10-04 06:37:43

Recent challenges in operating power networks arise from increasing energy demands and unpredictable renewable sources like wind and solar. While reinforcement learning (RL) shows promise in managing these networks, through topological actions like bus and line switching, efficiently handling large action spaces as networks grow is crucial. This paper presents a hierarchical multi-agent reinforcement learning (MARL) framework tailored for these expansive action spaces, leveraging the power grid's inherent hierarchical nature. Experimental results indicate the MARL framework's competitive performance with single-agent RL methods. We also compare different RL algorithms for lower-level agents alongside different policies for higher-order agents.

Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control

Authors:Rohit Bokade, Xiaoning Jin, Christopher Amato
Date:2023-10-03 21:06:51

Traffic signal control (TSC) is a challenging problem within intelligent transportation systems and has been tackled using multi-agent reinforcement learning (MARL). While centralized approaches are often infeasible for large-scale TSC problems, decentralized approaches provide scalability but introduce new challenges, such as partial observability. Communication plays a critical role in decentralized MARL, as agents must learn to exchange information using messages to better understand the system and achieve effective coordination. Deep MARL has been used to enable inter-agent communication by learning communication protocols in a differentiable manner. However, many deep MARL communication frameworks proposed for TSC allow agents to communicate with all other agents at all times, which can add to the existing noise in the system and degrade overall performance. In this study, we propose a communication-based MARL framework for large-scale TSC. Our framework allows each agent to learn a communication policy that dictates "which" part of the message is sent "to whom". In essence, our framework enables agents to selectively choose the recipients of their messages and exchange variable length messages with them. This results in a decentralized and flexible communication mechanism in which agents can effectively use the communication channel only when necessary. We designed two networks, a synthetic $4 \times 4$ grid network and a real-world network based on the Pasubio neighborhood in Bologna. Our framework achieved the lowest network congestion compared to related methods, with agents utilizing $\sim 47-65 \%$ of the communication channel. Ablation studies further demonstrated the effectiveness of the communication policies learned within our framework.

COMPOSER: Scalable and Robust Modular Policies for Snake Robots

Authors:Yuyou Zhang, Yaru Niu, Xingyu Liu, Ding Zhao
Date:2023-10-02 03:20:31

Snake robots have showcased remarkable compliance and adaptability in their interaction with environments, mirroring the traits of their natural counterparts. While their hyper-redundant and high-dimensional characteristics add to this adaptability, they also pose great challenges to robot control. Instead of perceiving the hyper-redundancy and flexibility of snake robots as mere challenges, there lies an unexplored potential in leveraging these traits to enhance robustness and generalizability at the control policy level. We seek to develop a control policy that effectively breaks down the high dimensionality of snake robots while harnessing their redundancy. In this work, we consider the snake robot as a modular robot and formulate the control of the snake robot as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Each segment of the snake robot functions as an individual agent. Specifically, we incorporate a self-attention mechanism to enhance the cooperative behavior between agents. A high-level imagination policy is proposed to provide additional rewards to guide the low-level control policy. We validate the proposed method COMPOSER with five snake robot tasks, including goal reaching, wall climbing, shape formation, tube crossing, and block pushing. COMPOSER achieves the highest success rate across all tasks when compared to a centralized baseline and four modular policy baselines. Additionally, we show enhanced robustness against module corruption and significantly superior zero-shot generalizability in our proposed method. The videos of this work are available on our project page: https://sites.google.com/view/composer-snake/.

MARL: Multi-scale Archetype Representation Learning for Urban Building Energy Modeling

Authors:Xinwei Zhuang, Zixun Huang, Wentao Zeng, Luisa Caldas
Date:2023-09-29 22:56:19

Building archetypes, representative models of building stock, are crucial for precise energy simulations in Urban Building Energy Modeling. The current widely adopted building archetypes are developed on a nationwide scale, potentially neglecting the impact of local buildings' geometric specificities. We present Multi-scale Archetype Representation Learning (MARL), an approach that leverages representation learning to extract geometric features from a specific building stock. Built upon VQ-AE, MARL encodes building footprints and purifies geometric information into latent vectors constrained by multiple architectural downstream tasks. These tailored representations are proven valuable for further clustering and building energy modeling. The advantages of our algorithm are its adaptability with respect to the different building footprint sizes, the ability for automatic generation across multi-scale regions, and the preservation of geometric features across neighborhoods and local ecologies. In our study spanning five regions in LA County, we show MARL surpasses both conventional and VQ-AE extracted archetypes in performance. Results demonstrate that geometric feature embeddings significantly improve the accuracy and reliability of energy consumption estimates. Code, dataset and trained models are publicly available: https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation

Double-Layer Power Control for Mobile Cell-Free XL-MIMO with Multi-Agent Reinforcement Learning

Authors:Ziheng Liu, Jiayi Zhang, Zhilong Liu, Huahua Xiao, Bo Ai
Date:2023-09-29 09:19:01

Cell-free (CF) extremely large-scale multiple-input multiple-output (XL-MIMO) is regarded as a promising technology for enabling future wireless communication systems. Significant attention has been generated by its considerable advantages in augmenting degrees of freedom. In this paper, we first investigate a CF XL-MIMO system with base stations equipped with XL-MIMO panels under a dynamic environment. Then, we propose an innovative multi-agent reinforcement learning (MARL)-based power control algorithm that incorporates predictive management and distributed optimization architecture, which provides a dynamic strategy for addressing high-dimension signal processing problems. Specifically, we compare various MARL-based algorithms, which shows that the proposed MARL-based algorithm effectively strikes a balance between spectral efficiency (SE) performance and convergence time. Moreover, we consider a double-layer power control architecture based on the large-scale fading coefficients between antennas to suppress interference within dynamic systems. Compared to the single-layer architecture, the results obtained unveil that the proposed double-layer architecture has a nearly24% SE performance improvement, especially with massive antennas and smaller antenna spacing.

TaxAI: A Dynamic Economic Simulator and Benchmark for Multi-Agent Reinforcement Learning

Authors:Qirui Mi, Siyu Xia, Yan Song, Haifeng Zhang, Shenghao Zhu, Jun Wang
Date:2023-09-28 09:59:48

Taxation and government spending are crucial tools for governments to promote economic growth and maintain social equity. However, the difficulty in accurately predicting the dynamic strategies of diverse self-interested households presents a challenge for governments to implement effective tax policies. Given its proficiency in modeling other agents in partially observable environments and adaptively learning to find optimal policies, Multi-Agent Reinforcement Learning (MARL) is highly suitable for solving dynamic games between the government and numerous households. Although MARL shows more potential than traditional methods such as the genetic algorithm and dynamic programming, there is a lack of large-scale multi-agent reinforcement learning economic simulators. Therefore, we propose a MARL environment, named \textbf{TaxAI}, for dynamic games involving $N$ households, government, firms, and financial intermediaries based on the Bewley-Aiyagari economic model. Our study benchmarks 2 traditional economic methods with 7 MARL methods on TaxAI, demonstrating the effectiveness and superiority of MARL algorithms. Moreover, TaxAI's scalability in simulating dynamic interactions between the government and 10,000 households, coupled with real-data calibration, grants it a substantial improvement in scale and reality over existing simulators. Therefore, TaxAI is the most realistic economic simulator for optimal tax policy, which aims to generate feasible recommendations for governments and individuals.

Cooperation Dynamics in Multi-Agent Systems: Exploring Game-Theoretic Scenarios with Mean-Field Equilibria

Authors:Vaigarai Sathi, Sabahat Shaik, Jaswanth Nidamanuri
Date:2023-09-28 08:57:01

Cooperation is fundamental in Multi-Agent Systems (MAS) and Multi-Agent Reinforcement Learning (MARL), often requiring agents to balance individual gains with collective rewards. In this regard, this paper aims to investigate strategies to invoke cooperation in game-theoretic scenarios, namely the Iterated Prisoner's Dilemma, where agents must optimize both individual and group outcomes. Existing cooperative strategies are analyzed for their effectiveness in promoting group-oriented behavior in repeated games. Modifications are proposed where encouraging group rewards will also result in a higher individual gain, addressing real-world dilemmas seen in distributed systems. The study extends to scenarios with exponentially growing agent populations ($N \longrightarrow +\infty$), where traditional computation and equilibrium determination are challenging. Leveraging mean-field game theory, equilibrium solutions and reward structures are established for infinitely large agent sets in repeated games. Finally, practical insights are offered through simulations using the Multi Agent-Posthumous Credit Assignment trainer, and the paper explores adapting simulation algorithms to create scenarios favoring cooperation for group rewards. These practical implementations bridge theoretical concepts with real-world applications.

Less Is More: Robust Robot Learning via Partially Observable Multi-Agent Reinforcement Learning

Authors:Wenshuai Zhao, Eetu-Aleksi Rantala, Joni Pajarinen, Jorge Peña Queralta
Date:2023-09-26 09:40:35

In many multi-agent and high-dimensional robotic tasks, the controller can be designed in either a centralized or decentralized way. Correspondingly, it is possible to use either single-agent reinforcement learning (SARL) or multi-agent reinforcement learning (MARL) methods to learn such controllers. However, the relationship between these two paradigms remains under-studied in the literature. This work explores research questions in terms of robustness and performance of SARL and MARL approaches to the same task, in order to gain insight into the most suitable methods. We start by analytically showing the equivalence between these two paradigms under the full-state observation assumption. Then, we identify a broad subclass of \textit{Dec-POMDP} tasks where the agents are weakly or partially interacting. In these tasks, we show that partial observations of each agent are sufficient for near-optimal decision-making. Furthermore, we propose to exploit such partially observable MARL to improve the robustness of robots when joint or agent failures occur. Our experiments on both simulated multi-agent tasks and a real robot task with a mobile manipulator validate the presented insights and the effectiveness of the proposed robust robot learning method via partially observable MARL.

Effective Multi-Agent Deep Reinforcement Learning Control with Relative Entropy Regularization

Authors:Chenyang Miao, Yunduan Cui, Huiyun Li, Xinyu Wu
Date:2023-09-26 07:38:19

In this paper, a novel Multi-agent Reinforcement Learning (MARL) approach, Multi-Agent Continuous Dynamic Policy Gradient (MACDPP) was proposed to tackle the issues of limited capability and sample efficiency in various scenarios controlled by multiple agents. It alleviates the inconsistency of multiple agents' policy updates by introducing the relative entropy regularization to the Centralized Training with Decentralized Execution (CTDE) framework with the Actor-Critic (AC) structure. Evaluated by multi-agent cooperation and competition tasks and traditional control tasks including OpenAI benchmarks and robot arm manipulation, MACDPP demonstrates significant superiority in learning capability and sample efficiency compared with both related multi-agent and widely implemented signal-agent baselines and therefore expands the potential of MARL in effectively learning challenging control scenarios.

Boosting Studies of Multi-Agent Reinforcement Learning on Google Research Football Environment: the Past, Present, and Future

Authors:Yan Song, He Jiang, Haifeng Zhang, Zheng Tian, Weinan Zhang, Jun Wang
Date:2023-09-22 15:50:07

Even though Google Research Football (GRF) was initially benchmarked and studied as a single-agent environment in its original paper, recent years have witnessed an increasing focus on its multi-agent nature by researchers utilizing it as a testbed for Multi-Agent Reinforcement Learning (MARL). However, the absence of standardized environment settings and unified evaluation metrics for multi-agent scenarios hampers the consistent understanding of various studies. Furthermore, the challenging 5-vs-5 and 11-vs-11 full-game scenarios have received limited thorough examination due to their substantial training complexities. To address these gaps, this paper extends the original environment by not only standardizing the environment settings and benchmarking cooperative learning algorithms across different scenarios, including the most challenging full-game scenarios, but also by discussing approaches to enhance football AI from diverse perspectives and introducing related research tools. Specifically, we provide a distributed and asynchronous population-based self-play framework with diverse pre-trained policies for faster training, two football-specific analytical tools for deeper investigation, and an online leaderboard for broader evaluation. The overall expectation of this work is to advance the study of Multi-Agent Reinforcement Learning on Google Research Football environment, with the ultimate goal of benefiting real-world sports beyond virtual games.

Characterizing Speed Performance of Multi-Agent Reinforcement Learning

Authors:Samuel Wiggins, Yuan Meng, Rajgopal Kannan, Viktor Prasanna
Date:2023-09-13 17:26:36

Multi-Agent Reinforcement Learning (MARL) has achieved significant success in large-scale AI systems and big-data applications such as smart grids, surveillance, etc. Existing advancements in MARL algorithms focus on improving the rewards obtained by introducing various mechanisms for inter-agent cooperation. However, these optimizations are usually compute- and memory-intensive, thus leading to suboptimal speed performance in end-to-end training time. In this work, we analyze the speed performance (i.e., latency-bounded throughput) as the key metric in MARL implementations. Specifically, we first introduce a taxonomy of MARL algorithms from an acceleration perspective categorized by (1) training scheme and (2) communication method. Using our taxonomy, we identify three state-of-the-art MARL algorithms - Multi-Agent Deep Deterministic Policy Gradient (MADDPG), Target-oriented Multi-agent Communication and Cooperation (ToM2C), and Networked Multi-Agent RL (NeurComm) - as target benchmark algorithms, and provide a systematic analysis of their performance bottlenecks on a homogeneous multi-core CPU platform. We justify the need for MARL latency-bounded throughput to be a key performance metric in future literature while also addressing opportunities for parallelization and acceleration.

Privacy-Engineered Value Decomposition Networks for Cooperative Multi-Agent Reinforcement Learning

Authors:Parham Gohari, Matthew Hale, Ufuk Topcu
Date:2023-09-13 02:50:57

In cooperative multi-agent reinforcement learning (Co-MARL), a team of agents must jointly optimize the team's long-term rewards to learn a designated task. Optimizing rewards as a team often requires inter-agent communication and data sharing, leading to potential privacy implications. We assume privacy considerations prohibit the agents from sharing their environment interaction data. Accordingly, we propose Privacy-Engineered Value Decomposition Networks (PE-VDN), a Co-MARL algorithm that models multi-agent coordination while provably safeguarding the confidentiality of the agents' environment interaction data. We integrate three privacy-engineering techniques to redesign the data flows of the VDN algorithm, an existing Co-MARL algorithm that consolidates the agents' environment interaction data to train a central controller that models multi-agent coordination, and develop PE-VDN. In the first technique, we design a distributed computation scheme that eliminates Vanilla VDN's dependency on sharing environment interaction data. Then, we utilize a privacy-preserving multi-party computation protocol to guarantee that the data flows of the distributed computation scheme do not pose new privacy risks. Finally, we enforce differential privacy to preempt inference threats against the agents' training data, past environment interactions, when they take actions based on their neural network predictions. We implement PE-VDN in StarCraft Multi-Agent Competition (SMAC) and show that it achieves 80% of Vanilla VDN's win rate while maintaining differential privacy levels that provide meaningful privacy guarantees. The results demonstrate that PE-VDN can safeguard the confidentiality of agents' environment interaction data without sacrificing multi-agent coordination.

Emergent Communication in Multi-Agent Reinforcement Learning for Future Wireless Networks

Authors:Marwa Chafii, Salmane Naoumi, Reda Alami, Ebtesam Almazrouei, Mehdi Bennis, Merouane Debbah
Date:2023-09-12 07:40:53

In different wireless network scenarios, multiple network entities need to cooperate in order to achieve a common task with minimum delay and energy consumption. Future wireless networks mandate exchanging high dimensional data in dynamic and uncertain environments, therefore implementing communication control tasks becomes challenging and highly complex. Multi-agent reinforcement learning with emergent communication (EC-MARL) is a promising solution to address high dimensional continuous control problems with partially observable states in a cooperative fashion where agents build an emergent communication protocol to solve complex tasks. This paper articulates the importance of EC-MARL within the context of future 6G wireless networks, which imbues autonomous decision-making capabilities into network entities to solve complex tasks such as autonomous driving, robot navigation, flying base stations network planning, and smart city applications. An overview of EC-MARL algorithms and their design criteria are provided while presenting use cases and research opportunities on this emerging topic.

Decentralized Multi-agent Reinforcement Learning based State-of-Charge Balancing Strategy for Distributed Energy Storage System

Authors:Zheng Xiong, Biao Luo, Bing-Chuan Wang, Xiaodong Xu, Xiaodong Liu, Tingwen Huang
Date:2023-08-29 15:48:49

This paper develops a Decentralized Multi-Agent Reinforcement Learning (Dec-MARL) method to solve the SoC balancing problem in the distributed energy storage system (DESS). First, the SoC balancing problem is formulated into a finite Markov decision process with action constraints derived from demand balance, which can be solved by Dec-MARL. Specifically, the first-order average consensus algorithm is utilized to expand the observations of the DESS state in a fully-decentralized way, and the initial actions (i.e., output power) are decided by the agents (i.e., energy storage units) according to these observations. In order to get the final actions in the allowable range, a counterfactual demand balance algorithm is proposed to balance the total demand and the initial actions. Next, the agents execute the final actions and get local rewards from the environment, and the DESS steps into the next state. Finally, through the first-order average consensus algorithm, the agents get the average reward and the expended observation of the next state for later training. By the above procedure, Dec-MARL reveals outstanding performance in a fully-decentralized system without any expert experience or constructing any complicated model. Besides, it is flexible and can be extended to other decentralized multi-agent systems straightforwardly. Extensive simulations have validated the effectiveness and efficiency of Dec-MARL.

Recent Progress in Energy Management of Connected Hybrid Electric Vehicles Using Reinforcement Learning

Authors:Min Hua, Bin Shuai, Quan Zhou, Jinhai Wang, Yinglong He, Hongming Xu
Date:2023-08-28 14:12:52

The growing adoption of hybrid electric vehicles (HEVs) presents a transformative opportunity for revolutionizing transportation energy systems. The shift towards electrifying transportation aims to curb environmental concerns related to fossil fuel consumption. This necessitates efficient energy management systems (EMS) to optimize energy efficiency. The evolution of EMS from HEVs to connected hybrid electric vehicles (CHEVs) represent a pivotal shift. For HEVs, EMS now confronts the intricate energy cooperation requirements of CHEVs, necessitating advanced algorithms for route optimization, charging coordination, and load distribution. Challenges persist in both domains, including optimal energy utilization for HEVs, and cooperative eco-driving control (CED) for CHEVs across diverse vehicle types. Reinforcement learning (RL) stands out as a promising tool for addressing these challenges. Specifically, within the realm of CHEVs, the application of multi-agent reinforcement learning (MARL) emerges as a powerful approach for effectively tackling the intricacies of CED control. Despite extensive research, few reviews span from individual vehicles to multi-vehicle scenarios. This review bridges the gap, highlighting challenges, advancements, and potential contributions of RL-based solutions for future sustainable transportation systems.

Policy Diversity for Cooperative Agents

Authors:Mingxi Tan, Andong Tian, Ludovic Denoyer
Date:2023-08-28 05:23:16

Standard cooperative multi-agent reinforcement learning (MARL) methods aim to find the optimal team cooperative policy to complete a task. However there may exist multiple different ways of cooperating, which usually are very needed by domain experts. Therefore, identifying a set of significantly different policies can alleviate the task complexity for them. Unfortunately, there is a general lack of effective policy diversity approaches specifically designed for the multi-agent domain. In this work, we propose a method called Moment-Matching Policy Diversity to alleviate this problem. This method can generate different team policies to varying degrees by formalizing the difference between team policies as the difference in actions of selected agents in different policies. Theoretically, we show that our method is a simple way to implement a constrained optimization problem that regularizes the difference between two trajectory distributions by using the maximum mean discrepancy. The effectiveness of our approach is demonstrated on a challenging team-based shooter.

MARL for Decentralized Electric Vehicle Charging Coordination with V2V Energy Exchange

Authors:Jiarong Fan, Hao Wang, Ariel Liebman
Date:2023-08-27 14:06:21

Effective energy management of electric vehicle (EV) charging stations is critical to supporting the transport sector's sustainable energy transition. This paper addresses the EV charging coordination by considering vehicle-to-vehicle (V2V) energy exchange as the flexibility to harness in EV charging stations. Moreover, this paper takes into account EV user experiences, such as charging satisfaction and fairness. We propose a Multi-Agent Reinforcement Learning (MARL) approach to coordinate EV charging with V2V energy exchange while considering uncertainties in the EV arrival time, energy price, and solar energy generation. The exploration capability of MARL is enhanced by introducing parameter noise into MARL's neural network models. Experimental results demonstrate the superior performance and scalability of our proposed method compared to traditional optimization baselines. The decentralized execution of the algorithm enables it to effectively deal with partial system faults in the charging station.

Collaborative Information Dissemination with Graph-based Multi-Agent Reinforcement Learning

Authors:Raffaele Galliera, Kristen Brent Venable, Matteo Bassani, Niranjan Suri
Date:2023-08-25 21:30:16

Efficient information dissemination is crucial for supporting critical operations across domains like disaster response, autonomous vehicles, and sensor networks. This paper introduces a Multi-Agent Reinforcement Learning (MARL) approach as a significant step forward in achieving more decentralized, efficient, and collaborative information dissemination. We propose a Partially Observable Stochastic Game (POSG) formulation for information dissemination empowering each agent to decide on message forwarding independently, based on the observation of their one-hop neighborhood. This constitutes a significant paradigm shift from heuristics currently employed in real-world broadcast protocols. Our novel approach harnesses Graph Convolutional Reinforcement Learning and Graph Attention Networks (GATs) with dynamic attention to capture essential network features. We propose two approaches, L-DyAN and HL-DyAN, which differ in terms of the information exchanged among agents. Our experimental results show that our trained policies outperform existing methods, including the state-of-the-art heuristic, in terms of network coverage as well as communication overhead on dynamic networks of varying density and behavior.

Learning Cyber Defence Tactics from Scratch with Multi-Agent Reinforcement Learning

Authors:Jacob Wiebe, Ranwa Al Mallah, Li Li
Date:2023-08-25 14:07:50

Recent advancements in deep learning techniques have opened new possibilities for designing solutions for autonomous cyber defence. Teams of intelligent agents in computer network defence roles may reveal promising avenues to safeguard cyber and kinetic assets. In a simulated game environment, agents are evaluated on their ability to jointly mitigate attacker activity in host-based defence scenarios. Defender systems are evaluated against heuristic attackers with the goals of compromising network confidentiality, integrity, and availability. Value-based Independent Learning and Centralized Training Decentralized Execution (CTDE) cooperative Multi-Agent Reinforcement Learning (MARL) methods are compared revealing that both approaches outperform a simple multi-agent heuristic defender. This work demonstrates the ability of cooperative MARL to learn effective cyber defence tactics against varied threats.

An Efficient Distributed Multi-Agent Reinforcement Learning for EV Charging Network Control

Authors:Amin Shojaeighadikolaei, Morteza Hashemi
Date:2023-08-24 16:53:52

The increasing trend in adopting electric vehicles (EVs) will significantly impact the residential electricity demand, which results in an increased risk of transformer overload in the distribution grid. To mitigate such risks, there are urgent needs to develop effective EV charging controllers. Currently, the majority of the EV charge controllers are based on a centralized approach for managing individual EVs or a group of EVs. In this paper, we introduce a decentralized Multi-agent Reinforcement Learning (MARL) charging framework that prioritizes the preservation of privacy for EV owners. We employ the Centralized Training Decentralized Execution-Deep Deterministic Policy Gradient (CTDE-DDPG) scheme, which provides valuable information to users during training while maintaining privacy during execution. Our results demonstrate that the CTDE framework improves the performance of the charging network by reducing the network costs. Moreover, we show that the Peak-to-Average Ratio (PAR) of the total demand is reduced, which, in turn, reduces the risk of transformer overload during the peak hours.

Perimeter Control with Heterogeneous Metering Rates for Cordon Signals: A Physics-Regularized Multi-Agent Reinforcement Learning Approach

Authors:Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Wei Ma, Ludovic Leclercq
Date:2023-08-24 13:51:16

Perimeter Control (PC) strategies have been proposed to address urban road network control in oversaturated situations by regulating the transfer flow of the Protected Network (PN) based on the Macroscopic Fundamental Diagram (MFD). The uniform metering rate for cordon signals in most existing studies overlooks the variance of local traffic states at the intersection level, which may cause severe local traffic congestion and degradation of the network stability. PC strategies with heterogeneous metering rates for cordon signals allow precise control for the perimeter but the complexity of the problem increases exponentially with the scale of the PN. This paper leverages a Multi-Agent Reinforcement Learning (MARL)-based traffic signal control framework to decompose this PC problem, which considers heterogeneous metering rates for cordon signals, into multi-agent cooperation tasks. Each agent controls an individual signal located in the cordon, decreasing the dimension of action space for the controller compared to centralized methods. A physics regularization approach for the MARL framework is proposed to ensure the distributed cordon signal controllers are aware of the global network state by encoding MFD-based knowledge into the action-value functions of the local agents. The proposed PC strategy is operated as a two-stage system, with a feedback PC strategy detecting the overall traffic state within the PN and then distributing local instructions to cordon signals controllers in the MARL framework via the physics regularization. Through numerical tests with different demand patterns in a microscopic traffic environment, the proposed PC strategy shows promising robustness and transferability. It outperforms state-of-the-art feedback PC strategies in increasing network throughput, decreasing distributed delay for gate links, and reducing carbon emissions.

Using PIC and PIC-MHD to investigate cosmic ray acceleration in mildly relativistic shocks

Authors:Artem Bohdan, Anabella Araudo, Allard Jan van Marle, Fabien Casse, Alexandre Marcowith
Date:2023-08-24 11:40:35

Astrophysical shocks create cosmic rays by accelerating charged particles to relativistic speeds. However, the relative contribution of various types of shocks to the cosmic ray spectrum is still the subject of ongoing debate. Numerical studies have shown that in the non-relativistic regime, oblique shocks are capable of accelerating cosmic rays, depending on the Alfv\'enic Mach number of the shock. We now seek to extend this study into the mildly relativistic regime. In this case, dependence of the ion reflection rate on the shock obliquity is different compared to the nonrelativistic regime. Faster relativistic shocks are perpendicular for the majority of shock obliquity angles therefore their ability to initialize efficient DSA is limited. We define the ion injection rate using fully kinetic PIC simulation where we follow the formation of the shock and determine the fraction of ions that gets involved into formation of the shock precursor in the mildly relativistic regime covering a Lorentz factor range from 1 to 3. Then, with this result, we use a combined PIC-MHD method to model the large-scale evolution of the shock with the ion injection recipe dependent on the local shock obliquity. This methodology accounts for the influence of the self-generated or pre-existing upstream turbulence on the shock obliquity which allows study substantially larger and longer simulations compared to classical hybrid techniques.

${\rm E}(3)$-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning

Authors:Dingyang Chen, Qi Zhang
Date:2023-08-23 00:18:17

Identification and analysis of symmetrical patterns in the natural world have led to significant discoveries across various scientific fields, such as the formulation of gravitational laws in physics and advancements in the study of chemical structures. In this paper, we focus on exploiting Euclidean symmetries inherent in certain cooperative multi-agent reinforcement learning (MARL) problems and prevalent in many applications. We begin by formally characterizing a subclass of Markov games with a general notion of symmetries that admits the existence of symmetric optimal values and policies. Motivated by these properties, we design neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods. This inductive bias results in superior performance in various cooperative MARL benchmarks and impressive generalization capabilities such as zero-shot learning and transfer learning in unseen scenarios with repeated symmetric patterns. The code is available at: https://github.com/dchen48/E3AC.

FoX: Formation-aware exploration in multi-agent reinforcement learning

Authors:Yonghyeon Jo, Sunwoo Lee, Junghyuk Yeom, Seungyul Han
Date:2023-08-22 08:39:44

Recently, deep multi-agent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks.

Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

Authors:Hadi Nekoei, Xutong Zhao, Janarthanan Rajendran, Miao Liu, Sarath Chandar
Date:2023-08-20 14:44:50

Cooperative Multi-agent Reinforcement Learning (MARL) algorithms with Zero-Shot Coordination (ZSC) have gained significant attention in recent years. ZSC refers to the ability of agents to coordinate zero-shot (without additional interaction experience) with independently trained agents. While ZSC is crucial for cooperative MARL agents, it might not be possible for complex tasks and changing environments. Agents also need to adapt and improve their performance with minimal interaction with other agents. In this work, we show empirically that state-of-the-art ZSC algorithms have poor performance when paired with agents trained with different learning methods, and they require millions of interaction samples to adapt to these new partners. To investigate this issue, we formally defined a framework based on a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods. In particular, we created a diverse set of pre-trained agents and defined a new metric called adaptation regret that measures the agent's ability to efficiently adapt and improve its coordination performance when paired with some held-out pool of partners on top of its ZSC performance. After evaluating several SOTA algorithms using our framework, our experiments reveal that naive Independent Q-Learning (IQL) agents in most cases adapt as quickly as the SOTA ZSC algorithm Off-Belief Learning (OBL). This finding raises an interesting research question: How to design MARL algorithms with high ZSC performance and capability of fast adaptation to unseen partners. As a first step, we studied the role of different hyper-parameters and design choices on the adaptability of current MARL algorithms. Our experiments show that two categories of hyper-parameters controlling the training data diversity and optimization process have a significant impact on the adaptability of Hanabi agents.

Intelligent Communication Planning for Constrained Environmental IoT Sensing with Reinforcement Learning

Authors:Yi Hu, Jinhang Zuo, Bob Iannucci, Carlee Joe-Wong
Date:2023-08-19 22:59:09

Internet of Things (IoT) technologies have enabled numerous data-driven mobile applications and have the potential to significantly improve environmental monitoring and hazard warnings through the deployment of a network of IoT sensors. However, these IoT devices are often power-constrained and utilize wireless communication schemes with limited bandwidth. Such power constraints limit the amount of information each device can share across the network, while bandwidth limitations hinder sensors' coordination of their transmissions. In this work, we formulate the communication planning problem of IoT sensors that track the state of the environment. We seek to optimize sensors' decisions in collecting environmental data under stringent resource constraints. We propose a multi-agent reinforcement learning (MARL) method to find the optimal communication policies for each sensor that maximize the tracking accuracy subject to the power and bandwidth limitations. MARL learns and exploits the spatial-temporal correlation of the environmental data at each sensor's location to reduce the redundant reports from the sensors. Experiments on wildfire spread with LoRA wireless network simulators show that our MARL method can learn to balance the need to collect enough data to predict wildfire spread with unknown bandwidth limitations.

DPMAC: Differentially Private Communication for Cooperative Multi-Agent Reinforcement Learning

Authors:Canzhe Zhao, Yanjie Ze, Jing Dong, Baoxiang Wang, Shuai Li
Date:2023-08-19 04:26:23

Communication lays the foundation for cooperation in human society and in multi-agent reinforcement learning (MARL). Humans also desire to maintain their privacy when communicating with others, yet such privacy concern has not been considered in existing works in MARL. To this end, we propose the \textit{differentially private multi-agent communication} (DPMAC) algorithm, which protects the sensitive information of individual agents by equipping each agent with a local message sender with rigorous $(\epsilon, \delta)$-differential privacy (DP) guarantee. In contrast to directly perturbing the messages with predefined DP noise as commonly done in privacy-preserving scenarios, we adopt a stochastic message sender for each agent respectively and incorporate the DP requirement into the sender, which automatically adjusts the learned message distribution to alleviate the instability caused by DP noise. Further, we prove the existence of a Nash equilibrium in cooperative MARL with privacy-preserving communication, which suggests that this problem is game-theoretically learnable. Extensive experiments demonstrate a clear advantage of DPMAC over baseline methods in privacy-preserving scenarios.

Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent Policy Optimization

Authors:Mohammad Mehdi Nasiri, Mansoor Rezghi
Date:2023-08-13 10:18:10

This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings, where agents have varying abilities and individual policies. The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma to enable efficient policy updates for each agent while ensuring overall performance improvements. By iteratively updating agent policies through an approximate solution of the trust-region problem, HAMDPO guarantees stability and improves performance. Moreover, the HAMDPO algorithm is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.

The Impact of Overall Optimization on Warehouse Automation

Authors:Hiroshi Yoshitake, Pieter Abbeel
Date:2023-08-11 09:31:42

In this study, we propose a novel approach for investigating optimization performance by flexible robot coordination in automated warehouses with multi-agent reinforcement learning (MARL)-based control. Automated systems using robots are expected to achieve efficient operations compared with manual systems in terms of overall optimization performance. However, the impact of overall optimization on performance remains unclear in most automated systems due to a lack of suitable control methods. Thus, we proposed a centralized training-and-decentralized execution MARL framework as a practical overall optimization control method. In the proposed framework, we also proposed a single shared critic, trained with global states and rewards, applicable to a case in which heterogeneous agents make decisions asynchronously. Our proposed MARL framework was applied to the task selection of material handling equipment through automated order picking simulation, and its performance was evaluated to determine how far overall optimization outperforms partial optimization by comparing it with other MARL frameworks and rule-based control methods.

GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Authors:Guillermo Bernárdez, José Suárez-Varela, Xiang Shi, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, Albert Cabellos-Aparicio
Date:2023-08-09 12:04:41

Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN). Currently, DCNs mainly implement two main CC protocols: DCTCP and DCQCN. Both protocols -- and their main variants -- are based on Explicit Congestion Notification (ECN), where intermediate switches mark packets when they detect congestion. The ECN configuration is thus a crucial aspect on the performance of CC protocols. Nowadays, network experts set static ECN parameters carefully selected to optimize the average network performance. However, today's high-speed DCNs experience quick and abrupt changes that severely change the network state (e.g., dynamic traffic workloads, incast events, failures). This leads to under-utilization and sub-optimal performance. This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization. Our distributed solution relies on a novel combination of Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and it is compatible with widely deployed ECN-based CC protocols. GraphCC deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test the performance of GraphCC under a wide variety of scenarios, focusing on the capability of this solution to adapt to new scenarios unseen during training (e.g., new traffic workloads, failures, upgrades). We compare GraphCC with a state-of-the-art MARL-based solution for ECN tuning -- ACC -- and observe that our proposed solution outperforms the state-of-the-art baseline in all of the evaluation scenarios, showing improvements up to $20\%$ in Flow Completion Time as well as significant reductions in buffer occupancy ($38.0-85.7\%$).

Communication-Efficient Cooperative Multi-Agent PPO via Regulated Segment Mixture in Internet of Vehicles

Authors:Xiaoxue Yu, Rongpeng Li, Fei Wang, Chenghui Peng, Chengchao Liang, Zhifeng Zhao, Honggang Zhang
Date:2023-08-08 11:50:53

Multi-Agent Reinforcement Learning (MARL) has become a classic paradigm to solve diverse, intelligent control tasks like autonomous driving in Internet of Vehicles (IoV). However, the widely assumed existence of a central node to implement centralized federated learning-assisted MARL might be impractical in highly dynamic scenarios, and the excessive communication overheads possibly overwhelm the IoV system. Therefore, in this paper, we design a communication efficient cooperative MARL algorithm, named RSM-MAPPO, to reduce the communication overheads in a fully distributed architecture. In particular, RSM-MAPPO enhances the multi-agent Proximal Policy Optimization (PPO) by incorporating the idea of segment mixture and augmenting multiple model replicas from received neighboring policy segments. Afterwards, RSM-MAPPO adopts a theory-guided metric to regulate the selection of contributive replicas to guarantee the policy improvement. Finally, extensive simulations in a mixed-autonomy traffic control scenario verify the effectiveness of the RSM-MAPPO algorithm.

Cooperative Multi-Type Multi-Agent Deep Reinforcement Learning for Resource Management in Space-Air-Ground Integrated Networks

Authors:Hengxi Zhang, Huaze Tang, Wenbo Ding, Xiao-Ping Zhang
Date:2023-08-08 02:15:06

The Space-Air-Ground Integrated Network (SAGIN), integrating heterogeneous devices including low earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users (GUs), holds significant promise for advancing smart city applications. However, resource management of the SAGIN is a challenge requiring urgent study in that inappropriate resource management will cause poor data transmission, and hence affect the services in smart cities. In this paper, we develop a comprehensive SAGIN system that encompasses five distinct communication links and propose an efficient cooperative multi-type multi-agent deep reinforcement learning (CMT-MARL) method to address the resource management issue. The experimental results highlight the efficacy of the proposed CMT-MARL, as evidenced by key performance indicators such as the overall transmission rate and transmission success rate. These results underscore the potential value and feasibility of future implementation of the SAGIN.

Mamba: Bringing Multi-Dimensional ABR to WebRTC

Authors:Yueheng Li, Zicheng Zhang, Hao Chen, Zhan Ma
Date:2023-08-07 14:47:45

Contemporary real-time video communication systems, such as WebRTC, use an adaptive bitrate (ABR) algorithm to assure high-quality and low-delay services, e.g., promptly adjusting video bitrate according to the instantaneous network bandwidth. However, target bitrate decisions in the network and bitrate control in the codec are typically incoordinated and simply ignoring the effect of inappropriate resolution and frame rate settings also leads to compromised results in bitrate control, thus devastatingly deteriorating the quality of experience (QoE). To tackle these challenges, Mamba, an end-to-end multi-dimensional ABR algorithm is proposed, which utilizes multi-agent reinforcement learning (MARL) to maximize the user's QoE by adaptively and collaboratively adjusting encoding factors including the quantization parameters (QP), resolution, and frame rate based on observed states such as network conditions and video complexity information in a video conferencing system. We also introduce curriculum learning to improve the training efficiency of MARL. Both the in-lab and real-world evaluation results demonstrate the remarkable efficacy of Mamba.

Asynchronous Decentralized Q-Learning: Two Timescale Analysis By Persistence

Authors:Bora Yongacoglu, Gürdal Arslan, Serdar Yüksel
Date:2023-08-07 01:32:09

Non-stationarity is a fundamental challenge in multi-agent reinforcement learning (MARL), where agents update their behaviour as they learn. Many theoretical advances in MARL avoid the challenge of non-stationarity by coordinating the policy updates of agents in various ways, including synchronizing times at which agents are allowed to revise their policies. Synchronization enables analysis of many MARL algorithms via multi-timescale methods, but such synchrony is infeasible in many decentralized applications. In this paper, we study an asynchronous variant of the decentralized Q-learning algorithm, a recent MARL algorithm for stochastic games. We provide sufficient conditions under which the asynchronous algorithm drives play to equilibrium with high probability. Our solution utilizes constant learning rates in the Q-factor update, which we show to be critical for relaxing the synchrony assumptions of earlier work. Our analysis also applies to asynchronous generalizations of a number of other algorithms from the regret testing tradition, whose performance is analyzed by multi-timescale methods that study Markov chains obtained via policy update dynamics. This work extends the applicability of the decentralized Q-learning algorithm and its relatives to settings in which parameters are selected in an independent manner, and tames non-stationarity without imposing the coordination assumptions of prior work.

Communication-Efficient Decentralized Multi-Agent Reinforcement Learning for Cooperative Adaptive Cruise Control

Authors:Dong Chen, Kaixiang Zhang, Yongqiang Wang, Xunyuan Yin, Zhaojian Li, Dimitar Filev
Date:2023-08-04 14:19:36

Connected and autonomous vehicles (CAVs) promise next-gen transportation systems with enhanced safety, energy efficiency, and sustainability. One typical control strategy for CAVs is the so-called cooperative adaptive cruise control (CACC) where vehicles drive in platoons and cooperate to achieve safe and efficient transportation. In this study, we formulate CACC as a multi-agent reinforcement learning (MARL) problem. Diverging from existing MARL methods that use centralized training and decentralized execution which require not only a centralized communication mechanism but also dense inter-agent communication during training and online adaptation, we propose a fully decentralized MARL framework for enhanced efficiency and scalability. In addition, a quantization-based communication scheme is proposed to reduce the communication overhead without significantly degrading the control performance. This is achieved by employing randomized rounding numbers to quantize each piece of communicated information and only communicating non-zero components after quantization. Extensive experimentation in two distinct CACC settings reveals that the proposed MARL framework consistently achieves superior performance over several contemporary benchmarks in terms of both communication efficiency and control efficacy. In the appendix, we show that our proposed framework's applicability extends beyond CACC, showing promise for broader intelligent transportation systems with intricate action and state spaces.

Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation

Authors:Soohyun Park, Jae Pyoung Kim, Chanyoung Park, Soyi Jung, Joongheon Kim
Date:2023-08-03 03:29:25

For Industry 4.0 Revolution, cooperative autonomous mobility systems are widely used based on multi-agent reinforcement learning (MARL). However, the MARL-based algorithms suffer from huge parameter utilization and convergence difficulties with many agents. To tackle these problems, a quantum MARL (QMARL) algorithm based on the concept of actor-critic network is proposed, which is beneficial in terms of scalability, to deal with the limitations in the noisy intermediate-scale quantum (NISQ) era. Additionally, our QMARL is also beneficial in terms of efficient parameter utilization and fast convergence due to quantum supremacy. Note that the reward in our QMARL is defined as task precision over computation time in multiple agents, thus, multi-agent cooperation can be realized. For further improvement, an additional technique for scalability is proposed, which is called projection value measure (PVM). Based on PVM, our proposed QMARL can achieve the highest reward, by reducing the action dimension into a logarithmic-scale. Finally, we can conclude that our proposed QMARL with PVM outperforms the other algorithms in terms of efficient parameter utilization, fast convergence, and scalability.

BRNES: Enabling Security and Privacy-aware Experience Sharing in Multiagent Robotic and Autonomous Systems

Authors:Md Tamjid Hossain, Hung Manh La, Shahriar Badsha, Anton Netchaev
Date:2023-08-02 16:57:19

Although experience sharing (ES) accelerates multiagent reinforcement learning (MARL) in an advisor-advisee framework, attempts to apply ES to decentralized multiagent systems have so far relied on trusted environments and overlooked the possibility of adversarial manipulation and inference. Nevertheless, in a real-world setting, some Byzantine attackers, disguised as advisors, may provide false advice to the advisee and catastrophically degrade the overall learning performance. Also, an inference attacker, disguised as an advisee, may conduct several queries to infer the advisors' private information and make the entire ES process questionable in terms of privacy leakage. To address and tackle these issues, we propose a novel MARL framework (BRNES) that heuristically selects a dynamic neighbor zone for each advisee at each learning step and adopts a weighted experience aggregation technique to reduce Byzantine attack impact. Furthermore, to keep the agent's private information safe from adversarial inference attacks, we leverage the local differential privacy (LDP)-induced noise during the ES process. Our experiments show that our framework outperforms the state-of-the-art in terms of the steps to goal, obtained reward, and time to goal metrics. Particularly, our evaluation shows that the proposed framework is 8.32x faster than the current non-private frameworks and 1.41x faster than the private frameworks in an adversarial setting.

Cooperative Multi-Agent Constrained POMDPs: Strong Duality and Primal-Dual Reinforcement Learning with Approximate Information States

Authors:Nouman Khan, Vijay Subramanian
Date:2023-07-31 10:02:30

We study the problem of decentralized constrained POMDPs in a team-setting where the multiple non-strategic agents have asymmetric information. Strong duality is established for the setting of infinite-horizon expected total discounted costs when the observations lie in a countable space, the actions are chosen from a finite space, and the immediate cost functions are bounded. Following this, connections with the common-information and approximate information-state approaches are established. The approximate information-states are characterized independent of the Lagrange-multipliers vector so that adaptations of the multiplier (during learning) will not necessitate new representations. Finally, a primal-dual multi-agent reinforcement learning (MARL) framework based on centralized training distributed execution (CTDE) and three time-scale stochastic approximation is developed with the aid of recurrent and feedforward neural-networks as function-approximators.

Robust Electric Vehicle Balancing of Autonomous Mobility-On-Demand System: A Multi-Agent Reinforcement Learning Approach

Authors:Sihong He, Shuo Han, Fei Miao
Date:2023-07-30 13:40:42

Electric autonomous vehicles (EAVs) are getting attention in future autonomous mobility-on-demand (AMoD) systems due to their economic and societal benefits. However, EAVs' unique charging patterns (long charging time, high charging frequency, unpredictable charging behaviors, etc.) make it challenging to accurately predict the EAVs supply in E-AMoD systems. Furthermore, the mobility demand's prediction uncertainty makes it an urgent and challenging task to design an integrated vehicle balancing solution under supply and demand uncertainties. Despite the success of reinforcement learning-based E-AMoD balancing algorithms, state uncertainties under the EV supply or mobility demand remain unexplored. In this work, we design a multi-agent reinforcement learning (MARL)-based framework for EAVs balancing in E-AMoD systems, with adversarial agents to model both the EAVs supply and mobility demand uncertainties that may undermine the vehicle balancing solutions. We then propose a robust E-AMoD Balancing MARL (REBAMA) algorithm to train a robust EAVs balancing policy to balance both the supply-demand ratio and charging utilization rate across the whole city. Experiments show that our proposed robust method performs better compared with a non-robust MARL method that does not consider state uncertainties; it improves the reward, charging utilization fairness, and supply-demand fairness by 19.28%, 28.18%, and 3.97%, respectively. Compared with a robust optimization-based method, the proposed MARL algorithm can improve the reward, charging utilization fairness, and supply-demand fairness by 8.21%, 8.29%, and 9.42%, respectively.

Robust Multi-Agent Reinforcement Learning with State Uncertainty

Authors:Sihong He, Songyang Han, Sanbao Su, Shuo Han, Shaofeng Zou, Fei Miao
Date:2023-07-30 12:31:42

In real-world multi-agent reinforcement learning (MARL) applications, agents may not have perfect state information (e.g., due to inaccurate measurement or malicious attacks), which challenges the robustness of agents' policies. Though robustness is getting important in MARL deployment, little prior work has studied state uncertainties in MARL, neither in problem formulation nor algorithm design. Motivated by this robustness issue and the lack of corresponding studies, we study the problem of MARL with state uncertainty in this work. We provide the first attempt to the theoretical and empirical analysis of this challenging problem. We first model the problem as a Markov Game with state perturbation adversaries (MG-SPA) by introducing a set of state perturbation adversaries into a Markov Game. We then introduce robust equilibrium (RE) as the solution concept of an MG-SPA. We conduct a fundamental analysis regarding MG-SPA such as giving conditions under which such a robust equilibrium exists. Then we propose a robust multi-agent Q-learning (RMAQ) algorithm to find such an equilibrium, with convergence guarantees. To handle high-dimensional state-action space, we design a robust multi-agent actor-critic (RMAAC) algorithm based on an analytical expression of the policy gradient derived in the paper. Our experiments show that the proposed RMAQ algorithm converges to the optimal value function; our RMAAC algorithm outperforms several MARL and robust MARL methods in multiple multi-agent environments when state uncertainty is present. The source code is public on \url{https://github.com/sihongho/robust_marl_with_state_uncertainty}.

ESP: Exploiting Symmetry Prior for Multi-Agent Reinforcement Learning

Authors:Xin Yu, Rongye Shi, Pu Feng, Yongkai Tian, Jie Luo, Wenjun Wu
Date:2023-07-30 09:49:05

Multi-agent reinforcement learning (MARL) has achieved promising results in recent years. However, most existing reinforcement learning methods require a large amount of data for model training. In addition, data-efficient reinforcement learning requires the construction of strong inductive biases, which are ignored in the current MARL approaches. Inspired by the symmetry phenomenon in multi-agent systems, this paper proposes a framework for exploiting prior knowledge by integrating data augmentation and a well-designed consistency loss into the existing MARL methods. In addition, the proposed framework is model-agnostic and can be applied to most of the current MARL algorithms. Experimental tests on multiple challenging tasks demonstrate the effectiveness of the proposed framework. Moreover, the proposed framework is applied to a physical multi-robot testbed to show its superiority.

Learning to Collaborate by Grouping: a Consensus-oriented Strategy for Multi-agent Reinforcement Learning

Authors:Jingqing Ruan, Xiaotian Hao, Dong Li, Hangyu Mao
Date:2023-07-28 12:41:36

Multi-agent systems require effective coordination between groups and individuals to achieve common goals. However, current multi-agent reinforcement learning (MARL) methods primarily focus on improving individual policies and do not adequately address group-level policies, which leads to weak cooperation. To address this issue, we propose a novel Consensus-oriented Strategy (CoS) that emphasizes group and individual policies simultaneously. Specifically, CoS comprises two main components: (a) the vector quantized group consensus module, which extracts discrete latent embeddings that represent the stable and discriminative group consensus, and (b) the group consensus-oriented strategy, which integrates the group policy using a hypernet and the individual policies using the group consensus, thereby promoting coordination at both the group and individual levels. Through empirical experiments on cooperative navigation tasks with both discrete and continuous spaces, as well as Google research football, we demonstrate that CoS outperforms state-of-the-art MARL algorithms and achieves better collaboration, thus providing a promising solution for achieving effective coordination in multi-agent systems.

MatrixWorld: A pursuit-evasion platform for safe multi-agent coordination and autocurricula

Authors:Lijun Sun, Yu-Cheng Chang, Chao Lyu, Chin-Teng Lin, Yuhui Shi
Date:2023-07-27 13:35:03

Multi-agent reinforcement learning (MARL) achieves encouraging performance in solving complex tasks. However, the safety of MARL policies is one critical concern that impedes their real-world applications. Popular multi-agent benchmarks focus on diverse tasks yet provide limited safety support. Therefore, this work proposes a safety-constrained multi-agent environment: MatrixWorld, based on the general pursuit-evasion game. Particularly, a safety-constrained multi-agent action execution model is proposed for the software implementation of safe multi-agent environments based on diverse safety definitions. It (1) extends the vertex conflict among homogeneous / cooperative agents to heterogeneous / adversarial settings, and (2) proposes three types of resolutions for each type of conflict, aiming at providing rational and unbiased feedback for safe MARL. Besides, MatrixWorld is also a lightweight co-evolution framework for the learning of pursuit tasks, evasion tasks, or both, where more pursuit-evasion variants can be designed based on different practical meanings of safety. As a brief survey, we review and analyze the co-evolution mechanism in the multi-agent setting, which clearly reveals its relationships with autocurricula, self-play, arms races, and adversarial learning. Thus, MatrixWorld can also serve as the first environment for autocurricula research, where ideas can be quickly verified and well understood.

Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization

Authors:Xiangsen Wang, Haoran Xu, Yinan Zheng, Xianyuan Zhan
Date:2023-07-21 14:37:54

Offline reinforcement learning (RL) has received considerable attention in recent years due to its attractive capability of learning policies from offline datasets without environmental interactions. Despite some success in the single-agent setting, offline multi-agent RL (MARL) remains to be a challenge. The large joint state-action space and the coupled multi-agent behaviors pose extra complexities for offline policy optimization. Most existing offline MARL studies simply apply offline data-related regularizations on individual agents, without fully considering the multi-agent system at the global level. In this work, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit global-to-local v alue regularization. OMIGA provides a principled framework to convert global-level value regularization into equivalent implicit local value regularizations and simultaneously enables in-sample learning, thus elegantly bridging multi-agent value decomposition and policy learning with offline regularizations. Based on comprehensive experiments on the offline multi-agent MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks.

Two Tales of Platoon Intelligence for Autonomous Mobility Control: Enabling Deep Learning Recipes

Authors:Soohyun Park, Haemin Lee, Chanyoung Park, Soyi Jung, Minseok Choi, Joongheon Kim
Date:2023-07-19 01:46:38

This paper presents the deep learning-based recent achievements to resolve the problem of autonomous mobility control and efficient resource management of autonomous vehicles and UAVs, i.e., (i) multi-agent reinforcement learning (MARL), and (ii) neural Myerson auction. Representatively, communication network (CommNet), which is one of the most popular MARL algorithms, is introduced to enable multiple agents to take actions in a distributed manner for their shared goals by training all agents' states and actions in a single neural network. Moreover, the neural Myerson auction guarantees trustfulness among multiple agents as well as achieves the optimal revenue of highly dynamic systems. Therefore, we survey the recent studies on autonomous mobility control based on MARL and neural Myerson auction. Furthermore, we emphasize that integration of MARL and neural Myerson auction is expected to be critical for efficient and trustful autonomous mobility services.

Non-Stationary Policy Learning for Multi-Timescale Multi-Agent Reinforcement Learning

Authors:Patrick Emami, Xiangyu Zhang, David Biagioni, Ahmed S. Zamzam
Date:2023-07-17 19:25:46

In multi-timescale multi-agent reinforcement learning (MARL), agents interact across different timescales. In general, policies for time-dependent behaviors, such as those induced by multiple timescales, are non-stationary. Learning non-stationary policies is challenging and typically requires sophisticated or inefficient algorithms. Motivated by the prevalence of this control problem in real-world complex systems, we introduce a simple framework for learning non-stationary policies for multi-timescale MARL. Our approach uses available information about agent timescales to define a periodic time encoding. In detail, we theoretically demonstrate that the effects of non-stationarity introduced by multiple timescales can be learned by a periodic multi-agent policy. To learn such policies, we propose a policy gradient algorithm that parameterizes the actor and critic with phase-functioned neural networks, which provide an inductive bias for periodicity. The framework's ability to effectively learn multi-timescale policies is validated on a gridworld and building energy management environment.

Efficient Adversarial Attacks on Online Multi-agent Reinforcement Learning

Authors:Guanlin Liu, Lifeng Lai
Date:2023-07-15 00:38:55

Due to the broad range of applications of multi-agent reinforcement learning (MARL), understanding the effects of adversarial attacks against MARL model is essential for the safe applications of this model. Motivated by this, we investigate the impact of adversarial attacks on MARL. In the considered setup, there is an exogenous attacker who is able to modify the rewards before the agents receive them or manipulate the actions before the environment receives them. The attacker aims to guide each agent into a target policy or maximize the cumulative rewards under some specific reward function chosen by the attacker, while minimizing the amount of manipulation on feedback and action. We first show the limitations of the action poisoning only attacks and the reward poisoning only attacks. We then introduce a mixed attack strategy with both the action poisoning and the reward poisoning. We show that the mixed attack strategy can efficiently attack MARL agents even if the attacker has no prior information about the underlying environment and the agents' algorithms.

Learning Multiple Coordinated Agents under Directed Acyclic Graph Constraints

Authors:Jaeyeon Jang, Diego Klabjan, Han Liu, Nital S. Patel, Xiuqi Li, Balakrishnan Ananthanarayanan, Husam Dauod, Tzung-Han Juang
Date:2023-07-13 13:41:24

This paper proposes a novel multi-agent reinforcement learning (MARL) method to learn multiple coordinated agents under directed acyclic graph (DAG) constraints. Unlike existing MARL approaches, our method explicitly exploits the DAG structure between agents to achieve more effective learning performance. Theoretically, we propose a novel surrogate value function based on a MARL model with synthetic rewards (MARLM-SR) and prove that it serves as a lower bound of the optimal value function. Computationally, we propose a practical training algorithm that exploits new notion of leader agent and reward generator and distributor agent to guide the decomposed follower agents to better explore the parameter space in environments with DAG constraints. Empirically, we exploit four DAG environments including a real-world scheduling for one of Intel's high volume packaging and test factory to benchmark our methods and show it outperforms the other non-DAG approaches.

Learning Decentralized Partially Observable Mean Field Control for Artificial Collective Behavior

Authors:Kai Cui, Sascha Hauck, Christian Fabian, Heinz Koeppl
Date:2023-07-12 14:02:03

Recent reinforcement learning (RL) methods have achieved success in various domains. However, multi-agent RL (MARL) remains a challenge in terms of decentralization, partial observability and scalability to many agents. Meanwhile, collective behavior requires resolution of the aforementioned challenges, and remains of importance to many state-of-the-art applications such as active matter physics, self-organizing systems, opinion dynamics, and biological or robotic swarms. Here, MARL via mean field control (MFC) offers a potential solution to scalability, but fails to consider decentralized and partially observable systems. In this paper, we enable decentralized behavior of agents under partial information by proposing novel models for decentralized partially observable MFC (Dec-POMFC), a broad class of problems with permutation-invariant agents allowing for reduction to tractable single-agent Markov decision processes (MDP) with single-agent RL solution. We provide rigorous theoretical results, including a dynamic programming principle, together with optimality guarantees for Dec-POMFC solutions applied to finite swarms of interest. Algorithmically, we propose Dec-POMFC-based policy gradient methods for MARL via centralized training and decentralized execution, together with policy gradient approximation guarantees. In addition, we improve upon state-of-the-art histogram-based MFC by kernel methods, which is of separate interest also for fully observable MFC. We evaluate numerically on representative collective behavior tasks such as adapted Kuramoto and Vicsek swarming models, being on par with state-of-the-art MARL. Overall, our framework takes a step towards RL-based engineering of artificial collective behavior via MFC.

Channel Selection for Wi-Fi 7 Multi-Link Operation via Optimistic-Weighted VDN and Parallel Transfer Reinforcement Learning

Authors:Pedro Enrique Iturria-Rivera, Marcel Chenier, Bernard Herscovici, Burak Kantarci, Melike Erol-Kantarci
Date:2023-07-11 16:35:04

Dense and unplanned IEEE 802.11 Wireless Fidelity(Wi-Fi) deployments and the continuous increase of throughput and latency stringent services for users have led to machine learning algorithms to be considered as promising techniques in the industry and the academia. Specifically, the ongoing IEEE 802.11be EHT -- Extremely High Throughput, known as Wi-Fi 7 -- amendment propose, for the first time, Multi-Link Operation (MLO). Among others, this new feature will increase the complexity of channel selection due the novel multiple interfaces proposal. In this paper, we present a Parallel Transfer Reinforcement Learning (PTRL)-based cooperative Multi-Agent Reinforcement Learning (MARL) algorithm named Parallel Transfer Reinforcement Learning Optimistic-Weighted Value Decomposition Networks (oVDN) to improve intelligent channel selection in IEEE 802.11be MLO-capable networks. Additionally, we compare the impact of different parallel transfer learning alternatives and a centralized non-transfer MARL baseline. Two PTRL methods are presented: Multi-Agent System (MAS) Joint Q-function Transfer, where the joint Q-function is transferred and MAS Best/Worst Experience Transfer where the best and worst experiences are transferred among MASs. Simulation results show that oVDNg -- only the best experiences are utilized -- is the best algorithm variant. Moreover, oVDNg offers a gain up to 3%, 7.2% and 11% when compared with VDN, VDN-nonQ and non-PTRL baselines. Furthermore, oVDNg experienced a reward convergence gain in the 5 GHz interface of 33.3% over oVDNb and oVDN where only worst and both types of experiences are considered, respectively. Finally, our best PTRL alternative showed an improvement over the non-PTRL baseline in terms of speed of convergence up to 40 episodes and reward up to 135%.

MARBLER: An Open Platform for Standardized Evaluation of Multi-Robot Reinforcement Learning Algorithms

Authors:Reza Torbati, Shubham Lohiya, Shivika Singh, Meher Shashwat Nigam, Harish Ravichandar
Date:2023-07-08 03:58:23

Multi-Agent Reinforcement Learning (MARL) has enjoyed significant recent progress thanks, in part, to the integration of deep learning techniques for modeling interactions in complex environments. This is naturally starting to benefit multi-robot systems (MRS) in the form of multi-robot RL (MRRL). However, existing infrastructure to train and evaluate policies predominantly focus on the challenges of coordinating virtual agents, and ignore characteristics important to robotic systems. Few platforms support realistic robot dynamics, and fewer still can evaluate Sim2Real performance of learned behavior. To address these issues, we contribute MARBLER: Multi-Agent RL Benchmark and Learning Environment for the Robotarium. MARBLER offers a robust and comprehensive evaluation platform for MRRL by marrying Georgia Tech's Robotarium (which enables rapid deployment on physical MRS) and OpenAI's Gym interface (which facilitates standardized use of modern learning algorithms). MARBLER offers a highly controllable environment with realistic dynamics, including barrier certificate-based obstacle avoidance. It allows anyone across the world to train and deploy MRRL algorithms on a physical testbed with reproducibility. Further, we introduce five novel scenarios inspired by common challenges in MRS and provide support for new custom scenarios. Finally, we use MARBLER to evaluate popular MARL algorithms and provide insights into their suitability for MRRL. In summary, MARBLER can be a valuable tool to the MRS research community by facilitating comprehensive and standardized evaluation of learning algorithms on realistic simulations and physical hardware. Links to our open-source framework and videos of real-world experiments can be found at https://shubhlohiya.github.io/MARBLER/.

Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance

Authors:Yuchen Fang, Zhenggang Tang, Kan Ren, Weiqing Liu, Li Zhao, Jiang Bian, Dongsheng Li, Weinan Zhang, Yong Yu, Tie-Yan Liu
Date:2023-07-06 16:45:40

Order execution is a fundamental task in quantitative finance, aiming at finishing acquisition or liquidation for a number of trading orders of the specific assets. Recent advance in model-free reinforcement learning (RL) provides a data-driven solution to the order execution problem. However, the existing works always optimize execution for an individual order, overlooking the practice that multiple orders are specified to execute simultaneously, resulting in suboptimality and bias. In this paper, we first present a multi-agent RL (MARL) method for multi-order execution considering practical constraints. Specifically, we treat every agent as an individual operator to trade one specific order, while keeping communicating with each other and collaborating for maximizing the overall profits. Nevertheless, the existing MARL algorithms often incorporate communication among agents by exchanging only the information of their partial observations, which is inefficient in complicated financial market. To improve collaboration, we then propose a learnable multi-round communication protocol, for the agents communicating the intended actions with each other and refining accordingly. It is optimized through a novel action value attribution method which is provably consistent with the original learning objective yet more efficient. The experiments on the data from two real-world markets have illustrated superior performance with significantly better collaboration effectiveness achieved by our method.

Cell-Free XL-MIMO Meets Multi-Agent Reinforcement Learning: Architectures, Challenges, and Future Directions

Authors:Zhilong Liu, Jiayi Zhang, Ziheng Liu, Hongyang Du, Zhe Wang, Dusit Niyato, Mohsen Guizani, Bo Ai
Date:2023-07-06 07:50:47

Cell-free massive multiple-input multiple-output (mMIMO) and extremely large-scale MIMO (XL-MIMO) are regarded as promising innovations for the forthcoming generation of wireless communication systems. Their significant advantages in augmenting the number of degrees of freedom have garnered considerable interest. In this article, we first review the essential opportunities and challenges induced by XL-MIMO systems. We then propose the enhanced paradigm of cell-free XL-MIMO, which incorporates multi-agent reinforcement learning (MARL) to provide a distributed strategy for tackling the problem of high-dimension signal processing and costly energy consumption. Based on the unique near-field characteristics, we propose two categories of the low-complexity design, i.e., antenna selection and power control, to adapt to different cell-free XL-MIMO scenarios and achieve the maximum data rate. For inspiration, several critical future research directions pertaining to green cell-free XL-MIMO systems are presented.

Multi-Agent Cooperation via Unsupervised Learning of Joint Intentions

Authors:Shanqi Liu, Weiwei Liu, Wenzhou Chen, Guanzhong Tian, Yong Liu
Date:2023-07-05 10:52:02

The field of cooperative multi-agent reinforcement learning (MARL) has seen widespread use in addressing complex coordination tasks. While value decomposition methods in MARL have been popular, they have limitations in solving tasks with non-monotonic returns, restricting their general application. Our work highlights the significance of joint intentions in cooperation, which can overcome non-monotonic problems and increase the interpretability of the learning process. To this end, we present a novel MARL method that leverages learnable joint intentions. Our method employs a hierarchical framework consisting of a joint intention policy and a behavior policy to formulate the optimal cooperative policy. The joint intentions are autonomously learned in a latent space through unsupervised learning and enable the method adaptable to different agent configurations. Our results demonstrate significant performance improvements in both the StarCraft micromanagement benchmark and challenging MAgent domains, showcasing the effectiveness of our method in learning meaningful joint intentions.

Beyond Conservatism: Diffusion Policies in Offline Multi-agent Reinforcement Learning

Authors:Zhuoran Li, Ling Pan, Longbo Huang
Date:2023-07-04 04:40:54

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-augmentation scheme in training. These key ingredients make our algorithm more robust to environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better in shifted environments thanks to its high expressiveness and diversity. Furthermore, DOM2 shows superior data efficiency and can achieve state-of-the-art performance with $20+$ times less data compared to existing algorithms.

Environmental effects on emergent strategy in micro-scale multi-agent reinforcement learning

Authors:Samuel Tovey, David Zimmer, Christoph Lohrmann, Tobias Merkt, Simon Koppenhoefer, Veit-Lorenz Heuthe, Clemens Bechinger, Christian Holm
Date:2023-07-03 13:18:25

Multi-Agent Reinforcement Learning (MARL) is a promising candidate for realizing efficient control of microscopic particles, of which micro-robots are a subset. However, the microscopic particles' environment presents unique challenges, such as Brownian motion at sufficiently small length-scales. In this work, we explore the role of temperature in the emergence and efficacy of strategies in MARL systems using particle-based Langevin molecular dynamics simulations as a realistic representation of micro-scale environments. To this end, we perform experiments on two different multi-agent tasks in microscopic environments at different temperatures, detecting the source of a concentration gradient and rotation of a rod. We find that at higher temperatures, the RL agents identify new strategies for achieving these tasks, highlighting the importance of understanding this regime and providing insight into optimal training strategies for bridging the generalization gap between simulation and reality. We also introduce a novel Python package for studying microscopic agents using reinforcement learning (RL) to accompany our results.

Enhancing the Robustness of QMIX against State-adversarial Attacks

Authors:Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang, Jiacun Wang
Date:2023-07-03 10:10:34

Deep reinforcement learning (DRL) performance is generally impacted by state-adversarial attacks, a perturbation applied to an agent's observation. Most recent research has concentrated on robust single-agent reinforcement learning (SARL) algorithms against state-adversarial attacks. Still, there has yet to be much work on robust multi-agent reinforcement learning. Using QMIX, one of the popular cooperative multi-agent reinforcement algorithms, as an example, we discuss four techniques to improve the robustness of SARL algorithms and extend them to multi-agent scenarios. To increase the robustness of multi-agent reinforcement learning (MARL) algorithms, we train models using a variety of attacks in this research. We then test the models taught using the other attacks by subjecting them to the corresponding attacks throughout the training phase. In this way, we organize and summarize techniques for enhancing robustness when used with MARL.

TVDO: Tchebycheff Value-Decomposition Optimization for Multi-Agent Reinforcement Learning

Authors:Xiaoliang Hu, Pengcheng Guo, Chuanwei Zhou, Tong Zhang, Zhen Cui
Date:2023-06-24 14:19:50

In cooperative multi-agent reinforcement learning (MARL) settings, the centralized training with decentralized execution (CTDE) becomes customary recently due to the physical demand. However, the most dilemma is the inconsistency of jointly-trained policies and individually-optimized actions. In this work, we propose a novel value-based multi-objective learning approach, named Tchebycheff value decomposition optimization (TVDO), to overcome the above dilemma. In particular, a nonlinear Tchebycheff aggregation method is designed to transform the MARL task into multi-objective optimal counterpart by tightly constraining the upper bound of individual action-value bias. We theoretically prove that TVDO well satisfies the necessary and sufficient condition of individual global max (IGM) with no extra limitations, which exactly guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO represents precisely from global to individual value factorization with a guarantee of the policy consistency. Furthermore, we also evaluate TVDO in the challenging scenarios of StarCraft II micromanagement tasks, and extensive experiments demonstrate that TVDO achieves more competitive performances than several state-of-the-art MARL methods.

Discovering Causality for Efficient Cooperation in Multi-Agent Environments

Authors:Rafael Pina, Varuna De Silva, Corentin Artaud
Date:2023-06-20 18:56:25

In cooperative Multi-Agent Reinforcement Learning (MARL) agents are required to learn behaviours as a team to achieve a common goal. However, while learning a task, some agents may end up learning sub-optimal policies, not contributing to the objective of the team. Such agents are called lazy agents due to their non-cooperative behaviours that may arise from failing to understand whether they caused the rewards. As a consequence, we observe that the emergence of cooperative behaviours is not necessarily a byproduct of being able to solve a task as a team. In this paper, we investigate the applications of causality in MARL and how it can be applied in MARL to penalise these lazy agents. We observe that causality estimations can be used to improve the credit assignment to the agents and show how it can be leveraged to improve independent learning in MARL. Furthermore, we investigate how Amortized Causal Discovery can be used to automate causality detection within MARL environments. The results demonstrate that causality relations between individual observations and the team reward can be used to detect and punish lazy agents, making them develop more intelligent behaviours. This results in improvements not only in the overall performances of the team but also in their individual capabilities. In addition, results show that Amortized Causal Discovery can be used efficiently to find causal relations in MARL.

IMP-MARL: a Suite of Environments for Large-scale Infrastructure Management Planning via MARL

Authors:Pascal Leroy, Pablo G. Morato, Jonathan Pisane, Athanasios Kolios, Damien Ernst
Date:2023-06-20 14:12:29

We introduce IMP-MARL, an open-source suite of multi-agent reinforcement learning (MARL) environments for large-scale Infrastructure Management Planning (IMP), offering a platform for benchmarking the scalability of cooperative MARL methods in real-world engineering applications. In IMP, a multi-component engineering system is subject to a risk of failure due to its components' damage condition. Specifically, each agent plans inspections and repairs for a specific system component, aiming to minimise maintenance costs while cooperating to minimise system failure risk. With IMP-MARL, we release several environments including one related to offshore wind structural systems, in an effort to meet today's needs to improve management strategies to support sustainable and reliable energy systems. Supported by IMP practical engineering environments featuring up to 100 agents, we conduct a benchmark campaign, where the scalability and performance of state-of-the-art cooperative MARL methods are compared against expert-based heuristic policies. The results reveal that centralised training with decentralised execution methods scale better with the number of agents than fully centralised or decentralised RL approaches, while also outperforming expert-based heuristic policies in most IMP environments. Based on our findings, we additionally outline remaining cooperation and scalability challenges that future MARL methods should still address. Through IMP-MARL, we encourage the implementation of new environments and the further development of MARL methods.

Cooperative Multi-Agent Learning for Navigation via Structured State Abstraction

Authors:Mohamed K. Abdelaziz, Mohammed S. Elbamby, Sumudu Samarakoon, Mehdi Bennis
Date:2023-06-20 07:06:17

Cooperative multi-agent reinforcement learning (MARL) for navigation enables agents to cooperate to achieve their navigation goals. Using emergent communication, agents learn a communication protocol to coordinate and share information that is needed to achieve their navigation tasks. In emergent communication, symbols with no pre-specified usage rules are exchanged, in which the meaning and syntax emerge through training. Learning a navigation policy along with a communication protocol in a MARL environment is highly complex due to the huge state space to be explored. To cope with this complexity, this work proposes a novel neural network architecture, for jointly learning an adaptive state space abstraction and a communication protocol among agents participating in navigation tasks. The goal is to come up with an adaptive abstractor that significantly reduces the size of the state space to be explored, without degradation in the policy performance. Simulation results show that the proposed method reaches a better policy, in terms of achievable rewards, resulting in fewer training iterations compared to the case where raw states or fixed state abstraction are used. Moreover, it is shown that a communication protocol emerges during training which enables the agents to learn better policies within fewer training iterations.

Adversarial Search and Tracking with Multiagent Reinforcement Learning in Sparsely Observable Environment

Authors:Zixuan Wu, Sean Ye, Manisha Natarajan, Letian Chen, Rohan Paleja, Matthew C. Gombolay
Date:2023-06-20 05:31:13

We study a search and tracking (S&T) problem where a team of dynamic search agents must collaborate to track an adversarial, evasive agent. The heterogeneous search team may only have access to a limited number of past adversary trajectories within a large search space. This problem is challenging for both model-based searching and reinforcement learning (RL) methods since the adversary exhibits reactionary and deceptive evasive behaviors in a large space leading to sparse detections for the search agents. To address this challenge, we propose a novel Multi-Agent RL (MARL) framework that leverages the estimated adversary location from our learnable filtering model. We show that our MARL architecture can outperform all baselines and achieves a 46% increase in detection rate.

CAMMARL: Conformal Action Modeling in Multi Agent Reinforcement Learning

Authors:Nikunj Gupta, Somjit Nath, Samira Ebrahimi Kahou
Date:2023-06-19 19:03:53

Before taking actions in an environment with more than one intelligent agent, an autonomous agent may benefit from reasoning about the other agents and utilizing a notion of a guarantee or confidence about the behavior of the system. In this article, we propose a novel multi-agent reinforcement learning (MARL) algorithm CAMMARL, which involves modeling the actions of other agents in different situations in the form of confident sets, i.e., sets containing their true actions with a high probability. We then use these estimates to inform an agent's decision-making. For estimating such sets, we use the concept of conformal predictions, by means of which, we not only obtain an estimate of the most probable outcome but get to quantify the operable uncertainty as well. For instance, we can predict a set that provably covers the true predictions with high probabilities (e.g., 95%). Through several experiments in two fully cooperative multi-agent tasks, we show that CAMMARL elevates the capabilities of an autonomous agent in MARL by modeling conformal prediction sets over the behavior of other agents in the environment and utilizing such estimates to enhance its policy learning.

Robust Multi-Agent Control via Maximum Entropy Heterogeneous-Agent Reinforcement Learning

Authors:Simin Li, Yifan Zhong, Jiarong Liu, Jianing Guo, Siyuan Qi, Ruixiao Xu, Xin Yu, Siyi Hu, Haobo Fu, Qiang Fu, Xiaojun Chang, Yujing Hu, Bo An, Xianglong Liu, Yaodong Yang
Date:2023-06-19 06:22:02

In multi-agent reinforcement learning, optimal control with robustness guarantees are critical for its deployment in real world. However, existing methods face challenges related to sample complexity, training instability, potential suboptimal Nash Equilibrium convergence and non-robustness to multiple perturbations. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective optimal for MARL. Based on the MaxEnt framework, we propose \emph{Heterogeneous-Agent Soft Actor-Critic} (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to \emph{quantal response equilibrium} (QRE) properties of HASAC. Furthermore, HASAC is provably robust against a wide range of real-world uncertainties, including perturbations in rewards, environment dynamics, states, and actions. Finally, we generalize a unified template for MaxEnt algorithmic design named \emph{Maximum Entropy Heterogeneous-Agent Mirror Learning} (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on seven benchmarks: Bi-DexHands, Multi-Agent MuJoCo, Pursuit-Evade, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines in 34 out of 38 tasks, exhibiting improved training stability, better sample efficiency and sufficient exploration. The robustness of HASAC was further validated when encountering uncertainties in rewards, dynamics, states, and actions of 14 magnitudes, and real-world deployment in a multi-robot arena against these four types of uncertainties. See our page at \url{https://sites.google.com/view/meharl}.

Dynamic Size Message Scheduling for Multi-Agent Communication under Limited Bandwidth

Authors:Qingshuang Sun, Denis Steckelmacher, Yuan Yao, Ann Nowé, Raphaël Avalos
Date:2023-06-16 18:33:11

Communication plays a vital role in multi-agent systems, fostering collaboration and coordination. However, in real-world scenarios where communication is bandwidth-limited, existing multi-agent reinforcement learning (MARL) algorithms often provide agents with a binary choice: either transmitting a fixed number of bytes or no information at all. This limitation hinders the ability to effectively utilize the available bandwidth. To overcome this challenge, we present the Dynamic Size Message Scheduling (DSMS) method, which introduces a finer-grained approach to scheduling by considering the actual size of the information to be exchanged. Our contribution lies in adaptively adjusting message sizes using Fourier transform-based compression techniques, enabling agents to tailor their messages to match the allocated bandwidth while striking a balance between information loss and transmission efficiency. Receiving agents can reliably decompress the messages using the inverse Fourier transform. Experimental results demonstrate that DSMS significantly improves performance in multi-agent cooperative tasks by optimizing the utilization of bandwidth and effectively balancing information value.

Density-Aware Reinforcement Learning to Optimise Energy Efficiency in UAV-Assisted Networks

Authors:Babatunji Omoniwa, Boris Galkin, Ivana Dusparic
Date:2023-06-14 23:43:18

Unmanned aerial vehicles (UAVs) serving as aerial base stations can be deployed to provide wireless connectivity to mobile users, such as vehicles. However, the density of vehicles on roads often varies spatially and temporally primarily due to mobility and traffic situations in a geographical area, making it difficult to provide ubiquitous service. Moreover, as energy-constrained UAVs hover in the sky while serving mobile users, they may be faced with interference from nearby UAV cells or other access points sharing the same frequency band, thereby impacting the system's energy efficiency (EE). Recent multi-agent reinforcement learning (MARL) approaches applied to optimise the users' coverage worked well in reasonably even densities but might not perform as well in uneven users' distribution, i.e., in urban road networks with uneven concentration of vehicles. In this work, we propose a density-aware communication-enabled multi-agent decentralised double deep Q-network (DACEMAD-DDQN) approach that maximises the total system's EE by jointly optimising the trajectory of each UAV, the number of connected users, and the UAVs' energy consumption while keeping track of dense and uneven users' distribution. Our result outperforms state-of-the-art MARL approaches in terms of EE by as much as 65% - 85%.

Mediated Multi-Agent Reinforcement Learning

Authors:Dmitry Ivanov, Ilya Zisman, Kirill Chernyshev
Date:2023-06-14 10:31:37

The majority of Multi-Agent Reinforcement Learning (MARL) literature equates the cooperation of self-interested agents in mixed environments to the problem of social welfare maximization, allowing agents to arbitrarily share rewards and private information. This results in agents that forgo their individual goals in favour of social good, which can potentially be exploited by selfish defectors. We argue that cooperation also requires agents' identities and boundaries to be respected by making sure that the emergent behaviour is an equilibrium, i.e., a convention that no agent can deviate from and receive higher individual payoffs. Inspired by advances in mechanism design, we propose to solve the problem of cooperation, defined as finding socially beneficial equilibrium, by using mediators. A mediator is a benevolent entity that may act on behalf of agents, but only for the agents that agree to it. We show how a mediator can be trained alongside agents with policy gradient to maximize social welfare subject to constraints that encourage agents to cooperate through the mediator. Our experiments in matrix and iterative games highlight the potential power of applying mediators in MARL.

Hierarchical Task Network Planning for Facilitating Cooperative Multi-Agent Reinforcement Learning

Authors:Xuechen Mu, Hankz Hankui Zhuo, Chen Chen, Kai Zhang, Chao Yu, Jianye Hao
Date:2023-06-14 08:51:43

Exploring sparse reward multi-agent reinforcement learning (MARL) environments with traps in a collaborative manner is a complex task. Agents typically fail to reach the goal state and fall into traps, which affects the overall performance of the system. To overcome this issue, we present SOMARL, a framework that uses prior knowledge to reduce the exploration space and assist learning. In SOMARL, agents are treated as part of the MARL environment, and symbolic knowledge is embedded using a tree structure to build a knowledge hierarchy. The framework has a two-layer hierarchical structure, comprising a hybrid module with a Hierarchical Task Network (HTN) planning and meta-controller at the higher level, and a MARL-based interactive module at the lower level. The HTN module and meta-controller use Hierarchical Domain Definition Language (HDDL) and the option framework to formalize symbolic knowledge and obtain domain knowledge and a symbolic option set, respectively. Moreover, the HTN module leverages domain knowledge to guide low-level agent exploration by assisting the meta-controller in selecting symbolic options. The meta-controller further computes intrinsic rewards of symbolic options to limit exploration behavior and adjust HTN planning solutions as needed. We evaluate SOMARL on two benchmarks, FindTreasure and MoveBox, and report superior performance over state-of-the-art MARL and subgoal-based baselines for MARL environments significantly.

Data Poisoning to Fake a Nash Equilibrium in Markov Games

Authors:Young Wu, Jeremy McMahan, Xiaojin Zhu, Qiaomin Xie
Date:2023-06-13 18:01:18

We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside the set. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.

Provably Learning Nash Policies in Constrained Markov Potential Games

Authors:Pragnya Alatur, Giorgia Ramponi, Niao He, Andreas Krause
Date:2023-06-13 13:08:31

Multi-agent reinforcement learning (MARL) addresses sequential decision-making problems with multiple agents, where each agent optimizes its own objective. In many real-world instances, the agents may not only want to optimize their objectives, but also ensure safe behavior. For example, in traffic routing, each car (agent) aims to reach its destination quickly (objective) while avoiding collisions (safety). Constrained Markov Games (CMGs) are a natural formalism for safe MARL problems, though generally intractable. In this work, we introduce and study Constrained Markov Potential Games (CMPGs), an important class of CMGs. We first show that a Nash policy for CMPGs can be found via constrained optimization. One tempting approach is to solve it by Lagrangian-based primal-dual methods. As we show, in contrast to the single-agent setting, however, CMPGs do not satisfy strong duality, rendering such approaches inapplicable and potentially unsafe. To solve the CMPG problem, we propose our algorithm Coordinate-Ascent for CMPGs (CA-CMPG), which provably converges to a Nash policy in tabular, finite-horizon CMPGs. Furthermore, we provide the first sample complexity bounds for learning Nash policies in unknown CMPGs, and, which under additional assumptions, guarantee safe exploration.

A Versatile Multi-Agent Reinforcement Learning Benchmark for Inventory Management

Authors:Xianliang Yang, Zhihao Liu, Wei Jiang, Chuheng Zhang, Li Zhao, Lei Song, Jiang Bian
Date:2023-06-13 05:22:30

Multi-agent reinforcement learning (MARL) models multiple agents that interact and learn within a shared environment. This paradigm is applicable to various industrial scenarios such as autonomous driving, quantitative trading, and inventory management. However, applying MARL to these real-world scenarios is impeded by many challenges such as scaling up, complex agent interactions, and non-stationary dynamics. To incentivize the research of MARL on these challenges, we develop MABIM (Multi-Agent Benchmark for Inventory Management) which is a multi-echelon, multi-commodity inventory management simulator that can generate versatile tasks with these different challenging properties. Based on MABIM, we evaluate the performance of classic operations research (OR) methods and popular MARL algorithms on these challenging tasks to highlight their weaknesses and potential.

Multi-Agent Reinforcement Learning Guided by Signal Temporal Logic Specifications

Authors:Jiangwei Wang, Shuo Yang, Ziyan An, Songyang Han, Zhili Zhang, Rahul Mangharam, Meiyi Ma, Fei Miao
Date:2023-06-11 23:53:29

Reward design is a key component of deep reinforcement learning, yet some tasks and designer's objectives may be unnatural to define as a scalar cost function. Among the various techniques, formal methods integrated with DRL have garnered considerable attention due to their expressiveness and flexibility to define the reward and requirements for different states and actions of the agent. However, how to leverage Signal Temporal Logic (STL) to guide multi-agent reinforcement learning reward design remains unexplored. Complex interactions, heterogeneous goals and critical safety requirements in multi-agent systems make this problem even more challenging. In this paper, we propose a novel STL-guided multi-agent reinforcement learning framework. The STL requirements are designed to include both task specifications according to the objective of each agent and safety specifications, and the robustness values of the STL specifications are leveraged to generate rewards. We validate the advantages of our method through empirical studies. The experimental results demonstrate significant reward performance improvements compared to MARL without STL guidance, along with a remarkable increase in the overall safety rate of the multi-agent systems.

Multi-agent Exploration with Sub-state Entropy Estimation

Authors:Jian Tao, Yang Zhang, Yangkun Chen, Xiu Li
Date:2023-06-10 08:41:57

Researchers have integrated exploration techniques into multi-agent reinforcement learning (MARL) algorithms, drawing on their remarkable success in deep reinforcement learning. Nonetheless, exploration in MARL presents a more substantial challenge, as agents need to coordinate their efforts in order to achieve comprehensive state coverage. Reaching a unanimous agreement on which kinds of states warrant exploring can be a struggle for agents in this context. We introduce \textbf{M}ulti-agent \textbf{E}xploration based on \textbf{S}ub-state \textbf{E}ntropy (MESE) to address this limitation. This novel approach incentivizes agents to explore states cooperatively by directing them to achieve consensus via an extra team reward. Calculating the additional reward is based on the novelty of the current sub-state that merits cooperative exploration. MESE employs a conditioned entropy approach to select the sub-state, using particle-based entropy estimation to calculate the entropy. MESE is a plug-and-play module that can be seamlessly integrated into most existing MARL algorithms, which makes it a highly effective tool for reinforcement learning. Our experiments demonstrate that MESE can substantially improve the MAPPO's performance on various tasks in the StarCraft multi-agent challenge (SMAC).

iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning

Authors:Xiyang Wu, Rohan Chandra, Tianrui Guan, Amrit Singh Bedi, Dinesh Manocha
Date:2023-06-09 20:12:02

Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations. We model two distinct incentives for agents' strategies: Behavioral Incentive for high-level decision-making based on their driving behavior or personality and Instant Incentive for motion planning for collision avoidance based on the current traffic state. Our approach enables agents to infer their opponents' behavior incentives and integrate this inferred information into their decision-making and motion-planning processes. We perform experiments on two simulation environments, Non-Cooperative Navigation and Heterogeneous Highway. In Heterogeneous Highway, results show that, compared with centralized training decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic traffic, with 48.1% higher success rate and 80.6% longer survival time in chaotic traffic. We also compare with a decentralized training decentralized execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of 12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate, and 13.7% longer survival time.

Robustness Testing for Multi-Agent Reinforcement Learning: State Perturbations on Critical Agents

Authors:Ziyuan Zhou, Guanjun Liu
Date:2023-06-09 02:26:28

Multi-Agent Reinforcement Learning (MARL) has been widely applied in many fields such as smart traffic and unmanned aerial vehicles. However, most MARL algorithms are vulnerable to adversarial perturbations on agent states. Robustness testing for a trained model is an essential step for confirming the trustworthiness of the model against unexpected perturbations. This work proposes a novel Robustness Testing framework for MARL that attacks states of Critical Agents (RTCA). The RTCA has two innovations: 1) a Differential Evolution (DE) based method to select critical agents as victims and to advise the worst-case joint actions on them; and 2) a team cooperation policy evaluation method employed as the objective function for the optimization of DE. Then, adversarial state perturbations of the critical agents are generated based on the worst-case joint actions. This is the first robustness testing framework with varying victim agents. RTCA demonstrates outstanding performance in terms of the number of victim agents and destroying cooperation policies.

Negotiated Reasoning: On Provably Addressing Relative Over-Generalization

Authors:Junjie Sheng, Wenhao Li, Bo Jin, Hongyuan Zha, Jun Wang, Xiangfeng Wang
Date:2023-06-08 16:57:12

Over-generalization is a thorny issue in cognitive science, where people may become overly cautious due to past experiences. Agents in multi-agent reinforcement learning (MARL) also have been found to suffer relative over-generalization (RO) as people do and stuck to sub-optimal cooperation. Recent methods have shown that assigning reasoning ability to agents can mitigate RO algorithmically and empirically, but there has been a lack of theoretical understanding of RO, let alone designing provably RO-free methods. This paper first proves that RO can be avoided when the MARL method satisfies a consistent reasoning requirement under certain conditions. Then we introduce a novel reasoning framework, called negotiated reasoning, that first builds the connection between reasoning and RO with theoretical justifications. After that, we propose an instantiated algorithm, Stein variational negotiated reasoning (SVNR), which uses Stein variational gradient descent to derive a negotiation policy that provably avoids RO in MARL under maximum entropy policy iteration. The method is further parameterized with neural networks for amortized learning, making computation efficient. Numerical experiments on many RO-challenged environments demonstrate the superiority and efficiency of SVNR compared to state-of-the-art methods in addressing RO.

Progression Cognition Reinforcement Learning with Prioritized Experience for Multi-Vehicle Pursuit

Authors:Xinhang Li, Yiying Yang, Zheng Yuan, Zhe Wang, Qinwen Wang, Chen Xu, Lei Li, Jianhua He, Lin Zhang
Date:2023-06-08 08:10:46

Multi-vehicle pursuit (MVP) such as autonomous police vehicles pursuing suspects is important but very challenging due to its mission and safety critical nature. While multi-agent reinforcement learning (MARL) algorithms have been proposed for MVP problem in structured grid-pattern roads, the existing algorithms use randomly training samples in centralized learning, which leads to homogeneous agents showing low collaboration performance. For the more challenging problem of pursuing multiple evading vehicles, these algorithms typically select a fixed target evading vehicle for pursuing vehicles without considering dynamic traffic situation, which significantly reduces pursuing success rate. To address the above problems, this paper proposes a Progression Cognition Reinforcement Learning with Prioritized Experience for MVP (PEPCRL-MVP) in urban multi-intersection dynamic traffic scenes. PEPCRL-MVP uses a prioritization network to assess the transitions in the global experience replay buffer according to the parameters of each MARL agent. With the personalized and prioritized experience set selected via the prioritization network, diversity is introduced to the learning process of MARL, which can improve collaboration and task related performance. Furthermore, PEPCRL-MVP employs an attention module to extract critical features from complex urban traffic environments. These features are used to develop progression cognition method to adaptively group pursuing vehicles. Each group efficiently target one evading vehicle in dynamic driving environments. Extensive experiments conducted with a simulator over unstructured roads of an urban area show that PEPCRL-MVP is superior to other state-of-the-art methods. Specifically, PEPCRL-MVP improves pursuing efficiency by 3.95% over TD3-DMAP and its success rate is 34.78% higher than that of MADDPG. Codes are open sourced.

A Novel Multi-Agent Deep RL Approach for Traffic Signal Control

Authors:Shijie Wang, Shangbo Wang
Date:2023-06-05 08:20:37

As travel demand increases and urban traffic condition becomes more complicated, applying multi-agent deep reinforcement learning (MARL) to traffic signal control becomes one of the hot topics. The rise of Reinforcement Learning (RL) has opened up opportunities for solving Adaptive Traffic Signal Control (ATSC) in complex urban traffic networks, and deep neural networks have further enhanced their ability to handle complex data. Traditional research in traffic signal control is based on the centralized Reinforcement Learning technique. However, in a large-scale road network, centralized RL is infeasible because of an exponential growth of joint state-action space. In this paper, we propose a Friend-Deep Q-network (Friend-DQN) approach for multiple traffic signal control in urban networks, which is based on an agent-cooperation scheme. In particular, the cooperation between multiple agents can reduce the state-action space and thus speed up the convergence. We use SUMO (Simulation of Urban Transport) platform to evaluate the performance of Friend-DQN model, and show its feasibility and superiority over other existing methods.

A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning

Authors:Wei-Fang Sun, Cheng-Kuang Lee, Simon See, Chun-Yi Lee
Date:2023-06-04 18:26:25

In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines.

Presentation of Jean-Marie Souriau's book ''Structure des systèmes dynamiques''

Authors:Géry de Saxcé, Charles-Michel Marle
Date:2023-06-03 16:40:39

Jean-Marie Souriau's book ''Structure des syst\`emes dynamiques'', published in 1970, republished recently by Gabay, translated in English and published under the title ''Structure of Dynamical Systems, a Symplectic View of Physic'', is a work with an exceptional wealth which, fifty years after its publication, is still topical. In this paper, we give a rather detailled description of its content and we intend to highlight the ideas that to us, are the most creative and promising.

Model-aided Federated Reinforcement Learning for Multi-UAV Trajectory Planning in IoT Networks

Authors:Jichao Chen, Omid Esrafilian, Harald Bayerlein, David Gesbert, Marco Caccamo
Date:2023-06-03 07:16:17

Deploying teams of unmanned aerial vehicles (UAVs) to harvest data from distributed Internet of Things (IoT) devices requires efficient trajectory planning and coordination algorithms. Multi-agent reinforcement learning (MARL) has emerged as a solution, but requires extensive and costly real-world training data. To tackle this challenge, we propose a novel model-aided federated MARL algorithm to coordinate multiple UAVs on a data harvesting mission with only limited knowledge about the environment. The proposed algorithm alternates between building an environment simulation model from real-world measurements, specifically learning the radio channel characteristics and estimating unknown IoT device positions, and federated QMIX training in the simulated environment. Each UAV agent trains a local QMIX model in its simulated environment and continuously consolidates it through federated learning with other agents, accelerating the learning process. A performance comparison with standard MARL algorithms demonstrates that our proposed model-aided FedQMIX algorithm reduces the need for real-world training experiences by around three magnitudes while attaining similar data collection performance.

MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Authors:Haolin Song, Mingxiao Feng, Wengang Zhou, Houqiang Li
Date:2023-06-03 05:32:19

Recent approaches have utilized self-supervised auxiliary tasks as representation learning to improve the performance and sample efficiency of vision-based reinforcement learning algorithms in single-agent settings. However, in multi-agent reinforcement learning (MARL), these techniques face challenges because each agent only receives partial observation from an environment influenced by others, resulting in correlated observations in the agent dimension. So it is necessary to consider agent-level information in representation learning for MARL. In this paper, we propose an effective framework called \textbf{M}ulti-\textbf{A}gent \textbf{M}asked \textbf{A}ttentive \textbf{C}ontrastive \textbf{L}earning (MA2CL), which encourages learning representation to be both temporal and agent-level predictive by reconstructing the masked agent observation in latent space. Specifically, we use an attention reconstruction model for recovering and the model is trained via contrastive learning. MA2CL allows better utilization of contextual information at the agent level, facilitating the training of MARL agents for cooperation tasks. Extensive experiments demonstrate that our method significantly improves the performance and sample efficiency of different MARL algorithms and outperforms other methods in various vision-based and state-based scenarios. Our code can be found in \url{https://github.com/ustchlsong/MA2CL}

Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning

Authors:Dingyang Chen, Qi Zhang
Date:2023-06-02 21:22:27

Executing actions in a correlated manner is a common strategy for human coordination that often leads to better cooperation, which is also potentially beneficial for cooperative multi-agent reinforcement learning (MARL). However, the recent success of MARL relies heavily on the convenient paradigm of purely decentralized execution, where there is no action correlation among agents for scalability considerations. In this work, we introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy. Theoretically, we establish a theoretical justification for why action dependencies are beneficial by deriving the multi-agent policy gradient formula under such a Bayesian network joint policy and proving its global convergence to Nash equilibria under tabular softmax policy parameterization in cooperative Markov games. Further, by equipping existing MARL algorithms with a recent method of differentiable directed acyclic graphs (DAGs), we develop practical algorithms to learn the context-aware Bayesian network policies in scenarios with partial observability and various difficulty. We also dynamically decrease the sparsity of the learned DAG throughout the training process, which leads to weakly or even purely independent policies for decentralized execution. Empirical results on a range of MARL benchmarks show the benefits of our approach.

AccMER: Accelerating Multi-Agent Experience Replay with Cache Locality-aware Prioritization

Authors:Kailash Gogineni, Yongsheng Mei, Peng Wei, Tian Lan, Guru Venkataramani
Date:2023-05-31 21:06:34

Multi-Agent Experience Replay (MER) is a key component of off-policy reinforcement learning~(RL) algorithms. By remembering and reusing experiences from the past, experience replay significantly improves the stability of RL algorithms and their learning efficiency. In many scenarios, multiple agents interact in a shared environment during online training under centralized training and decentralized execution~(CTDE) paradigm. Current multi-agent reinforcement learning~(MARL) algorithms consider experience replay with uniform sampling or based on priority weights to improve transition data sample efficiency in the sampling phase. However, moving transition data histories for each agent through the processor memory hierarchy is a performance limiter. Also, as the agents' transitions continuously renew every iteration, the finite cache capacity results in increased cache misses. To this end, we propose \name, that repeatedly reuses the transitions~(experiences) for a window of $n$ steps in order to improve the cache locality and minimize the transition data movement, instead of sampling new transitions at each step. Specifically, our optimization uses priority weights to select the transitions so that only high-priority transitions will be reused frequently, thereby improving the cache performance. Our experimental results on the Predator-Prey environment demonstrate the effectiveness of reusing the essential transitions based on the priority weights, where we observe an end-to-end training time reduction of $25.4\%$~(for $32$ agents) compared to existing prioritized MER algorithms without notable degradation in the mean reward.

Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL?

Authors:Yihe Zhou, Shunyu Liu, Yunpeng Qing, Kaixuan Chen, Tongya Zheng, Yanhao Huang, Jie Song, Mingli Song
Date:2023-05-27 03:15:24

Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents to adopt global cooperative information from each other during centralized training. Therefore, we argue that existing CTDE methods cannot fully utilize global information for training, leading to an inefficient joint-policy exploration and even suboptimal results. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for multi-agent reinforcement learning, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for execution. Firstly, CADP endows agents the explicit communication channel to seek and take advices from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constraint the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on StarCraft II micromanagement and Google Research Football benchmarks demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code will be made publicly available.

A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

Authors:Paul Barde, Jakob Foerster, Derek Nowrouzezahrai, Amy Zhang
Date:2023-05-26 18:43:16

Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.

Towards Efficient Multi-Agent Learning Systems

Authors:Kailash Gogineni, Peng Wei, Tian Lan, Guru Venkataramani
Date:2023-05-22 18:51:12

Multi-Agent Reinforcement Learning (MARL) is an increasingly important research field that can model and control multiple large-scale autonomous systems. Despite its achievements, existing multi-agent learning methods typically involve expensive computations in terms of training time and power arising from large observation-action space and a huge number of training steps. Therefore, a key challenge is understanding and characterizing the computationally intensive functions in several popular classes of MARL algorithms during their training phases. Our preliminary experiments reveal new insights into the key modules of MARL algorithms that limit the adoption of MARL in real-world systems. We explore neighbor sampling strategy to improve cache locality and observe performance improvement ranging from 26.66% (3 agents) to 27.39% (12 agents) during the computationally intensive mini-batch sampling phase. Additionally, we demonstrate that improving the locality leads to an end-to-end training time reduction of 10.2% (for 12 agents) compared to existing multi-agent algorithms without significant degradation in the mean reward.

Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game

Authors:Simin Li, Jun Guo, Jingqiao Xiu, Ruixiao Xu, Xin Yu, Jiakai Wang, Aishan Liu, Yaodong Yang, Xianglong Liu
Date:2023-05-22 09:50:59

In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.

Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning

Authors:Wenhao Li, Dan Qiao, Baoxiang Wang, Xiangfeng Wang, Bo Jin, Hongyuan Zha
Date:2023-05-18 10:37:54

The difficulty of appropriately assigning credit is particularly heightened in cooperative MARL with sparse reward, due to the concurrent time and structural scales involved. Automatic subgoal generation (ASG) has recently emerged as a viable MARL approach inspired by utilizing subgoals in intrinsically motivated reinforcement learning. However, end-to-end learning of complex task planning from sparse rewards without prior knowledge, undoubtedly requires massive training samples. Moreover, the diversity-promoting nature of existing ASG methods can lead to the "over-representation" of subgoals, generating numerous spurious subgoals of limited relevance to the actual task reward and thus decreasing the sample efficiency of the algorithm. To address this problem and inspired by the disentangled representation learning, we propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA), that prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. Additionally, SAMA incorporates language-grounded RL to train each agent's subgoal-conditioned policy. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods, as evidenced by its performance on two challenging sparse-reward tasks, Overcooked and MiniRTS.

Explainable Multi-Agent Reinforcement Learning for Temporal Queries

Authors:Kayla Boggess, Sarit Kraus, Lu Feng
Date:2023-05-17 17:04:29

As multi-agent reinforcement learning (MARL) systems are increasingly deployed throughout society, it is imperative yet challenging for users to understand the emergent behaviors of MARL agents in complex environments. This work presents an approach for generating policy-level contrastive explanations for MARL to answer a temporal user query, which specifies a sequence of tasks completed by agents with possible cooperation. The proposed approach encodes the temporal query as a PCTL logic formula and checks if the query is feasible under a given MARL policy via probabilistic model checking. Such explanations can help reconcile discrepancies between the actual and anticipated multi-agent behaviors. The proposed approach also generates correct and complete explanations to pinpoint reasons that make a user query infeasible. We have successfully applied the proposed approach to four benchmark MARL domains (up to 9 agents in one domain). Moreover, the results of a user study show that the generated explanations significantly improve user performance and satisfaction.

Multi-Agent Reinforcement Learning: Methods, Applications, Visionary Prospects, and Challenges

Authors:Ziyuan Zhou, Guanjun Liu, Ying Tang
Date:2023-05-17 09:53:13

Multi-agent reinforcement learning (MARL) is a widely used Artificial Intelligence (AI) technique. However, current studies and applications need to address its scalability, non-stationarity, and trustworthiness. This paper aims to review methods and applications and point out research trends and visionary prospects for the next decade. First, this paper summarizes the basic methods and application scenarios of MARL. Second, this paper outlines the corresponding research methods and their limitations on safety, robustness, generalization, and ethical constraints that need to be addressed in the practical applications of MARL. In particular, we believe that trustworthy MARL will become a hot research topic in the next decade. In addition, we suggest that considering human interaction is essential for the practical application of MARL in various societies. Therefore, this paper also analyzes the challenges while MARL is applied to human-machine interaction.

An Empirical Study on Google Research Football Multi-agent Scenarios

Authors:Yan Song, He Jiang, Zheng Tian, Haifeng Zhang, Yingping Zhang, Jiangcheng Zhu, Zonghong Dai, Weinan Zhang, Jun Wang
Date:2023-05-16 14:18:53

Few multi-agent reinforcement learning (MARL) research on Google Research Football (GRF) focus on the 11v11 multi-agent full-game scenario and to the best of our knowledge, no open benchmark on this scenario has been released to the public. In this work, we fill the gap by providing a population-based MARL training pipeline and hyperparameter settings on multi-agent football scenario that outperforms the bot with difficulty 1.0 from scratch within 2 million steps. Our experiments serve as a reference for the expected performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art multi-agent reinforcement learning algorithm where each agent tries to maximize its own policy independently across various training configurations. Meanwhile, we open-source our training framework Light-MALib which extends the MALib codebase by distributed and asynchronized implementation with additional analytical tools for football games. Finally, we provide guidance for building strong football AI with population-based training and release diverse pretrained policies for benchmarking. The goal is to provide the community with a head start for whoever experiment their works on GRF and a simple-to-use population-based training framework for further improving their agents through self-play. The implementation is available at https://github.com/Shanghai-Digital-Brain-Laboratory/DB-Football.

More Like Real World Game Challenge for Partially Observable Multi-Agent Cooperation

Authors:Meng Yao, Xueou Feng, Qiyue Yin
Date:2023-05-15 07:19:42

Some standardized environments have been designed for partially observable multi-agent cooperation, but we find most current environments are synchronous, whereas real-world agents often have their own action spaces leading to asynchrony. Furthermore, fixed agents number limits the scalability of action space, whereas in reality agents number can change resulting in a flexible action space. In addition, current environments are balanced, which is not always the case in the real world where there may be an ability gap between different parties leading to asymmetry. Finally, current environments tend to have less stochasticity with simple state transitions, whereas real-world environments can be highly stochastic and result in extremely risky. To address this gap, we propose WarGame Challenge (WGC) inspired by the Wargame. WGC is a lightweight, flexible, and easy-to-use environment with a clear framework that can be easily configured by users. Along with the benchmark, we provide MARL baseline algorithms such as QMIX and a toolkit to help algorithms complete performance tests on WGC. Finally, we present baseline experiment results, which demonstrate the challenges of WGC. We think WGC enrichs the partially observable multi-agent cooperation domain and introduces more challenges that better reflect the real-world characteristics. Code is release in http://turingai.ia.ac.cn/data\_center/show/10.

Stackelberg Decision Transformer for Asynchronous Action Coordination in Multi-Agent Systems

Authors:Bin Zhang, Hangyu Mao, Lijuan Li, Zhiwei Xu, Dapeng Li, Rui Zhao, Guoliang Fan
Date:2023-05-13 07:29:31

Asynchronous action coordination presents a pervasive challenge in Multi-Agent Systems (MAS), which can be represented as a Stackelberg game (SG). However, the scalability of existing Multi-Agent Reinforcement Learning (MARL) methods based on SG is severely constrained by network structures or environmental limitations. To address this issue, we propose the Stackelberg Decision Transformer (STEER), a heuristic approach that resolves the difficulties of hierarchical coordination among agents. STEER efficiently manages decision-making processes in both spatial and temporal contexts by incorporating the hierarchical decision structure of SG, the modeling capability of autoregressive sequence models, and the exploratory learning methodology of MARL. Our research contributes to the development of an effective and adaptable asynchronous action coordination method that can be widely applied to various task types and environmental configurations in MAS. Experimental results demonstrate that our method can converge to Stackelberg equilibrium solutions and outperforms other existing methods in complex scenarios.

Multi-Agent Reinforcement Learning for Network Routing in Integrated Access Backhaul Networks

Authors:Shahaf Yamin, Haim Permuter
Date:2023-05-12 13:03:26

We investigate the problem of wireless routing in integrated access backhaul (IAB) networks consisting of fiber-connected and wireless base stations and multiple users. The physical constraints of these networks prevent the use of a central controller, and base stations have limited access to real-time network conditions. We aim to maximize packet arrival ratio while minimizing their latency, for this purpose, we formulate the problem as a multi-agent partially observed Markov decision process (POMDP). To solve this problem, we develop a Relational Advantage Actor Critic (Relational A2C) algorithm that uses Multi-Agent Reinforcement Learning (MARL) and information about similar destinations to derive a joint routing policy on a distributed basis. We present three training paradigms for this algorithm and demonstrate its ability to achieve near-centralized performance. Our results show that Relational A2C outperforms other reinforcement learning algorithms, leading to increased network efficiency and reduced selfish agent behavior. To the best of our knowledge, this work is the first to optimize routing strategy for IAB networks.

Boosting Value Decomposition via Unit-Wise Attentive State Representation for Cooperative Multi-Agent Reinforcement Learning

Authors:Qingpeng Zhao, Yuanyang Zhu, Zichuan Liu, Zhi Wang, Chunlin Chen
Date:2023-05-12 00:33:22

In cooperative multi-agent reinforcement learning (MARL), the environmental stochasticity and uncertainties will increase exponentially when the number of agents increases, which puts hard pressure on how to come up with a compact latent representation from partial observation for boosting value decomposition. To tackle these issues, we propose a simple yet powerful method that alleviates partial observability and efficiently promotes coordination by introducing the UNit-wise attentive State Representation (UNSR). In UNSR, each agent learns a compact and disentangled unit-wise state representation outputted from transformer blocks, and produces its local action-value function. The proposed UNSR is used to boost the value decomposition with a multi-head attention mechanism for producing efficient credit assignment in the mixing network, providing an efficient reasoning path between the individual value function and joint value function. Experimental results demonstrate that our method achieves superior performance and data efficiency compared to solid baselines on the StarCraft II micromanagement challenge. Additional ablation experiments also help identify the key factors contributing to the performance of UNSR.

Learning Optimal "Pigovian Tax" in Sequential Social Dilemmas

Authors:Yun Hua, Shang Gao, Wenhao Li, Bo Jin, Xiangfeng Wang, Hongyuan Zha
Date:2023-05-10 15:04:11

In multi-agent reinforcement learning, each agent acts to maximize its individual accumulated rewards. Nevertheless, individual accumulated rewards could not fully reflect how others perceive them, resulting in selfish behaviors that undermine global performance. The externality theory, defined as ``the activities of one economic actor affect the activities of another in ways that are not reflected in market transactions,'' is applicable to analyze the social dilemmas in MARL. One of its most profound non-market solutions, ``Pigovian Tax'', which internalizes externalities by taxing those who create negative externalities and subsidizing those who create positive externalities, could aid in developing a mechanism to resolve MARL's social dilemmas. The purpose of this paper is to apply externality theory to analyze social dilemmas in MARL. To internalize the externalities in MARL, the \textbf{L}earning \textbf{O}ptimal \textbf{P}igovian \textbf{T}ax method (LOPT), is proposed, where an additional agent is introduced to learn the tax/allowance allocation policy so as to approximate the optimal ``Pigovian Tax'' which accurately reflects the externalities for all agents. Furthermore, a reward shaping mechanism based on the approximated optimal ``Pigovian Tax'' is applied to reduce the social cost of each agent and tries to alleviate the social dilemmas. Compared with existing state-of-the-art methods, the proposed LOPT leads to higher collective social welfare in both the Escape Room and the Cleanup environments, which shows the superiority of our method in solving social dilemmas.

Fast Teammate Adaptation in the Presence of Sudden Policy Change

Authors:Ziqian Zhang, Lei Yuan, Lihe Li, Ke Xue, Chengxing Jia, Cong Guan, Chao Qian, Yang Yu
Date:2023-05-10 05:42:47

In cooperative multi-agent reinforcement learning (MARL), where an agent coordinates with teammate(s) for a shared goal, it may sustain non-stationary caused by the policy change of teammates. Prior works mainly concentrate on the policy change during the training phase or teammates altering cross episodes, ignoring the fact that teammates may suffer from policy change suddenly within an episode, which might lead to miscoordination and poor performance as a result. We formulate the problem as an open Dec-POMDP, where we control some agents to coordinate with uncontrolled teammates, whose policies could be changed within one episode. Then we develop a new framework, fast teammates adaptation (Fastap), to address the problem. Concretely, we first train versatile teammates' policies and assign them to different clusters via the Chinese Restaurant Process (CRP). Then, we train the controlled agent(s) to coordinate with the sampled uncontrolled teammates by capturing their identifications as context for fast adaptation. Finally, each agent applies its local information to anticipate the teammates' context for decision-making accordingly. This process proceeds alternately, leading to a robust policy that can adapt to any teammates during the decentralized execution phase. We show in multiple multi-agent benchmarks that Fastap can achieve superior performance than multiple baselines in stationary and non-stationary scenarios.

Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers

Authors:Lei Yuan, Zi-Qian Zhang, Ke Xue, Hao Yin, Feng Chen, Cong Guan, Li-He Li, Chao Qian, Yang Yu
Date:2023-05-10 05:29:47

Cooperative multi-agent reinforcement learning (CMARL) has shown to be promising for many real-world applications. Previous works mainly focus on improving coordination ability via solving MARL-specific challenges (e.g., non-stationarity, credit assignment, scalability), but ignore the policy perturbation issue when testing in a different environment. This issue hasn't been considered in problem formulation or efficient algorithm design. To address this issue, we firstly model the problem as a limited policy adversary Dec-POMDP (LPA-Dec-POMDP), where some coordinators from a team might accidentally and unpredictably encounter a limited number of malicious action attacks, but the regular coordinators still strive for the intended goal. Then, we propose Robust Multi-Agent Coordination via Evolutionary Generation of Auxiliary Adversarial Attackers (ROMANCE), which enables the trained policy to encounter diversified and strong auxiliary adversarial attacks during training, thus achieving high robustness under various policy perturbations. Concretely, to avoid the ego-system overfitting to a specific attacker, we maintain a set of attackers, which is optimized to guarantee the attackers high attacking quality and behavior diversity. The goal of quality is to minimize the ego-system coordination effect, and a novel diversity regularizer based on sparse action is applied to diversify the behaviors among attackers. The ego-system is then paired with a population of attackers selected from the maintained attacker set, and alternately trained against the constantly evolving attackers. Extensive experiments on multiple scenarios from SMAC indicate our ROMANCE provides comparable or better robustness and generalization ability than other baselines.

Mixture of personality improved Spiking actor network for efficient multi-agent cooperation

Authors:Xiyun Li, Ziyi Ni, Jingqing Ruan, Linghui Meng, Jing Shi, Tielin Zhang, Bo Xu
Date:2023-05-10 05:01:59

Adaptive human-agent and agent-agent cooperation are becoming more and more critical in the research area of multi-agent reinforcement learning (MARL), where remarked progress has been made with the help of deep neural networks. However, many established algorithms can only perform well during the learning paradigm but exhibit poor generalization during cooperation with other unseen partners. The personality theory in cognitive psychology describes that humans can well handle the above cooperation challenge by predicting others' personalities first and then their complex actions. Inspired by this two-step psychology theory, we propose a biologically plausible mixture of personality (MoP) improved spiking actor network (SAN), whereby a determinantal point process is used to simulate the complex formation and integration of different types of personality in MoP, and dynamic and spiking neurons are incorporated into the SAN for the efficient reinforcement learning. The benchmark Overcooked task, containing a strong requirement for cooperative cooking, is selected to test the proposed MoP-SAN. The experimental results show that the MoP-SAN can achieve both high performances during not only the learning paradigm but also the generalization test (i.e., cooperation with other unseen agents) paradigm where most counterpart deep actor networks failed. Necessary ablation experiments and visualization analyses were conducted to explain why MoP and SAN are effective in multi-agent reinforcement learning scenarios while DNN performs poorly in the generalization test.

An Algorithm For Adversary Aware Decentralized Networked MARL

Authors:Soumajyoti Sarkar
Date:2023-05-09 16:02:31

Decentralized multi-agent reinforcement learning (MARL) algorithms have become popular in the literature since it allows heterogeneous agents to have their own reward functions as opposed to canonical multi-agent Markov Decision Process (MDP) settings which assume common reward functions over all agents. In this work, we follow the existing work on collaborative MARL where agents in a connected time varying network can exchange information among each other in order to reach a consensus. We introduce vulnerabilities in the consensus updates of existing MARL algorithms where agents can deviate from their usual consensus update, who we term as adversarial agents. We then proceed to provide an algorithm that allows non-adversarial agents to reach a consensus in the presence of adversaries under a constrained setting.

SMAClite: A Lightweight Environment for Multi-Agent Reinforcement Learning

Authors:Adam Michalski, Filippos Christianos, Stefano V. Albrecht
Date:2023-05-09 15:55:19

There is a lack of standard benchmarks for Multi-Agent Reinforcement Learning (MARL) algorithms. The Starcraft Multi-Agent Challenge (SMAC) has been widely used in MARL research, but is built on top of a heavy, closed-source computer game, StarCraft II. Thus, SMAC is computationally expensive and requires knowledge and the use of proprietary tools specific to the game for any meaningful alteration or contribution to the environment. We introduce SMAClite -- a challenge based on SMAC that is both decoupled from Starcraft II and open-source, along with a framework which makes it possible to create new content for SMAClite without any special knowledge. We conduct experiments to show that SMAClite is equivalent to SMAC, by training MARL algorithms on SMAClite and reproducing SMAC results. We then show that SMAClite outperforms SMAC in both runtime speed and memory.

Latent Interactive A2C for Improved RL in Open Many-Agent Systems

Authors:Keyang He, Prashant Doshi, Bikramjit Banerjee
Date:2023-05-09 04:03:40

There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. But, these methods involve obtaining various types of information from the other agents, which may not be feasible in competitive or adversarial settings. A recent method, the interactive advantage actor critic (IA2C), engages in decentralized training coupled with decentralized execution, aiming to predict the other agents' actions from possibly noisy observations. In this paper, we present the latent IA2C that utilizes an encoder-decoder architecture to learn a latent representation of the hidden state and other agents' actions. Our experiments in two domains -- each populated by many agents -- reveal that the latent IA2C significantly improves sample efficiency by reducing variance and converging faster. Additionally, we introduce open versions of these domains where the agent population may change over time, and evaluate on these instances as well.

Communication-Robust Multi-Agent Learning by Adaptable Auxiliary Multi-Agent Adversary Generation

Authors:Lei Yuan, Feng Chen, Zhongzhang Zhang, Yang Yu
Date:2023-05-09 01:29:46

Communication can promote coordination in cooperative Multi-Agent Reinforcement Learning (MARL). Nowadays, existing works mainly focus on improving the communication efficiency of agents, neglecting that real-world communication is much more challenging as there may exist noise or potential attackers. Thus the robustness of the communication-based policies becomes an emergent and severe issue that needs more exploration. In this paper, we posit that the ego system trained with auxiliary adversaries may handle this limitation and propose an adaptable method of Multi-Agent Auxiliary Adversaries Generation for robust Communication, dubbed MA3C, to obtain a robust communication-based policy. In specific, we introduce a novel message-attacking approach that models the learning of the auxiliary attacker as a cooperative problem under a shared goal to minimize the coordination ability of the ego system, with which every information channel may suffer from distinct message attacks. Furthermore, as naive adversarial training may impede the generalization ability of the ego system, we design an attacker population generation approach based on evolutionary learning. Finally, the ego system is paired with an attacker population and then alternatively trained against the continuously evolving attackers to improve its robustness, meaning that both the ego system and the attackers are adaptable. Extensive experiments on multiple benchmarks indicate that our proposed MA3C provides comparable or better robustness and generalization ability than other baselines.

Multi-agent Continual Coordination via Progressive Task Contextualization

Authors:Lei Yuan, Lihe Li, Ziqian Zhang, Fuxiang Zhang, Cong Guan, Yang Yu
Date:2023-05-07 15:04:56

Cooperative Multi-agent Reinforcement Learning (MARL) has attracted significant attention and played the potential for many real-world applications. Previous arts mainly focus on facilitating the coordination ability from different aspects (e.g., non-stationarity, credit assignment) in single-task or multi-task scenarios, ignoring the stream of tasks that appear in a continual manner. This ignorance makes the continual coordination an unexplored territory, neither in problem formulation nor efficient algorithms designed. Towards tackling the mentioned issue, this paper proposes an approach Multi-Agent Continual Coordination via Progressive Task Contextualization, dubbed MACPro. The key point lies in obtaining a factorized policy, using shared feature extraction layers but separated independent task heads, each specializing in a specific class of tasks. The task heads can be progressively expanded based on the learned task contextualization. Moreover, to cater to the popular CTDE paradigm in MARL, each agent learns to predict and adopt the most relevant policy head based on local information in a decentralized manner. We show in multiple multi-agent benchmarks that existing continual learning methods fail, while MACPro is able to achieve close-to-optimal performance. More results also disclose the effectiveness of MACPro from multiple aspects like high generalization ability.

Stackelberg Games for Learning Emergent Behaviors During Competitive Autocurricula

Authors:Boling Yang, Liyuan Zheng, Lillian J. Ratliff, Byron Boots, Joshua R. Smith
Date:2023-05-04 19:27:35

Autocurricular training is an important sub-area of multi-agent reinforcement learning~(MARL) that allows multiple agents to learn emergent skills in an unsupervised co-evolving scheme. The robotics community has experimented autocurricular training with physically grounded problems, such as robust control and interactive manipulation tasks. However, the asymmetric nature of these tasks makes the generation of sophisticated policies challenging. Indeed, the asymmetry in the environment may implicitly or explicitly provide an advantage to a subset of agents which could, in turn, lead to a low-quality equilibrium. This paper proposes a novel game-theoretic algorithm, Stackelberg Multi-Agent Deep Deterministic Policy Gradient (ST-MADDPG), which formulates a two-player MARL problem as a Stackelberg game with one player as the `leader' and the other as the `follower' in a hierarchical interaction structure wherein the leader has an advantage. We first demonstrate that the leader's advantage from ST-MADDPG can be used to alleviate the inherent asymmetry in the environment. By exploiting the leader's advantage, ST-MADDPG improves the quality of a co-evolution process and results in more sophisticated and complex strategies that work well even against an unseen strong opponent.

System Neural Diversity: Measuring Behavioral Heterogeneity in Multi-Agent Learning

Authors:Matteo Bettini, Ajay Shankar, Amanda Prorok
Date:2023-05-03 13:58:13

Evolutionary science provides evidence that diversity confers resilience in natural systems. Yet, traditional multi-agent reinforcement learning techniques commonly enforce homogeneity to increase training sample efficiency. When a system of learning agents is not constrained to homogeneous policies, individuals may develop diverse behaviors, resulting in emergent complementarity that benefits the system. Despite this, there is a surprising lack of tools that quantify behavioral diversity. Such techniques would pave the way towards understanding the impact of diversity in collective artificial intelligence and enabling its control. In this paper, we introduce System Neural Diversity (SND): a measure of behavioral heterogeneity in multi-agent systems. We discuss and prove its theoretical properties, and compare it with alternate, state-of-the-art behavioral diversity metrics used in the robotics domain. Through simulations of a variety of cooperative multi-robot tasks, we show how our metric constitutes an important tool that enables measurement and control of behavioral heterogeneity. In dynamic tasks, where the problem is affected by repeated disturbances during training, we show that SND allows us to measure latent resilience skills acquired by the agents, while other proxies, such as task performance (reward), fail to. Finally, we show how the metric can be employed to control diversity, allowing us to enforce a desired heterogeneity set-point or range. We demonstrate how this paradigm can be used to bootstrap the exploration phase, finding optimal policies faster, thus enabling novel and more efficient MARL paradigms.

Human Machine Co-adaption Interface via Cooperation Markov Decision Process System

Authors:Kairui Guo, Adrian Cheng, Yaqi Li, Jun Li, Rob Duffield, Steven W. Su
Date:2023-05-03 12:00:53

This paper aims to develop a new human-machine interface to improve rehabilitation performance from the perspective of both the user (patient) and the machine (robot) by introducing the co-adaption techniques via model-based reinforcement learning. Previous studies focus more on robot assistance, i.e., to improve the control strategy so as to fulfill the objective of Assist-As-Needed. In this study, we treat the full process of robot-assisted rehabilitation as a co-adaptive or mutual learning process and emphasize the adaptation of the user to the machine. To this end, we proposed a Co-adaptive MDPs (CaMDPs) model to quantify the learning rates based on cooperative multi-agent reinforcement learning (MARL) in the high abstraction layer of the systems. We proposed several approaches to cooperatively adjust the Policy Improvement among the two agents in the framework of Policy Iteration. Based on the proposed co-adaptive MDPs, the simulation study indicates the non-stationary problem can be mitigated using various proposed Policy Improvement approaches.

On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring

Authors:Dylan J. Foster, Dean P. Foster, Noah Golowich, Alexander Rakhlin
Date:2023-05-01 06:46:22

A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with function approximation and normal-form games with bandit feedback. We focus on equilibrium computation, in which a centralized learning algorithm aims to compute an equilibrium by controlling multiple agents that interact with an unknown environment. Our main contributions are: - We provide upper and lower bounds on the optimal sample complexity for multi-agent decision making based on a multi-agent generalization of the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. (2021) in the single-agent counterpart to our setting. Compared to the best results for the single-agent setting, our bounds have additional gaps. We show that no "reasonable" complexity measure can close these gaps, highlighting a striking separation between single and multiple agents. - We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making, but with hidden (unobserved) rewards, a framework that subsumes variants of the partial monitoring problem. As a consequence, we characterize the statistical complexity for hidden-reward interactive decision making to the best extent possible. Building on this development, we provide several new structural results, including 1) conditions under which the statistical complexity of multi-agent decision making can be reduced to that of single-agent, and 2) conditions under which the so-called curse of multiple agents can be avoided.

From Explicit Communication to Tacit Cooperation:A Novel Paradigm for Cooperative MARL

Authors:Dapeng Li, Zhiwei Xu, Bin Zhang, Guoliang Fan
Date:2023-04-28 06:56:07

Centralized training with decentralized execution (CTDE) is a widely-used learning paradigm that has achieved significant success in complex tasks. However, partial observability issues and the absence of effectively shared signals between agents often limit its effectiveness in fostering cooperation. While communication can address this challenge, it simultaneously reduces the algorithm's practicality. Drawing inspiration from human team cooperative learning, we propose a novel paradigm that facilitates a gradual shift from explicit communication to tacit cooperation. In the initial training stage, we promote cooperation by sharing relevant information among agents and concurrently reconstructing this information using each agent's local trajectory. We then combine the explicitly communicated information with the reconstructed information to obtain mixed information. Throughout the training process, we progressively reduce the proportion of explicitly communicated information, facilitating a seamless transition to fully decentralized execution without communication. Experimental results in various scenarios demonstrate that the performance of our method without communication can approaches or even surpasses that of QMIX and communication-based methods.

Centralized control for multi-agent RL in a complex Real-Time-Strategy game

Authors:Roger Creus Castanyer
Date:2023-04-25 17:19:05

Multi-agent Reinforcement learning (MARL) studies the behaviour of multiple learning agents that coexist in a shared environment. MARL is more challenging than single-agent RL because it involves more complex learning dynamics: the observations and rewards of each agent are functions of all other agents. In the context of MARL, Real-Time Strategy (RTS) games represent very challenging environments where multiple players interact simultaneously and control many units of different natures all at once. In fact, RTS games are so challenging for the current RL methods, that just being able to tackle them with RL is interesting. This project provides the end-to-end experience of applying RL in the Lux AI v2 Kaggle competition, where competitors design agents to control variable-sized fleets of units and tackle a multi-variable optimization, resource gathering, and allocation problem in a 1v1 scenario against other competitors. We use a centralized approach for training the RL agents, and report multiple design decisions along the process. We provide the source code of the project: https://github.com/roger-creus/centralized-control-lux.

Partially Observable Mean Field Multi-Agent Reinforcement Learning Based on Graph-Attention

Authors:Min Yang, Guanjun Liu, Ziyuan Zhou
Date:2023-04-25 08:38:32

Traditional multi-agent reinforcement learning algorithms are difficultly applied in a large-scale multi-agent environment. The introduction of mean field theory has enhanced the scalability of multi-agent reinforcement learning in recent years. This paper considers partially observable multi-agent reinforcement learning (MARL), where each agent can only observe other agents within a fixed range. This partial observability affects the agent's ability to assess the quality of the actions of surrounding agents. This paper focuses on developing a method to capture more effective information from local observations in order to select more effective actions. Previous work in this field employs probability distributions or weighted mean field to update the average actions of neighborhood agents, but it does not fully consider the feature information of surrounding neighbors and leads to a local optimum. In this paper, we propose a novel multi-agent reinforcement learning algorithm, Partially Observable Mean Field Multi-Agent Reinforcement Learning based on Graph-Attention (GAMFQ) to remedy this flaw. GAMFQ uses a graph attention module and a mean field module to describe how an agent is influenced by the actions of other agents at each time step. This graph attention module consists of a graph attention encoder and a differentiable attention mechanism, and this mechanism outputs a dynamic graph to represent the effectiveness of neighborhood agents against central agents. The mean-field module approximates the effect of a neighborhood agent on a central agent as the average effect of effective neighborhood agents. Experiments show that GAMFQ outperforms baselines including the state-of-the-art partially observable mean-field reinforcement learning algorithms. The code for this paper is here \url{https://github.com/yangmin32/GPMF}.

Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives

Authors:Ram Rachum, Yonatan Nakar, Reuth Mirsky
Date:2023-04-24 17:19:15

Recent research in multi-agent reinforcement learning (MARL) has shown success in learning social behavior and cooperation. Social dilemmas between agents in mixed-sum settings have been studied extensively, but there is little research into social dilemmas in fullycooperative settings, where agents have no prospect of gaining reward at another agent's expense. While fully-aligned interests are conducive to cooperation between agents, they do not guarantee it. We propose a measure of "stubbornness" between agents that aims to capture the human social behavior from which it takes its name: a disagreement that is gradually escalating and potentially disastrous. We would like to promote research into the tendency of agents to be stubborn, the reactions of counterpart agents, and the resulting social dynamics. In this paper we present Stubborn, an environment for evaluating stubbornness between agents with fully-aligned incentives. In our preliminary results, the agents learn to use their partner's stubbornness as a signal for improving the choices that they make in the environment.

Inducing Stackelberg Equilibrium through Spatio-Temporal Sequential Decision-Making in Multi-Agent Reinforcement Learning

Authors:Bin Zhang, Lijuan Li, Zhiwei Xu, Dapeng Li, Guoliang Fan
Date:2023-04-20 14:47:54

In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Agents can learn heterogeneous SE policies while still maintaining parameter sharing, which leads to reduced cost for learning and storage and enhanced scalability as the number of agents increases. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.

SocialLight: Distributed Cooperation Learning towards Network-Wide Traffic Signal Control

Authors:Harsh Goel, Yifeng Zhang, Mehul Damani, Guillaume Sartoretti
Date:2023-04-20 12:41:25

Many recent works have turned to multi-agent reinforcement learning (MARL) for adaptive traffic signal control to optimize the travel time of vehicles over large urban networks. However, achieving effective and scalable cooperation among junctions (agents) remains an open challenge, as existing methods often rely on extensive, non-generalizable reward shaping or on non-scalable centralized learning. To address these problems, we propose a new MARL method for traffic signal control, SocialLight, which learns cooperative traffic control policies by distributedly estimating the individual marginal contribution of agents on their local neighborhood. SocialLight relies on the Asynchronous Actor Critic (A3C) framework, and makes learning scalable by learning a locally-centralized critic conditioned over the states and actions of neighboring agents, used by agents to estimate individual contributions by counterfactual reasoning. We further introduce important modifications to the advantage calculation that help stabilize policy updates. These modifications decouple the impact of the neighbors' actions on the computed advantages, thereby reducing the variance in the gradient updates. We benchmark our trained network against state-of-the-art traffic signal control methods on standard benchmarks in two traffic simulators, SUMO and CityFlow. Our results show that SocialLight exhibits improved scalability to larger road networks and better performance across usual traffic metrics.

Graph Exploration for Effective Multi-agent Q-Learning

Authors:Ainur Zhaikhan, Ali H. Sayed
Date:2023-04-19 10:28:28

This paper proposes an exploration technique for multi-agent reinforcement learning (MARL) with graph-based communication among agents. We assume the individual rewards received by the agents are independent of the actions by the other agents, while their policies are coupled. In the proposed framework, neighbouring agents collaborate to estimate the uncertainty about the state-action space in order to execute more efficient explorative behaviour. Different from existing works, the proposed algorithm does not require counting mechanisms and can be applied to continuous-state environments without requiring complex conversion techniques. Moreover, the proposed scheme allows agents to communicate in a fully decentralized manner with minimal information exchange. And for continuous-state scenarios, each agent needs to exchange only a single parameter vector. The performance of the algorithm is verified with theoretical results for discrete-state scenarios and with experiments for continuous ones.

Heterogeneous-Agent Reinforcement Learning

Authors:Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang
Date:2023-04-19 05:08:02

The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint return and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.

STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

Authors:Sirui Chen, Zhaowei Zhang, Yaodong Yang, Yali Du
Date:2023-04-15 10:09:03

Centralized Training with Decentralized Execution (CTDE) has been proven to be an effective paradigm in cooperative multi-agent reinforcement learning (MARL). One of the major challenges is credit assignment, which aims to credit agents by their contributions. While prior studies have shown great success, their methods typically fail to work in episodic reinforcement learning scenarios where global rewards are revealed only at the end of the episode. They lack the functionality to model complicated relations of the delayed global reward in the temporal dimension and suffer from inefficiencies. To tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a novel method that learns credit assignment in both temporal and spatial dimensions. It first decomposes the global return back to each time step, then utilizes the Shapley Value to redistribute the individual payoff from the decomposed global reward. To mitigate the computational complexity of the Shapley Value, we introduce an approximation of marginal contribution and utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice & Bob example and MPE environments across different scenarios. Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.

Model-based Dynamic Shielding for Safe and Efficient Multi-Agent Reinforcement Learning

Authors:Wenli Xiao, Yiwei Lyu, John Dolan
Date:2023-04-13 06:08:10

Multi-Agent Reinforcement Learning (MARL) discovers policies that maximize reward but do not have safety guarantees during the learning and deployment phases. Although shielding with Linear Temporal Logic (LTL) is a promising formal method to ensure safety in single-agent Reinforcement Learning (RL), it results in conservative behaviors when scaling to multi-agent scenarios. Additionally, it poses computational challenges for synthesizing shields in complex multi-agent environments. This work introduces Model-based Dynamic Shielding (MBDS) to support MARL algorithm design. Our algorithm synthesizes distributive shields, which are reactive systems running in parallel with each MARL agent, to monitor and rectify unsafe behaviors. The shields can dynamically split, merge, and recompute based on agents' states. This design enables efficient synthesis of shields to monitor agents in complex environments without coordination overheads. We also propose an algorithm to synthesize shields without prior knowledge of the dynamics model. The proposed algorithm obtains an approximate world model by interacting with the environment during the early stage of exploration, making our MBDS enjoy formal safety guarantees with high probability. We demonstrate in simulations that our framework can surpass existing baselines in terms of safety guarantees and learning performance.

MABL: Bi-Level Latent-Variable World Model for Sample-Efficient Multi-Agent Reinforcement Learning

Authors:Aravind Venugopal, Stephanie Milani, Fei Fang, Balaraman Ravindran
Date:2023-04-12 17:46:23

Multi-agent reinforcement learning (MARL) methods often suffer from high sample complexity, limiting their use in real-world problems where data is sparse or expensive to collect. Although latent-variable world models have been employed to address this issue by generating abundant synthetic data for MARL training, most of these models cannot encode vital global information available during training into their latent states, which hampers learning efficiency. The few exceptions that incorporate global information assume centralized execution of their learned policies, which is impractical in many applications with partial observability. We propose a novel model-based MARL algorithm, MABL (Multi-Agent Bi-Level world model), that learns a bi-level latent-variable world model from high-dimensional inputs. Unlike existing models, MABL is capable of encoding essential global information into the latent states during training while guaranteeing the decentralized execution of learned policies. For each agent, MABL learns a global latent state at the upper level, which is used to inform the learning of an agent latent state at the lower level. During execution, agents exclusively use lower-level latent states and act independently. Crucially, MABL can be combined with any model-free MARL algorithm for policy learning. In our empirical evaluation with complex discrete and continuous multi-agent tasks including SMAC, Flatland, and MAMuJoCo, MABL surpasses SOTA multi-agent latent-variable world models in both sample efficiency and overall performance.

Learning to Communicate and Collaborate in a Competitive Multi-Agent Setup to Clean the Ocean from Macroplastics

Authors:Philipp Dominic Siedler
Date:2023-04-12 14:02:42

Finding a balance between collaboration and competition is crucial for artificial agents in many real-world applications. We investigate this using a Multi-Agent Reinforcement Learning (MARL) setup on the back of a high-impact problem. The accumulation and yearly growth of plastic in the ocean cause irreparable damage to many aspects of oceanic health and the marina system. To prevent further damage, we need to find ways to reduce macroplastics from known plastic patches in the ocean. Here we propose a Graph Neural Network (GNN) based communication mechanism that increases the agents' observation space. In our custom environment, agents control a plastic collecting vessel. The communication mechanism enables agents to develop a communication protocol using a binary signal. While the goal of the agent collective is to clean up as much as possible, agents are rewarded for the individual amount of macroplastics collected. Hence agents have to learn to communicate effectively while maintaining high individual performance. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show communication enables collaboration and increases collective performance significantly. This means agents have learned the importance of communication and found a balance between collaboration and competition.

Improving ABR Performance for Short Video Streaming Using Multi-Agent Reinforcement Learning with Expert Guidance

Authors:Yueheng Li, Qianyuan Zheng, Zicheng Zhang, Hao Chen, Zhan Ma
Date:2023-04-10 15:05:21

In the realm of short video streaming, popular adaptive bitrate (ABR) algorithms developed for classical long video applications suffer from catastrophic failures because they are tuned to solely adapt bitrates. Instead, short video adaptive bitrate (SABR) algorithms have to properly determine which video at which bitrate level together for content prefetching, without sacrificing the users' quality of experience (QoE) and yielding noticeable bandwidth wastage jointly. Unfortunately, existing SABR methods are inevitably entangled with slow convergence and poor generalization. Thus, in this paper, we propose Incendio, a novel SABR framework that applies Multi-Agent Reinforcement Learning (MARL) with Expert Guidance to separate the decision of video ID and video bitrate in respective buffer management and bitrate adaptation agents to maximize the system-level utilized score modeled as a compound function of QoE and bandwidth wastage metrics. To train Incendio, it is first initialized by imitating the hand-crafted expert rules and then fine-tuned through the use of MARL. Results from extensive experiments indicate that Incendio outperforms the current state-of-the-art SABR algorithm with a 53.2% improvement measured by the utility score while maintaining low training complexity and inference time.

MARL-iDR: Multi-Agent Reinforcement Learning for Incentive-based Residential Demand Response

Authors:Jasper van Tilburg, Luciano C. Siebert, Jochen L. Cremer
Date:2023-04-08 19:17:38

This paper presents a decentralized Multi-Agent Reinforcement Learning (MARL) approach to an incentive-based Demand Response (DR) program, which aims to maintain the capacity limits of the electricity grid and prevent grid congestion by financially incentivizing residential consumers to reduce their energy consumption. The proposed approach addresses the key challenge of coordinating heterogeneous preferences and requirements from multiple participants while preserving their privacy and minimizing financial costs for the aggregator. The participant agents use a novel Disjunctively Constrained Knapsack Problem optimization to curtail or shift the requested household appliances based on the selected demand reduction. Through case studies with electricity data from $25$ households, the proposed approach effectively reduced energy consumption's Peak-to-Average ratio (PAR) by $14.48$% compared to the original PAR while fully preserving participant privacy. This approach has the potential to significantly improve the efficiency and reliability of the electricity grid, making it an important contribution to the management of renewable energy resources and the growing electricity demand.

Effective control of two-dimensional Rayleigh--Bénard convection: invariant multi-agent reinforcement learning is all you need

Authors:Colin Vignon, Jean Rabault, Joel Vasanth, Francisco Alcántara-Ávila, Mikael Mortensen, Ricardo Vinuesa
Date:2023-04-05 11:21:21

Rayleigh-B\'enard convection (RBC) is a recurrent phenomenon in several industrial and geoscience flows and a well-studied system from a fundamental fluid-mechanics viewpoint. However, controlling RBC, for example by modulating the spatial distribution of the bottom-plate heating in the canonical RBC configuration, remains a challenging topic for classical control-theory methods. In the present work, we apply deep reinforcement learning (DRL) for controlling RBC. We show that effective RBC control can be obtained by leveraging invariant multi-agent reinforcement learning (MARL), which takes advantage of the locality and translational invariance inherent to RBC flows inside wide channels. The MARL framework applied to RBC allows for an increase in the number of control segments without encountering the curse of dimensionality that would result from a naive increase in the DRL action-size dimension. This is made possible by the MARL ability for re-using the knowledge generated in different parts of the RBC domain. We show in a case study that MARL DRL is able to discover an advanced control strategy that destabilizes the spontaneous RBC double-cell pattern, changes the topology of RBC by coalescing adjacent convection cells, and actively controls the resulting coalesced cell to bring it to a new stable configuration. This modified flow configuration results in reduced convective heat transfer, which is beneficial in several industrial processes. Therefore, our work both shows the potential of MARL DRL for controlling large RBC systems, as well as demonstrates the possibility for DRL to discover strategies that move the RBC configuration between different topological configurations, yielding desirable heat-transfer characteristics. These results are useful for both gaining further understanding of the intrinsic properties of RBC, as well as for developing industrial applications.

Risk-Aware Distributed Multi-Agent Reinforcement Learning

Authors:Abdullah Al Maruf, Luyao Niu, Bhaskar Ramasubramanian, Andrew Clark, Radha Poovendran
Date:2023-04-04 17:56:44

Autonomous cyber and cyber-physical systems need to perform decision-making, learning, and control in unknown environments. Such decision-making can be sensitive to multiple factors, including modeling errors, changes in costs, and impacts of events in the tails of probability distributions. Although multi-agent reinforcement learning (MARL) provides a framework for learning behaviors through repeated interactions with the environment by minimizing an average cost, it will not be adequate to overcome the above challenges. In this paper, we develop a distributed MARL approach to solve decision-making problems in unknown environments by learning risk-aware actions. We use the conditional value-at-risk (CVaR) to characterize the cost function that is being minimized, and define a Bellman operator to characterize the value function associated to a given state-action pair. We prove that this operator satisfies a contraction property, and that it converges to the optimal value function. We then propose a distributed MARL algorithm called the CVaR QD-Learning algorithm, and establish that value functions of individual agents reaches consensus. We identify several challenges that arise in the implementation of the CVaR QD-Learning algorithm, and present solutions to overcome these. We evaluate the CVaR QD-Learning algorithm through simulations, and demonstrate the effect of a risk parameter on value functions at consensus.

Regularization of the policy updates for stabilizing Mean Field Games

Authors:Talal Algumaei, Ruben Solozabal, Reda Alami, Hakim Hacid, Merouane Debbah, Martin Takac
Date:2023-04-04 05:45:42

This work studies non-cooperative Multi-Agent Reinforcement Learning (MARL) where multiple agents interact in the same environment and whose goal is to maximize the individual returns. Challenges arise when scaling up the number of agents due to the resultant non-stationarity that the many agents introduce. In order to address this issue, Mean Field Games (MFG) rely on the symmetry and homogeneity assumptions to approximate games with very large populations. Recently, deep Reinforcement Learning has been used to scale MFG to games with larger number of states. Current methods rely on smoothing techniques such as averaging the q-values or the updates on the mean-field distribution. This work presents a different approach to stabilize the learning based on proximal updates on the mean-field policy. We name our algorithm Mean Field Proximal Policy Optimization (MF-PPO), and we empirically show the effectiveness of our method in the OpenSpiel framework.

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning

Authors:Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman
Date:2023-04-04 01:44:19

Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been applied to differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance.

Effective and Stable Role-Based Multi-Agent Collaboration by Structural Information Principles

Authors:Xianghua Zeng, Hao Peng, Angsheng Li
Date:2023-04-03 07:13:44

Role-based learning is a promising approach to improving the performance of Multi-Agent Reinforcement Learning (MARL). Nevertheless, without manual assistance, current role-based methods cannot guarantee stably discovering a set of roles to effectively decompose a complex task, as they assume either a predefined role structure or practical experience for selecting hyperparameters. In this article, we propose a mathematical Structural Information principles-based Role Discovery method, namely SIRD, and then present a SIRD optimizing MARL framework, namely SR-MARL, for multi-agent collaboration. The SIRD transforms role discovery into a hierarchical action space clustering. Specifically, the SIRD consists of structuralization, sparsification, and optimization modules, where an optimal encoding tree is generated to perform abstracting to discover roles. The SIRD is agnostic to specific MARL algorithms and flexibly integrated with various value function factorization approaches. Empirical evaluations on the StarCraft II micromanagement benchmark demonstrate that, compared with state-of-the-art MARL algorithms, the SR-MARL framework improves the average test win rate by 0.17%, 6.08%, and 3.24%, and reduces the deviation by 16.67%, 30.80%, and 66.30%, under easy, hard, and super hard scenarios.

The challenge of redundancy on multi-agent value factorisation

Authors:Siddarth Singh, Benjamin Rosman
Date:2023-03-28 20:41:12

In the field of cooperative multi-agent reinforcement learning (MARL), the standard paradigm is the use of centralised training and decentralised execution where a central critic conditions the policies of the cooperative agents based on a central state. It has been shown, that in cases with large numbers of redundant agents these methods become less effective. In a more general case, there is likely to be a larger number of agents in an environment than is required to solve the task. These redundant agents reduce performance by enlarging the dimensionality of both the state space and and increasing the size of the joint policy used to solve the environment. We propose leveraging layerwise relevance propagation (LRP) to instead separate the learning of the joint value function and generation of local reward signals and create a new MARL algorithm: relevance decomposition network (RDN). We find that although the performance of both baselines VDN and Qmix degrades with the number of redundant agents, RDN is unaffected.

Embedding Contextual Information through Reward Shaping in Multi-Agent Learning: A Case Study from Google Football

Authors:Chaoyi Gu, Varuna De Silva, Corentin Artaud, Rafael Pina
Date:2023-03-25 10:21:13

Artificial Intelligence has been used to help human complete difficult tasks in complicated environments by providing optimized strategies for decision-making or replacing the manual labour. In environments including multiple agents, such as football, the most common methods to train agents are Imitation Learning and Multi-Agent Reinforcement Learning (MARL). However, the agents trained by Imitation Learning cannot outperform the expert demonstrator, which makes humans hardly get new insights from the learnt policy. Besides, MARL is prone to the credit assignment problem. In environments with sparse reward signal, this method can be inefficient. The objective of our research is to create a novel reward shaping method by embedding contextual information in reward function to solve the aforementioned challenges. We demonstrate this in the Google Research Football (GRF) environment. We quantify the contextual information extracted from game state observation and use this quantification together with original sparse reward to create the shaped reward. The experiment results in the GRF environment prove that our reward shaping method is a useful addition to state-of-the-art MARL algorithms for training agents in environments with sparse reward signal.

Causality Detection for Efficient Multi-Agent Reinforcement Learning

Authors:Rafael Pina, Varuna De Silva, Corentin Artaud
Date:2023-03-24 18:47:44

When learning a task as a team, some agents in Multi-Agent Reinforcement Learning (MARL) may fail to understand their true impact in the performance of the team. Such agents end up learning sub-optimal policies, demonstrating undesired lazy behaviours. To investigate this problem, we start by formalising the use of temporal causality applied to MARL problems. We then show how causality can be used to penalise such lazy agents and improve their behaviours. By understanding how their local observations are causally related to the team reward, each agent in the team can adjust their individual credit based on whether they helped to cause the reward or not. We show empirically that using causality estimations in MARL improves not only the holistic performance of the team, but also the individual capabilities of each agent. We observe that the improvements are consistent in a set of different environments.

Learning Reward Machines in Cooperative Multi-Agent Tasks

Authors:Leo Ardon, Daniel Furelos-Blanco, Alessandra Russo
Date:2023-03-24 15:12:28

This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.

marl-jax: Multi-Agent Reinforcement Leaning Framework

Authors:Kinal Mehta, Anuj Mahajan, Pawan Kumar
Date:2023-03-24 05:05:01

Recent advances in Reinforcement Learning (RL) have led to many exciting applications. These advancements have been driven by improvements in both algorithms and engineering, which have resulted in faster training of RL agents. We present marl-jax, a multi-agent reinforcement learning software package for training and evaluating social generalization of the agents. The package is designed for training a population of agents in multi-agent environments and evaluating their ability to generalize to diverse background agents. It is built on top of DeepMind's JAX ecosystem~\cite{deepmind2020jax} and leverages the RL ecosystem developed by DeepMind. Our framework marl-jax is capable of working in cooperative and competitive, simultaneous-acting environments with multiple agents. The package offers an intuitive and user-friendly command-line interface for training a population and evaluating its generalization capabilities. In conclusion, marl-jax provides a valuable resource for researchers interested in exploring social generalization in the context of MARL. The open-source code for marl-jax is available at: \href{https://github.com/kinalmehta/marl-jax}{https://github.com/kinalmehta/marl-jax}

Stochastic Graph Neural Network-based Value Decomposition for MARL in Internet of Vehicles

Authors:Baidi Xiao, Rongpeng Li, Fei Wang, Chenghui Peng, Jianjun Wu, Zhifeng Zhao, Honggang Zhang
Date:2023-03-23 12:14:04

Autonomous driving has witnessed incredible advances in the past several decades, while Multi-Agent Reinforcement Learning (MARL) promises to satisfy the essential need of autonomous vehicle control in a wireless connected vehicle networks. In MARL, how to effectively decompose a global feedback into the relative contributions of individual agents belongs to one of the most fundamental problems. However, the environment volatility due to vehicle movement and wireless disturbance could significantly shape time-varying topological relationships among agents, thus making the Value Decomposition (VD) challenging. Therefore, in order to cope with this annoying volatility, it becomes imperative to design a dynamic VD framework. Hence, in this paper, we propose a novel Stochastic VMIX (SVMIX) methodology by taking account of dynamic topological features during the VD and incorporating the corresponding components into a multi-agent actor-critic architecture. In particular, Stochastic Graph Neural Network (SGNN) is leveraged to effectively capture underlying dynamics in topological features and improve the flexibility of VD against the environment volatility. Finally, the superiority of SVMIX is verified through extensive simulations.

NeuronsMAE: A Novel Multi-Agent Reinforcement Learning Environment for Cooperative and Competitive Multi-Robot Tasks

Authors:Guangzheng Hu, Haoran Li, Shasha Liu, Mingjun Ma, Yuanheng Zhu, Dongbin Zhao
Date:2023-03-22 05:30:39

Multi-agent reinforcement learning (MARL) has achieved remarkable success in various challenging problems. Meanwhile, more and more benchmarks have emerged and provided some standards to evaluate the algorithms in different fields. On the one hand, the virtual MARL environments lack knowledge of real-world tasks and actuator abilities, and on the other hand, the current task-specified multi-robot platform has poor support for the generality of multi-agent reinforcement learning algorithms and lacks support for transferring from simulation to the real environment. Bridging the gap between the virtual MARL environments and the real multi-robot platform becomes the key to promoting the practicability of MARL algorithms. This paper proposes a novel MARL environment for real multi-robot tasks named NeuronsMAE (Neurons Multi-Agent Environment). This environment supports cooperative and competitive multi-robot tasks and is configured with rich parameter interfaces to study the multi-agent policy transfer from simulation to reality. With this platform, we evaluate various popular MARL algorithms and build a new MARL benchmark for multi-robot tasks. We hope that this platform will facilitate the research and application of MARL algorithms for real robot tasks. Information about the benchmark and the open-source code will be released.

Cheap Talk Discovery and Utilization in Multi-Agent Reinforcement Learning

Authors:Yat Long Lo, Christian Schroeder de Witt, Samuel Sokota, Jakob Nicolaus Foerster, Shimon Whiteson
Date:2023-03-19 18:38:32

By enabling agents to communicate, recent cooperative multi-agent reinforcement learning (MARL) methods have demonstrated better task performance and more coordinated behavior. Most existing approaches facilitate inter-agent communication by allowing agents to send messages to each other through free communication channels, i.e., cheap talk channels. Current methods require these channels to be constantly accessible and known to the agents a priori. In this work, we lift these requirements such that the agents must discover the cheap talk channels and learn how to use them. Hence, the problem has two main parts: cheap talk discovery (CTD) and cheap talk utilization (CTU). We introduce a novel conceptual framework for both parts and develop a new algorithm based on mutual information maximization that outperforms existing algorithms in CTD/CTU settings. We also release a novel benchmark suite to stimulate future research in CTD/CTU.

Major-Minor Mean Field Multi-Agent Reinforcement Learning

Authors:Kai Cui, Christian Fabian, Anam Tahir, Heinz Koeppl
Date:2023-03-19 14:12:57

Multi-agent reinforcement learning (MARL) remains difficult to scale to many agents. Recent MARL using Mean Field Control (MFC) provides a tractable and rigorous approach to otherwise difficult cooperative MARL. However, the strict MFC assumption of many independent, weakly-interacting agents is too inflexible in practice. We generalize MFC to instead simultaneously model many similar and few complex agents -- as Major-Minor Mean Field Control (M3FC). Theoretically, we give approximation results for finite agent control, and verify the sufficiency of stationary policies for optimality together with a dynamic programming principle. Algorithmically, we propose Major-Minor Mean Field MARL (M3FMARL) for finite agent systems instead of the limiting system. The algorithm is shown to approximate the policy gradient of the underlying M3FC MDP. Finally, we demonstrate its capabilities experimentally in various scenarios. We observe a strong performance in comparison to state-of-the-art policy gradient MARL methods.

Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games

Authors:Awni Altabaa, Bora Yongacoglu, Serdar Yüksel
Date:2023-03-16 14:25:16

Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other's actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.

LCS-TF: Multi-Agent Deep Reinforcement Learning-Based Intelligent Lane-Change System for Improving Traffic Flow

Authors:Lokesh Chandra Das, Myounggyu Won
Date:2023-03-16 04:03:17

Discretionary lane-change is one of the critical challenges for autonomous vehicle (AV) design due to its significant impact on traffic efficiency. Existing intelligent lane-change solutions have primarily focused on optimizing the performance of the ego-vehicle, thereby suffering from limited generalization performance. Recent research has seen an increased interest in multi-agent reinforcement learning (MARL)-based approaches to address the limitation of the ego vehicle-based solutions through close coordination of multiple agents. Although MARL-based approaches have shown promising results, the potential impact of lane-change decisions on the overall traffic flow of a road segment has not been fully considered. In this paper, we present a novel hybrid MARL-based intelligent lane-change system for AVs designed to jointly optimize the local performance for the ego vehicle, along with the global performance focused on the overall traffic flow of a given road segment. With a careful review of the relevant transportation literature, a novel state space is designed to integrate both the critical local traffic information pertaining to the surrounding vehicles of the ego vehicle, as well as the global traffic information obtained from a road-side unit (RSU) responsible for managing a road segment. We create a reward function to ensure that the agents make effective lane-change decisions by considering the performance of the ego vehicle and the overall improvement of traffic flow. A multi-agent deep Q-network (DQN) algorithm is designed to determine the optimal policy for each agent to effectively cooperate in performing lane-change maneuvers. LCS-TF's performance was evaluated through extensive simulations in comparison with state-of-the-art MARL models. In all aspects of traffic efficiency, driving safety, and driver comfort, the results indicate that LCS-TF exhibits superior performance.

SVDE: Scalable Value-Decomposition Exploration for Cooperative Multi-Agent Reinforcement Learning

Authors:Shuhan Qi, Shuhao Zhang, Qiang Wang, Jiajia Zhang, Jing Xiao, Xuan Wang
Date:2023-03-16 03:17:20

Value-decomposition methods, which reduce the difficulty of a multi-agent system by decomposing the joint state-action space into local observation-action spaces, have become popular in cooperative multi-agent reinforcement learning (MARL). However, value-decomposition methods still have the problems of tremendous sample consumption for training and lack of active exploration. In this paper, we propose a scalable value-decomposition exploration (SVDE) method, which includes a scalable training mechanism, intrinsic reward design, and explorative experience replay. The scalable training mechanism asynchronously decouples strategy learning with environmental interaction, so as to accelerate sample generation in a MapReduce manner. For the problem of lack of exploration, an intrinsic reward design and explorative experience replay are proposed, so as to enhance exploration to produce diverse samples and filter non-novel samples, respectively. Empirically, our method achieves the best performance on almost all maps compared to other popular algorithms in a set of StarCraft II micromanagement games. A data-efficiency experiment also shows the acceleration of SVDE for sample collection and policy convergence, and we demonstrate the effectiveness of factors in SVDE through a set of ablation experiments.

Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning

Authors:Xutong Zhao, Yangchen Pan, Chenjun Xiao, Sarath Chandar, Janarthanan Rajendran
Date:2023-03-16 02:05:16

Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at \textit{each environment timestep}, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent's state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.

Muti-Agent Proximal Policy Optimization For Data Freshness in UAV-assisted Networks

Authors:Mouhamed Naby Ndiaye, El Houcine Bergou, Hajar El Hammouti
Date:2023-03-15 15:03:09

Unmanned aerial vehicles (UAVs) are seen as a promising technology to perform a wide range of tasks in wireless communication networks. In this work, we consider the deployment of a group of UAVs to collect the data generated by IoT devices. Specifically, we focus on the case where the collected data is time-sensitive, and it is critical to maintain its timeliness. Our objective is to optimally design the UAVs' trajectories and the subsets of visited IoT devices such as the global Age-of-Updates (AoU) is minimized. To this end, we formulate the studied problem as a mixed-integer nonlinear programming (MINLP) under time and quality of service constraints. To efficiently solve the resulting optimization problem, we investigate the cooperative Multi-Agent Reinforcement Learning (MARL) framework and propose an RL approach based on the popular on-policy Reinforcement Learning (RL) algorithm: Policy Proximal Optimization (PPO). Our approach leverages the centralized training decentralized execution (CTDE) framework where the UAVs learn their optimal policies while training a centralized value function. Our simulation results show that the proposed MAPPO approach reduces the global AoU by at least a factor of 1/2 compared to conventional off-policy reinforcement learning approaches.

Optimizing Trading Strategies in Quantitative Markets using Multi-Agent Reinforcement Learning

Authors:Hengxi Zhang, Zhendong Shi, Yuanquan Hu, Wenbo Ding, Ercan E. Kuruoglu, Xiao-Ping Zhang
Date:2023-03-15 11:47:57

Quantitative markets are characterized by swift dynamics and abundant uncertainties, making the pursuit of profit-driven stock trading actions inherently challenging. Within this context, reinforcement learning (RL), which operates on a reward-centric mechanism for optimal control, has surfaced as a potentially effective solution to the intricate financial decision-making conundrums presented. This paper delves into the fusion of two established financial trading strategies, namely the constant proportion portfolio insurance (CPPI) and the time-invariant portfolio protection (TIPP), with the multi-agent deep deterministic policy gradient (MADDPG) framework. As a result, we introduce two novel multi-agent RL (MARL) methods, CPPI-MADDPG and TIPP-MADDPG, tailored for probing strategic trading within quantitative markets. To validate these innovations, we implemented them on a diverse selection of 100 real-market shares. Our empirical findings reveal that the CPPI-MADDPG and TIPP-MADDPG strategies consistently outpace their traditional counterparts, affirming their efficacy in the realm of quantitative trading.

Multi-agent Attention Actor-Critic Algorithm for Load Balancing in Cellular Networks

Authors:Jikun Kang, Di Wu, Ju Wang, Ekram Hossain, Xue Liu, Gregory Dudek
Date:2023-03-14 15:51:33

In cellular networks, User Equipment (UE) handoff from one Base Station (BS) to another, giving rise to the load balancing problem among the BSs. To address this problem, BSs can work collaboratively to deliver a smooth migration (or handoff) and satisfy the UEs' service requirements. This paper formulates the load balancing problem as a Markov game and proposes a Robust Multi-agent Attention Actor-Critic (Robust-MA3C) algorithm that can facilitate collaboration among the BSs (i.e., agents). In particular, to solve the Markov game and find a Nash equilibrium policy, we embrace the idea of adopting a nature agent to model the system uncertainty. Moreover, we utilize the self-attention mechanism, which encourages high-performance BSs to assist low-performance BSs. In addition, we consider two types of schemes, which can facilitate load balancing for both active UEs and idle UEs. We carry out extensive evaluations by simulations, and simulation results illustrate that, compared to the state-of-the-art MARL methods, Robust-\ours~scheme can improve the overall performance by up to 45%.

SOCIALGYM 2.0: Simulator for Multi-Agent Social Robot Navigation in Shared Human Spaces

Authors:Zayne Sprague, Rohan Chandra, Jarrett Holtz, Joydeep Biswas
Date:2023-03-09 21:21:05

We present SocialGym 2, a multi-agent navigation simulator for social robot research. Our simulator models multiple autonomous agents, replicating real-world dynamics in complex environments, including doorways, hallways, intersections, and roundabouts. Unlike traditional simulators that concentrate on single robots with basic kinematic constraints in open spaces, SocialGym 2 employs multi-agent reinforcement learning (MARL) to develop optimal navigation policies for multiple robots with diverse, dynamic constraints in complex environments. Built on the PettingZoo MARL library and Stable Baselines3 API, SocialGym 2 offers an accessible python interface that integrates with a navigation stack through ROS messaging. SocialGym 2 can be easily installed and is packaged in a docker container, and it provides the capability to swap and evaluate different MARL algorithms, as well as customize observation and reward functions. We also provide scripts to allow users to create their own environments and have conducted benchmarks using various social navigation algorithms, reporting a broad range of social navigation metrics. Projected hosted at: https://amrl.cs.utexas.edu/social_gym/index.html

Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Authors:Hai Ci, Mickel Liu, Xuehai Pan, Fangwei Zhong, Yizhou Wang
Date:2023-03-07 10:01:00

This paper presents a multi-agent reinforcement learning (MARL) scheme for proactive Multi-Camera Collaboration in 3D Human Pose Estimation in dynamic human crowds. Traditional fixed-viewpoint multi-camera solutions for human motion capture (MoCap) are limited in capture space and susceptible to dynamic occlusions. Active camera approaches proactively control camera poses to find optimal viewpoints for 3D reconstruction. However, current methods still face challenges with credit assignment and environment dynamics. To address these issues, our proposed method introduces a novel Collaborative Triangulation Contribution Reward (CTCR) that improves convergence and alleviates multi-agent credit assignment issues resulting from using 3D reconstruction accuracy as the shared reward. Additionally, we jointly train our model with multiple world dynamics learning tasks to better capture environment dynamics and encourage anticipatory behaviors for occlusion avoidance. We evaluate our proposed method in four photo-realistic UE4 environments to ensure validity and generalizability. Empirical results show that our method outperforms fixed and active baselines in various scenarios with different numbers of cameras and humans.

Multi-Target Pursuit by a Decentralized Heterogeneous UAV Swarm using Deep Multi-Agent Reinforcement Learning

Authors:Maryam Kouzeghar, Youngbin Song, Malika Meghjani, Roland Bouffanais
Date:2023-03-03 09:17:43

Multi-agent pursuit-evasion tasks involving intelligent targets are notoriously challenging coordination problems. In this paper, we investigate new ways to learn such coordinated behaviors of unmanned aerial vehicles (UAVs) aimed at keeping track of multiple evasive targets. Within a Multi-Agent Reinforcement Learning (MARL) framework, we specifically propose a variant of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method. Our approach addresses multi-target pursuit-evasion scenarios within non-stationary and unknown environments with random obstacles. In addition, given the critical role played by collective exploration in terms of detecting possible targets, we implement heterogeneous roles for the pursuers for enhanced exploratory actions balanced by exploitation (i.e. tracking) of previously identified targets. Our proposed role-based MADDPG algorithm is not only able to track multiple targets, but also is able to explore for possible targets by means of the proposed Voronoi-based rewarding policy. We implemented, tested and validated our approach in a simulation environment prior to deploying a real-world multi-robot system comprising of Crazyflie drones. Our results demonstrate that a multi-agent pursuit team has the ability to learn highly efficient coordinated control policies in terms of target tracking and exploration even when confronted with multiple fast evasive targets in complex environments.

Approximating Energy Market Clearing and Bidding With Model-Based Reinforcement Learning

Authors:Thomas Wolgast, Astrid Nieße
Date:2023-03-03 08:26:22

Energy market rules should incentivize market participants to behave in a market and grid conform way. However, they can also provide incentives for undesired and unexpected strategies if the market design is flawed. Multi-agent Reinforcement learning (MARL) is a promising new approach to predicting the expected profit-maximizing behavior of energy market participants in simulation. However, reinforcement learning requires many interactions with the system to converge, and the power system environment often consists of extensive computations, e.g., optimal power flow (OPF) calculation for market clearing. To tackle this complexity, we provide a model of the energy market to a basic MARL algorithm in the form of a learned OPF approximation and explicit market rules. The learned OPF surrogate model makes an explicit solving of the OPF completely unnecessary. Our experiments demonstrate that the model additionally reduces training time by about one order of magnitude but at the cost of a slightly worse performance. Potential applications of our method are market design, more realistic modeling of market participants, and analysis of manipulative behavior.

Toward Risk-based Optimistic Exploration for Cooperative Multi-Agent Reinforcement Learning

Authors:Jihwan Oh, Joonkee Kim, Minchan Jeong, Se-Young Yun
Date:2023-03-03 08:17:57

The multi-agent setting is intricate and unpredictable since the behaviors of multiple agents influence one another. To address this environmental uncertainty, distributional reinforcement learning algorithms that incorporate uncertainty via distributional output have been integrated with multi-agent reinforcement learning (MARL) methods, achieving state-of-the-art performance. However, distributional MARL algorithms still rely on the traditional $\epsilon$-greedy, which does not take cooperative strategy into account. In this paper, we present a risk-based exploration that leads to collaboratively optimistic behavior by shifting the sampling region of distribution. Initially, we take expectations from the upper quantiles of state-action values for exploration, which are optimistic actions, and gradually shift the sampling region of quantiles to the full distribution for exploitation. By ensuring that each agent is exposed to the same level of risk, we can force them to take cooperatively optimistic actions. Our method shows remarkable performance in multi-agent settings requiring cooperative exploration based on quantile regression appropriately controlling the level of risk.

Distributed Learning Meets 6G: A Communication and Computing Perspective

Authors:Shashank Jere, Yifei Song, Yang Yi, Lingjia Liu
Date:2023-03-02 15:15:33

With the ever-improving computing capabilities and storage capacities of mobile devices in line with evolving telecommunication network paradigms, there has been an explosion of research interest towards exploring Distributed Learning (DL) frameworks to realize stringent key performance indicators (KPIs) that are expected in next-generation/6G cellular networks. In conjunction with Edge Computing, Federated Learning (FL) has emerged as the DL architecture of choice in prominent wireless applications. This article lays an outline of how DL in general and FL-based strategies specifically can contribute towards realizing a part of the 6G vision and strike a balance between communication and computing constraints. As a practical use case, we apply Multi-Agent Reinforcement Learning (MARL) within the FL framework to the Dynamic Spectrum Access (DSA) problem and present preliminary evaluation results. Top contemporary challenges in applying DL approaches to 6G networks are also highlighted.

GHQ: Grouped Hybrid Q Learning for Heterogeneous Cooperative Multi-agent Reinforcement Learning

Authors:Xiaoyang Yu, Youfang Lin, Xiangsen Wang, Sheng Han, Kai Lv
Date:2023-03-02 08:45:49

Previous deep multi-agent reinforcement learning (MARL) algorithms have achieved impressive results, typically in homogeneous scenarios. However, heterogeneous scenarios are also very common and usually harder to solve. In this paper, we mainly discuss cooperative heterogeneous MARL problems in Starcraft Multi-Agent Challenges (SMAC) environment. We firstly define and describe the heterogeneous problems in SMAC. In order to comprehensively reveal and study the problem, we make new maps added to the original SMAC maps. We find that baseline algorithms fail to perform well in those heterogeneous maps. To address this issue, we propose the Grouped Individual-Global-Max Consistency (GIGM) and a novel MARL algorithm, Grouped Hybrid Q Learning (GHQ). GHQ separates agents into several groups and keeps individual parameters for each group, along with a novel hybrid structure for factorization. To enhance coordination between groups, we maximize the Inter-group Mutual Information (IGMI) between groups' trajectories. Experiments on original and new heterogeneous maps show the fabulous performance of GHQ compared to other state-of-the-art algorithms.

Parameter Sharing with Network Pruning for Scalable Multi-Agent Deep Reinforcement Learning

Authors:Woojun Kim, Youngchul Sung
Date:2023-03-02 02:17:14

Handling the problem of scalability is one of the essential issues for multi-agent reinforcement learning (MARL) algorithms to be applied to real-world problems typically involving massively many agents. For this, parameter sharing across multiple agents has widely been used since it reduces the training time by decreasing the number of parameters and increasing the sample efficiency. However, using the same parameters across agents limits the representational capacity of the joint policy and consequently, the performance can be degraded in multi-agent tasks that require different behaviors for different agents. In this paper, we propose a simple method that adopts structured pruning for a deep neural network to increase the representational capacity of the joint policy without introducing additional parameters. We evaluate the proposed method on several benchmark tasks, and numerical results show that the proposed method significantly outperforms other parameter-sharing methods.

SCRIMP: Scalable Communication for Reinforcement- and Imitation-Learning-Based Multi-Agent Pathfinding

Authors:Yutong Wang, Bairan Xiang, Shinan Huang, Guillaume Sartoretti
Date:2023-03-01 15:53:59

Trading off performance guarantees in favor of scalability, the Multi-Agent Path Finding (MAPF) community has recently started to embrace Multi-Agent Reinforcement Learning (MARL), where agents learn to collaboratively generate individual, collision-free (but often suboptimal) paths. Scalability is usually achieved by assuming a local field of view (FOV) around the agents, helping scale to arbitrary world sizes. However, this assumption significantly limits the amount of information available to the agents, making it difficult for them to enact the type of joint maneuvers needed in denser MAPF tasks. In this paper, we propose SCRIMP, where agents learn individual policies from even very small (down to 3x3) FOVs, by relying on a highly-scalable global/local communication mechanism based on a modified transformer. We further equip agents with a state-value-based tie-breaking strategy to further improve performance in symmetric situations, and introduce intrinsic rewards to encourage exploration while mitigating the long-term credit assignment problem. Empirical evaluations on a set of experiments indicate that SCRIMP can achieve higher performance with improved scalability compared to other state-of-the-art learning-based MAPF planners with larger FOVs, and even yields similar performance as a classical centralized planner in many cases. Ablation studies further validate the effectiveness of our proposed techniques. Finally, we show that our trained model can be directly implemented on real robots for online MAPF through high-fidelity simulations in gazebo.

Multi-Arm Robot Task Planning for Fruit Harvesting Using Multi-Agent Reinforcement Learning

Authors:Tao Li, Feng Xie, Ya Xiong, Qingchun Feng
Date:2023-03-01 12:39:30

The emergence of harvesting robotics offers a promising solution to the issue of limited agricultural labor resources and the increasing demand for fruits. Despite notable advancements in the field of harvesting robotics, the utilization of such technology in orchards is still limited. The key challenge is to improve operational efficiency. Taking into account inner-arm conflicts, couplings of DoFs, and dynamic tasks, we propose a task planning strategy for a harvesting robot with four arms in this paper. The proposed method employs a Markov game framework to formulate the four-arm robotic harvesting task, which avoids the computational complexity of solving an NP-hard scheduling problem. Furthermore, a multi-agent reinforcement learning (MARL) structure with a fully centralized collaboration protocol is used to train a MARL-based task planning network. Several simulations and orchard experiments are conducted to validate the effectiveness of the proposed method for a multi-arm harvesting robot in comparison with the existing method.

A Variational Approach to Mutual Information-Based Coordination for Multi-Agent Reinforcement Learning

Authors:Woojun Kim, Whiyoung Jung, Myungsik Cho, Youngchul Sung
Date:2023-03-01 12:21:30

In this paper, we propose a new mutual information framework for multi-agent reinforcement learning to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the simultaneous mutual information between multi-agent actions. By introducing a latent variable to induce nonzero mutual information between multi-agent actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. The derived tractable objective can be interpreted as maximum entropy reinforcement learning combined with uncertainty reduction of other agents actions. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic, which follows centralized learning with decentralized execution. We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring high-quality coordination.

Finite-sample Guarantees for Nash Q-learning with Linear Function Approximation

Authors:Pedro Cisneros-Velarde, Sanmi Koyejo
Date:2023-03-01 02:09:49

Nash Q-learning may be considered one of the first and most known algorithms in multi-agent reinforcement learning (MARL) for learning policies that constitute a Nash equilibrium of an underlying general-sum Markov game. Its original proof provided asymptotic guarantees and was for the tabular case. Recently, finite-sample guarantees have been provided using more modern RL techniques for the tabular case. Our work analyzes Nash Q-learning using linear function approximation -- a representation regime introduced when the state space is large or continuous -- and provides finite-sample guarantees that indicate its sample efficiency. We find that the obtained performance nearly matches an existing efficient result for single-agent RL under the same representation and has a polynomial gap when compared to the best-known result for the tabular case.

On Learning Intrinsic Rewards for Faster Multi-Agent Reinforcement Learning based MAC Protocol Design in 6G Wireless Networks

Authors:Luciano Miuccio, Salvatore Riolo, Mehdi Bennis, Daniela Panno
Date:2023-02-28 17:07:51

In this paper, we propose a novel framework for designing a fast convergent multi-agent reinforcement learning (MARL)-based medium access control (MAC) protocol operating in a single cell scenario. The user equipments (UEs) are cast as learning agents that need to learn a proper signaling policy to coordinate the transmission of protocol data units (PDUs) to the base station (BS) over shared radio resources. In many MARL tasks, the conventional centralized training with decentralized execution (CTDE) is adopted, where each agent receives the same global extrinsic reward from the environment. However, this approach involves a long training time. To overcome this drawback, we adopt the concept of learning a per-agent intrinsic reward, in which each agent learns a different intrinsic reward signal based solely on its individual behavior. Moreover, in order to provide an intrinsic reward function that takes into account the long-term training history, we represent it as a long shortterm memory (LSTM) network. As a result, each agent updates its policy network considering both the extrinsic reward, which characterizes the cooperative task, and the intrinsic reward that reflects local dynamics. The proposed learning framework yields a faster convergence and higher transmission performance compared to the baselines. Simulation results show that the proposed learning solution yields 75% improvement in convergence speed compared to the most performing baseline.

IQ-Flow: Mechanism Design for Inducing Cooperative Behavior to Self-Interested Agents in Sequential Social Dilemmas

Authors:Bengisu Guresti, Abdullah Vanlioglu, Nazim Kemal Ure
Date:2023-02-28 14:44:29

Achieving and maintaining cooperation between agents to accomplish a common objective is one of the central goals of Multi-Agent Reinforcement Learning (MARL). Nevertheless in many real-world scenarios, separately trained and specialized agents are deployed into a shared environment, or the environment requires multiple objectives to be achieved by different coexisting parties. These variations among specialties and objectives are likely to cause mixed motives that eventually result in a social dilemma where all the parties are at a loss. In order to resolve this issue, we propose the Incentive Q-Flow (IQ-Flow) algorithm, which modifies the system's reward setup with an incentive regulator agent such that the cooperative policy also corresponds to the self-interested policy for the agents. Unlike the existing methods that learn to incentivize self-interested agents, IQ-Flow does not make any assumptions about agents' policies or learning algorithms, which enables the generalization of the developed framework to a wider array of applications. IQ-Flow performs an offline evaluation of the optimality of the learned policies using the data provided by other agents to determine cooperative and self-interested policies. Next, IQ-Flow uses meta-gradient learning to estimate how policy evaluation changes according to given incentives and modifies the incentive such that the greedy policy for cooperative objective and self-interested objective yield the same actions. We present the operational characteristics of IQ-Flow in Iterated Matrix Games. We demonstrate that IQ-Flow outperforms the state-of-the-art incentive design algorithm in Escape Room and 2-Player Cleanup environments. We further demonstrate that the pretrained IQ-Flow mechanism significantly outperforms the performance of the shared reward setup in the 2-Player Cleanup environment.

On the Role of Emergent Communication for Social Learning in Multi-Agent Reinforcement Learning

Authors:Seth Karten, Siva Kailas, Huao Li, Katia Sycara
Date:2023-02-28 03:23:27

Explicit communication among humans is key to coordinating and learning. Social learning, which uses cues from experts, can greatly benefit from the usage of explicit communication to align heterogeneous policies, reduce sample complexity, and solve partially observable tasks. Emergent communication, a type of explicit communication, studies the creation of an artificial language to encode a high task-utility message directly from data. However, in most cases, emergent communication sends insufficiently compressed messages with little or null information, which also may not be understandable to a third-party listener. This paper proposes an unsupervised method based on the information bottleneck to capture both referential complexity and task-specific utility to adequately explore sparse social communication scenarios in multi-agent reinforcement learning (MARL). We show that our model is able to i) develop a natural-language-inspired lexicon of messages that is independently composed of a set of emergent concepts, which span the observations and intents with minimal bits, ii) develop communication to align the action policies of heterogeneous agents with dissimilar feature models, and iii) learn a communication policy from watching an expert's action policy, which we term `social shadowing'.

AC2C: Adaptively Controlled Two-Hop Communication for Multi-Agent Reinforcement Learning

Authors:Xuefeng Wang, Xinran Li, Jiawei Shao, Jun Zhang
Date:2023-02-24 09:00:34

Learning communication strategies in cooperative multi-agent reinforcement learning (MARL) has recently attracted intensive attention. Early studies typically assumed a fully-connected communication topology among agents, which induces high communication costs and may not be feasible. Some recent works have developed adaptive communication strategies to reduce communication overhead, but these methods cannot effectively obtain valuable information from agents that are beyond the communication range. In this paper, we consider a realistic communication model where each agent has a limited communication range, and the communication topology dynamically changes. To facilitate effective agent communication, we propose a novel communication protocol called Adaptively Controlled Two-Hop Communication (AC2C). After an initial local communication round, AC2C employs an adaptive two-hop communication strategy to enable long-range information exchange among agents to boost performance, which is implemented by a communication controller. This controller determines whether each agent should ask for two-hop messages and thus helps to reduce the communication overhead during distributed execution. We evaluate AC2C on three cooperative multi-agent tasks, and the experimental results show that it outperforms relevant baselines with lower communication costs.

Revisiting the Gumbel-Softmax in MADDPG

Authors:Callum Rhys Tilbury, Filippos Christianos, Stefano V. Albrecht
Date:2023-02-23 06:13:51

MADDPG is an algorithm in multi-agent reinforcement learning (MARL) that extends the popular single-agent method, DDPG, to multi-agent scenarios. Importantly, DDPG is an algorithm designed for continuous action spaces, where the gradient of the state-action value function exists. For this algorithm to work in discrete action spaces, discrete gradient estimation must be performed. For MADDPG, the Gumbel-Softmax (GS) estimator is used -- a reparameterisation which relaxes a discrete distribution into a similar continuous one. This method, however, is statistically biased, and a recent MARL benchmarking paper suggests that this bias makes MADDPG perform poorly in grid-world situations, where the action space is discrete. Fortunately, many alternatives to the GS exist, boasting a wide range of properties. This paper explores several of these alternatives and integrates them into MADDPG for discrete grid-world scenarios. The corresponding impact on various performance metrics is then measured and analysed. It is found that one of the proposed estimators performs significantly better than the original GS in several tasks, achieving up to 55% higher returns, along with faster convergence.

Semantic Information Marketing in The Metaverse: A Learning-Based Contract Theory Framework

Authors:Ismail Lotfi, Dusit Niyato, Sumei Sun, Dong In Kim, Xuemin Shen
Date:2023-02-22 15:52:37

In this paper, we address the problem of designing incentive mechanisms by a virtual service provider (VSP) to hire sensing IoT devices to sell their sensing data to help creating and rendering the digital copy of the physical world in the Metaverse. Due to the limited bandwidth, we propose to use semantic extraction algorithms to reduce the delivered data by the sensing IoT devices. Nevertheless, mechanisms to hire sensing IoT devices to share their data with the VSP and then deliver the constructed digital twin to the Metaverse users are vulnerable to adverse selection problem. The adverse selection problem, which is caused by information asymmetry between the system entities, becomes harder to solve when the private information of the different entities are multi-dimensional. We propose a novel iterative contract design and use a new variant of multi-agent reinforcement learning (MARL) to solve the modelled multi-dimensional contract problem. To demonstrate the effectiveness of our algorithm, we conduct extensive simulations and measure several key performance metrics of the contract for the Metaverse. Our results show that our designed iterative contract is able to incentivize the participants to interact truthfully, which maximizes the profit of the VSP with minimal individual rationality (IR) and incentive compatibility (IC) violation rates. Furthermore, the proposed learning-based iterative contract framework has limited access to the private information of the participants, which is to the best of our knowledge, the first of its kind in addressing the problem of adverse selection in incentive mechanisms.

MAC-PO: Multi-Agent Experience Replay via Collective Priority Optimization

Authors:Yongsheng Mei, Hanhan Zhou, Tian Lan, Guru Venkataramani, Peng Wei
Date:2023-02-21 03:11:21

Experience replay is crucial for off-policy reinforcement learning (RL) methods. By remembering and reusing the experiences from past different policies, experience replay significantly improves the training efficiency and stability of RL algorithms. Many decision-making problems in practice naturally involve multiple agents and require multi-agent reinforcement learning (MARL) under centralized training decentralized execution paradigm. Nevertheless, existing MARL algorithms often adopt standard experience replay where the transitions are uniformly sampled regardless of their importance. Finding prioritized sampling weights that are optimized for MARL experience replay has yet to be explored. To this end, we propose MAC-PO, which formulates optimal prioritized experience replay for multi-agent problems as a regret minimization over the sampling weights of transitions. Such optimization is relaxed and solved using the Lagrangian multiplier approach to obtain the close-form optimal sampling weights. By minimizing the resulting policy regret, we can narrow the gap between the current policy and a nominal optimal policy, thus acquiring an improved prioritization scheme for multi-agent tasks. Our experimental results on Predator-Prey and StarCraft Multi-Agent Challenge environments demonstrate the effectiveness of our method, having a better ability to replay important transitions and outperforming other state-of-the-art baselines.

Differentiable Arbitrating in Zero-sum Markov Games

Authors:Jing Wang, Meichen Song, Feng Gao, Boyi Liu, Zhaoran Wang, Yi Wu
Date:2023-02-20 16:05:04

We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments.

Efficient Communication via Self-supervised Information Aggregation for Online and Offline Multi-agent Reinforcement Learning

Authors:Cong Guan, Feng Chen, Lei Yuan, Zongzhang Zhang, Yang Yu
Date:2023-02-19 16:02:16

Utilizing messages from teammates can improve coordination in cooperative Multi-agent Reinforcement Learning (MARL). Previous works typically combine raw messages of teammates with local information as inputs for policy. However, neglecting message aggregation poses significant inefficiency for policy learning. Motivated by recent advances in representation learning, we argue that efficient message aggregation is essential for good coordination in cooperative MARL. In this paper, we propose Multi-Agent communication via Self-supervised Information Aggregation (MASIA), where agents can aggregate the received messages into compact representations with high relevance to augment the local policy. Specifically, we design a permutation invariant message encoder to generate common information-aggregated representation from messages and optimize it via reconstructing and shooting future information in a self-supervised manner. Hence, each agent would utilize the most relevant parts of the aggregated representation for decision-making by a novel message extraction mechanism. Furthermore, considering the potential of offline learning for real-world applications, we build offline benchmarks for multi-agent communication, which is the first as we know. Empirical results demonstrate the superiority of our method in both online and offline settings. We also release the built offline benchmarks in this paper as a testbed for communication ability validation to facilitate further future research.

AIIR-MIX: Multi-Agent Reinforcement Learning Meets Attention Individual Intrinsic Reward Mixing Network

Authors:Wei Li, Weiyan Liu, Shitong Shao, Shiyi Huang
Date:2023-02-19 10:25:25

Deducing the contribution of each agent and assigning the corresponding reward to them is a crucial problem in cooperative Multi-Agent Reinforcement Learning (MARL). Previous studies try to resolve the issue through designing an intrinsic reward function, but the intrinsic reward is simply combined with the environment reward by summation in these studies, which makes the performance of their MARL framework unsatisfactory. We propose a novel method named Attention Individual Intrinsic Reward Mixing Network (AIIR-MIX) in MARL, and the contributions of AIIR-MIX are listed as follows:(a) we construct a novel intrinsic reward network based on the attention mechanism to make teamwork more effective. (b) we propose a Mixing network that is able to combine intrinsic and extrinsic rewards non-linearly and dynamically in response to changing conditions of the environment. We compare AIIR-MIX with many State-Of-The-Art (SOTA) MARL methods on battle games in StarCraft II. And the results demonstrate that AIIR-MIX performs admirably and can defeat the current advanced methods on average test win rate. To validate the effectiveness of AIIR-MIX, we conduct additional ablation studies. The results show that AIIR-MIX can dynamically assign each agent a real-time intrinsic reward in accordance with their actual contribution.

Uplink Power Control for Extremely Large-Scale MIMO with Multi-Agent Reinforcement Learning and Fuzzy Logic

Authors:Ziheng Liu, Zhilong Liu, Jiayi Zhang, Huahua Xiao, Bo Ai, Derrick Wing Kwan Ng
Date:2023-02-18 11:25:08

In this paper, we investigate the uplink transmit power optimization problem in cell-free (CF) extremely large-scale multiple-input multiple-output (XL-MIMO) systems. Instead of applying the traditional methods, we propose two signal processing architectures: the centralized training and centralized execution with fuzzy logic as well as the centralized training and decentralized execution with fuzzy logic, respectively, which adopt the amalgamation of multi-agent reinforcement learning (MARL) and fuzzy logic to solve the design problem of power control for the maximization of the system spectral efficiency (SE). Furthermore, the uplink performance of the system adopting maximum ratio (MR) combining and local minimum mean-squared error (L-MMSE) combining is evaluated. Our results show that the proposed methods with fuzzy logic outperform the conventional MARL-based method and signal processing methods in terms of computational complexity. Also, the SE performance under MR combining is even better than that of the conventional MARL-based method.

Promoting Cooperation in Multi-Agent Reinforcement Learning via Mutual Help

Authors:Yunbo Qiu, Yue Jin, Lebin Yu, Jian Wang, Xudong Zhang
Date:2023-02-18 10:07:01

Multi-agent reinforcement learning (MARL) has achieved great progress in cooperative tasks in recent years. However, in the local reward scheme, where only local rewards for each agent are given without global rewards shared by all the agents, traditional MARL algorithms lack sufficient consideration of agents' mutual influence. In cooperative tasks, agents' mutual influence is especially important since agents are supposed to coordinate to achieve better performance. In this paper, we propose a novel algorithm Mutual-Help-based MARL (MH-MARL) to instruct agents to help each other in order to promote cooperation. MH-MARL utilizes an expected action module to generate expected other agents' actions for each particular agent. Then, the expected actions are delivered to other agents for selective imitation during training. Experimental results show that MH-MARL improves the performance of MARL both in success rate and cumulative reward.

Scalable Multi-Agent Reinforcement Learning with General Utilities

Authors:Donghao Ying, Yuhao Ding, Alec Koppel, Javad Lavaei
Date:2023-02-15 20:47:43

We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.

Learning Random Access Schemes for Massive Machine-Type Communication with MARL

Authors:Muhammad Awais Jadoon, Adriano Pastore, Monica Navarro, Alvaro Valcarce
Date:2023-02-15 18:21:22

In this paper, we explore various multi-agent reinforcement learning (MARL) techniques to design grant-free random access (RA) schemes for low-complexity, low-power battery operated devices in massive machine-type communication (mMTC) wireless networks. We use value decomposition networks (VDN) and QMIX algorithms with parameter sharing (PS) with centralized training and decentralized execution (CTDE) while maintaining scalability. We then compare the policies learned by VDN, QMIX, and deep recurrent Q-network (DRQN) and explore the impact of including the agent identifiers in the observation vector. We show that the MARL-based RA schemes can achieve a better throughput-fairness trade-off between agents without having to condition on the agent identifiers. We also present a novel correlated traffic model, which is more descriptive of mMTC scenarios, and show that the proposed algorithm can easily adapt to traffic non-stationarities

Attacking Fake News Detectors via Manipulating News Social Engagement

Authors:Haoran Wang, Yingtong Dou, Canyu Chen, Lichao Sun, Philip S. Yu, Kai Shu
Date:2023-02-14 21:51:56

Social media is one of the main sources for news consumption, especially among the younger generation. With the increasing popularity of news consumption on various social media platforms, there has been a surge of misinformation which includes false information or unfounded claims. As various text- and social context-based fake news detectors are proposed to detect misinformation on social media, recent works start to focus on the vulnerabilities of fake news detectors. In this paper, we present the first adversarial attack framework against Graph Neural Network (GNN)-based fake news detectors to probe their robustness. Specifically, we leverage a multi-agent reinforcement learning (MARL) framework to simulate the adversarial behavior of fraudsters on social media. Research has shown that in real-world settings, fraudsters coordinate with each other to share different news in order to evade the detection of fake news detectors. Therefore, we modeled our MARL framework as a Markov Game with bot, cyborg, and crowd worker agents, which have their own distinctive cost, budget, and influence. We then use deep Q-learning to search for the optimal policy that maximizes the rewards. Extensive experimental results on two real-world fake news propagation datasets demonstrate that our proposed framework can effectively sabotage the GNN-based fake news detector performance. We hope this paper can provide insights for future research on fake news detection.

Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning

Authors:Shanqi Liu, Yujing Hu, Runze Wu, Dong Xing, Yu Xiong, Changjie Fan, Kun Kuang, Yong Liu
Date:2023-02-14 07:23:59

Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cutting-edge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements in several non-monotonic domains.

Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation

Authors:Yuanhao Wang, Qinghua Liu, Yu Bai, Chi Jin
Date:2023-02-13 18:59:25

A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the curse of multiagency, where the description length of the game as well as the complexity of many existing learning algorithms scale exponentially with the number of agents. While recent works successfully address this challenge under the model of tabular Markov Games, their mechanisms critically rely on the number of states being finite and small, and do not extend to practical scenarios with enormous state spaces where function approximation must be used to approximate value functions or policies. This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency under function approximation. We design a new decentralized algorithm -- V-Learning with Policy Replay, which gives the first polynomial sample complexity results for learning approximate Coarse Correlated Equilibria (CCEs) of Markov Games under decentralized linear function approximation. Our algorithm always outputs Markov CCEs, and achieves an optimal rate of $\widetilde{\mathcal{O}}(\epsilon^{-2})$ for finding $\epsilon$-optimal solutions. Also, when restricted to the tabular case, our result improves over the current best decentralized result $\widetilde{\mathcal{O}}(\epsilon^{-3})$ for finding Markov CCEs. We further present an alternative algorithm -- Decentralized Optimistic Policy Mirror Descent, which finds policy-class-restricted CCEs using a polynomial number of samples. In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation, such as linear quadratic games and MARL problems with low ''marginal'' Eluder dimension.

MANSA: Learning Fast and Slow in Multi-Agent Systems

Authors:David Mguni, Haojun Chen, Taher Jafferjee, Jianhong Wang, Long Fei, Xidong Feng, Stephen McAleer, Feifei Tong, Jun Wang, Yaodong Yang
Date:2023-02-12 13:23:39

In multi-agent reinforcement learning (MARL), independent learning (IL) often shows remarkable performance and easily scales with the number of agents. Yet, using IL can be inefficient and runs the risk of failing to successfully train, particularly in scenarios that require agents to coordinate their actions. Using centralised learning (CL) enables MARL agents to quickly learn how to coordinate their behaviour but employing CL everywhere is often prohibitively expensive in real-world applications. Besides, using CL in value-based methods often needs strong representational constraints (e.g. individual-global-max condition) that can lead to poor performance if violated. In this paper, we introduce a novel plug & play IL framework named Multi-Agent Network Selection Algorithm (MANSA) which selectively employs CL only at states that require coordination. At its core, MANSA has an additional agent that uses switching controls to quickly learn the best states to activate CL during training, using CL only where necessary and vastly reducing the computational burden of CL. Our theory proves MANSA preserves cooperative MARL convergence properties, boosts IL performance and can optimally make use of a fixed budget on the number CL calls. We show empirically in Level-based Foraging (LBF) and StarCraft Multi-agent Challenge (SMAC) that MANSA achieves fast, superior and more reliable performance while making 40% fewer CL calls in SMAC and using CL at only 1% CL calls in LBF.

Scalability Bottlenecks in Multi-Agent Reinforcement Learning Systems

Authors:Kailash Gogineni, Peng Wei, Tian Lan, Guru Venkataramani
Date:2023-02-10 01:30:01

Multi-Agent Reinforcement Learning (MARL) is a promising area of research that can model and control multiple, autonomous decision-making agents. During online training, MARL algorithms involve performance-intensive computations such as exploration and exploitation phases originating from large observation-action space belonging to multiple agents. In this article, we seek to characterize the scalability bottlenecks in several popular classes of MARL algorithms during their training phases. Our experimental results reveal new insights into the key modules of MARL algorithms that limit the scalability, and outline potential strategies that may help address these performance issues.

Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems

Authors:Chanyoung Park, Won Joon Yun, Jae Pyoung Kim, Tiago Koketsu Rodrigues, Soohyun Park, Soyi Jung, Joongheon Kim
Date:2023-02-09 05:31:57

This paper proposes a novel algorithm, named quantum multi-agent actor-critic networks (QMACN) for autonomously constructing a robust mobile access system employing multiple unmanned aerial vehicles (UAVs). In the context of facilitating collaboration among multiple unmanned aerial vehicles (UAVs), the application of multi-agent reinforcement learning (MARL) techniques is regarded as a promising approach. These methods enable UAVs to learn collectively, optimizing their actions within a shared environment, ultimately leading to more efficient cooperative behavior. Furthermore, the principles of a quantum computing (QC) are employed in our study to enhance the training process and inference capabilities of the UAVs involved. By leveraging the unique computational advantages of quantum computing, our approach aims to boost the overall effectiveness of the UAV system. However, employing a QC introduces scalability challenges due to the near intermediate-scale quantum (NISQ) limitation associated with qubit usage. The proposed algorithm addresses this issue by implementing a quantum centralized critic, effectively mitigating the constraints imposed by NISQ limitations. Additionally, the advantages of the QMACN with performance improvements in terms of training speed and wireless service quality are verified via various data-intensive evaluations. Furthermore, this paper validates that a noise injection scheme can be used for handling environmental uncertainties in order to realize robust mobile access.

Shared Information-Based Safe And Efficient Behavior Planning For Connected Autonomous Vehicles

Authors:Songyang Han, Shanglin Zhou, Lynn Pepin, Jiangwei Wang, Caiwen Ding, Fei Miao
Date:2023-02-08 20:31:41

The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather data via vehicle-to-vehicle (V2V) communication, such as processed LIDAR and camera data from other vehicles. In this work, we design an integrated information sharing and safe multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. We first use weight pruned convolutional neural networks (CNN) to process the raw image and point cloud LIDAR data locally at each autonomous vehicle, and share CNN-output data with neighboring CAVs. We then design a safe actor-critic algorithm that utilizes both a vehicle's local observation and the information received via V2V communication to explore an efficient behavior planning policy with safety guarantees. Using the CARLA simulator for experiments, we show that our approach improves the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.

Learning Graph-Enhanced Commander-Executor for Multi-Agent Navigation

Authors:Xinyi Yang, Shiyu Huang, Yiwen Sun, Yuxiang Yang, Chao Yu, Wei-Wei Tu, Huazhong Yang, Yu Wang
Date:2023-02-08 14:44:21

This paper investigates the multi-agent navigation problem, which requires multiple agents to reach the target goals in a limited time. Multi-agent reinforcement learning (MARL) has shown promising results for solving this issue. However, it is inefficient for MARL to directly explore the (nearly) optimal policy in the large search space, which is exacerbated as the agent number increases (e.g., 10+ agents) or the environment is more complex (e.g., 3D simulator). Goal-conditioned hierarchical reinforcement learning (HRL) provides a promising direction to tackle this challenge by introducing a hierarchical structure to decompose the search space, where the low-level policy predicts primitive actions in the guidance of the goals derived from the high-level policy. In this paper, we propose Multi-Agent Graph-Enhanced Commander-Executor (MAGE-X), a graph-based goal-conditioned hierarchical method for multi-agent navigation tasks. MAGE-X comprises a high-level Goal Commander and a low-level Action Executor. The Goal Commander predicts the probability distribution of goals and leverages them to assign each agent the most appropriate final target. The Action Executor utilizes graph neural networks (GNN) to construct a subgraph for each agent that only contains crucial partners to improve cooperation. Additionally, the Goal Encoder in the Action Executor captures the relationship between the agent and the designated goal to encourage the agent to reach the final target. The results show that MAGE-X outperforms the state-of-the-art MARL baselines with a 100% success rate with only 3 million training steps in multi-agent particle environments (MPE) with 50 agents, and at least a 12% higher success rate and 2x higher data efficiency in a more complicated quadrotor 3D navigation task.

Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

Authors:Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni
Date:2023-02-07 12:51:20

Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space to find joint actions that lead to coordination. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment which is not systematic and inefficient at identifying effective actions in multi-agent problems. Additionally, the concurrent training of the policies of multiple agents during training can render the optimisation non-stationary. This can lead to unstable value estimates, highly variant gradients, and ultimately hinder coordination between agents. To address these challenges, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX is a framework to seamlessly extend value-based MARL algorithms. EMAX leverages an ensemble of value functions for each agent to guide their exploration, reduce the variance of their optimisation, and makes their policies more robust to miscoordination. EMAX achieves these benefits by (1) systematically guiding the exploration of agents with a UCB policy towards parts of the environment that require multiple agents to coordinate. (2) EMAX computes average value estimates across the ensemble as target values to reduce the variance of gradients and make optimisation more stable. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble to reduce the likelihood of miscoordination. We first instantiate independent DQN with EMAX and evaluate it in 11 general-sum tasks with sparse rewards. We show that EMAX improves final evaluation returns by 185% across all tasks. We then evaluate EMAX on top of IDQN, VDN and QMIX in 21 common-reward tasks, and show that EMAX improves sample efficiency and final evaluation returns across all tasks over all three vanilla algorithms by 60%, 47%, and 538%, respectively.

Towards Skilled Population Curriculum for Multi-Agent Reinforcement Learning

Authors:Rundong Wang, Longtao Zheng, Wei Qiu, Bowei He, Bo An, Zinovi Rabinovich, Yujing Hu, Yingfeng Chen, Tangjie Lv, Changjie Fan
Date:2023-02-07 12:30:52

Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.

Dealing With Non-stationarity in Decentralized Cooperative Multi-Agent Deep Reinforcement Learning via Multi-Timescale Learning

Authors:Hadi Nekoei, Akilesh Badrinaaraayanan, Amit Sinha, Mohammad Amini, Janarthanan Rajendran, Aditya Mahajan, Sarath Chandar
Date:2023-02-06 14:10:53

Decentralized cooperative multi-agent deep reinforcement learning (MARL) can be a versatile learning framework, particularly in scenarios where centralized training is either not possible or not practical. One of the critical challenges in decentralized deep MARL is the non-stationarity of the learning environment when multiple agents are learning concurrently. A commonly used and efficient scheme for decentralized MARL is independent learning in which agents concurrently update their policies independently of each other. We first show that independent learning does not always converge, while sequential learning where agents update their policies one after another in a sequence is guaranteed to converge to an agent-by-agent optimal solution. In sequential learning, when one agent updates its policy, all other agent's policies are kept fixed, alleviating the challenge of non-stationarity due to simultaneous updates in other agents' policies. However, it can be slow because only one agent is learning at any time. Therefore it might also not always be practical. In this work, we propose a decentralized cooperative MARL algorithm based on multi-timescale learning. In multi-timescale learning, all agents learn simultaneously, but at different learning rates. In our proposed method, when one agent updates its policy, other agents are allowed to update their policies as well, but at a slower rate. This speeds up sequential learning, while also minimizing non-stationarity caused by other agents updating concurrently. Multi-timescale learning outperforms state-of-the-art decentralized learning methods on a set of challenging multi-agent cooperative tasks in the epymarl(Papoudakis et al., 2020) benchmark. This can be seen as a first step towards more general decentralized cooperative deep MARL methods based on multi-timescale learning.

Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased

Authors:Chao Yu, Jiaxuan Gao, Weilin Liu, Botian Xu, Hao Tang, Jiaqi Yang, Yu Wang, Yi Wu
Date:2023-02-03 09:06:42

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.

Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

Authors:Zun Li, Marc Lanctot, Kevin R. McKee, Luke Marris, Ian Gemp, Daniel Hennes, Paul Muller, Kate Larson, Yoram Bachrach, Michael P. Wellman
Date:2023-02-01 23:06:23

Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.

Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

Authors:Yihong Li, Xiaoxi Zhang, Tianyu Zeng, Jingpu Duan, Chuan Wu, Di Wu, Xu Chen
Date:2023-02-01 16:45:26

Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger, a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9% reduction in the average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.

Off-the-Grid MARL: Datasets with Baselines for Offline Multi-Agent Reinforcement Learning

Authors:Claude Formanek, Asad Jeewa, Jonathan Shock, Arnu Pretorius
Date:2023-02-01 15:41:27

Being able to harness the power of large datasets for developing cooperative multi-agent controllers promises to unlock enormous value for real-world applications. Many important industrial systems are multi-agent in nature and are difficult to model using bespoke simulators. However, in industry, distributed processes can often be recorded during operation, and large quantities of demonstrative data stored. Offline multi-agent reinforcement learning (MARL) provides a promising paradigm for building effective decentralised controllers from such datasets. However, offline MARL is still in its infancy and therefore lacks standardised benchmark datasets and baselines typically found in more mature subfields of reinforcement learning (RL). These deficiencies make it difficult for the community to sensibly measure progress. In this work, we aim to fill this gap by releasing off-the-grid MARL (OG-MARL): a growing repository of high-quality datasets with baselines for cooperative offline MARL research. Our datasets provide settings that are characteristic of real-world systems, including complex environment dynamics, heterogeneous agents, non-stationarity, many agents, partial observability, suboptimality, sparse rewards and demonstrated coordination. For each setting, we provide a range of different dataset types (e.g. Good, Medium, Poor, and Replay) and profile the composition of experiences for each dataset. We hope that OG-MARL will serve the community as a reliable source of datasets and help drive progress, while also providing an accessible entry point for researchers new to the field.

DIFFER: Decomposing Individual Reward for Fair Experience Replay in Multi-Agent Reinforcement Learning

Authors:Xunhan Hu, Jian Zhao, Wengang Zhou, Ruili Feng, Houqiang Li
Date:2023-01-25 13:27:05

Cooperative multi-agent reinforcement learning (MARL) is a challenging task, as agents must learn complex and diverse individual strategies from a shared team reward. However, existing methods struggle to distinguish and exploit important individual experiences, as they lack an effective way to decompose the team reward into individual rewards. To address this challenge, we propose DIFFER, a powerful theoretical framework for decomposing individual rewards to enable fair experience replay in MARL. By enforcing the invariance of network gradients, we establish a partial differential equation whose solution yields the underlying individual reward function. The individual TD-error can then be computed from the solved closed-form individual rewards, indicating the importance of each piece of experience in the learning task and guiding the training process. Our method elegantly achieves an equivalence to the original learning framework when individual experiences are homogeneous, while also adapting to achieve more muscular efficiency and fairness when diversity is observed.Our extensive experiments on popular benchmarks validate the effectiveness of our theory and method, demonstrating significant improvements in learning efficiency and fairness.

On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment

Authors:Ansh Mittal, Aditya Malte
Date:2023-01-20 03:17:16

Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.

Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems

Authors:Nayana Dasgupta, Mirco Musolesi
Date:2023-01-19 19:33:54

Solving the problem of cooperation is fundamentally important for the creation and maintenance of functional societies. Problems of cooperation are omnipresent within human society, with examples ranging from navigating busy road junctions to negotiating treaties. As the use of AI becomes more pervasive throughout society, the need for socially intelligent agents capable of navigating these complex cooperative dilemmas is becoming increasingly evident. Direct punishment is a ubiquitous social mechanism that has been shown to foster the emergence of cooperation in both humans and non-humans. In the natural world, direct punishment is often strongly coupled with partner selection and reputation and used in conjunction with third-party punishment. The interactions between these mechanisms could potentially enhance the emergence of cooperation within populations. However, no previous work has evaluated the learning dynamics and outcomes emerging from Multi-Agent Reinforcement Learning (MARL) populations that combine these mechanisms. This paper addresses this gap. It presents a comprehensive analysis and evaluation of the behaviors and learning dynamics associated with direct punishment, third-party punishment, partner selection, and reputation. Finally, we discuss the implications of using these mechanisms on the design of cooperative AI systems.

Heterogeneous Multi-Robot Reinforcement Learning

Authors:Matteo Bettini, Ajay Shankar, Amanda Prorok
Date:2023-01-17 19:05:17

Cooperative multi-robot tasks can benefit from heterogeneity in the robots' physical and behavioral traits. In spite of this, traditional Multi-Agent Reinforcement Learning (MARL) frameworks lack the ability to explicitly accommodate policy heterogeneity, and typically constrain agents to share neural network parameters. This enforced homogeneity limits application in cases where the tasks benefit from heterogeneous behaviors. In this paper, we crystallize the role of heterogeneity in MARL policies. Towards this end, we introduce Heterogeneous Graph Neural Network Proximal Policy Optimization (HetGPPO), a paradigm for training heterogeneous MARL policies that leverages a Graph Neural Network for differentiable inter-agent communication. HetGPPO allows communicating agents to learn heterogeneous behaviors while enabling fully decentralized training in partially observable environments. We complement this with a taxonomical overview that exposes more heterogeneity classes than previously identified. To motivate the need for our model, we present a characterization of techniques that homogeneous models can leverage to emulate heterogeneous behavior, and show how this "apparent heterogeneity" is brittle in real-world conditions. Through simulations and real-world experiments, we show that: (i) when homogeneous methods fail due to strong heterogeneous requirements, HetGPPO succeeds, and, (ii) when homogeneous methods are able to learn apparently heterogeneous behaviors, HetGPPO achieves higher resilience to both training and deployment noise.

Mean-Field Control based Approximation of Multi-Agent Reinforcement Learning in Presence of a Non-decomposable Shared Global State

Authors:Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri
Date:2023-01-13 18:55:58

Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems. However, the success of MFC relies on the presumption that given the local states and actions of all the agents, the next (local) states of the agents evolve conditionally independent of each other. Here we demonstrate that even in a MARL setting where agents share a common global state in addition to their local states evolving conditionally independently (thus introducing a correlation between the state transition processes of individual agents), the MFC can still be applied as a good approximation tool. The global state is assumed to be non-decomposable i.e., it cannot be expressed as a collection of local states of the agents. We compute the approximation error as $\mathcal{O}(e)$ where $e=\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|} +\sqrt{|\mathcal{U}|}\right]$. The size of the agent population is denoted by the term $N$, and $|\mathcal{X}|, |\mathcal{U}|$ respectively indicate the sizes of (local) state and action spaces of individual agents. The approximation error is found to be independent of the size of the shared global state space. We further demonstrate that in a special case if the reward and state transition functions are independent of the action distribution of the population, then the error can be improved to $e=\frac{\sqrt{|\mathcal{X}|}}{\sqrt{N}}$. Finally, we devise a Natural Policy Gradient based algorithm that solves the MFC problem with $\mathcal{O}(\epsilon^{-3})$ sample complexity and obtains a policy that is within $\mathcal{O}(\max\{e,\epsilon\})$ error of the optimal MARL policy for any $\epsilon>0$.

TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems

Authors:Matteo Gallici, Mario Martin, Ivan Masmitja
Date:2023-01-13 00:07:08

Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.

SoK: Adversarial Machine Learning Attacks and Defences in Multi-Agent Reinforcement Learning

Authors:Maxwell Standen, Junae Kim, Claudia Szabo
Date:2023-01-11 04:25:00

Multi-Agent Reinforcement Learning (MARL) is vulnerable to Adversarial Machine Learning (AML) attacks and needs adequate defences before it can be used in real world applications. We have conducted a survey into the use of execution-time AML attacks against MARL and the defences against those attacks. We surveyed related work in the application of AML in Deep Reinforcement Learning (DRL) and Multi-Agent Learning (MAL) to inform our analysis of AML for MARL. We propose a novel perspective to understand the manner of perpetrating an AML attack, by defining Attack Vectors. We develop two new frameworks to address a gap in current modelling frameworks, focusing on the means and tempo of an AML attack against MARL, and identify knowledge gaps and future avenues of research.

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

Authors:Chao Yu, Xinyi Yang, Jiaxuan Gao, Jiayu Chen, Yunfei Li, Jijia Liu, Yunfei Xiang, Ruixin Huang, Huazhong Yang, Yi Wu, Yu Wang
Date:2023-01-09 14:53:38

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

Platoon Leader Selection, User Association and Resource Allocation on a C-V2X based highway: A Reinforcement Learning Approach

Authors:Mohammad Farzanullah, Tho Le-Ngoc
Date:2023-01-09 02:07:22

We consider the problem of dynamic platoon leader selection, user association, channel assignment, and power allocation on a cellular vehicle-to-everything (C-V2X) based highway, where multiple vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) links share the frequency resources. There are multiple roadside units (RSUs) on a highway, and vehicles can form platoons, which has been identified as an advanced use case to increase road efficiency. The traditional optimization methods, requiring global channel information at a central controller, are not viable for high-mobility vehicular networks. To deal with this challenge, we propose a distributed multi-agent reinforcement learning (MARL) for resource allocation (RA). Each platoon leader, acting as an agent, can collaborate with other agents for joint sub-band selection and power allocation for its V2V links, and joint user association and power control for its V2I links. Moreover, each platoon can dynamically select the vehicle most suitable to be the platoon leader. We aim to maximize the V2V and V2I packet delivery probability in the desired latency using the deep Q-learning algorithm. Simulation results indicate that our proposed MARL outperforms the centralized hill-climbing algorithm, and platoon leader selection helps to improve both V2V and V2I performance.

XDQN: Inherently Interpretable DQN through Mimicking

Authors:Andreas Kontogiannis, George Vouros
Date:2023-01-08 13:39:58

Although deep reinforcement learning (DRL) methods have been successfully applied in challenging tasks, their application in real-world operational settings is challenged by methods' limited ability to provide explanations. Among the paradigms for explainability in DRL is the interpretable box design paradigm, where interpretable models substitute inner constituent models of the DRL method, thus making the DRL method "inherently" interpretable. In this paper we explore this paradigm and we propose XDQN, an explainable variation of DQN, which uses an interpretable policy model trained through mimicking. XDQN is challenged in a complex, real-world operational multi-agent problem, where agents are independent learners solving congestion problems. Specifically, XDQN is evaluated in three MARL scenarios, pertaining to the demand-capacity balancing problem of air traffic management. XDQN achieves performance similar to that of DQN, while its abilities to provide global models' interpretations and interpretations of local decisions are demonstrated.

Scalable Communication for Multi-Agent Reinforcement Learning via Transformer-Based Email Mechanism

Authors:Xudong Guo, Daming Shi, Wenhui Fan
Date:2023-01-05 05:34:30

Communication can impressively improve cooperation in multi-agent reinforcement learning (MARL), especially for partially-observed tasks. However, existing works either broadcast the messages leading to information redundancy, or learn targeted communication by modeling all the other agents as targets, which is not scalable when the number of agents varies. In this work, to tackle the scalability problem of MARL communication for partially-observed tasks, we propose a novel framework Transformer-based Email Mechanism (TEM). The agents adopt local communication to send messages only to the ones that can be observed without modeling all the agents. Inspired by human cooperation with email forwarding, we design message chains to forward information to cooperate with the agents outside the observation range. We introduce Transformer to encode and decode the message chain to choose the next receiver selectively. Empirically, TEM outperforms the baselines on multiple cooperative MARL benchmarks. When the number of agents varies, TEM maintains superior performance without further training.

Quantum Multi-Agent Actor-Critic Neural Networks for Internet-Connected Multi-Robot Coordination in Smart Factory Management

Authors:Won Joon Yun, Jae Pyoung Kim, Soyi Jung, Jae-Hyun Kim, Joongheon Kim
Date:2023-01-04 04:28:39

As one of the latest fields of interest in both academia and industry, quantum computing has garnered significant attention. Among various topics in quantum computing, variational quantum circuits (VQC) have been noticed for their ability to carry out quantum deep reinforcement learning (QRL). This paper verifies the potential of QRL, which will be further realized by implementing quantum multi-agent reinforcement learning (QMARL) from QRL, especially for Internet-connected autonomous multi-robot control and coordination in smart factory applications. However, the extension is not straightforward due to the non-stationarity of classical MARL. To cope with this, the centralized training and decentralized execution (CTDE) QMARL framework is proposed under the Internet connection. A smart factory environment with the Internet of Things (IoT)-based multiple agents is used to show the efficacy of the proposed algorithm. The simulation corroborates that the proposed QMARL-based autonomous multi-robot control and coordination performs better than the other frameworks.

Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics

Authors:Yahao Ding, Zhaohui Yang, Quoc-Viet Pham, Zhaoyang Zhang, Mohammad Shikh-Bahaei
Date:2023-01-03 01:05:18

Unmanned aerial vehicle (UAV) swarms are considered as a promising technique for next-generation communication networks due to their flexibility, mobility, low cost, and the ability to collaboratively and autonomously provide services. Distributed learning (DL) enables UAV swarms to intelligently provide communication services, multi-directional remote surveillance, and target tracking. In this survey, we first introduce several popular DL algorithms such as federated learning (FL), multi-agent Reinforcement Learning (MARL), distributed inference, and split learning, and present a comprehensive overview of their applications for UAV swarms, such as trajectory design, power control, wireless resource allocation, user assignment, perception, and satellite communications. Then, we present several state-of-the-art applications of UAV swarms in wireless communication systems, such us reconfigurable intelligent surface (RIS), virtual reality (VR), semantic communications, and discuss the problems and challenges that DL-enabled UAV swarms can solve in these applications. Finally, we describe open problems of using DL in UAV swarms and future research directions of DL enabled UAV swarms. In summary, this survey provides a comprehensive survey of various DL applications for UAV swarms in extensive scenarios.

Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos

Authors:Wei Xingxing, Wang Songping, Yan Huanqian
Date:2023-01-03 00:28:57

Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.

Large-Scale Traffic Signal Control by a Nash Deep Q-network Approach

Authors:Yuli. Zhang, Shangbo. Wang, Ruiyuan. Jiang
Date:2023-01-02 12:58:51

Reinforcement Learning (RL) is currently one of the most commonly used techniques for traffic signal control (TSC), which can adaptively adjusted traffic signal phase and duration according to real-time traffic data. However, a fully centralized RL approach is beset with difficulties in a multi-network scenario because of exponential growth in state-action space with increasing intersections. Multi-agent reinforcement learning (MARL) can overcome the high-dimension problem by employing the global control of each local RL agent, but it also brings new challenges, such as the failure of convergence caused by the non-stationary Markov Decision Process (MDP). In this paper, we introduce an off-policy nash deep Q-Network (OPNDQN) algorithm, which mitigates the weakness of both fully centralized and MARL approaches. The OPNDQN algorithm solves the problem that traditional algorithms cannot be used in large state-action space traffic models by utilizing a fictitious game approach at each iteration to find the nash equilibrium among neighboring intersections, from which no intersection has incentive to unilaterally deviate. One of main advantages of OPNDQN is to mitigate the non-stationarity of multi-agent Markov process because it considers the mutual influence among neighboring intersections by sharing their actions. On the other hand, for training a large traffic network, the convergence rate of OPNDQN is higher than that of existing MARL approaches because it does not incorporate all state information of each agent. We conduct an extensive experiments by using Simulation of Urban MObility simulator (SUMO), and show the dominant superiority of OPNDQN over several existing MARL approaches in terms of average queue length, episode training reward and average waiting time.

Decentralized Voltage Control with Peer-to-peer Energy Trading in a Distribution Network

Authors:Chen Feng, Andrew L. Lu, Yihsu Chen
Date:2022-12-29 02:32:20

Utilizing distributed renewable and energy storage resources via peer-to-peer (P2P) energy trading has long been touted as a solution to improve energy system's resilience and sustainability. Consumers and prosumers (those who have energy generation resources), however, do not have expertise to engage in repeated P2P trading, and the zero-marginal costs of renewables present challenges in determining fair market prices. To address these issues, we propose a multi-agent reinforcement learning (MARL) framework to help automate consumers' bidding and management of their solar PV and energy storage resources, under a specific P2P clearing mechanism that utilizes the so-called supply-demand ratio. In addition, we show how the MARL framework can integrate physical network constraints to realize decentralized voltage control, hence ensuring physical feasibility of the P2P energy trading and paving ways for real-world implementations.

Strangeness-driven Exploration in Multi-Agent Reinforcement Learning

Authors:Ju-Bong Kim, Ho-Bin Choi, Youn-Hee Han
Date:2022-12-27 11:08:49

Efficient exploration strategy is one of essential issues in cooperative multi-agent reinforcement learning (MARL) algorithms requiring complex coordination. In this study, we introduce a new exploration method with the strangeness that can be easily incorporated into any centralized training and decentralized execution (CTDE)-based MARL algorithms. The strangeness refers to the degree of unfamiliarity of the observations that an agent visits. In order to give the observation strangeness a global perspective, it is also augmented with the the degree of unfamiliarity of the visited entire state. The exploration bonus is obtained from the strangeness and the proposed exploration method is not much affected by stochastic transitions commonly observed in MARL tasks. To prevent a high exploration bonus from making the MARL training insensitive to extrinsic rewards, we also propose a separate action-value function trained by both extrinsic reward and exploration bonus, on which a behavioral policy to generate transitions is designed based. It makes the CTDE-based MARL algorithms more stable when they are used with an exploration method. Through a comparative evaluation in didactic examples and the StarCraft Multi-Agent Challenge, we show that the proposed exploration method achieves significant performance improvement in the CTDE-based MARL algorithms.

Learning Individual Policies in Large Multi-agent Systems through Local Variance Minimization

Authors:Tanvi Verma, Pradeep Varakantham
Date:2022-12-27 06:59:00

In multi-agent systems with large number of agents, typically the contribution of each agent to the value of other agents is minimal (e.g., aggregation systems such as Uber, Deliveroo). In this paper, we consider such multi-agent systems where each agent is self-interested and takes a sequence of decisions and represent them as a Stochastic Non-atomic Congestion Game (SNCG). We derive key properties for equilibrium solutions in SNCG model with non-atomic and also nearly non-atomic agents. With those key equilibrium properties, we provide a novel Multi-Agent Reinforcement Learning (MARL) mechanism that minimizes variance across values of agents in the same state. To demonstrate the utility of this new mechanism, we provide detailed results on a real-world taxi dataset and also a generic simulator for aggregation systems. We show that our approach reduces the variance in revenues earned by taxi drivers, while still providing higher joint revenues than leading approaches.

Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers

Authors:Aleksandar Krnjaic, Raul D. Steleac, Jonathan D. Thomas, Georgios Papoudakis, Lukas Schäfer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter Börsting, Stefano V. Albrecht
Date:2022-12-22 06:18:41

We consider a warehouse in which dozens of mobile robots and human pickers work together to collect and deliver items within the warehouse. The fundamental problem we tackle, called the order-picking problem, is how these worker agents must coordinate their movement and actions in the warehouse to maximise performance in this task. Established industry methods using heuristic approaches require large engineering efforts to optimise for innately variable warehouse configurations. In contrast, multi-agent reinforcement learning (MARL) can be flexibly applied to diverse warehouse configurations (e.g. size, layout, number/types of workers, item replenishment frequency), and different types of order-picking paradigms (e.g. Goods-to-Person and Person-to-Goods), as the agents can learn how to cooperate optimally through experience. We develop hierarchical MARL algorithms in which a manager agent assigns goals to worker agents, and the policies of the manager and workers are co-trained toward maximising a global objective (e.g. pick rate). Our hierarchical algorithms achieve significant gains in sample efficiency over baseline MARL algorithms and overall pick rates over multiple established industry heuristics in a diverse set of warehouse configurations and different order-picking paradigms.

AdverSAR: Adversarial Search and Rescue via Multi-Agent Reinforcement Learning

Authors:Aowabin Rahman, Arnab Bhattacharya, Thiagarajan Ramachandran, Sayak Mukherjee, Himanshu Sharma, Ted Fujimoto, Samrat Chatterjee
Date:2022-12-20 08:13:29

Search and Rescue (SAR) missions in remote environments often employ autonomous multi-robot systems that learn, plan, and execute a combination of local single-robot control actions, group primitives, and global mission-oriented coordination and collaboration. Often, SAR coordination strategies are manually designed by human experts who can remotely control the multi-robot system and enable semi-autonomous operations. However, in remote environments where connectivity is limited and human intervention is often not possible, decentralized collaboration strategies are needed for fully-autonomous operations. Nevertheless, decentralized coordination may be ineffective in adversarial environments due to sensor noise, actuation faults, or manipulation of inter-agent communication data. In this paper, we propose an algorithmic approach based on adversarial multi-agent reinforcement learning (MARL) that allows robots to efficiently coordinate their strategies in the presence of adversarial inter-agent communications. In our setup, the objective of the multi-robot team is to discover targets strategically in an obstacle-strewn geographical area by minimizing the average time needed to find the targets. It is assumed that the robots have no prior knowledge of the target locations, and they can interact with only a subset of neighboring robots at any time. Based on the centralized training with decentralized execution (CTDE) paradigm in MARL, we utilize a hierarchical meta-learning framework to learn dynamic team-coordination modalities and discover emergent team behavior under complex cooperative-competitive scenarios. The effectiveness of our approach is demonstrated on a collection of prototype grid-world environments with different specifications of benign and adversarial agents, target locations, and agent rewards.

Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management

Authors:Yuandong Ding, Mingxiao Feng, Guozi Liu, Wei Jiang, Chuheng Zhang, Li Zhao, Lei Song, Houqiang Li, Yan Jin, Jiang Bian
Date:2022-12-15 09:35:54

In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG)and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms.

SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning

Authors:Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N. Foerster, Shimon Whiteson
Date:2022-12-14 20:15:19

The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC lacks the stochasticity and partial observability to require complex *closed-loop* policies. In particular, we show that an *open-loop* policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We also introduce the extended partial observability challenge (EPO), which augments SMACv2 to ensure meaningful partial observability. We show that these changes ensure the benchmark requires the use of *closed-loop* policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2.

Hierarchical Strategies for Cooperative Multi-Agent Reinforcement Learning

Authors:Majd Ibrahim, Ammar Fayad
Date:2022-12-14 18:27:58

Adequate strategizing of agents behaviors is essential to solving cooperative MARL problems. One intuitively beneficial yet uncommon method in this domain is predicting agents future behaviors and planning accordingly. Leveraging this point, we propose a two-level hierarchical architecture that combines a novel information-theoretic objective with a trajectory prediction model to learn a strategy. To this end, we introduce a latent policy that learns two types of latent strategies: individual $z_A$, and relational $z_R$ using a modified Graph Attention Network module to extract interaction features. We encourage each agent to behave according to the strategy by conditioning its local $Q$ functions on $z_A$, and we further equip agents with a shared $Q$ function that conditions on $z_R$. Additionally, we introduce two regularizers to allow predicted trajectories to be accurate and rewarding. Empirical results on Google Research Football (GRF) and StarCraft (SC) II micromanagement tasks show that our method establishes a new state of the art being, to the best of our knowledge, the first MARL algorithm to solve all super hard SC II scenarios as well as the GRF full game with a win rate higher than $95\%$, thus outperforming all existing methods. Videos and brief overview of the methods and results are available at: https://sites.google.com/view/hier-strats-marl/home.

Enabling the Wireless Metaverse via Semantic Multiverse Communication

Authors:Jihong Park, Jinho Choi, Seong-Lyun Kim, Mehdi Bennis
Date:2022-12-13 21:21:07

Metaverse over wireless networks is an emerging use case of the sixth generation (6G) wireless systems, posing unprecedented challenges in terms of its multi-modal data transmissions with stringent latency and reliability requirements. Towards enabling this wireless metaverse, in this article we propose a novel semantic communication (SC) framework by decomposing the metaverse into human/machine agent-specific semantic multiverses (SMs). An SM stored at each agent comprises a semantic encoder and a generator, leveraging recent advances in generative artificial intelligence (AI). To improve communication efficiency, the encoder learns the semantic representations (SRs) of multi-modal data, while the generator learns how to manipulate them for locally rendering scenes and interactions in the metaverse. Since these learned SMs are biased towards local environments, their success hinges on synchronizing heterogeneous SMs in the background while communicating SRs in the foreground, turning the wireless metaverse problem into the problem of semantic multiverse communication (SMC). Based on this SMC architecture, we propose several promising algorithmic and analytic tools for modeling and designing SMC, ranging from distributed learning and multi-agent reinforcement learning (MARL) to signaling games and symbolic AI.

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Authors:Xin Liu, Honghao Wei, Lei Ying
Date:2022-12-13 03:44:00

This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned algorithm achieves a stationary policy and its iterative complexity bounds depend on the dimension of local states and actions. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.

CURO: Curriculum Learning for Relative Overgeneralization

Authors:Lin Shi, Qiyuan Liu, Bei Peng
Date:2022-12-06 03:41:08

Relative overgeneralization (RO) is a pathology that can arise in cooperative multi-agent tasks when the optimal joint action's utility falls below that of a sub-optimal joint action. RO can cause the agents to get stuck into local optima or fail to solve cooperative tasks requiring significant coordination between agents within a given timestep. In this work, we empirically find that, in multi-agent reinforcement learning (MARL), both value-based and policy gradient MARL algorithms can suffer from RO and fail to learn effective coordination policies. To better overcome RO, we propose a novel approach called curriculum learning for relative overgeneralization (CURO). To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks to train the agent. Then, to effectively transfer the knowledge acquired in one task to the next, we use a transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. CURO is general and can be applied to both value-based and policy gradient MARL methods. We demonstrate that, when applied to QMIX, HAPPO, and HATRPO, CURO can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.

What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning?

Authors:Songyang Han, Sanbao Su, Sihong He, Shuo Han, Haizhao Yang, Shaofeng Zou, Fei Miao
Date:2022-12-06 01:57:33

Various methods for Multi-Agent Reinforcement Learning (MARL) have been developed with the assumption that agents' policies are based on accurate state information. However, policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks. In this work, we propose a State-Adversarial Markov Game (SAMG) and make the first attempt to investigate different solution concepts of MARL under state uncertainties. Our analysis shows that the commonly used solution concepts of optimal agent policy and robust Nash equilibrium do not always exist in SAMGs. To circumvent this difficulty, we consider a new solution concept called robust agent policy, where agents aim to maximize the worst-case expected state value. We prove the existence of robust agent policy for finite state and finite action SAMGs. Additionally, we propose a Robust Multi-Agent Adversarial Actor-Critic (RMA3C) algorithm to learn robust policies for MARL agents under state uncertainties. Our experiments demonstrate that our algorithm outperforms existing methods when faced with state perturbations and greatly improves the robustness of MARL policies. Our code is public on https://songyanghan.github.io/what_is_solution/.

E-MAPP: Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance

Authors:Can Chang, Ni Mu, Jiajun Wu, Ling Pan, Huazhe Xu
Date:2022-12-05 07:02:05

A critical challenge in multi-agent reinforcement learning(MARL) is for multiple agents to efficiently accomplish complex, long-horizon tasks. The agents often have difficulties in cooperating on common goals, dividing complex tasks, and planning through several stages to make progress. We propose to address these challenges by guiding agents with programs designed for parallelization, since programs as a representation contain rich structural and semantic information, and are widely used as abstractions for long-horizon tasks. Specifically, we introduce Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance(E-MAPP), a novel framework that leverages parallel programs to guide multiple agents to efficiently accomplish goals that require planning over $10+$ stages. E-MAPP integrates the structural information from a parallel program, promotes the cooperative behaviors grounded in program semantics, and improves the time efficiency via a task allocator. We conduct extensive experiments on a series of challenging, long-horizon cooperative tasks in the Overcooked environment. Results show that E-MAPP outperforms strong baselines in terms of the completion rate, time efficiency, and zero-shot generalization ability by a large margin.

DACOM: Learning Delay-Aware Communication for Multi-Agent Reinforcement Learning

Authors:Tingting Yuan, Hwei-Ming Chung, Jie Yuan, Xiaoming Fu
Date:2022-12-03 14:20:59

Communication is supposed to improve multi-agent collaboration and overall performance in cooperative Multi-agent reinforcement learning (MARL). However, such improvements are prevalently limited in practice since most existing communication schemes ignore communication overheads (e.g., communication delays). In this paper, we demonstrate that ignoring communication delays has detrimental effects on collaborations, especially in delay-sensitive tasks such as autonomous driving. To mitigate this impact, we design a delay-aware multi-agent communication model (DACOM) to adapt communication to delays. Specifically, DACOM introduces a component, TimeNet, that is responsible for adjusting the waiting time of an agent to receive messages from other agents such that the uncertainty associated with delay can be addressed. Our experiments reveal that DACOM has a non-negligible performance improvement over other mechanisms by making a better trade-off between the benefits of communication and the costs of waiting for messages.

Multi-Agent Reinforcement Learning with Reward Delays

Authors:Yuyang Zhang, Runyu Zhang, Yuantao Gu, Na Li
Date:2022-12-02 20:50:48

This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $\tilde{\mathcal{O}}(\frac{H^3\sqrt{S\mathcal{T}_K}}{K}+\frac{H^3\sqrt{SA}}{\sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $\mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.

A Bayesian Framework for Digital Twin-Based Control, Monitoring, and Data Collection in Wireless Systems

Authors:Clement Ruah, Osvaldo Simeone, Bashir Al-Hashimi
Date:2022-12-02 18:13:48

Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control, monitor, and analyze software-based, "open", communication systems. Notably, DT platforms provide a sandbox in which to test artificial intelligence (AI) solutions for communication systems, potentially reducing the need to collect data and test algorithms in the field, i.e., on the physical twin (PT). A key challenge in the deployment of DT systems is to ensure that virtual control optimization, monitoring, and analysis at the DT are safe and reliable, avoiding incorrect decisions caused by "model exploitation". To address this challenge, this paper presents a general Bayesian framework with the aim of quantifying and accounting for model uncertainty at the DT that is caused by limitations in the amount and quality of data available at the DT from the PT. In the proposed framework, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL), monitoring of the PT for anomaly detection, prediction, data-collection optimization, and counterfactual analysis. To exemplify the application of the proposed framework, we specifically investigate a case-study system encompassing multiple sensing devices that report to a common receiver. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Authors:Yizhou Zhang, Guannan Qu, Pan Xu, Yiheng Lin, Zaiwei Chen, Adam Wierman
Date:2022-11-30 15:58:00

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its $\kappa$-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in $\kappa$. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing $\kappa$. Numerical simulations demonstrate the effectiveness of LPI.

Multi-agent reinforcement learning for wall modeling in LES of flow over periodic hills

Authors:Di Zhou, Michael P. Whitmore, Kevin P. Griffin, H. Jane Bae
Date:2022-11-29 17:57:36

We develop a wall model for large-eddy simulation (LES) that takes into account various pressure-gradient effects using multi-agent reinforcement learning (MARL). The model is trained using low-Reynolds-number flow over periodic hills with agents distributed on the wall along the computational grid points. The model utilizes a wall eddy-viscosity formulation as the boundary condition, which is shown to provide better predictions of the mean velocity field, rather than the typical wall-shear stress formulation. Each agent receives states based on local instantaneous flow quantities at an off-wall location, computes a reward based on the estimated wall-shear stress, and provides an action to update the wall eddy viscosity at each time step. The trained wall model is validated in wall-modeled LES (WMLES) of flow over periodic hills at higher Reynolds numbers, and the results show the effectiveness of the model on flow with pressure gradients. The analysis of the trained model indicates that the model is capable of distinguishing between the various pressure gradient regimes present in the flow.

Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration

Authors:Srivatsan Krishnan, Natasha Jaques, Shayegan Omidshafiei, Dan Zhang, Izzeddin Gur, Vijay Janapa Reddi, Aleksandra Faust
Date:2022-11-29 17:10:24

Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems grow in complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compiler optimization, etc.) quickly results in a combinatorial explosion of design space. This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role assigned to each agent. We test this hypothesis by designing domain-specific DRAM memory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.

Interpreting Primal-Dual Algorithms for Constrained Multiagent Reinforcement Learning

Authors:Daniel Tabas, Ahmed S. Zamzam, Baosen Zhang
Date:2022-11-29 10:23:26

Constrained multiagent reinforcement learning (C-MARL) is gaining importance as MARL algorithms find new applications in real-world systems ranging from energy systems to drone swarms. Most C-MARL algorithms use a primal-dual approach to enforce constraints through a penalty function added to the reward. In this paper, we study the structural effects of this penalty term on the MARL problem. First, we show that the standard practice of using the constraint function as the penalty leads to a weak notion of safety. However, by making simple modifications to the penalty term, we can enforce meaningful probabilistic (chance and conditional value at risk) constraints. Second, we quantify the effect of the penalty term on the value function, uncovering an improved value estimation procedure. We use these insights to propose a constrained multiagent advantage actor critic (C-MAA2C) algorithm. Simulations in a simple constrained multiagent environment affirm that our reinterpretation of the primal-dual method in terms of probabilistic constraints is effective, and that our proposed value estimate accelerates convergence to a safe joint policy.

ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Authors:Chuming Li, Jie Liu, Yinmin Zhang, Yuhong Wei, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang
Date:2022-11-29 10:22:55

Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. Starting from first principle, in this paper, we manage to solve the non-stationarity problem by proposing bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision-making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multiagent MDP into a single-agent MDP. We implement the ACE framework by identifying the proper network representation to formulate the action dependency, so that the sequential decision process is computed implicitly in one forward pass. To validate ACE, we compare it with strong baselines on two MARL benchmarks. Empirical experiments demonstrate that ACE outperforms the state-of-the-art algorithms on Google Research Football and StarCraft Multi-Agent Challenge by a large margin. In particular, on SMAC tasks, ACE achieves 100% success rate on almost all the hard and super-hard maps. We further study extensive research problems regarding ACE, including extension, generalization, and practicability. Code is made available to facilitate further research.

Multi-robot Social-aware Cooperative Planning in Pedestrian Environments Using Multi-agent Reinforcement Learning

Authors:Zichen He, Chunwei Song, Lu Dong
Date:2022-11-29 03:38:47

Safe and efficient co-planning of multiple robots in pedestrian participation environments is promising for applications. In this work, a novel multi-robot social-aware efficient cooperative planner that on the basis of off-policy multi-agent reinforcement learning (MARL) under partial dimension-varying observation and imperfect perception conditions is proposed. We adopt temporal-spatial graph (TSG)-based social encoder to better extract the importance of social relation between each robot and the pedestrians in its field of view (FOV). Also, we introduce K-step lookahead reward setting in multi-robot RL framework to avoid aggressive, intrusive, short-sighted, and unnatural motion decisions generated by robots. Moreover, we improve the traditional centralized critic network with multi-head global attention module to better aggregates local observation information among different robots to guide the process of individual policy update. Finally, multi-group experimental results verify the effectiveness of the proposed cooperative motion planner.

Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning

Authors:Qi Tian, Kun Kuang, Furui Liu, Baoxiang Wang
Date:2022-11-28 18:11:26

Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the performance of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e, multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.

Quantile-based MANOVA: A new tool for inferring multivariate data in factorial designs

Authors:Marléne Baumeister, Marc Ditzhaus, Markus Pauly
Date:2022-11-28 16:00:52

Multivariate analysis-of-variance (MANOVA) is a well established tool to examine multivariate endpoints. While classical approaches depend on restrictive assumptions like normality and homogeneity, there is a recent trend to more general and flexible proce dures. In this paper, we proceed on this path, but do not follow the typical mean-focused perspective. Instead we consider general quantiles, in particular the median, for a more robust multivariate analysis. The resulting methodology is applicable for all kind of factorial designs and shown to be asymptotically valid. Our theoretical results are complemented by an extensive simulation study for small and moderate sample sizes. An illustrative data analysis is also presented.

Software Simulation and Visualization of Quantum Multi-Drone Reinforcement Learning

Authors:Chanyoung Park, Jae Pyoung Kim, Won Joon Yun, Soohyun Park, Soyi Jung, Joongheon Kim
Date:2022-11-24 06:08:24

Quantum machine learning (QML) has received a lot of attention according to its light training parameter numbers and speeds; and the advances of QML lead to active research on quantum multi-agent reinforcement learning (QMARL). Existing classical multi-agent reinforcement learning (MARL) features non-stationarity and uncertain properties. Therefore, this paper presents a simulation software framework for novel QMARL to control autonomous multi-drones, i.e., quantum multi-drone reinforcement learning. Our proposed framework accomplishes reasonable reward convergence and service quality performance with fewer trainable parameters. Furthermore, it shows more stable training results. Lastly, our proposed software allows us to analyze the training process and results.

Contrastive Identity-Aware Learning for Multi-Agent Value Decomposition

Authors:Shunyu Liu, Yihe Zhou, Jie Song, Tongya Zheng, Kaixuan Chen, Tongtian Zhu, Zunlei Feng, Mingli Song
Date:2022-11-23 05:18:42

Value Decomposition (VD) aims to deduce the contributions of agents for decentralized policies in the presence of only global rewards, and has recently emerged as a powerful credit assignment paradigm for tackling cooperative Multi-Agent Reinforcement Learning (MARL) problems. One of the main challenges in VD is to promote diverse behaviors among agents, while existing methods directly encourage the diversity of learned agent networks with various strategies. However, we argue that these dedicated designs for agent networks are still limited by the indistinguishable VD network, leading to homogeneous agent behaviors and thus downgrading the cooperation capability. In this paper, we propose a novel Contrastive Identity-Aware learning (CIA) method, explicitly boosting the credit-level distinguishability of the VD network to break the bottleneck of multi-agent diversity. Specifically, our approach leverages contrastive learning to maximize the mutual information between the temporal credits and identity representations of different agents, encouraging the full expressiveness of credit assignment and further the emergence of individualities. The algorithm implementation of the proposed CIA module is simple yet effective that can be readily incorporated into various VD architectures. Experiments on the SMAC benchmarks and across different VD backbones demonstrate that the proposed method yields results superior to the state-of-the-art counterparts. Our code is available at https://github.com/liushunyu/CIA.

Learning-based social coordination to improve safety and robustness of cooperative autonomous vehicles in mixed traffic

Authors:Rodolfo Valiente, Behrad Toghi, Mahdi Razzaghpour, Ramtin Pedarsani, Yaser P. Fallah
Date:2022-11-22 02:53:45

It is expected that autonomous vehicles(AVs) and heterogeneous human-driven vehicles(HVs) will coexist on the same road. The safety and reliability of AVs will depend on their social awareness and their ability to engage in complex social interactions in a socially accepted manner. However, AVs are still inefficient in terms of cooperating with HVs and struggle to understand and adapt to human behavior, which is particularly challenging in mixed autonomy. In a road shared by AVs and HVs, the social preferences or individual traits of HVs are unknown to the AVs and different from AVs, which are expected to follow a policy, HVs are particularly difficult to forecast since they do not necessarily follow a stationary policy. To address these challenges, we frame the mixed-autonomy problem as a multi-agent reinforcement learning (MARL) problem and propose an approach that allows AVs to learn the decision-making of HVs implicitly from experience, account for all vehicles' interests, and safely adapt to other traffic situations. In contrast with existing works, we quantify AVs' social preferences and propose a distributed reward structure that introduces altruism into their decision-making process, allowing the altruistic AVs to learn to establish coalitions and influence the behavior of HVs.

Credit-cognisant reinforcement learning for multi-agent cooperation

Authors:F. Bredell, H. A. Engelbrecht, J. C. Schoeman
Date:2022-11-18 09:00:25

Traditional multi-agent reinforcement learning (MARL) algorithms, such as independent Q-learning, struggle when presented with partially observable scenarios, and where agents are required to develop delicate action sequences. This is often the result of the reward for a good action only being available after other agents have taken theirs, and these actions are not credited accordingly. Recurrent neural networks have proven to be a viable solution strategy for solving these types of problems, resulting in significant performance increase when compared to other methods. In this paper, we explore a different approach and focus on the experiences used to update the action-value functions of each agent. We introduce the concept of credit-cognisant rewards (CCRs), which allows an agent to perceive the effect its actions had on the environment as well as on its co-agents. We show that by manipulating these experiences and constructing the reward contained within them to include the rewards received by all the agents within the same action sequence, we are able to improve significantly on the performance of independent deep Q-learning as well as deep recurrent Q-learning. We evaluate and test the performance of CCRs when applied to deep reinforcement learning techniques at the hands of a simplified version of the popular card game Hanabi.

Contextual Transformer for Offline Meta Reinforcement Learning

Authors:Runji Lin, Ye Li, Xidong Feng, Zhaowei Zhang, Xian Hong Wu Fung, Haifeng Zhang, Jun Wang, Yali Du, Yaodong Yang
Date:2022-11-15 10:00:14

The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can improve sequence modeling-based offline reinforcement learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional policy generation. As such, we can pretrain a model on the offline dataset with self-supervised loss and learn a prompt to guide the policy towards desired actions. Secondly, we extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. Superior results validate the strong performance, and generality of our methods.

Dynamic Collaborative Multi-Agent Reinforcement Learning Communication for Autonomous Drone Reforestation

Authors:Philipp Dominic Siedler
Date:2022-11-14 13:25:22

We approach autonomous drone-based reforestation with a collaborative multi-agent reinforcement learning (MARL) setup. Agents can communicate as part of a dynamically changing network. We explore collaboration and communication on the back of a high-impact problem. Forests are the main resource to control rising CO2 conditions. Unfortunately, the global forest volume is decreasing at an unprecedented rate. Many areas are too large and hard to traverse to plant new trees. To efficiently cover as much area as possible, here we propose a Graph Neural Network (GNN) based communication mechanism that enables collaboration. Agents can share location information on areas needing reforestation, which increases viewed area and planted tree count. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show how communication enables collaboration and increases collective performance, planting precision and the risk-taking propensity of individual agents.

Fleet Rebalancing for Expanding Shared e-Mobility Systems: A Multi-agent Deep Reinforcement Learning Approach

Authors:Man Luo, Bowen Du, Wenzhe Zhang, Tianyou Song, Kun Li, Hongming Zhu, Mark Birkin, Hongkai Wen
Date:2022-11-11 11:25:30

The electrification of shared mobility has become popular across the globe. Many cities have their new shared e-mobility systems deployed, with continuously expanding coverage from central areas to the city edges. A key challenge in the operation of these systems is fleet rebalancing, i.e., how EVs should be repositioned to better satisfy future demand. This is particularly challenging in the context of expanding systems, because i) the range of the EVs is limited while charging time is typically long, which constrain the viable rebalancing operations; and ii) the EV stations in the system are dynamically changing, i.e., the legitimate targets for rebalancing operations can vary over time. We tackle these challenges by first investigating rich sets of data collected from a real-world shared e-mobility system for one year, analyzing the operation model, usage patterns and expansion dynamics of this new mobility mode. With the learned knowledge we design a high-fidelity simulator, which is able to abstract key operation details of EV sharing at fine granularity. Then we model the rebalancing task for shared e-mobility systems under continuous expansion as a Multi-Agent Reinforcement Learning (MARL) problem, which directly takes the range and charging properties of the EVs into account. We further propose a novel policy optimization approach with action cascading, which is able to cope with the expansion dynamics and solve the formulated MARL. We evaluate the proposed approach extensively, and experimental results show that our approach outperforms the state-of-the-art, offering significant performance gain in both satisfied demand and net revenue.

Policy-Based Reinforcement Learning for Assortative Matching in Human Behavior Modeling

Authors:Ou Deng, Qun Jin
Date:2022-11-08 01:19:14

This paper explores human behavior in virtual networked communities, specifically individuals or groups' potential and expressive capacity to respond to internal and external stimuli, with assortative matching as a typical example. A modeling approach based on Multi-Agent Reinforcement Learning (MARL) is proposed, adding a multi-head attention function to the A3C algorithm to enhance learning effectiveness. This approach simulates human behavior in certain scenarios through various environmental parameter settings and agent action strategies. In our experiment, reinforcement learning is employed to serve specific agents that learn from environment status and competitor behaviors, optimizing strategies to achieve better results. The simulation includes individual and group levels, displaying possible paths to forming competitive advantages. This modeling approach provides a means for further analysis of the evolutionary dynamics of human behavior, communities, and organizations in various socioeconomic issues.

Satellite Navigation and Coordination with Limited Information Sharing

Authors:Sydney Dolan, Siddharth Nayak, Hamsa Balakrishnan
Date:2022-11-07 16:14:31

We explore space traffic management as an application of collision-free navigation in multi-agent systems where vehicles have limited observation and communication ranges. We investigate the effectiveness of transferring a collision avoidance multi-agent reinforcement (MARL) model trained on a ground environment to a space one. We demonstrate that the transfer learning model outperforms a model that is trained directly on the space environment. Furthermore, we find that our approach works well even when we consider the perturbations to satellite dynamics caused by the Earth's oblateness. Finally, we show how our methods can be used to evaluate the benefits of information-sharing between satellite operators in order to improve coordination.

EdgeVision: Towards Collaborative Video Analytics on Distributed Edges for Performance Maximization

Authors:Guanyu Gao, Yuqi Dong, Ran Wang, Xin Zhou
Date:2022-11-06 13:02:08

Deep Neural Network (DNN)-based video analytics significantly improves recognition accuracy in computer vision applications. Deploying DNN models at edge nodes, closer to end users, reduces inference delay and minimizes bandwidth costs. However, these resource-constrained edge nodes may experience substantial delays under heavy workloads, leading to imbalanced workload distribution. While previous efforts focused on optimizing hierarchical device-edge-cloud architectures or centralized clusters for video analytics, we propose addressing these challenges through collaborative distributed and autonomous edge nodes. Despite the intricate control involved, we introduce EdgeVision, a Multiagent Reinforcement Learning (MARL)- based framework for collaborative video analytics on distributed edges. EdgeVision enables edge nodes to autonomously learn policies for video preprocessing, model selection, and request dispatching. Our approach utilizes an actor-critic-based MARL algorithm enhanced with an attention mechanism to learn optimal policies. To validate EdgeVision, we construct a multi-edge testbed and conduct experiments with real-world datasets. Results demonstrate a performance enhancement of 33.6% to 86.4% compared to baseline methods.

Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation

Authors:Siddharth Nayak, Kenneth Choi, Wenqi Ding, Sydney Dolan, Karthik Gopalakrishnan, Hamsa Balakrishnan
Date:2022-11-03 20:02:45

We consider the problem of multi-agent navigation and collision avoidance when observations are limited to the local neighborhood of each agent. We propose InforMARL, a novel architecture for multi-agent reinforcement learning (MARL) which uses local information intelligently to compute paths for all the agents in a decentralized manner. Specifically, InforMARL aggregates information about the local neighborhood of agents for both the actor and the critic using a graph neural network and can be used in conjunction with any standard MARL algorithm. We show that (1) in training, InforMARL has better sample efficiency and performance than baseline approaches, despite using less information, and (2) in testing, it scales well to environments with arbitrary numbers of agents and obstacles. We illustrate these results using four task environments, including one with predetermined goals for each agent, and one in which the agents collectively try to cover all goals. Code available at https://github.com/nsidn98/InforMARL.

Multi-Agent Reinforcement Learning for Adaptive Mesh Refinement

Authors:Jiachen Yang, Ketan Mittal, Tarik Dzanic, Socratis Petrides, Brendan Keith, Brenden Petersen, Daniel Faissol, Robert Anderson
Date:2022-11-02 00:41:32

Adaptive mesh refinement (AMR) is necessary for efficient finite element simulations of complex physical phenomenon, as it allocates limited computational budget based on the need for higher or lower resolution, which varies over space and time. We present a novel formulation of AMR as a fully-cooperative Markov game, in which each element is an independent agent who makes refinement and de-refinement choices based on local information. We design a novel deep multi-agent reinforcement learning (MARL) algorithm called Value Decomposition Graph Network (VDGN), which solves the two core challenges that AMR poses for MARL: posthumous credit assignment due to agent creation and deletion, and unstructured observations due to the diversity of mesh geometries. For the first time, we show that MARL enables anticipatory refinement of regions that will encounter complex features at future times, thereby unlocking entirely new regions of the error-cost objective landscape that are inaccessible by traditional methods based on local error estimators. Comprehensive experiments show that VDGN policies significantly outperform error threshold-based policies in global error and cost metrics. We show that learned policies generalize to test problems with physical features, mesh geometries, and longer simulation times that were not seen in training. We also extend VDGN with multi-objective optimization capabilities to find the Pareto front of the tradeoff between cost and error.

Agent-Time Attention for Sparse Rewards Multi-Agent Reinforcement Learning

Authors:Jennifer She, Jayesh K. Gupta, Mykel J. Kochenderfer
Date:2022-10-31 17:54:51

Sparse and delayed rewards pose a challenge to single agent reinforcement learning. This challenge is amplified in multi-agent reinforcement learning (MARL) where credit assignment of these rewards needs to happen not only across time, but also across agents. We propose Agent-Time Attention (ATA), a neural network model with auxiliary losses for redistributing sparse and delayed rewards in collaborative MARL. We provide a simple example that demonstrates how providing agents with their own local redistributed rewards and shared global redistributed rewards motivate different policies. We extend several MiniGrid environments, specifically MultiRoom and DoorKey, to the multi-agent sparse delayed rewards setting. We demonstrate that ATA outperforms various baselines on many instances of these environments. Source code of the experiments is available at https://github.com/jshe/agent-time-attention.

LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

Authors:Je Yang, JaeUk Kim, Joo-Young Kim
Date:2022-10-29 15:09:34

Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. \par In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.

Curiosity-Driven Multi-Agent Exploration with Mixed Objectives

Authors:Roben Delos Reyes, Kyunghwan Son, Jinhwan Jung, Wan Ju Kang, Yung Yi
Date:2022-10-29 02:45:38

Intrinsic rewards have been increasingly used to mitigate the sparse reward problem in single-agent reinforcement learning. These intrinsic rewards encourage the agent to look for novel experiences, guiding the agent to explore the environment sufficiently despite the lack of extrinsic rewards. Curiosity-driven exploration is a simple yet efficient approach that quantifies this novelty as the prediction error of the agent's curiosity module, an internal neural network that is trained to predict the agent's next state given its current state and action. We show here, however, that naively using this curiosity-driven approach to guide exploration in sparse reward cooperative multi-agent environments does not consistently lead to improved results. Straightforward multi-agent extensions of curiosity-driven exploration take into consideration either individual or collective novelty only and thus, they do not provide a distinct but collaborative intrinsic reward signal that is essential for learning in cooperative multi-agent tasks. In this work, we propose a curiosity-driven multi-agent exploration method that has the mixed objective of motivating the agents to explore the environment in ways that are individually and collectively novel. First, we develop a two-headed curiosity module that is trained to predict the corresponding agent's next observation in the first head and the next joint observation in the second head. Second, we design the intrinsic reward formula to be the sum of the individual and joint prediction errors of this curiosity module. We empirically show that the combination of our curiosity module architecture and intrinsic reward formulation guides multi-agent exploration more efficiently than baseline approaches, thereby providing the best performance boost to MARL algorithms in cooperative navigation environments with sparse rewards.

Classifying Ambiguous Identities in Hidden-Role Stochastic Games with Multi-Agent Reinforcement Learning

Authors:Shijie Han, Siyuan Li, Bo An, Wei Zhao, Peng Liu
Date:2022-10-24 00:54:59

Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.

Solving Continuous Control via Q-learning

Authors:Tim Seyde, Peter Werner, Wilko Schwarting, Igor Gilitschenski, Martin Riedmiller, Daniela Rus, Markus Wulfmeier
Date:2022-10-22 22:55:50

While there has been substantial success for solving continuous control with actor-critic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.

RPM: Generalizable Behaviors for Multi-Agent Reinforcement Learning

Authors:Wei Qiu, Xiao Ma, Bo An, Svetlana Obraztsova, Shuicheng Yan, Zhongwen Xu
Date:2022-10-18 07:32:43

Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative self-play training framework leverages agents' past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.

Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning

Authors:Jifeng Hu, Yanchao Sun, Hechang Chen, Sili Huang, haiyin piao, Yi Chang, Lichao Sun
Date:2022-10-14 08:31:45

Multi-agent reinforcement learning has drawn increasing attention in practice, e.g., robotics and automatic driving, as it can explore optimal policies using samples generated by interacting with the environment. However, high reward uncertainty still remains a problem when we want to train a satisfactory model, because obtaining high-quality reward feedback is usually expensive and even infeasible. To handle this issue, previous methods mainly focus on passive reward correction. At the same time, recent active reward estimation methods have proven to be a recipe for reducing the effect of reward uncertainty. In this paper, we propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL). Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. Specifically, we design the multi-action-branch reward estimation to model reward distributions on all action branches. Then we utilize reward aggregation to obtain stable updating signals during training. Our intuition is that consideration of all possible consequences of actions could be useful for learning policies. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.

Multi-agent Dynamic Algorithm Configuration

Authors:Ke Xue, Jiacheng Xu, Lei Yuan, Miqing Li, Chao Qian, Zongzhang Zhang, Yang Yu
Date:2022-10-13 08:39:32

Automated algorithm configuration relieves users from tedious, trial-and-error tuning tasks. A popular algorithm configuration tuning paradigm is dynamic algorithm configuration (DAC), in which an agent learns dynamic configuration policies across instances by reinforcement learning (RL). However, in many complex algorithms, there may exist different types of configuration hyperparameters, and such heterogeneity may bring difficulties for classic DAC which uses a single-agent RL policy. In this paper, we aim to address this issue and propose multi-agent DAC (MA-DAC), with one agent working for one type of configuration hyperparameter. MA-DAC formulates the dynamic configuration of a complex algorithm with multiple types of hyperparameters as a contextual multi-agent Markov decision process and solves it by a cooperative multi-agent RL (MARL) algorithm. To instantiate, we apply MA-DAC to a well-known optimization algorithm for multi-objective optimization problems. Experimental results show the effectiveness of MA-DAC in not only achieving superior performance compared with other configuration tuning approaches based on heuristic rules, multi-armed bandits, and single-agent RL, but also being capable of generalizing to different problem classes. Furthermore, we release the environments in this paper as a benchmark for testing MARL algorithms, with the hope of facilitating the application of MARL.

Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Authors:Pedro P. Santos, Diogo S. Carvalho, Miguel Vasco, Alberto Sardinha, Pedro A. Santos, Ana Paiva, Francisco S. Melo
Date:2022-10-12 14:58:32

We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully complete cooperative tasks with arbitrary communication levels at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized), but the agents do not know beforehand which communication level they will encounter at execution time. To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly model a communication process between the agents. We contribute MARO, an approach that makes use of an auto-regressive predictive model, trained in a centralized manner, to estimate missing agents' observations at execution time. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms relevant baselines, allowing agents to act with faulty communication while successfully exploiting shared information.

Phantom -- A RL-driven multi-agent framework to model complex systems

Authors:Leo Ardon, Jared Vann, Deepeka Garg, Tom Spooner, Sumitra Ganesh
Date:2022-10-12 08:37:38

Agent based modelling (ABM) is a computational approach to modelling complex systems by specifying the behaviour of autonomous decision-making components or agents in the system and allowing the system dynamics to emerge from their interactions. Recent advances in the field of Multi-agent reinforcement learning (MARL) have made it feasible to study the equilibrium of complex environments where multiple agents learn simultaneously. However, most ABM frameworks are not RL-native, in that they do not offer concepts and interfaces that are compatible with the use of MARL to learn agent behaviours. In this paper, we introduce a new open-source framework, Phantom, to bridge the gap between ABM and MARL. Phantom is an RL-driven framework for agent-based modelling of complex multi-agent systems including, but not limited to economic systems and markets. The framework aims to provide the tools to simplify the ABM specification in a MARL-compatible way - including features to encode dynamic partial observability, agent utility functions, heterogeneity in agent preferences or types, and constraints on the order in which agents can act (e.g. Stackelberg games, or more complex turn-taking environments). In this paper, we present these features, their design rationale and present two new environments leveraging the framework.

Digital Twin-Based Multiple Access Optimization and Monitoring via Model-Driven Bayesian Learning

Authors:Clement Ruah, Osvaldo Simeone, Bashir Al-Hashimi
Date:2022-10-11 16:14:43

Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control and monitor software-based, "open", communication systems, which play the role of the physical twin (PT). In the general framework presented in this work, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL) and monitoring of the PT for anomaly detection. We specifically investigate the application of the proposed framework to a simple case-study system encompassing multiple sensing devices that report to a common receiver. The Bayesian model trained at the DT has the key advantage of capturing epistemic uncertainty regarding the communication system, e.g., regarding current traffic conditions, which arise from limited PT-to-DT data transfer. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.

MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library

Authors:Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Xiaodan Liang, Zhihui Li, Xiaojun Chang, Yaodong Yang
Date:2022-10-11 03:11:12

A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task's attributes. The MARLlib library's source code is publicly accessible on GitHub: \url{https://github.com/Replicable-MARL/MARLlib}.

Multi-agent Deep Covering Skill Discovery

Authors:Jiayu Chen, Marina Haliem, Tian Lan, Vaneet Aggarwal
Date:2022-10-07 00:40:59

The use of skills (a.k.a., options) can greatly accelerate exploration in reinforcement learning, especially when only sparse reward signals are available. While option discovery methods have been proposed for individual agents, in multi-agent reinforcement learning settings, discovering collaborative options that can coordinate the behavior of multiple agents and encourage them to visit the under-explored regions of their joint state space has not been considered. In this case, we propose Multi-agent Deep Covering Option Discovery, which constructs the multi-agent options through minimizing the expected cover time of the multiple agents' joint state space. Also, we propose a novel framework to adopt the multi-agent options in the MARL process. In practice, a multi-agent task can usually be divided into some sub-tasks, each of which can be completed by a sub-group of the agents. Therefore, our algorithm framework first leverages an attention mechanism to find collaborative agent sub-groups that would benefit most from coordinated actions. Then, a hierarchical algorithm, namely HA-MSAC, is developed to learn the multi-agent options for each sub-group to complete their sub-tasks first, and then to integrate them through a high-level policy as the solution of the whole task. This hierarchical option construction allows our framework to strike a balance between scalability and effective collaboration among the agents. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can effectively capture the agent interactions with the attention mechanism, successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher task rewards.

Spatial-Temporal-Aware Safe Multi-Agent Reinforcement Learning of Connected Autonomous Vehicles in Challenging Scenarios

Authors:Zhili Zhang, Songyang Han, Jiangwei Wang, Fei Miao
Date:2022-10-05 14:39:07

Communication technologies enable coordination among connected and autonomous vehicles (CAVs). However, it remains unclear how to utilize shared information to improve the safety and efficiency of the CAV system in dynamic and complicated driving scenarios. In this work, we propose a framework of constrained multi-agent reinforcement learning (MARL) with a parallel Safety Shield for CAVs in challenging driving scenarios that includes unconnected hazard vehicles. The coordination mechanisms of the proposed MARL include information sharing and cooperative policy learning, with Graph Convolutional Network (GCN)-Transformer as a spatial-temporal encoder that enhances the agent's environment awareness. The Safety Shield module with Control Barrier Functions (CBF)-based safety checking protects the agents from taking unsafe actions. We design a constrained multi-agent advantage actor-critic (CMAA2C) algorithm to train safe and cooperative policies for CAVs. With the experiment deployed in the CARLA simulator, we verify the performance of the safety checking, spatial-temporal encoder, and coordination mechanisms designed in our method by comparative experiments in several challenging scenarios with unconnected hazard vehicles. Results show that our proposed methodology significantly increases system safety and efficiency in challenging scenarios.

Stateful active facilitator: Coordination and Environmental Heterogeneity in Cooperative Multi-Agent Reinforcement Learning

Authors:Dianbo Liu, Vedant Shah, Oussama Boussif, Cristian Meo, Anirudh Goyal, Tianmin Shu, Michael Mozer, Nicolas Heess, Yoshua Bengio
Date:2022-10-04 18:17:01

In cooperative multi-agent reinforcement learning, a team of agents works together to achieve a common goal. Different environments or tasks may require varying degrees of coordination among agents in order to achieve the goal in an optimal way. The nature of coordination will depend on the properties of the environment -- its spatial layout, distribution of obstacles, dynamics, etc. We term this variation of properties within an environment as heterogeneity. Existing literature has not sufficiently addressed the fact that different environments may have different levels of heterogeneity. We formalize the notions of coordination level and heterogeneity level of an environment and present HECOGrid, a suite of multi-agent RL environments that facilitates empirical evaluation of different MARL approaches across different levels of coordination and environmental heterogeneity by providing a quantitative control over coordination and heterogeneity levels of the environment. Further, we propose a Centralized Training Decentralized Execution learning approach called Stateful Active Facilitator (SAF) that enables agents to work efficiently in high-coordination and high-heterogeneity environments through a differentiable and shared knowledge source used during training and dynamic selection from a shared pool of policies. We evaluate SAF and compare its performance against baselines IPPO and MAPPO on HECOGrid. Our results show that SAF consistently outperforms the baselines across different tasks and different heterogeneity and coordination levels. We release the code for HECOGrid as well as all our experiments.

Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Authors:Shicong Cen, Yuejie Chi, Simon S. Du, Lin Xiao
Date:2022-10-03 16:05:43

Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to interact in a shared dynamic environment -- permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning

Authors:Filippos Christianos, Georgios Papoudakis, Stefano V. Albrecht
Date:2022-09-28 18:14:34

This work focuses on equilibrium selection in no-conflict multi-agent games, where we specifically study the problem of selecting a Pareto-optimal Nash equilibrium among several existing equilibria. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address sub-optimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC), which is an actor-critic algorithm that utilises a simple property of no-conflict games (a superset of cooperative games): the Pareto-optimal equilibrium in a no-conflict game maximises the returns of all agents and, therefore, is the preferred outcome for all agents. We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to seven state-of-the-art MARL algorithms and that it successfully converges to a Pareto-optimal equilibrium in a range of matrix games. Finally, we propose PACDCG, a graph neural network extension of Pareto-AC, which is shown to efficiently scale in games with a large number of agents.

More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization

Authors:Jiangxing Wang, Deheng Ye, Zongqing Lu
Date:2022-09-26 13:29:22

In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.

Towards a Standardised Performance Evaluation Protocol for Cooperative MARL

Authors:Rihab Gorsane, Omayma Mahjoub, Ruan de Kock, Roland Dubb, Siddarth Singh, Arnu Pretorius
Date:2022-09-21 16:40:03

Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale. Research in the field has been growing steadily with many breakthrough algorithms proposed in recent years. In this work, we take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL. By conducting a detailed meta-analysis of prior work, spanning 75 papers accepted for publication from 2016 to 2022, we bring to light worrying trends that put into question the true rate of progress. We further consider these trends in a wider context and take inspiration from single-agent RL literature on similar issues with recommendations that remain applicable to MARL. Combining these recommendations, with novel insights from our analysis, we propose a standardised performance evaluation protocol for cooperative MARL. We argue that such a standard protocol, if widely adopted, would greatly improve the validity and credibility of future research, make replication and reproducibility easier, as well as improve the ability of the field to accurately gauge the rate of progress over time by being able to make sound comparisons across different works. Finally, we release our meta-analysis data publicly on our project website for future research on evaluation: https://sites.google.com/view/marl-standard-protocol

Macro-Action-Based Multi-Agent/Robot Deep Reinforcement Learning under Partial Observability

Authors:Yuchen Xiao
Date:2022-09-20 21:13:51

The state-of-the-art multi-agent reinforcement learning (MARL) methods have provided promising solutions to a variety of complex problems. Yet, these methods all assume that agents perform synchronized primitive-action executions so that they are not genuinely scalable to long-horizon real-world multi-agent/robot tasks that inherently require agents/robots to asynchronously reason about high-level action selection at varying time durations. The Macro-Action Decentralized Partially Observable Markov Decision Process (MacDec-POMDP) is a general formalization for asynchronous decision-making under uncertainty in fully cooperative multi-agent tasks. In this thesis, we first propose a group of value-based RL approaches for MacDec-POMDPs, where agents are allowed to perform asynchronous learning and decision-making with macro-action-value functions in three paradigms: decentralized learning and control, centralized learning and control, and centralized training for decentralized execution (CTDE). Building on the above work, we formulate a set of macro-action-based policy gradient algorithms under the three training paradigms, where agents are allowed to directly optimize their parameterized policies in an asynchronous manner. We evaluate our methods both in simulation and on real robots over a variety of realistic domains. Empirical results demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions with macro-actions.

Relational Reasoning via Set Transformers: Provable Efficiency and Applications to MARL

Authors:Fengzhuo Zhang, Boyi Liu, Kaixin Wang, Vincent Y. F. Tan, Zhuoran Yang, Zhaoran Wang
Date:2022-09-20 16:42:59

The cooperative Multi-A gent R einforcement Learning (MARL) with permutation invariant agents framework has achieved tremendous empirical successes in real-world applications. Unfortunately, the theoretical understanding of this MARL problem is lacking due to the curse of many agents and the limited exploration of the relational reasoning in existing works. In this paper, we verify that the transformer implements complex relational reasoning, and we propose and analyze model-free and model-based offline MARL algorithms with the transformer approximators. We prove that the suboptimality gaps of the model-free and model-based algorithms are independent of and logarithmic in the number of agents respectively, which mitigates the curse of many agents. These results are consequences of a novel generalization error bound of the transformer and a novel analysis of the Maximum Likelihood Estimate (MLE) of the system dynamics with the transformer. Our model-based algorithm is the first provably efficient MARL algorithm that explicitly exploits the permutation invariance of the agents. Our improved generalization bound may be of independent interest and is applicable to other regression problems related to the transformer beyond MARL.

Sample-Efficient Multi-Agent Reinforcement Learning with Demonstrations for Flocking Control

Authors:Yunbo Qiu, Yuzhu Zhan, Yue Jin, Jian Wang, Xudong Zhang
Date:2022-09-17 15:24:37

Flocking control is a significant problem in multi-agent systems such as multi-agent unmanned aerial vehicles and multi-agent autonomous underwater vehicles, which enhances the cooperativity and safety of agents. In contrast to traditional methods, multi-agent reinforcement learning (MARL) solves the problem of flocking control more flexibly. However, methods based on MARL suffer from sample inefficiency, since they require a huge number of experiences to be collected from interactions between agents and the environment. We propose a novel method Pretraining with Demonstrations for MARL (PwD-MARL), which can utilize non-expert demonstrations collected in advance with traditional methods to pretrain agents. During the process of pretraining, agents learn policies from demonstrations by MARL and behavior cloning simultaneously, and are prevented from overfitting demonstrations. By pretraining with non-expert demonstrations, PwD-MARL improves sample efficiency in the process of online MARL with a warm start. Experiments show that PwD-MARL improves sample efficiency and policy performance in the problem of flocking control, even with bad or few demonstrations.

Sub-optimal Policy Aided Multi-Agent Reinforcement Learning for Flocking Control

Authors:Yunbo Qiu, Yue Jin, Jian Wang, Xudong Zhang
Date:2022-09-17 15:10:49

Flocking control is a challenging problem, where multiple agents, such as drones or vehicles, need to reach a target position while maintaining the flock and avoiding collisions with obstacles and collisions among agents in the environment. Multi-agent reinforcement learning has achieved promising performance in flocking control. However, methods based on traditional reinforcement learning require a considerable number of interactions between agents and the environment. This paper proposes a sub-optimal policy aided multi-agent reinforcement learning algorithm (SPA-MARL) to boost sample efficiency. SPA-MARL directly leverages a prior policy that can be manually designed or solved with a non-learning method to aid agents in learning, where the performance of the policy can be sub-optimal. SPA-MARL recognizes the difference in performance between the sub-optimal policy and itself, and then imitates the sub-optimal policy if the sub-optimal policy is better. We leverage SPA-MARL to solve the flocking control problem. A traditional control method based on artificial potential fields is used to generate a sub-optimal policy. Experiments demonstrate that SPA-MARL can speed up the training process and outperform both the MARL baseline and the used sub-optimal policy.

MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning

Authors:Kefan Su, Siyuan Zhou, Jiechuan Jiang, Chuang Gan, Xiangjun Wang, Zongqing Lu
Date:2022-09-17 04:54:32

Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in fully decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose multi-agent alternate Q-learning (MA2QL), where agents take turns updating their Q-functions by Q-learning. MA2QL is a minimalist approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.

A Robust and Constrained Multi-Agent Reinforcement Learning Electric Vehicle Rebalancing Method in AMoD Systems

Authors:Sihong He, Yue Wang, Shuo Han, Shaofeng Zou, Fei Miao
Date:2022-09-17 03:24:10

Electric vehicles (EVs) play critical roles in autonomous mobility-on-demand (AMoD) systems, but their unique charging patterns increase the model uncertainties in AMoD systems (e.g. state transition probability). Since there usually exists a mismatch between the training and test/true environments, incorporating model uncertainty into system design is of critical importance in real-world applications. However, model uncertainties have not been considered explicitly in EV AMoD system rebalancing by existing literature yet, and the coexistence of model uncertainties and constraints that the decision should satisfy makes the problem even more challenging. In this work, we design a robust and constrained multi-agent reinforcement learning (MARL) framework with state transition kernel uncertainty for EV AMoD systems. We then propose a robust and constrained MARL algorithm (ROCOMA) with robust natural policy gradients (RNPG) that trains a robust EV rebalancing policy to balance the supply-demand ratio and the charging utilization rate across the city under model uncertainty. Experiments show that the ROCOMA can learn an effective and robust rebalancing policy. It outperforms non-robust MARL methods in the presence of model uncertainties. It increases the system fairness by 19.6% and decreases the rebalancing costs by 75.8%.

Mean-Field Approximation of Cooperative Constrained Multi-Agent Reinforcement Learning (CMARL)

Authors:Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri
Date:2022-09-15 16:33:38

Mean-Field Control (MFC) has recently been proven to be a scalable tool to approximately solve large-scale multi-agent reinforcement learning (MARL) problems. However, these studies are typically limited to unconstrained cumulative reward maximization framework. In this paper, we show that one can use the MFC approach to approximate the MARL problem even in the presence of constraints. Specifically, we prove that, an $N$-agent constrained MARL problem, with state, and action spaces of each individual agents being of sizes $|\mathcal{X}|$, and $|\mathcal{U}|$ respectively, can be approximated by an associated constrained MFC problem with an error, $e\triangleq \mathcal{O}\left([\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}]/\sqrt{N}\right)$. In a special case where the reward, cost, and state transition functions are independent of the action distribution of the population, we prove that the error can be improved to $e=\mathcal{O}(\sqrt{|\mathcal{X}|}/\sqrt{N})$. Also, we provide a Natural Policy Gradient based algorithm and prove that it can solve the constrained MARL problem within an error of $\mathcal{O}(e)$ with a sample complexity of $\mathcal{O}(e^{-6})$.

MIXRTs: Toward Interpretable Multi-Agent Reinforcement Learning via Mixing Recurrent Soft Decision Trees

Authors:Zichuan Liu, Yuanyang Zhu, Zhi Wang, Yang Gao, Chunlin Chen
Date:2022-09-15 11:39:59

While achieving tremendous success in various fields, existing multi-agent reinforcement learning (MARL) with a black-box neural network architecture makes decisions in an opaque manner that hinders humans from understanding the learned knowledge and how input observations influence decisions. Instead, existing interpretable approaches, such as traditional linear models and decision trees, usually suffer from weak expressivity and low accuracy. To address this apparent dichotomy between performance and interpretability, our solution, MIXing Recurrent soft decision Trees (MIXRTs), is a novel interpretable architecture that can represent explicit decision processes via the root-to-leaf path and reflect each agent's contribution to the team. Specifically, we construct a novel soft decision tree to address partial observability by leveraging the advances in recurrent neural networks, and demonstrate which features influence the decision-making process through the tree-based model. Then, based on the value decomposition framework, we linearly assign credit to each agent by explicitly mixing individual action values to estimate the joint action value using only local observations, providing new insights into how agents cooperate to accomplish the task. Theoretical analysis shows that MIXRTs guarantees the structural constraint on additivity and monotonicity in the factorization of joint action values. Evaluations on the challenging Spread and StarCraft II tasks show that MIXRTs achieves competitive performance compared to widely investigated methods and delivers more straightforward explanations of the decision processes. We explore a promising path toward developing learning algorithms with both high performance and interpretability, potentially shedding light on new interpretable paradigms for MARL.

Skip Training for Multi-Agent Reinforcement Learning Controller for Industrial Wave Energy Converters

Authors:Soumyendu Sarkar, Vineet Gundecha, Sahand Ghorbanpour, Alexander Shmakov, Ashwin Ramesh Babu, Alexandre Pichard, Mathieu Cocho
Date:2022-09-13 00:20:31

Recent Wave Energy Converters (WEC) are equipped with multiple legs and generators to maximize energy generation. Traditional controllers have shown limitations to capture complex wave patterns and the controllers must efficiently maximize the energy capture. This paper introduces a Multi-Agent Reinforcement Learning controller (MARL), which outperforms the traditionally used spring damper controller. Our initial studies show that the complex nature of problems makes it hard for training to converge. Hence, we propose a novel skip training approach which enables the MARL training to overcome performance saturation and converge to more optimum controllers compared to default MARL training, boosting power generation. We also present another novel hybrid training initialization (STHTI) approach, where the individual agents of the MARL controllers can be initially trained against the baseline Spring Damper (SD) controller individually and then be trained one agent at a time or all together in future iterations to accelerate convergence. We achieved double-digit gains in energy efficiency over the baseline Spring Damper controller with the proposed MARL controllers using the Asynchronous Advantage Actor-Critic (A3C) algorithm.

Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning

Authors:Yuanquan Hu, Xiaoli Wei, Junji Yan, Hengxi Zhang
Date:2022-09-11 08:00:39

The marriage between mean-field theory and reinforcement learning has shown a great capacity to solve large-scale control problems with homogeneous agents. To break the homogeneity restriction of mean-field theory, a recent interest is to introduce graphon theory to the mean-field paradigm. In this paper, we propose a graphon mean-field control (GMFC) framework to approximate cooperative multi-agent reinforcement learning (MARL) with nonuniform interactions and show that the approximate order is of $\mathcal{O}(\frac{1}{\sqrt{N}})$, with $N$ the number of agents. By discretizing the graphon index of GMFC, we further introduce a smaller class of GMFC called block GMFC, which is shown to well approximate cooperative MARL. Our empirical studies on several examples demonstrate that our GMFC approach is comparable with the state-of-art MARL algorithms while enjoying better scalability.

Joint Caching and Transmission in the Mobile Edge Network: A Multi-Agent Learning Approach

Authors:Qirui Mi, Ning Yang, Haifeng Zhang, Haijun Zhang, Jun Wang
Date:2022-09-09 07:51:10

Joint caching and transmission optimization problem is challenging due to the deep coupling between decisions. This paper proposes an iterative distributed multi-agent learning approach to jointly optimize caching and transmission. The goal of this approach is to minimize the total transmission delay of all users. In this iterative approach, each iteration includes caching optimization and transmission optimization. A multi-agent reinforcement learning (MARL)-based caching network is developed to cache popular tasks, such as answering which files to evict from the cache and which files to storage. Based on the cached files of the caching network, the transmission network transmits cached files for users by single transmission (ST) or joint transmission (JT) with multi-agent Bayesian learning automaton (MABLA) method. And then users access the edge servers with the minimum transmission delay. The experimental results demonstrate the performance of the proposed multi-agent learning approach.

Mean Field Games on Weighted and Directed Graphs via Colored Digraphons

Authors:Christian Fabian, Kai Cui, Heinz Koeppl
Date:2022-09-08 15:45:20

The field of multi-agent reinforcement learning (MARL) has made considerable progress towards controlling challenging multi-agent systems by employing various learning methods. Numerous of these approaches focus on empirical and algorithmic aspects of the MARL problems and lack a rigorous theoretical foundation. Graphon mean field games (GMFGs) on the other hand provide a scalable and mathematically well-founded approach to learning problems that involve a large number of connected agents. In standard GMFGs, the connections between agents are undirected, unweighted and invariant over time. Our paper introduces colored digraphon mean field games (CDMFGs) which allow for weighted and directed links between agents that are also adaptive over time. Thus, CDMFGs are able to model more complex connections than standard GMFGs. Besides a rigorous theoretical analysis including both existence and convergence guarantees, we provide a learning scheme and illustrate our findings with an epidemics model and a model of the systemic risk in financial markets.

Learning Sparse Graphon Mean Field Games

Authors:Christian Fabian, Kai Cui, Heinz Koeppl
Date:2022-09-08 15:35:42

Although the field of multi-agent reinforcement learning (MARL) has made considerable progress in the last years, solving systems with a large number of agents remains a hard challenge. Graphon mean field games (GMFGs) enable the scalable analysis of MARL problems that are otherwise intractable. By the mathematical structure of graphons, this approach is limited to dense graphs which are insufficient to describe many real-world networks such as power law graphs. Our paper introduces a novel formulation of GMFGs, called LPGMFGs, which leverages the graph theoretical concept of $L^p$ graphons and provides a machine learning tool to efficiently and accurately approximate solutions for sparse network problems. This especially includes power law networks which are empirically observed in various application areas and cannot be captured by standard graphons. We derive theoretical existence and convergence guarantees and give empirical examples that demonstrate the accuracy of our learning approach for systems with many agents. Furthermore, we extend the Online Mirror Descent (OMD) learning algorithm to our setup to accelerate learning speed, empirically show its capabilities, and conduct a theoretical analysis using the novel concept of smoothed step graphons. In general, we provide a scalable, mathematically well-founded machine learning approach to a large class of otherwise intractable problems of great relevance in numerous research fields.

A Survey on Large-Population Systems and Scalable Multi-Agent Reinforcement Learning

Authors:Kai Cui, Anam Tahir, Gizem Ekinci, Ahmed Elshamanhory, Yannick Eich, Mengguang Li, Heinz Koeppl
Date:2022-09-08 14:58:50

The analysis and control of large-population systems is of great interest to diverse areas of research and engineering, ranging from epidemiology over robotic swarms to economics and finance. An increasingly popular and effective approach to realizing sequential decision-making in multi-agent systems is through multi-agent reinforcement learning, as it allows for an automatic and model-free analysis of highly complex systems. However, the key issue of scalability complicates the design of control and reinforcement learning algorithms particularly in systems with large populations of agents. While reinforcement learning has found resounding empirical success in many scenarios with few agents, problems with many agents quickly become intractable and necessitate special consideration. In this survey, we will shed light on current approaches to tractably understanding and analyzing large-population systems, both through multi-agent reinforcement learning and through adjacent areas of research such as mean-field games, collective intelligence, or complex network theory. These classically independent subject areas offer a variety of approaches to understanding or modeling large-population systems, which may be of great use for the formulation of tractable MARL algorithms in the future. Finally, we survey potential areas of application for large-scale control and identify fruitful future applications of learning algorithms in practical systems. We hope that our survey could provide insight and future directions to junior and senior researchers in theoretical and applied sciences alike.

Learning to Deceive in Multi-Agent Hidden Role Games

Authors:Matthew Aitchison, Lyndon Benke, Penny Sweetser
Date:2022-09-04 07:35:23

Deception is prevalent in human social settings. However, studies into the effect of deception on reinforcement learning algorithms have been limited to simplistic settings, restricting their applicability to complex real-world problems. This paper addresses this by introducing a new mixed competitive-cooperative multi-agent reinforcement learning (MARL) environment inspired by popular role-based deception games such as Werewolf, Avalon, and Among Us. The environment's unique challenge lies in the necessity to cooperate with other agents despite not knowing if they are friend or foe. Furthermore, we introduce a model of deception, which we call Bayesian belief manipulation (BBM) and demonstrate its effectiveness at deceiving other agents in this environment while also increasing the deceiving agent's performance.

Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction

Authors:Taher Jafferjee, Juliusz Ziomek, Tianpei Yang, Zipeng Dai, Jianhong Wang, Matthew Taylor, Kun Shao, Jun Wang, David Mguni
Date:2022-09-02 13:44:00

Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms. Despite its popularity, it suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state. As agents explore and update their policies during training, these single samples may poorly represent the actual joint-policy of the system of agents leading to high variance gradient estimates that hinder learning. To address this problem, we propose an enhancement tool that accommodates any actor-critic MARL method. Our framework, Performance Enhancing Reinforcement Learning Apparatus (PERLA), introduces a sampling technique of the agents' joint-policy into the critics while the agents train. This leads to TD updates that closely approximate the true expected value under the current joint-policy rather than estimates from a single sample of the joint-action at a given state. This produces low variance and precise estimates of expected returns, minimising the variance in the critic estimators which typically hinders learning. Moreover, as we demonstrate, by eliminating much of the critic variance from the single sampling of the joint policy, PERLA enables CT-DE methods to scale more efficiently with the number of agents. Theoretically, we prove that PERLA reduces variance in value estimates similar to that of decentralised training while maintaining the benefits of centralised training. Empirically, we demonstrate PERLA's superior performance and ability to reduce estimator variance in a range of benchmarks including Multi-agent Mujoco, and StarCraft II Multi-agent Challenge.

Reinforcement Learning based Multi-connectivity Resource Allocation in Factory Automation Systems

Authors:Mohammad Farzanullah, Hung V. Vu, Tho Le-Ngoc
Date:2022-08-26 13:29:03

We propose joint user association, channel assignment and power allocation for mobile robot Ultra-Reliable and Low Latency Communications (URLLC) based on multi-connectivity and reinforcement learning. The mobile robots require control messages from the central guidance system at regular intervals. We use a two-phase communication scheme where robots can form multiple clusters. The robots in a cluster are close to each other and can have reliable Device-to-Device (D2D) communications. In Phase I, the APs transmit the combined payload of a cluster to the cluster leader within a latency constraint. The cluster leader broadcasts this message to its members in Phase II. We develop a distributed Multi-Agent Reinforcement Learning (MARL) algorithm for joint user association and resource allocation (RA) for Phase I. The cluster leaders use their local Channel State Information (CSI) to decide the APs for connection along with the sub-band and power level. The cluster leaders utilize multi-connectivity to connect to multiple APs to increase their reliability. The objective is to maximize the successful payload delivery probability for all robots. Illustrative simulation results indicate that the proposed scheme can approach the performance of the centralized algorithm and offer a substantial gain in reliability as compared to single-connectivity (when cluster leaders are able to connect to 1 AP).

An approach to implement Reinforcement Learning for Heterogeneous Vehicular Networks

Authors:Bhavya Peshavaria, Sagar Kavaiya, Dhaval K. Patel
Date:2022-08-26 07:15:14

This paper presents the extension of the idea of spectrum sharing in the vehicular networks towards the Heterogeneous Vehicular Network(HetVNET) based on multi-agent reinforcement learning. Here, the multiple vehicle-to-vehicle(V2V) links reuse the spectrum of other vehicle-to-interface(V2I) and also those of other networks. The fast-changing environment in vehicular networks limits the idea of centralizing the CSI and allocate the channels. So, the idea of implementing ML-based methods is used here so that it can be implemented in a distributed manner in all vehicles. Here each On-Board Unit(OBU) can sense the signals in the channel and based on that information runs the RL to decide which channel to autonomously take up. Here, each V2V link will be an agent in MARL. The idea is to train the RL model in such a way that these agents will collaborate rather than compete.

Entropy Enhanced Multi-Agent Coordination Based on Hierarchical Graph Learning for Continuous Action Space

Authors:Yining Chen, Ke Wang, Guanghua Song, Xiaohong Jiang
Date:2022-08-23 01:53:15

In most existing studies on large-scale multi-agent coordination, the control methods aim to learn discrete policies for agents with finite choices. They rarely consider selecting actions directly from continuous action spaces to provide more accurate control, which makes them unsuitable for more complex tasks. To solve the control issue due to large-scale multi-agent systems with continuous action spaces, we propose a novel MARL coordination control method that derives stable continuous policies. By optimizing policies with maximum entropy learning, agents improve their exploration in execution and acquire an excellent performance after training. We also employ hierarchical graph attention networks (HGAT) and gated recurrent units (GRU) to improve the scalability and transferability of our method. The experiments show that our method consistently outperforms all baselines in large-scale multi-agent cooperative reconnaissance tasks.

Quantum Multi-Agent Meta Reinforcement Learning

Authors:Won Joon Yun, Jihong Park, Joongheon Kim
Date:2022-08-22 22:46:52

Although quantum supremacy is yet to come, there has recently been an increasing interest in identifying the potential of quantum machine learning (QML) in the looming era of practical quantum computing. Motivated by this, in this article we re-design multi-agent reinforcement learning (MARL) based on the unique characteristics of quantum neural networks (QNNs) having two separate dimensions of trainable parameters: angle parameters affecting the output qubit states, and pole parameters associated with the output measurement basis. Exploiting this dyadic trainability as meta-learning capability, we propose quantum meta MARL (QM2ARL) that first applies angle training for meta-QNN learning, followed by pole training for few-shot or local-QNN training. To avoid overfitting, we develop an angle-to-pole regularization technique injecting noise into the pole domain during angle training. Furthermore, by exploiting the pole as the memory address of each trained QNN, we introduce the concept of pole memory allowing one to save and load trained QNNs using only two-parameter pole values. We theoretically prove the convergence of angle training under the angle-to-pole regularization, and by simulation corroborate the effectiveness of QM2ARL in achieving high reward and fast convergence, as well as of the pole memory in fast adaptation to a time-varying environment.

Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL

Authors:Andreas A. Haupt, Phillip J. K. Christoffersen, Mehul Damani, Dylan Hadfield-Menell
Date:2022-08-22 17:42:03

Multi-agent Reinforcement Learning (MARL) is a powerful tool for training autonomous agents acting independently in a common environment. However, it can lead to sub-optimal behavior when individual incentives and group incentives diverge. Humans are remarkably capable at solving these social dilemmas. It is an open problem in MARL to replicate such cooperative behaviors in selfish agents. In this work, we draw upon the idea of formal contracting from economics to overcome diverging incentives between agents in MARL. We propose an augmentation to a Markov game where agents voluntarily agree to binding transfers of reward, under pre-specified conditions. Our contributions are theoretical and empirical. First, we show that this augmentation makes all subgame-perfect equilibria of all Fully Observable Markov Games exhibit socially optimal behavior, given a sufficiently rich space of contracts. Next, we show that for general contract spaces, and even under partial observability, richer contract spaces lead to higher welfare. Hence, contract space design solves an exploration-exploitation tradeoff, sidestepping incentive issues. We complement our theoretical analysis with experiments. Issues of exploration in the contracting augmentation are mitigated using a training methodology inspired by multi-objective reinforcement learning: Multi-Objective Contract Augmentation Learning (MOCA). We test our methodology in static, single-move games, as well as dynamic domains that simulate traffic, pollution management and common pool resource management.

Transformer-based Value Function Decomposition for Cooperative Multi-agent Reinforcement Learning in StarCraft

Authors:Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar
Date:2022-08-15 16:13:16

The StarCraft II Multi-Agent Challenge (SMAC) was created to be a challenging benchmark problem for cooperative multi-agent reinforcement learning (MARL). SMAC focuses exclusively on the problem of StarCraft micromanagement and assumes that each unit is controlled individually by a learning agent that acts independently and only possesses local information; centralized training is assumed to occur with decentralized execution (CTDE). To perform well in SMAC, MARL algorithms must handle the dual problems of multi-agent credit assignment and joint action evaluation. This paper introduces a new architecture TransMix, a transformer-based joint action-value mixing network which we show to be efficient and scalable as compared to the other state-of-the-art cooperative MARL solutions. TransMix leverages the ability of transformers to learn a richer mixing function for combining the agents' individual value functions. It achieves comparable performance to previous work on easy SMAC scenarios and outperforms other techniques on hard scenarios, as well as scenarios that are corrupted with Gaussian noise to simulate fog of war.

Diversifying Message Aggregation in Multi-Agent Communication via Normalized Tensor Nuclear Norm Regularization

Authors:Yuanzhao Zhai, Kele Xu, Bo Ding, Dawei Feng, Zijian Gao, Huaimin Wang
Date:2022-08-10 16:04:49

Aggregating messages is a key component for the communication of multi-agent reinforcement learning (Comm-MARL). Recently, it has witnessed the prevalence of graph attention networks (GAT) in Comm-MARL, where agents can be represented as nodes and messages can be aggregated via the weighted passing. While successful, GAT can lead to homogeneity in the strategies of message aggregation, and the ``core'' agent may excessively influence other agents' behaviors, which can severely limit the multi-agent coordination. To address this challenge, we first study the adjacency tensor of the communication graph and demonstrate that the homogeneity of message aggregation could be measured by the normalized tensor rank. Since the rank optimization problem is known to be NP-hard, we define a new nuclear norm, which is a convex surrogate of normalized tensor rank, to replace the rank. Leveraging the norm, we further propose a plug-and-play regularizer on the adjacency tensor, named Normalized Tensor Nuclear Norm Regularization (NTNNR), to actively enrich the diversity of message aggregation during the training stage. We extensively evaluate GAT with the proposed regularizer in both cooperative and mixed cooperative-competitive scenarios. The results demonstrate that aggregating messages using NTNNR-enhanced GAT can improve the efficiency of the training and achieve higher asymptotic performance than existing message aggregation methods. When NTNNR is applied to existing graph-attention Comm-MARL methods, we also observe significant performance improvements on the StarCraft II micromanagement benchmarks.

Heterogeneous Multi-agent Zero-Shot Coordination by Coevolution

Authors:Ke Xue, Yutong Wang, Cong Guan, Lei Yuan, Haobo Fu, Qiang Fu, Chao Qian, Yang Yu
Date:2022-08-09 16:16:28

Generating agents that can achieve zero-shot coordination (ZSC) with unseen partners is a new challenge in cooperative multi-agent reinforcement learning (MARL). Recently, some studies have made progress in ZSC by exposing the agents to diverse partners during the training process. They usually involve self-play when training the partners, implicitly assuming that the tasks are homogeneous. However, many real-world tasks are heterogeneous, and hence previous methods may be inefficient. In this paper, we study the heterogeneous ZSC problem for the first time and propose a general method based on coevolution, which coevolves two populations of agents and partners through three sub-processes: pairing, updating and selection. Experimental results on various heterogeneous tasks highlight the necessity of considering the heterogeneous setting and demonstrate that our proposed method is a promising solution for heterogeneous ZSC tasks.

Multi-agent reinforcement learning for intent-based service assurance in cellular networks

Authors:Satheesh K. Perepu, Jean P. Martins, Ricardo Souza S, Kaushik Dey
Date:2022-08-07 14:42:58

Recently, intent-based management has received good attention in telecom networks owing to stringent performance requirements for many of the use cases. Several approaches in the literature employ traditional closed-loop driven methods to fulfill the intents on the KPIs. However, these methods consider every closed-loop independent of each other which degrades the combined performance. Also, such existing methods are not easily scalable. Multi-agent reinforcement learning (MARL) techniques have shown significant promise in many areas in which traditional closed-loop control falls short, typically for complex coordination and conflict management among loops. In this work, we propose a method based on MARL to achieve intent-based management without the need for knowing a model of the underlying system. Moreover, when there are conflicting intents, the MARL agents can implicitly incentivize the loops to cooperate and promote trade-offs, without human interaction, by prioritizing the important KPIs. Experiments have been performed on a network emulator for optimizing KPIs of three services. Results obtained demonstrate that the proposed system performs quite well and is able to fulfill all existing intents when there are enough resources or prioritize the KPIs when resources are scarce.

A Cooperation Graph Approach for Multiagent Sparse Reward Reinforcement Learning

Authors:Qingxu Fu, Tenghai Qiu, Zhiqiang Pu, Jianqiang Yi, Wanmai Yuan
Date:2022-08-05 06:32:16

Multiagent reinforcement learning (MARL) can solve complex cooperative tasks. However, the efficiency of existing MARL methods relies heavily on well-defined reward functions. Multiagent tasks with sparse reward feedback are especially challenging not only because of the credit distribution problem, but also due to the low probability of obtaining positive reward feedback. In this paper, we design a graph network called Cooperation Graph (CG). The Cooperation Graph is the combination of two simple bipartite graphs, namely, the Agent Clustering subgraph (ACG) and the Cluster Designating subgraph (CDG). Next, based on this novel graph structure, we propose a Cooperation Graph Multiagent Reinforcement Learning (CG-MARL) algorithm, which can efficiently deal with the sparse reward problem in multiagent tasks. In CG-MARL, agents are directly controlled by the Cooperation Graph. And a policy neural network is trained to manipulate this Cooperation Graph, guiding agents to achieve cooperation in an implicit way. This hierarchical feature of CG-MARL provides space for customized cluster-actions, an extensible interface for introducing fundamental cooperation knowledge. In experiments, CG-MARL shows state-of-the-art performance in sparse reward multiagent benchmarks, including the anti-invasion interception task and the multi-cargo delivery task.

Transferable Multi-Agent Reinforcement Learning with Dynamic Participating Agents

Authors:Xuting Tang, Jia Xu, Shusen Wang
Date:2022-08-04 03:16:42

We study multi-agent reinforcement learning (MARL) with centralized training and decentralized execution. During the training, new agents may join, and existing agents may unexpectedly leave the training. In such situations, a standard deep MARL model must be trained again from scratch, which is very time-consuming. To tackle this problem, we propose a special network architecture with a few-shot learning algorithm that allows the number of agents to vary during centralized training. In particular, when a new agent joins the centralized training, our few-shot learning algorithm trains its policy network and value network using a small number of samples; when an agent leaves the training, the training process of the remaining agents is not affected. Our experiments show that using the proposed network architecture and algorithm, model adaptation when new agents join can be 100+ times faster than the baseline. Our work is applicable to any setting, including cooperative, competitive, and mixed.

Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to Cooperative MARL

Authors:Jakub Grudzien Kuba, Xidong Feng, Shiyao Ding, Hao Dong, Jun Wang, Yaodong Yang
Date:2022-08-02 18:16:42

The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavors have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.

Evaluating Inter-Operator Cooperation Scenarios to Save Radio Access Network Energy

Authors:Xavier Marjou, Tangui Le Gléau, Vincent Messié, Benoit Radier, Tayeb Lemlouma, Gaël Fromentoux
Date:2022-08-02 07:14:28

Reducing energy consumption is crucial to reduce the human debt's with regard to our planet. Therefore most companies try to reduce their energetic consumption while taking care to preserve the service delivered to their customers. To do so, a service provider (SP) typically downscale or shutdown part of its infrastructure in periods of low-activity where only few customers need the service. However an SP still needs to maintain part of its infrastructure "on", which still requires significant energy. For example a mobile national operator (MNO) needs to maintain most of its radio access network (RAN) active. Could an SP do better by cooperating with other SPs who would temporarily support its users, thus allowing it to temporarily shut down its infrastructure, and then reciprocate during another low-activity period? To answer this question, we investigated a novel collaboration framework based on multi-agent reinforcement learning (MARL) allowing negotiations between SPs as well as trustful reports from a distributed ledger technology (DLT) to evaluate the amount of energy being saved. We leveraged it to experiment three different sets of rules (free, recommended, or imposed) regulating the negotiation between multiple SPs (3, 4, 8, or 10). With respect to four cooperation metrics (efficiency, safety, incentive-compatibility, and fairness), the simulations showed that the imposed set of rules proved to be the best mode.

Reward-Sharing Relational Networks in Multi-Agent Reinforcement Learning as a Framework for Emergent Behavior

Authors:Hossein Haeri, Reza Ahmadzadeh, Kshitij Jerath
Date:2022-07-12 23:27:42

In this work, we integrate `social' interactions into the MARL setup through a user-defined relational network and examine the effects of agent-agent relations on the rise of emergent behaviors. Leveraging insights from sociology and neuroscience, our proposed framework models agent relationships using the notion of Reward-Sharing Relational Networks (RSRN), where network edge weights act as a measure of how much one agent is invested in the success of (or `cares about') another. We construct relational rewards as a function of the RSRN interaction weights to collectively train the multi-agent system via a multi-agent reinforcement learning algorithm. The performance of the system is tested for a 3-agent scenario with different relational network structures (e.g., self-interested, communitarian, and authoritarian networks). Our results indicate that reward-sharing relational networks can significantly influence learned behaviors. We posit that RSRN can act as a framework where different relational networks produce distinct emergent behaviors, often analogous to the intuited sociological understanding of such networks.

Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

Authors:Jianing Ye, Chenghao Li, Jianhao Wang, Chongjie Zhang
Date:2022-07-12 06:59:13

Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution and use gradient descent as their optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods to prove their suboptimality when gradient descent is used. In addition, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived ``single-agent" MDP. This approach uses a two-stage learning paradigm to address the optimization problem in cooperative MARL, maintaining its performance guarantee. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.

VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning

Authors:Matteo Bettini, Ryan Kortvelesy, Jan Blumenkamp, Amanda Prorok
Date:2022-07-07 18:48:58

While many multi-robot coordination problems can be solved optimally by exact algorithms, solutions are often not scalable in the number of robots. Multi-Agent Reinforcement Learning (MARL) is gaining increasing attention in the robotics community as a promising solution to tackle such problems. Nevertheless, we still lack the tools that allow us to quickly and efficiently find solutions to large-scale collective learning tasks. In this work, we introduce the Vectorized Multi-Agent Simulator (VMAS). VMAS is an open-source framework designed for efficient MARL benchmarking. It is comprised of a vectorized 2D physics engine written in PyTorch and a set of twelve challenging multi-robot scenarios. Additional scenarios can be implemented through a simple and modular interface. We demonstrate how vectorization enables parallel simulation on accelerated hardware without added complexity. When comparing VMAS to OpenAI MPE, we show how MPE's execution time increases linearly in the number of simulations while VMAS is able to execute 30,000 parallel simulations in under 10s, proving more than 100x faster. Using VMAS's RLlib interface, we benchmark our multi-robot scenarios using various Proximal Policy Optimization (PPO)-based MARL algorithms. VMAS's scenarios prove challenging in orthogonal ways for state-of-the-art MARL algorithms. The VMAS framework is available at https://github.com/proroklab/VectorizedMultiAgentSimulator. A video of VMAS scenarios and experiments is available at https://youtu.be/aaDRYfiesAY.

The StarCraft Multi-Agent Challenges+ : Learning of Multi-Stage Tasks and Environmental Factors without Precise Reward Functions

Authors:Mingyu Kim, Jihwan Oh, Yongsik Lee, Joonkee Kim, Seonghwan Kim, Song Chong, Se-Young Yun
Date:2022-07-05 12:43:54

In this paper, we propose a novel benchmark called the StarCraft Multi-Agent Challenges+, where agents learn to perform multi-stage tasks and to use environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack. We investigate MARL algorithms under SMAC+ and observe that recent approaches work well in similar settings to the previous challenges, but misbehave in offensive scenarios. Additionally, we observe that an enhanced exploration approach has a positive effect on performance but is not able to completely solve all scenarios. This study proposes new directions for future research.

NVIF: Neighboring Variational Information Flow for Large-Scale Cooperative Multi-Agent Scenarios

Authors:Jiajun Chai, Yuanheng Zhu, Dongbin Zhao
Date:2022-07-03 06:15:16

Communication-based multi-agent reinforcement learning (MARL) provides information exchange between agents, which promotes the cooperation. However, existing methods cannot perform well in the large-scale multi-agent system. In this paper, we adopt neighboring communication and propose a Neighboring Variational Information Flow (NVIF) to provide efficient communication for agents. It employs variational auto-encoder to compress the shared information into a latent state. This communication protocol does not rely dependently on a specific task, so that it can be pre-trained to stabilize the MARL training. Besides. we combine NVIF with Proximal Policy Optimization (NVIF-PPO) and Deep Q Network (NVIF-DQN), and present a theoretical analysis to illustrate NVIF-PPO can promote cooperation. We evaluate the NVIF-PPO and NVIF-DQN on MAgent, a widely used large-scale multi-agent environment, by two tasks with different map sizes. Experiments show that our method outperforms other compared methods, and can learn effective and scalable cooperation strategies in the large-scale multi-agent system.

Distributed Influence-Augmented Local Simulators for Parallel MARL in Large Networked Systems

Authors:Miguel Suau, Jinke He, Mustafa Mert Çelikok, Matthijs T. J. Spaan, Frans A. Oliehoek
Date:2022-07-01 09:33:33

Due to its high sample complexity, simulation is, as of today, critical for the successful application of reinforcement learning. Many real-world problems, however, exhibit overly complex dynamics, which makes their full-scale simulation computationally slow. In this paper, we show how to decompose large networked systems of many agents into multiple local components such that we can build separate simulators that run independently and in parallel. To monitor the influence that the different local components exert on one another, each of these simulators is equipped with a learned model that is periodically trained on real trajectories. Our empirical results reveal that distributing the simulation among different processes not only makes it possible to train large multi-agent systems in just a few hours but also helps mitigate the negative effects of simultaneous learning.

Towards an understanding of long gamma-ray burst environments through circumstellar medium population synthesis predictions

Authors:A. A. Chrimes, B. P. Gompertz, D. A. Kann, A. J. van Marle, J. J. Eldridge, P. J. Groot, T. Laskar, A. J. Levan, M. Nicholl, E. R. Stanway, K. Wiersema
Date:2022-06-27 19:20:14

The temporal and spectral evolution of gamma-ray burst (GRB) afterglows can be used to infer the density and density profile of the medium through which the shock is propagating. In long-duration (core-collapse) GRBs, the circumstellar medium (CSM) is expected to resemble a wind-blown bubble, with a termination shock separating the stellar wind and the interstellar medium (ISM). A long standing problem is that flat density profiles, indicative of the ISM, are often found at lower radii than expected for a massive star progenitor. Furthermore, the presence of both wind-like environments at high radii and ISM-like environments at low radii remains a mystery. In this paper, we perform a 'CSM population synthesis' with long GRB progenitor stellar evolution models. Analytic results for the evolution of wind blown bubbles are adjusted through comparison with a grid of 2D hydrodynamical simulations. Predictions for the emission radii, ratio of ISM to wind-like environments, wind and ISM densities are compared with the largest sample of afterglow-derived parameters yet compiled, which we make available for the community. We find that high ISM densities around 1000/cm3 best reproduce observations. If long GRBs instead occur in typical ISM densities of approximately 1/cm3, then the discrepancy between theory and observations is shown to persist at a population level. We discuss possible explanations for the origin of variety in long GRB afterglows, and for the overall trend of CSM modelling to over-predict the termination shock radius.

Toward multi-target self-organizing pursuit in a partially observable Markov game

Authors:Lijun Sun, Yu-Cheng Chang, Chao Lyu, Ye Shi, Yuhui Shi, Chin-Teng Lin
Date:2022-06-24 14:59:56

The multiple-target self-organizing pursuit (SOP) problem has wide applications and has been considered a challenging self-organization game for distributed systems, in which intelligent agents cooperatively pursue multiple dynamic targets with partial observations. This work proposes a framework for decentralized multi-agent systems to improve the implicit coordination capabilities in search and pursuit. We model a self-organizing system as a partially observable Markov game (POMG) featured by large-scale, decentralization, partial observation, and noncommunication. The proposed distributed algorithm: fuzzy self-organizing cooperative coevolution (FSC2) is then leveraged to resolve the three challenges in multi-target SOP: distributed self-organizing search (SOS), distributed task allocation, and distributed single-target pursuit. FSC2 includes a coordinated multi-agent deep reinforcement learning (MARL) method that enables homogeneous agents to learn natural SOS patterns. Additionally, we propose a fuzzy-based distributed task allocation method, which locally decomposes multi-target SOP into several single-target pursuit problems. The cooperative coevolution principle is employed to coordinate distributed pursuers for each single-target pursuit problem. Therefore, the uncertainties of inherent partial observation and distributed decision-making in the POMG can be alleviated. The experimental results demonstrate that by decomposing the SOP task, FSC2 achieves superior performance compared with other implicit coordination policies fully trained by general MARL algorithms. The scalability of FSC2 is proved that up to 2048 FSC2 agents perform efficient multi-target SOP with almost 100 percent capture rates. Empirical analyses and ablation studies verify the interpretability, rationality, and effectiveness of component algorithms in FSC2.

PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning

Authors:Hanhan Zhou, Tian Lan, Vaneet Aggarwal
Date:2022-06-22 23:34:30

Multi-agent reinforcement learning (MARL) has witnessed significant progress with the development of value function factorization methods. It allows optimizing a joint action-value function through the maximization of factorized per-agent utilities due to monotonicity. In this paper, we show that in partially observable MARL problems, an agent's ordering over its own actions could impose concurrent constraints (across different states) on the representable function class, causing significant estimation error during training. We tackle this limitation and propose PAC, a new framework leveraging Assistive information generated from Counterfactual Predictions of optimal joint action selection, which enable explicit assistance to value function factorization through a novel counterfactual loss. A variational inference-based information encoding method is developed to collect and encode the counterfactual predictions from an estimated baseline. To enable decentralized execution, we also derive factorized per-agent policies inspired by a maximum-entropy MARL framework. We evaluate the proposed PAC on multi-agent predator-prey and a set of StarCraft II micromanagement tasks. Empirical results demonstrate improved results of PAC over state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms on all benchmarks.

Certifiably Robust Policy Learning against Adversarial Communication in Multi-agent Systems

Authors:Yanchao Sun, Ruijie Zheng, Parisa Hassanzadeh, Yongyuan Liang, Soheil Feizi, Sumitra Ganesh, Furong Huang
Date:2022-06-21 07:32:18

Communication is important in many multi-agent reinforcement learning (MARL) problems for agents to share information and make good decisions. However, when deploying trained communicative agents in a real-world application where noise and potential attackers exist, the safety of communication-based policies becomes a severe issue that is underexplored. Specifically, if communication messages are manipulated by malicious attackers, agents relying on untrustworthy communication may take unsafe actions that lead to catastrophic consequences. Therefore, it is crucial to ensure that agents will not be misled by corrupted communication, while still benefiting from benign communication. In this work, we consider an environment with $N$ agents, where the attacker may arbitrarily change the communication from any $C<\frac{N-1}{2}$ agents to a victim agent. For this strong threat model, we propose a certifiable defense by constructing a message-ensemble policy that aggregates multiple randomly ablated message sets. Theoretical analysis shows that this message-ensemble policy can utilize benign communication while being certifiably robust to adversarial communication, regardless of the attacking algorithm. Experiments in multiple environments verify that our defense significantly improves the robustness of trained policies against various types of attacks.

MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer

Authors:Jeewon Jeon, Woojun Kim, Whiyoung Jung, Youngchul Sung
Date:2022-06-20 08:12:26

In this paper, we consider cooperative multi-agent reinforcement learning (MARL) with sparse reward. To tackle this problem, we propose a novel method named MASER: MARL with subgoals generated from experience replay buffer. Under the widely-used assumption of centralized training with decentralized execution and consistent Q-value decomposition for MARL, MASER automatically generates proper subgoals for multiple agents from the experience replay buffer by considering both individual Q-value and total Q-value. Then, MASER designs individual intrinsic reward for each agent based on actionable representation relevant to Q-learning so that the agents reach their subgoals while maximizing the joint action value. Numerical results show that MASER significantly outperforms StarCraft II micromanagement benchmark compared to other state-of-the-art MARL algorithms.

S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning?

Authors:Shuang Luo, Yinchuan Li, Jiahui Li, Kun Kuang, Furui Liu, Yunfeng Shao, Chao Wu
Date:2022-06-20 07:33:40

Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications, where each agent makes a decision based on its own observation. Most mainstream methods treat each local observation as an entirety when modeling the decentralized local utility functions. However, they ignore the fact that local observation information can be further divided into several entities, and only part of the entities is helpful to model inference. Moreover, the importance of different entities may change over time. To improve the performance of decentralized policies, the attention mechanism is used to capture features of local information. Nevertheless, existing attention models rely on dense fully connected graphs and cannot better perceive important states. To this end, we propose a sparse state based MARL (S2RL) framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations. The local utility functions are estimated through the self-attention and sparse attention mechanisms separately, then are combined into a standard joint value function and auxiliary joint value function in the central critic. We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods. Extensive experiments on StarCraft II show that S2RL can significantly improve the performance of many state-of-the-art methods.

From Multi-agent to Multi-robot: A Scalable Training and Evaluation Platform for Multi-robot Reinforcement Learning

Authors:Zhiuxan Liang, Jiannong Cao, Shan Jiang, Divya Saxena, Jinlin Chen, Huafeng Xu
Date:2022-06-20 06:36:45

Multi-agent reinforcement learning (MARL) has been gaining extensive attention from academia and industries in the past few decades. One of the fundamental problems in MARL is how to evaluate different approaches comprehensively. Most existing MARL methods are evaluated in either video games or simplistic simulated scenarios. It remains unknown how these methods perform in real-world scenarios, especially multi-robot systems. This paper introduces a scalable emulation platform for multi-robot reinforcement learning (MRRL) called SMART to meet this need. Precisely, SMART consists of two components: 1) a simulation environment that provides a variety of complex interaction scenarios for training and 2) a real-world multi-robot system for realistic performance evaluation. Besides, SMART offers agent-environment APIs that are plug-and-play for algorithm implementation. To illustrate the practicality of our platform, we conduct a case study on the cooperative driving lane change scenario. Building off the case study, we summarize several unique challenges of MRRL, which are rarely considered previously. Finally, we open-source the simulation environments, associated benchmark tasks, and state-of-the-art baselines to encourage and empower MRRL research.

Cooperative Edge Caching via Multi Agent Reinforcement Learning in Fog Radio Access Networks

Authors:Qi Chang, Yanxiang Jiang, Fu-Chun Zheng, Mehdi Bennis, Xiaohu You
Date:2022-06-20 03:16:17

In this paper, the cooperative edge caching problem in fog radio access networks (F-RANs) is investigated. To minimize the content transmission delay, we formulate the cooperative caching optimization problem to find the globally optimal caching strategy.By considering the non-deterministic polynomial hard (NP-hard) property of this problem, a Multi Agent Reinforcement Learning (MARL)-based cooperative caching scheme is proposed.Our proposed scheme applies double deep Q-network (DDQN) in every fog access point (F-AP), and introduces the communication process in multi-agent system. Every F-AP records the historical caching strategies of its associated F-APs as the observations of communication procedure.By exchanging the observations, F-APs can leverage the cooperation and make the globally optimal caching strategy.Simulation results show that the proposed MARL-based cooperative caching scheme has remarkable performance compared with the benchmark schemes in minimizing the content transmission delay.

Logic-based Reward Shaping for Multi-Agent Reinforcement Learning

Authors:Ingy ElSayed-Aly, Lu Feng
Date:2022-06-17 16:30:27

Reinforcement learning (RL) relies heavily on exploration to learn from its environment and maximize observed rewards. Therefore, it is essential to design a reward function that guarantees optimal learning from the received experience. Previous work has combined automata and logic based reward shaping with environment assumptions to provide an automatic mechanism to synthesize the reward function based on the task. However, there is limited work on how to expand logic-based reward shaping to Multi-Agent Reinforcement Learning (MARL). The environment will need to consider the joint state in order to keep track of other agents if the task requires cooperation, thus suffering from the curse of dimensionality with respect to the number of agents. This project explores how logic-based reward shaping for MARL can be designed for different scenarios and tasks. We present a novel method for semi-centralized logic-based MARL reward shaping that is scalable in the number of agents and evaluate it in multiple scenarios.

Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning

Authors:Yuanpei Chen, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuang Jiang, Stephen Marcus McAleer, Yiran Geng, Hao Dong, Zongqing Lu, Song-Chun Zhu, Yaodong Yang
Date:2022-06-17 11:09:06

Achieving human-level dexterity is an important open problem in robotics. However, tasks of dexterous hand manipulation, even at the baby level, are challenging to solve through reinforcement learning (RL). The difficulty lies in the high degrees of freedom and the required cooperation among heterogeneous agents (e.g., joints of fingers). In this study, we propose the Bimanual Dexterous Hands Benchmark (Bi-DexHands), a simulator that involves two dexterous hands with tens of bimanual manipulation tasks and thousands of target objects. Specifically, tasks in Bi-DexHands are designed to match different levels of human motor skills according to cognitive science literature. We built Bi-DexHands in the Issac Gym; this enables highly efficient RL training, reaching 30,000+ FPS by only one single NVIDIA RTX 3090. We provide a comprehensive benchmark for popular RL algorithms under different settings; this includes Single-agent/Multi-agent RL, Offline RL, Multi-task RL, and Meta RL. Our results show that the PPO type of on-policy algorithms can master simple manipulation tasks that are equivalent up to 48-month human babies (e.g., catching a flying object, opening a bottle), while multi-agent RL can further help to master manipulations that require skilled bimanual cooperation (e.g., lifting a pot, stacking blocks). Despite the success on each single task, when it comes to acquiring multiple manipulation skills, existing RL algorithms fail to work in most of the multi-task and the few-shot learning settings, which calls for more substantial development from the RL community. Our project is open sourced at https://github.com/PKU-MARL/DexterousHands.

Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

Authors:Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, Yi Wu
Date:2022-06-15 13:03:05

Many advances in cooperative multi-agent reinforcement learning (MARL) are based on two common design principles: value decomposition and parameter sharing. A typical MARL algorithm of this fashion decomposes a centralized Q-function into local Q-networks with parameters shared across agents. Such an algorithmic paradigm enables centralized training and decentralized execution (CTDE) and leads to efficient learning in practice. Despite all the advantages, we revisit these two principles and show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, value decomposition, and parameter sharing can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases, which partially supports some recent empirical observations that PG can be effective in many MARL testbeds. Inspired by our theoretical analysis, we present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behaviors and empirically validate our findings on a variety of domains, ranging from the simplified matrix and grid-world games to complex benchmarks such as StarCraft Multi-Agent Challenge and Google Research Football. We hope our insights could benefit the community towards developing more general and more powerful MARL algorithms. Check our project website at https://sites.google.com/view/revisiting-marl.

Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic

Authors:Qijun Luo, Xiao Li
Date:2022-06-12 13:14:14

Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.

Scalable Joint Learning of Wireless Multiple-Access Policies and their Signaling

Authors:Mateus P. Mota, Alvaro Valcarce, Jean-Marie Gorce
Date:2022-06-08 12:38:04

In this paper, we apply an multi-agent reinforcement learning (MARL) framework allowing the base station (BS) and the user equipments (UEs) to jointly learn a channel access policy and its signaling in a wireless multiple access scenario. In this framework, the BS and UEs are reinforcement learning (RL) agents that need to cooperate in order to deliver data. The comparison with a contention-free and a contention-based baselines shows that our framework achieves a superior performance in terms of goodput even in high traffic situations while maintaining a low collision rate. The scalability of the proposed method is studied, since it is a major problem in MARL and this paper provides the first results in order to address it.

Stabilizing Voltage in Power Distribution Networks via Multi-Agent Reinforcement Learning with Transformer

Authors:Minrui Wang, Mingxiao Feng, Wengang Zhou, Houqiang Li
Date:2022-06-08 07:48:42

The increased integration of renewable energy poses a slew of technical challenges for the operation of power distribution networks. Among them, voltage fluctuations caused by the instability of renewable energy are receiving increasing attention. Utilizing MARL algorithms to coordinate multiple control units in the grid, which is able to handle rapid changes of power systems, has been widely studied in active voltage control task recently. However, existing approaches based on MARL ignore the unique nature of the grid and achieve limited performance. In this paper, we introduce the transformer architecture to extract representations adapting to power network problems and propose a Transformer-based Multi-Agent Actor-Critic framework (T-MAAC) to stabilize voltage in power distribution networks. In addition, we adopt a novel auxiliary-task training process tailored to the voltage control task, which improves the sample efficiency and facilitating the representation learning of the transformer-based model. We couple T-MAAC with different multi-agent actor-critic algorithms, and the consistent improvements on the active voltage control task demonstrate the effectiveness of the proposed method.

Reward Poisoning Attacks on Offline Multi-Agent Reinforcement Learning

Authors:Young Wu, Jeremy McMahan, Xiaojin Zhu, Qiaomin Xie
Date:2022-06-04 03:15:57

In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the $L^p$ norm of the reward modification. Unlike attacks on single-agent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. We show that the attack works on various MARL agents including uncertainty-aware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.

Learning Distributed and Fair Policies for Network Load Balancing as Markov Potential Game

Authors:Zhiyuan Yao, Zihan Ding
Date:2022-06-03 08:29:02

This paper investigates the network load balancing problem in data centers (DCs) where multiple load balancers (LBs) are deployed, using the multi-agent reinforcement learning (MARL) framework. The challenges of this problem consist of the heterogeneous processing architecture and dynamic environments, as well as limited and partial observability of each LB agent in distributed networking systems, which can largely degrade the performance of in-production load balancing algorithms in real-world setups. Centralised-training-decentralised-execution (CTDE) RL scheme has been proposed to improve MARL performance, yet it incurs -- especially in distributed networking systems, which prefer distributed and plug-and-play design scheme -- additional communication and management overhead among agents. We formulate the multi-agent load balancing problem as a Markov potential game, with a carefully and properly designed workload distribution fairness as the potential function. A fully distributed MARL algorithm is proposed to approximate the Nash equilibrium of the game. Experimental evaluations involve both an event-driven simulator and real-world system, where the proposed MARL load balancing algorithm shows close-to-optimal performance in simulations, and superior results over in-production LBs in the real-world system.

Sample-Efficient Reinforcement Learning of Partially Observable Markov Games

Authors:Qinghua Liu, Csaba Szepesvári, Chi Jin
Date:2022-06-02 21:57:47

This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.

Distributed Training for Deep Learning Models On An Edge Computing Network Using ShieldedReinforcement Learning

Authors:Tanmoy Sen, Haiying Shen
Date:2022-06-01 21:32:44

Edge devices with local computation capability has made distributed deep learning training on edges possible. In such method, the cluster head of a cluster of edges schedules DL training jobs from the edges. Using such centralized scheduling method, the cluster head knows all loads of edges, which can avoid overloading the cluster edges, but the head itself may become overloaded. To handle this problem, we propose a multi-agent RL (MARL) system that enables each edge to schedule its jobs using RL. However, without coordination among edges, action collision may occur, in which multiple edges schedule tasks to the same edge and make it overloaded. For this reason, we propose a system called Shielded ReinfOrcement learning (RL) based DL training on Edges (SROLE). In SROLE, the shield deployed in an edge checks action collisions and provides alternative actions to avoid collisions. As the central shield for entire cluster may become a bottleneck, we further propose a decentralized shielding method, where different shields are responsible for different regions in the cluster and they coordinate to avoid action collisions on the region boundaries. Our emulation and real device experiments show SROLE reduces training time by 59% compared to MARL and centralized RL.

Policy Diagnosis via Measuring Role Diversity in Cooperative Multi-agent RL

Authors:Siyi Hu, Chuanlong Xie, Xiaodan Liang, Xiaojun Chang
Date:2022-06-01 04:58:52

Cooperative multi-agent reinforcement learning (MARL) is making rapid progress for solving tasks in a grid world and real-world scenarios, in which agents are given different attributes and goals, resulting in different behavior through the whole multi-agent task. In this study, we quantify the agent's behavior difference and build its relationship with the policy performance via {\bf Role Diversity}, a metric to measure the characteristics of MARL tasks. We define role diversity from three perspectives: action-based, trajectory-based, and contribution-based to fully measure a multi-agent task. Through theoretical analysis, we find that the error bound in MARL can be decomposed into three parts that have a strong relation to the role diversity. The decomposed factors can significantly impact policy optimization on three popular directions including parameter sharing, communication mechanism, and credit assignment. The main experimental platforms are based on {\bf Multiagent Particle Environment (MPE)} and {\bf The StarCraft Multi-Agent Challenge (SMAC). Extensive experiments} clearly show that role diversity can serve as a robust measurement for the characteristics of a multi-agent cooperation task and help diagnose whether the policy fits the current multi-agent system for a better policy performance.

A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems

Authors:Oliver Slumbers, David Henry Mguni, Stephen Marcus McAleer, Stefano B. Blumberg, Jun Wang, Yaodong Yang
Date:2022-05-30 21:20:30

In order for agents in multi-agent systems (MAS) to be safe, they need to take into account the risks posed by the actions of other agents. However, the dominant paradigm in game theory (GT) assumes that agents are not affected by risk from other agents and only strive to maximise their expected utility. For example, in hybrid human-AI driving systems, it is necessary to limit large deviations in reward resulting from car crashes. Although there are equilibrium concepts in game theory that take into account risk aversion, they either assume that agents are risk-neutral with respect to the uncertainty caused by the actions of other agents, or they are not guaranteed to exist. We introduce a new GT-based Risk-Averse Equilibrium (RAE) that always produces a solution that minimises the potential variance in reward accounting for the strategy of other agents. Theoretically and empirically, we show RAE shares many properties with a Nash Equilibrium (NE), establishing convergence properties and generalising to risk-dominant NE in certain cases. To tackle large-scale problems, we extend RAE to the PSRO multi-agent reinforcement learning (MARL) framework. We empirically demonstrate the minimum reward variance benefits of RAE in matrix games with high-risk outcomes. Results on MARL experiments show RAE generalises to risk-dominant NE in a trust dilemma game and that it reduces instances of crashing by 7x in an autonomous driving setting versus the best performing baseline.

Residual Q-Networks for Value Function Factorizing in Multi-Agent Reinforcement Learning

Authors:Rafael Pina, Varuna De Silva, Joosep Hook, Ahmet Kondoz
Date:2022-05-30 16:56:06

Multi-Agent Reinforcement Learning (MARL) is useful in many problems that require the cooperation and coordination of multiple agents. Learning optimal policies using reinforcement learning in a multi-agent setting can be very difficult as the number of agents increases. Recent solutions such as Value Decomposition Networks (VDN), QMIX, QTRAN and QPLEX adhere to the centralized training and decentralized execution scheme and perform factorization of the joint action-value functions. However, these methods still suffer from increased environmental complexity, and at times fail to converge in a stable manner. We propose a novel concept of Residual Q-Networks (RQNs) for MARL, which learns to transform the individual Q-value trajectories in a way that preserves the Individual-Global-Max criteria (IGM), but is more robust in factorizing action-value functions. The RQN acts as an auxiliary network that accelerates convergence and will become obsolete as the agents reach the training objectives. The performance of the proposed method is compared against several state-of-the-art techniques such as QPLEX, QMIX, QTRAN and VDN, in a range of multi-agent cooperative tasks. The results illustrate that the proposed method, in general, converges faster, with increased stability and shows robust performance in a wider family of environments. The improvements in results are more prominent in environments with severe punishments for non-cooperative behaviours and especially in the absence of complete state information during training time.

Multi-Agent Reinforcement Learning is a Sequence Modeling Problem

Authors:Muning Wen, Jakub Grudzien Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, Yaodong Yang
Date:2022-05-30 09:39:45

Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.

ALMA: Hierarchical Learning for Composite Multi-Agent Tasks

Authors:Shariq Iqbal, Robby Costales, Fei Sha
Date:2022-05-27 19:12:23

Despite significant progress on multi-agent reinforcement learning (MARL) in recent years, coordination in complex domains remains a challenge. Work in MARL often focuses on solving tasks where agents interact with all other agents and entities in the environment; however, we observe that real-world tasks are often composed of several isolated instances of local agent interactions (subtasks), and each agent can meaningfully focus on one subtask to the exclusion of all else in the environment. In these composite tasks, successful policies can often be decomposed into two levels of decision-making: agents are allocated to specific subtasks and each agent acts productively towards their assigned subtask alone. This decomposed decision making provides a strong structural inductive bias, significantly reduces agent observation spaces, and encourages subtask-specific policies to be reused and composed during training, as opposed to treating each new composition of subtasks as unique. We introduce ALMA, a general learning method for taking advantage of these structured tasks. ALMA simultaneously learns a high-level subtask allocation policy and low-level agent policies. We demonstrate that ALMA learns sophisticated coordination behavior in a number of challenging environments, outperforming strong baselines. ALMA's modularity also enables it to better generalize to new environment configurations. Finally, we find that while ALMA can integrate separately trained allocation and action policies, the best performance is obtained only by training all components jointly. Our code is available at https://github.com/shariqiqbal2810/ALMA

Feudal Multi-Agent Reinforcement Learning with Adaptive Network Partition for Traffic Signal Control

Authors:Jinming Ma, Feng Wu
Date:2022-05-27 09:02:10

Multi-agent reinforcement learning (MARL) has been applied and shown great potential in multi-intersections traffic signal control, where multiple agents, one for each intersection, must cooperate together to optimize traffic flow. To encourage global cooperation, previous work partitions the traffic network into several regions and learns policies for agents in a feudal structure. However, static network partition fails to adapt to dynamic traffic flow, which will changes frequently over time. To address this, we propose a novel feudal MARL approach with adaptive network partition. Specifically, we first partition the network into several regions according to the traffic flow. To do this, we propose two approaches: one is directly to use graph neural network (GNN) to generate the network partition, and the other is to use Monte-Carlo tree search (MCTS) to find the best partition with criteria computed by GNN. Then, we design a variant of Qmix using GNN to handle various dimensions of input, given by the dynamic network partition. Finally, we use a feudal hierarchy to manage agents in each partition and promote global cooperation. By doing so, agents are able to adapt to the traffic flow as required in practice. We empirically evaluate our method both in a synthetic traffic grid and real-world traffic networks of three cities, widely used in the literature. Our experimental results confirm that our method can achieve better performance, in terms of average travel time and queue length, than several leading methods for traffic signal control.

Off-Beat Multi-Agent Reinforcement Learning

Authors:Wei Qiu, Weixun Wang, Rundong Wang, Bo An, Yujing Hu, Svetlana Obraztsova, Zinovi Rabinovich, Jianye Hao, Yingfeng Chen, Changjie Fan
Date:2022-05-27 02:21:04

We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.

Trust-based Consensus in Multi-Agent Reinforcement Learning Systems

Authors:Ho Long Fung, Victor-Alexandru Darvariu, Stephen Hailes, Mirco Musolesi
Date:2022-05-25 15:58:34

An often neglected issue in multi-agent reinforcement learning (MARL) is the potential presence of unreliable agents in the environment whose deviations from expected behavior can prevent a system from accomplishing its intended tasks. In particular, consensus is a fundamental underpinning problem of cooperative distributed multi-agent systems. Consensus requires different agents, situated in a decentralized communication network, to reach an agreement out of a set of initial proposals that they put forward. Learning-based agents should adopt a protocol that allows them to reach consensus despite having one or more unreliable agents in the system. This paper investigates the problem of unreliable agents in MARL, considering consensus as a case study. Echoing established results in the distributed systems literature, our experiments show that even a moderate fraction of such agents can greatly impact the ability of reaching consensus in a networked environment. We propose Reinforcement Learning-based Trusted Consensus (RLTC), a decentralized trust mechanism, in which agents can independently decide which neighbors to communicate with. We empirically demonstrate that our trust mechanism is able to handle unreliable agents effectively, as evidenced by higher consensus success rates.

Scalable Multi-Agent Model-Based Reinforcement Learning

Authors:Vladimir Egorov, Aleksei Shpilman
Date:2022-05-25 08:35:00

Recent Multi-Agent Reinforcement Learning (MARL) literature has been largely focused on Centralized Training with Decentralized Execution (CTDE) paradigm. CTDE has been a dominant approach for both cooperative and mixed environments due to its capability to efficiently train decentralized policies. While in mixed environments full autonomy of the agents can be a desirable outcome, cooperative environments allow agents to share information to facilitate coordination. Approaches that leverage this technique are usually referred as communication methods, as full autonomy of agents is compromised for better performance. Although communication approaches have shown impressive results, they do not fully leverage this additional information during training phase. In this paper, we propose a new method called MAMBA which utilizes Model-Based Reinforcement Learning (MBRL) to further leverage centralized training in cooperative environments. We argue that communication between agents is enough to sustain a world model for each agent during execution phase while imaginary rollouts can be used for training, removing the necessity to interact with the environment. These properties yield sample efficient algorithm that can scale gracefully with the number of agents. We empirically confirm that MAMBA achieves good performance while reducing the number of interactions with the environment up to an orders of magnitude compared to Model-Free state-of-the-art approaches in challenging domains of SMAC and Flatland.

MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning

Authors:Stephanie Milani, Zhicheng Zhang, Nicholay Topin, Zheyuan Ryan Shi, Charles Kamhoua, Evangelos E. Papalexakis, Fei Fang
Date:2022-05-25 02:38:10

Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable reinforcement learning (RL) has shown promise in extracting more interpretable decision tree-based policies from neural networks, but only in the single-agent setting. To fill this gap, we propose the first set of algorithms that extract interpretable decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER learns high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.

Learning to Advise and Learning from Advice in Cooperative Multi-Agent Reinforcement Learning

Authors:Yue Jin, Shuangqing Wei, Jian Yuan, Xudong Zhang
Date:2022-05-23 09:56:12

Learning to coordinate is a daunting problem in multi-agent reinforcement learning (MARL). Previous works have explored it from many facets, including cognition between agents, credit assignment, communication, expert demonstration, etc. However, less attention were paid to agents' decision structure and the hierarchy of coordination. In this paper, we explore the spatiotemporal structure of agents' decisions and consider the hierarchy of coordination from the perspective of multilevel emergence dynamics, based on which a novel approach, Learning to Advise and Learning from Advice (LALA), is proposed to improve MARL. Specifically, by distinguishing the hierarchy of coordination, we propose to enhance decision coordination at meso level with an advisor and leverage a policy discriminator to advise agents' learning at micro level. The advisor learns to aggregate decision information in both spatial and temporal domains and generates coordinated decisions by employing a spatiotemporal dual graph convolutional neural network with a task-oriented objective function. Each agent learns from the advice via a policy generative adversarial learning method where a discriminator distinguishes between the policies of the agent and the advisor and boosts both of them based on its judgement. Experimental results indicate the advantage of LALA over baseline approaches in terms of both learning efficiency and coordination capability. Coordination mechanism is investigated from the perspective of multilevel emergence dynamics and mutual information point of view, which provides a novel perspective and method to analyze and improve MARL algorithms.

Coordinating Policies Among Multiple Agents via an Intelligent Communication Channel

Authors:Dianbo Liu, Vedant Shah, Oussama Boussif, Cristian Meo, Anirudh Goyal, Tianmin Shu, Michael Mozer, Nicolas Heess, Yoshua Bengio
Date:2022-05-21 14:11:33

In Multi-Agent Reinforcement Learning (MARL), specialized channels are often introduced that allow agents to communicate directly with one another. In this paper, we propose an alternative approach whereby agents communicate through an intelligent facilitator that learns to sift through and interpret signals provided by all agents to improve the agents' collective performance. To ensure that this facilitator does not become a centralized controller, agents are incentivized to reduce their dependence on the messages it conveys, and the messages can only influence the selection of a policy from a fixed set, not instantaneous actions given the policy. We demonstrate the strength of this architecture over existing baselines on several cooperative MARL environments.

Learning Progress Driven Multi-Agent Curriculum

Authors:Wenshuai Zhao, Zhiyuan Li, Joni Pajarinen
Date:2022-05-20 08:16:30

Curriculum reinforcement learning (CRL) aims to speed up learning by gradually increasing the difficulty of a task, usually quantified by the achievable expected return. Inspired by the success of CRL in single-agent settings, a few works have attempted to apply CRL to multi-agent reinforcement learning (MARL) using the number of agents to control task difficulty. However, existing works typically use manually defined curricula such as a linear scheme. In this paper, we first apply state-of-the-art single-agent self-paced CRL to sparse reward MARL. Although with satisfying performance, we identify two potential flaws of the curriculum generated by existing reward-based CRL methods: (1) tasks with high returns may not provide informative learning signals and (2) the exacerbated credit assignment difficulty in tasks where more agents yield higher returns. Thereby, we further propose self-paced MARL (SPMARL) to prioritize tasks based on \textit{learning progress} instead of the episode return. Our method not only outperforms baselines in three challenging sparse-reward benchmarks but also converges faster than self-paced CRL.

Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling

Authors:Jiyuan Cao, Zhilei Liu, Yong Zhang
Date:2022-05-18 08:17:59

Facial Action Unit (AU) detection is a crucial task for emotion analysis from facial movements. The apparent differences of different subjects sometimes mislead changes brought by AUs, resulting in inaccurate results. However, most of the existing AU detection methods based on deep learning didn't consider the identity information of different subjects. The paper proposes a meta-learning-based cross-subject AU detection model to eliminate the identity-caused differences. Besides, a transformer-based relation learning module is introduced to learn the latent relations of multiple AUs. To be specific, our proposed work is composed of two sub-tasks. The first sub-task is meta-learning-based AU local region representation learning, called MARL, which learns discriminative representation of local AU regions that incorporates the shared information of multiple subjects and eliminates identity-caused differences. The second sub-task uses the local region representation of AU of the first sub-task as input, then adds relationship learning based on the transformer encoder architecture to capture AU relationships. The entire training process is cascaded. Ablation study and visualization show that our MARL can eliminate identity-caused differences, thus obtaining a robust and generalized AU discriminative embedding representation. Our results prove that on the two public datasets BP4D and DISFA, our method is superior to the state-of-the-art technology, and the F1 score is improved by 1.3% and 1.4%, respectively.

Mobility-Aware Resource Allocation for mmWave IAB Networks: A Multi-Agent Reinforcement Learning Approach

Authors:Bibo Zhang, Ilario Filippini
Date:2022-05-12 10:46:30

MmWaves have been envisioned as a promising direction to provide Gbps wireless access. However, they are susceptible to high path losses and blockages, which directional antennas can only partially mitigate. That makes mmWave networks coverage-limited, thus requiring dense deployments. Integrated access and backhaul (IAB) architectures have emerged as a cost-effective solution for network densification. Resource allocation in mmWave IAB networks must face big challenges to cope with heavy temporal dynamics, such as intermittent links caused by user mobility and blockages from moving obstacles. This makes it extremely difficult to find optimal and adaptive solutions. In this article, exploiting the distributed structure of the problem, we propose a Multi-Agent Reinforcement Learning (MARL) framework to optimize user throughput via flow routing and link scheduling in mmWave IAB networks characterized by user mobility and link outages generated by moving obstacles. The proposed approach implicitly captures the environment dynamics, coordinates the interference, and manages the buffer levels of IAB relay nodes. We design different MARL components, considering full-duplex and half-duplex IAB-nodes. In addition, we provide a communication and coordination scheme for RL agents in an online training framework, addressing the feasibility issues of practical systems. Numerical results show the effectiveness of the proposed approach.

Efficient Distributed Framework for Collaborative Multi-Agent Reinforcement Learning

Authors:Shuhan Qi, Shuhao Zhang, Xiaohan Hou, Jiajia Zhang, Xuan Wang, Jing Xiao
Date:2022-05-11 03:12:49

Multi-agent reinforcement learning for incomplete information environments has attracted extensive attention from researchers. However, due to the slow sample collection and poor sample exploration, there are still some problems in multi-agent reinforcement learning, such as unstable model iteration and low training efficiency. Moreover, most of the existing distributed framework are proposed for single-agent reinforcement learning and not suitable for multi-agent. In this paper, we design an distributed MARL framework based on the actor-work-learner architecture. In this framework, multiple asynchronous environment interaction modules can be deployed simultaneously, which greatly improves the sample collection speed and sample diversity. Meanwhile, to make full use of computing resources, we decouple the model iteration from environment interaction, and thus accelerate the policy iteration. Finally, we verified the effectiveness of propose framework in MaCA military simulation environment and the SMAC 3D realtime strategy gaming environment with imcomplete information characteristics.

Multi-agent Reinforcement Learning for Dynamic Resource Management in 6G in-X Subnetworks

Authors:Xiao Du, Ting Wang, Qiang Feng, Chenhui Ye, Tao Tao, Yuanming Shi, Mingsong Chen
Date:2022-05-10 16:50:03

The 6G network enables a subnetwork-wide evolution, resulting in a "network of subnetworks". However, due to the dynamic mobility of wireless subnetworks, the data transmission of intra-subnetwork and inter-subnetwork will inevitably interfere with each other, which poses a great challenge to radio resource management. Moreover, most of the existing approaches require the instantaneous channel gain between subnetworks, which are usually difficult to be collected. To tackle these issues, in this paper we propose a novel effective intelligent radio resource management method using multi-agent deep reinforcement learning (MARL), which only needs the sum of received power, named received signal strength indicator (RSSI), on each channel instead of channel gains. However, to directly separate individual interference from RSSI is an almost impossible thing. To this end, we further propose a novel MARL architecture, named GA-Net, which integrates a hard attention layer to model the importance distribution of inter-subnetwork relationships based on RSSI and exclude the impact of unrelated subnetworks, and employs a graph attention network with a multi-head attention layer to exact the features and calculate their weights that will impact individual throughput. Experimental results prove that our proposed framework significantly outperforms both traditional and MARL-based methods in various aspects.

Multi-Target Active Object Tracking with Monte Carlo Tree Search and Target Motion Modeling

Authors:Zheng Chen, Jian Zhao, Mingyu Yang, Wengang Zhou, Houqiang Li
Date:2022-05-07 05:08:15

In this work, we are dedicated to multi-target active object tracking (AOT), where there are multiple targets as well as multiple cameras in the environment. The goal is maximize the overall target coverage of all cameras. Previous work makes a strong assumption that each camera is fixed in a location and only allowed to rotate, which limits its application. In this work, we relax the setting by allowing all cameras to both move along the boundary lines and rotate. In our setting, the action space becomes much larger, which leads to much higher computational complexity to identify the optimal action. To this end, we propose to leverage the action selection from multi-agent reinforcement learning (MARL) network to prune the search tree of Monte Carlo Tree Search (MCTS) method, so as to find the optimal action more efficiently. Besides, we model the motion of the targets to predict the future position of the targets, which makes a better estimation of the future environment state in the MCTS process. We establish a multi-target 2D environment to simulate the sports games, and experimental results demonstrate that our method can effectively improve the target coverage.

General sum stochastic games with networked information flows

Authors:Sarah H. Q. Li, Lillian J. Ratliff, Peeyush Kumar
Date:2022-05-05 16:33:24

Inspired by applications such as supply chain management, epidemics, and social networks, we formulate a stochastic game model that addresses three key features common across these domains: 1) network-structured player interactions, 2) pair-wise mixed cooperation and competition among players, and 3) limited global information toward individual decision-making. In combination, these features pose significant challenges for black box approaches taken by deep learning-based multi-agent reinforcement learning (MARL) algorithms and deserve more detailed analysis. We formulate a networked stochastic game with pair-wise general sum objectives and asymmetrical information structure, and empirically explore the effects of information availability on the outcomes of different MARL paradigms such as individual learning and centralized learning decentralized execution.

LDSA: Learning Dynamic Subtask Assignment in Cooperative Multi-Agent Reinforcement Learning

Authors:Mingyu Yang, Jian Zhao, Xunhan Hu, Wengang Zhou, Jiangcheng Zhu, Houqiang Li
Date:2022-05-05 10:46:16

Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.

Hierarchical Control for Cooperative Teams in Competitive Autonomous Racing

Authors:Rishabh Saumil Thakkar, Aryaman Singh Samyal, David Fridovich-Keil, Zhe Xu, Ufuk Topcu
Date:2022-04-27 17:08:56

We investigate the problem of autonomous racing among teams of cooperative agents that are subject to realistic racing rules. Our work extends previous research on hierarchical control in head-to-head autonomous racing by considering a generalized version of the problem while maintaining the two-level hierarchical control structure. A high-level tactical planner constructs a discrete game that encodes the complex rules using simplified dynamics to produce a sequence of target waypoints. The low-level path planner uses these waypoints as a reference trajectory and computes high-resolution control inputs by solving a simplified formulation of a racing game with a simplified representation of the realistic racing rules. We explore two approaches for the low-level path planner: training a multi-agent reinforcement learning (MARL) policy and solving a linear-quadratic Nash game (LQNG) approximation. We evaluate our controllers on simple and complex tracks against three baselines: an end-to-end MARL controller, a MARL controller tracking a fixed racing line, and an LQNG controller tracking a fixed racing line. Quantitative results show our hierarchical methods outperform the baselines in terms of race wins, overall team performance, and compliance with the rules. Qualitatively, we observe the hierarchical controllers mimic actions performed by expert human drivers such as coordinated overtaking, defending against multiple opponents, and long-term planning for delayed advantages.

Toward Policy Explanations for Multi-Agent Reinforcement Learning

Authors:Kayla Boggess, Sarit Kraus, Lu Feng
Date:2022-04-26 20:07:08

Advances in multi-agent reinforcement learning (MARL) enable sequential decision making for a range of exciting multi-agent applications such as cooperative AI and autonomous driving. Explaining agent decisions is crucial for improving system transparency, increasing user satisfaction, and facilitating human-agent collaboration. However, existing works on explainable reinforcement learning mostly focus on the single-agent setting and are not suitable for addressing challenges posed by multi-agent environments. We present novel methods to generate two types of policy explanations for MARL: (i) policy summarization about the agent cooperation and task sequence, and (ii) language explanations to answer queries about agent behavior. Experimental results on three MARL domains demonstrate the scalability of our methods. A user study shows that the generated explanations significantly improve user performance and increase subjective ratings on metrics such as user satisfaction.

PP-MARL: Efficient Privacy-Preserving Multi-Agent Reinforcement Learning for Cooperative Intelligence in Communications

Authors:Tingting Yuan, Hwei-Ming Chung, Xiaoming Fu
Date:2022-04-26 04:08:27

Cooperative intelligence (CI) is expected to become an integral element in next-generation networks because it can aggregate the capabilities and intelligence of multiple devices. Multi-agent reinforcement learning (MARL) is a popular approach for achieving CI in communication problems by enabling effective collaboration among agents to address sequential problems. However, ensuring privacy protection for MARL is a challenging task because of the presence of heterogeneous agents that learn interdependently via sharing information. Implementing privacy protection techniques such as data encryption and federated learning to MARL introduces the notable overheads (e.g., computation and bandwidth). To overcome these challenges, we propose PP-MARL, an efficient privacy-preserving learning scheme for MARL. PP-MARL leverages homomorphic encryption (HE) and differential privacy (DP) to protect privacy, while introducing split learning to decrease overheads via reducing the volume of shared messages, and then improve efficiency. We apply and evaluate PP-MARL in two communication-related use cases. Simulation results reveal that PP-MARL can achieve efficient and reliable collaboration with 1.1-6 times better privacy protection and lower overheads (e.g., 84-91% reduction in bandwidth) than state-of-the-art approaches.

Is there evidence of a trend in the CO2 airborne fraction?

Authors:Mikkel Bennedsen, Eric Hillebrand, Siem Jan Koopman
Date:2022-04-25 13:40:24

In a paper recently published in this journal, van Marle et al. (van Marle et al., 2022) introduce an interesting new data set for land use and land cover change CO2 emissions (LULCC) that they use to study whether a trend is present in the airborne fraction (AF), defined as the fraction of CO2 emissions remaining in the atmosphere. Testing the hypothesis of a trend in the AF has attracted much attention, with the overall consensus that no statistical evidence is found for a trend in the data (Knorr, 2009; Gloor et al., 2010; Raupach et al., 2014; Bennedsen et al., 2019). In their paper, van Marle et al. analyze the AF as implied by three different LULCC emissions time series (GCP, H&N, and their new data series). In a Monte Carlo simulation study based on their new LULCC emissions data, van Marle et al. find evidence of a declining trend in the AF. In this note, we argue that the statistical analysis presented in van Marle et al. can be improved in several respects. Specifically, the Monte Carlo study presented in van Marle et al. is not conducive to determine whether there is a trend in the AF. Further, we re-examine the evidence for a trend in the AF by using a variety of different statistical tests. The statistical evidence for an uninterrupted (positive or negative) trend in the airborne fraction remains mixed at best. When allowing for a break in the trend, there is some evidence for upward trends in both subsamples.

Collaborative Auto-Curricula Multi-Agent Reinforcement Learning with Graph Neural Network Communication Layer for Open-ended Wildfire-Management Resource Distribution

Authors:Philipp Dominic Siedler
Date:2022-04-24 20:13:30

Most real-world domains can be formulated as multi-agent (MA) systems. Intentionality sharing agents can solve more complex tasks by collaborating, possibly in less time. True cooperative actions are beneficial for egoistic and collective reasons. However, teaching individual agents to sacrifice egoistic benefits for a better collective performance seems challenging. We build on a recently proposed Multi-Agent Reinforcement Learning (MARL) mechanism with a Graph Neural Network (GNN) communication layer. Rarely chosen communication actions were marginally beneficial. Here we propose a MARL system in which agents can help collaborators perform better while risking low individual performance. We conduct our study in the context of resource distribution for wildfire management. Communicating environmental features and partially observable fire occurrence help the agent collective to pre-emptively distribute resources. Furthermore, we introduce a procedural training environment accommodating auto-curricula and open-endedness towards better generalizability. Our MA communication proposal outperforms a Greedy Heuristic Baseline and a Single-Agent (SA) setup. We further demonstrate how auto-curricula and openendedness improves generalizability of our MA proposal.

Exact Formulas for Finite-Time Estimation Errors of Decentralized Temporal Difference Learning with Linear Function Approximation

Authors:Xingang Guo, Bin Hu
Date:2022-04-20 22:02:15

In this paper, we consider the policy evaluation problem in multi-agent reinforcement learning (MARL) and derive exact closed-form formulas for the finite-time mean-squared estimation errors of decentralized temporal difference (TD) learning with linear function approximation. Our analysis hinges upon the fact that the decentralized TD learning method can be viewed as a Markov jump linear system (MJLS). Then standard MJLS theory can be applied to quantify the mean and covariance matrix of the estimation error of the decentralized TD method at every time step. Various implications of our exact formulas on the algorithm performance are also discussed. An interesting finding is that under a necessary and sufficient stability condition, the mean-squared TD estimation error will converge to an exact limit at a specific exponential rate.

Multi-UAV Collision Avoidance using Multi-Agent Reinforcement Learning with Counterfactual Credit Assignment

Authors:Shuangyao Huang, Haibo Zhang, Zhiyi Huang
Date:2022-04-19 00:28:51

Multi-UAV collision avoidance is a challenging task for UAV swarm applications due to the need of tight cooperation among swarm members for collision-free path planning. Centralized Training with Decentralized Execution (CTDE) in Multi-Agent Reinforcement Learning is a promising method for multi-UAV collision avoidance, in which the key challenge is to effectively learn decentralized policies that can maximize a global reward cooperatively. We propose a new multi-agent critic-actor learning scheme called MACA for UAV swarm collision avoidance. MACA uses a centralized critic to maximize the discounted global reward that considers both safety and energy efficiency, and an actor per UAV to find decentralized policies to avoid collisions. To solve the credit assignment problem in CTDE, we design a counterfactual baseline that marginalizes both an agent's state and action, enabling to evaluate the importance of an agent in the joint observation-action space. To train and evaluate MACA, we design our own simulation environment MACAEnv to closely mimic the realistic behaviors of a UAV swarm. Simulation results show that MACA achieves more than 16% higher average reward than two state-of-the-art MARL algorithms and reduces failure rate by 90% and response time by over 99% compared to a conventional UAV swarm collision avoidance algorithm in all test scenarios.

Towards Comprehensive Testing on the Robustness of Cooperative Multi-agent Reinforcement Learning

Authors:Jun Guo, Yonghong Chen, Yihang Hao, Zixin Yin, Yin Yu, Simin Li
Date:2022-04-17 05:15:51

While deep neural networks (DNNs) have strengthened the performance of cooperative multi-agent reinforcement learning (c-MARL), the agent policy can be easily perturbed by adversarial examples. Considering the safety critical applications of c-MARL, such as traffic management, power management and unmanned aerial vehicle control, it is crucial to test the robustness of c-MARL algorithm before it was deployed in reality. Existing adversarial attacks for MARL could be used for testing, but is limited to one robustness aspects (e.g., reward, state, action), while c-MARL model could be attacked from any aspect. To overcome the challenge, we propose MARLSafe, the first robustness testing framework for c-MARL algorithms. First, motivated by Markov Decision Process (MDP), MARLSafe consider the robustness of c-MARL algorithms comprehensively from three aspects, namely state robustness, action robustness and reward robustness. Any c-MARL algorithm must simultaneously satisfy these robustness aspects to be considered secure. Second, due to the scarceness of c-MARL attack, we propose c-MARL attacks as robustness testing algorithms from multiple aspects. Experiments on \textit{SMAC} environment reveals that many state-of-the-art c-MARL algorithms are of low robustness in all aspect, pointing out the urgent need to test and enhance robustness of c-MARL algorithms.

The Complexity of Markov Equilibrium in Stochastic Games

Authors:Constantinos Daskalakis, Noah Golowich, Kaiqing Zhang
Date:2022-04-08 10:51:01

We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are efficiently computable. A fortiori, our results imply that there are no efficient algorithms for learning stationary Markov CCE policies in multi-agent reinforcement learning (MARL), even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players.

Distributed Reinforcement Learning for Robot Teams: A Review

Authors:Yutong Wang, Mehul Damani, Pamela Wang, Yuhong Cao, Guillaume Sartoretti
Date:2022-04-07 15:34:19

Purpose of review: Recent advances in sensing, actuation, and computation have opened the door to multi-robot systems consisting of hundreds/thousands of robots, with promising applications to automated manufacturing, disaster relief, harvesting, last-mile delivery, port/airport operations, or search and rescue. The community has leveraged model-free multi-agent reinforcement learning (MARL) to devise efficient, scalable controllers for multi-robot systems (MRS). This review aims to provide an analysis of the state-of-the-art in distributed MARL for multi-robot cooperation. Recent findings: Decentralized MRS face fundamental challenges, such as non-stationarity and partial observability. Building upon the "centralized training, decentralized execution" paradigm, recent MARL approaches include independent learning, centralized critic, value decomposition, and communication learning approaches. Cooperative behaviors are demonstrated through AI benchmarks and fundamental real-world robotic capabilities such as multi-robot motion/path planning. Summary: This survey reports the challenges surrounding decentralized model-free MARL for multi-robot cooperation and existing classes of approaches. We present benchmarks and robotic applications along with a discussion on current open avenues for research.

UNMAS: Multi-Agent Reinforcement Learning for Unshaped Cooperative Scenarios

Authors:Jiajun Chai, Weifan Li, Yuanheng Zhu, Dongbin Zhao, Zhe Ma, Kewu Sun, Jishiyu Ding
Date:2022-03-28 03:30:26

Multi-agent reinforcement learning methods such as VDN, QMIX, and QTRAN that adopt centralized training with decentralized execution (CTDE) framework have shown promising results in cooperation and competition. However, in some multi-agent scenarios, the number of agents and the size of action set actually vary over time. We call these unshaped scenarios, and the methods mentioned above fail in performing satisfyingly. In this paper, we propose a new method called Unshaped Networks for Multi-Agent Systems (UNMAS) that adapts to the number and size changes in multi-agent systems. We propose the self-weighting mixing network to factorize the joint action-value. Its adaption to the change in agent number is attributed to the nonlinear mapping from each-agent Q value to the joint action-value with individual weights. Besides, in order to address the change in action set, each agent constructs an individual action-value network that is composed of two streams to evaluate the constant environment-oriented subset and the varying unit-oriented subset. We evaluate UNMAS on various StarCraft II micro-management scenarios and compare the results with several state-of-the-art MARL algorithms. The superiority of UNMAS is demonstrated by its highest winning rates especially on the most difficult scenario 3s5z_vs_3s6z. The agents learn to perform effectively cooperative behaviors while other MARL algorithms fail in. Animated demonstrations and source code are provided in https://sites.google.com/view/unmas.

Remember and Forget Experience Replay for Multi-Agent Reinforcement Learning

Authors:Pascal Weber, Daniel Wälchli, Mustafa Zeqiri, Petros Koumoutsakos
Date:2022-03-24 19:59:43

We present the extension of the Remember and Forget for Experience Replay (ReF-ER) algorithm to Multi-Agent Reinforcement Learning (MARL). ReF-ER was shown to outperform state of the art algorithms for continuous control in problems ranging from the OpenAI Gym to complex fluid flows. In MARL, the dependencies between the agents are included in the state-value estimator and the environment dynamics are modeled via the importance weights used by ReF-ER. In collaborative environments, we find the best performance when the value is estimated using individual rewards and we ignore the effects of other actions on the transition map. We benchmark the performance of ReF-ER MARL on the Stanford Intelligent Systems Laboratory (SISL) environments. We find that employing a single feed-forward neural network for the policy and the value function in ReF-ER MARL, outperforms state of the art algorithms that rely on complex neural network architectures.

A Decentralised Multi-Agent Reinforcement Learning Approach for the Same-Day Delivery Problem

Authors:Elvin Ngu, Leandro Parada, Jose Javier Escribano Macias, Panagiotis Angeloudis
Date:2022-03-22 12:33:41

Same-Day Delivery services are becoming increasingly popular in recent years. These have been usually modelled by previous studies as a certain class of Dynamic Vehicle Routing Problem (DVRP) where goods must be delivered from a depot to a set of customers in the same day that the orders were placed. Adaptive exact solution methods for DVRPs can become intractable even for small problem instances. In this paper, we formulate the SDDP as a Markov Decision Process (MDP) and solve it using a parameter-sharing Deep Q-Network, which corresponds to a decentralised Multi-Agent Reinforcement Learning (MARL) approach. For this, we create a multi-agent grid-based SDD environment, consisting of multiple vehicles, a central depot and dynamic order generation. In addition, we introduce zone-specific order generation and reward probabilities. We compare the performance of our proposed MARL approach against a Mixed Inter Programming (MIP) solution. Results show that our proposed MARL framework performs on par with MIP-based policy when the number of orders is relatively low. For problem instances with higher order arrival rates, computational results show that the MARL approach underperforms the MIP by up to 30%. The performance gap between both methods becomes smaller when zone-specific parameters are employed. The gap is reduced from 30% to 3% for a 5x5 grid scenario with 30 orders. Execution time results indicate that the MARL approach is, on average, 65 times faster than the MIP-based policy, and therefore may be more advantageous for real-time control, at least for small-sized instances.

Model-based Multi-agent Reinforcement Learning: Recent Progress and Prospects

Authors:Xihuai Wang, Zhicheng Zhang, Weinan Zhang
Date:2022-03-20 17:24:47

Significant advances have recently been achieved in Multi-Agent Reinforcement Learning (MARL) which tackles sequential decision-making problems involving multiple participants. However, MARL requires a tremendous number of samples for effective training. On the other hand, model-based methods have been shown to achieve provable advantages of sample efficiency. However, the attempts of model-based methods to MARL have just started very recently. This paper presents a review of the existing research on model-based MARL, including theoretical analyses, algorithms, and applications, and analyzes the advantages and potential of model-based MARL. Specifically, we provide a detailed taxonomy of the algorithms and point out the pros and cons for each algorithm according to the challenges inherent to multi-agent scenarios. We also outline promising directions for future development of this field.

Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design

Authors:Won Joon Yun, Yunseok Kwak, Jae Pyoung Kim, Hyunhee Cho, Soyi Jung, Jihong Park, Joongheon Kim
Date:2022-03-20 03:44:45

In recent years, quantum computing (QC) has been getting a lot of attention from industry and academia. Especially, among various QC research topics, variational quantum circuit (VQC) enables quantum deep reinforcement learning (QRL). Many studies of QRL have shown that the QRL is superior to the classical reinforcement learning (RL) methods under the constraints of the number of training parameters. This paper extends and demonstrates the QRL to quantum multi-agent RL (QMARL). However, the extension of QRL to QMARL is not straightforward due to the challenge of the noise intermediate-scale quantum (NISQ) and the non-stationary properties in classical multi-agent RL (MARL). Therefore, this paper proposes the centralized training and decentralized execution (CTDE) QMARL framework by designing novel VQCs for the framework to cope with these issues. To corroborate the QMARL framework, this paper conducts the QMARL demonstration in a single-hop environment where edge agents offload packets to clouds. The extensive demonstration shows that the proposed QMARL framework enhances 57.7% of total reward than classical frameworks.

Backpropagation through Time and Space: Learning Numerical Methods with Multi-Agent Reinforcement Learning

Authors:Elliot Way, Dheeraj S. K. Kapilavai, Yiwei Fu, Lei Yu
Date:2022-03-16 20:50:24

We introduce Backpropagation Through Time and Space (BPTTS), a method for training a recurrent spatio-temporal neural network, that is used in a homogeneous multi-agent reinforcement learning (MARL) setting to learn numerical methods for hyperbolic conservation laws. We treat the numerical schemes underlying partial differential equations (PDEs) as a Partially Observable Markov Game (POMG) in Reinforcement Learning (RL). Similar to numerical solvers, our agent acts at each discrete location of a computational space for efficient and generalizable learning. To learn higher-order spatial methods by acting on local states, the agent must discern how its actions at a given spatiotemporal location affect the future evolution of the state. The manifestation of this non-stationarity is addressed by BPTTS, which allows for the flow of gradients across both space and time. The learned numerical policies are comparable to the SOTA numerics in two settings, the Burgers' Equation and the Euler Equations, and generalize well to other simulation set-ups.

PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

Authors:Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez
Date:2022-03-16 11:28:23

Learning to collaborate is critical in Multi-Agent Reinforcement Learning (MARL). Previous works promote collaboration by maximizing the correlation of agents' behaviors, which is typically characterized by Mutual Information (MI) in different forms. However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration. To address this issue, we propose a novel MARL framework, called Progressive Mutual Information Collaboration (PMIC), for more effective MI-driven collaboration. PMIC uses a new collaboration criterion measured by the MI between global states and joint actions. Based on this criterion, the key idea of PMIC is maximizing the MI associated with superior collaborative behaviors and minimizing the MI associated with inferior ones. The two MI objectives play complementary roles by facilitating better collaborations while avoiding falling into sub-optimal ones. Experiments on a wide range of MARL benchmarks show the superior performance of PMIC compared with other algorithms.

CTDS: Centralized Teacher with Decentralized Student for Multi-Agent Reinforcement Learning

Authors:Jian Zhao, Xunhan Hu, Mingyu Yang, Wengang Zhou, Jiangcheng Zhu, Houqiang Li
Date:2022-03-16 06:03:14

Due to the partial observability and communication constraints in many multi-agent reinforcement learning (MARL) tasks, centralized training with decentralized execution (CTDE) has become one of the most widely used MARL paradigms. In CTDE, centralized information is dedicated to learning the allocation of the team reward with a mixing network, while the learning of individual Q-values is usually based on local observations. The insufficient utility of global observation will degrade performance in challenging environments. To this end, this work proposes a novel Centralized Teacher with Decentralized Student (CTDS) framework, which consists of a teacher model and a student model. Specifically, the teacher model allocates the team reward by learning individual Q-values conditioned on global observation, while the student model utilizes the partial observations to approximate the Q-values estimated by the teacher model. In this way, CTDS balances the full utilization of global observation during training and the feasibility of decentralized execution for online inference. Our CTDS framework is generic which is ready to be applied upon existing CTDE methods to boost their performance. We conduct experiments on a challenging set of StarCraft II micromanagement tasks to test the effectiveness of our method and the results show that CTDS outperforms the existing value-based MARL methods.

An Introduction to Multi-Agent Reinforcement Learning and Review of its Application to Autonomous Mobility

Authors:Lukas M. Schmidt, Johanna Brosig, Axel Plinge, Bjoern M. Eskofier, Christopher Mutschler
Date:2022-03-15 06:40:28

Many scenarios in mobility and traffic involve multiple different agents that need to cooperate to find a joint solution. Recent advances in behavioral planning use Reinforcement Learning to find effective and performant behavior strategies. However, as autonomous vehicles and vehicle-to-X communications become more mature, solutions that only utilize single, independent agents leave potential performance gains on the road. Multi-Agent Reinforcement Learning (MARL) is a research field that aims to find optimal solutions for multiple agents that interact with each other. This work aims to give an overview of the field to researchers in autonomous mobility. We first explain MARL and introduce important concepts. Then, we discuss the central paradigms that underlie MARL algorithms, and give an overview of state-of-the-art methods and ideas in each paradigm. With this background, we survey applications of MARL in autonomous mobility scenarios and give an overview of existing scenarios and implementations.

The Multi-Agent Pickup and Delivery Problem: MAPF, MARL and Its Warehouse Applications

Authors:Tim Tsz-Kit Lau, Biswa Sengupta
Date:2022-03-14 13:23:35

We study two state-of-the-art solutions to the multi-agent pickup and delivery (MAPD) problem based on different principles -- multi-agent path-finding (MAPF) and multi-agent reinforcement learning (MARL). Specifically, a recent MAPF algorithm called conflict-based search (CBS) and a current MARL algorithm called shared experience actor-critic (SEAC) are studied. While the performance of these algorithms is measured using quite different metrics in their separate lines of work, we aim to benchmark these two methods comprehensively in a simulated warehouse automation environment.

Stable and Efficient Shapley Value-Based Reward Reallocation for Multi-Agent Reinforcement Learning of Autonomous Vehicles

Authors:Songyang Han, He Wang, Sanbao Su, Yuanyuan Shi, Fei Miao
Date:2022-03-12 03:56:01

With the development of sensing and communication technologies in networked cyber-physical systems (CPSs), multi-agent reinforcement learning (MARL)-based methodologies are integrated into the control process of physical systems and demonstrate prominent performance in a wide array of CPS domains, such as connected autonomous vehicles (CAVs). However, it remains challenging to mathematically characterize the improvement of the performance of CAVs with communication and cooperation capability. When each individual autonomous vehicle is originally self-interest, we can not assume that all agents would cooperate naturally during the training process. In this work, we propose to reallocate the system's total reward efficiently to motivate stable cooperation among autonomous vehicles. We formally define and quantify how to reallocate the system's total reward to each agent under the proposed transferable utility game, such that communication-based cooperation among multi-agents increases the system's total reward. We prove that Shapley value-based reward reallocation of MARL locates in the core if the transferable utility game is a convex game. Hence, the cooperation is stable and efficient and the agents should stay in the coalition or the cooperating group. We then propose a cooperative policy learning algorithm with Shapley value reward reallocation. In experiments, compared with several literature algorithms, we show the improvement of the mean episode system reward of CAV systems using our proposed algorithm.

Breaking the Curse of Dimensionality in Multiagent State Space: A Unified Agent Permutation Framework

Authors:Xiaotian Hao, Hangyu Mao, Weixun Wang, Yaodong Yang, Dong Li, Yan Zheng, Zhen Wang, Jianye Hao
Date:2022-03-10 11:00:53

The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, Google Research Football and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).

On-the-fly Strategy Adaptation for ad-hoc Agent Coordination

Authors:Jaleh Zand, Jack Parker-Holder, Stephen J. Roberts
Date:2022-03-08 02:18:11

Training agents in cooperative settings offers the promise of AI agents able to interact effectively with humans (and other agents) in the real world. Multi-agent reinforcement learning (MARL) has the potential to achieve this goal, demonstrating success in a series of challenging problems. However, whilst these advances are significant, the vast majority of focus has been on the self-play paradigm. This often results in a coordination problem, caused by agents learning to make use of arbitrary conventions when playing with themselves. This means that even the strongest self-play agents may have very low cross-play with other agents, including other initializations of the same algorithm. In this paper we propose to solve this problem by adapting agent strategies on the fly, using a posterior belief over the other agents' strategy. Concretely, we consider the problem of selecting a strategy from a finite set of previously trained agents, to play with an unknown partner. We propose an extension of the classic statistical technique, Gibbs sampling, to update beliefs about other agents and obtain close to optimal ad-hoc performance. Despite its simplicity, our method is able to achieve strong cross-play with unseen partners in the challenging card game of Hanabi, achieving successful ad-hoc coordination without knowledge of the partner's strategy a priori.

Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Authors:Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob N. Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, Jonathan P. How
Date:2022-03-07 17:32:35

The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.

Reliably Re-Acting to Partner's Actions with the Social Intrinsic Motivation of Transfer Empowerment

Authors:Tessa van der Heiden, Herke van Hoof, Efstratios Gavves, Christoph Salge
Date:2022-03-07 13:03:35

We consider multi-agent reinforcement learning (MARL) for cooperative communication and coordination tasks. MARL agents can be brittle because they can overfit their training partners' policies. This overfitting can produce agents that adopt policies that act under the expectation that other agents will act in a certain way rather than react to their actions. Our objective is to bias the learning process towards finding reactive strategies towards other agents' behaviors. Our method, transfer empowerment, measures the potential influence between agents' actions. Results from three simulated cooperation scenarios support our hypothesis that transfer empowerment improves MARL performance. We discuss how transfer empowerment could be a useful principle to guide multi-agent coordination by ensuring reactiveness to one's partner.

Efficient Policy Generation in Multi-Agent Systems via Hypergraph Neural Network

Authors:Bin Zhang, Yunpeng Bai, Zhiwei Xu, Dapeng Li, Guoliang Fan
Date:2022-03-07 10:34:40

The application of deep reinforcement learning in multi-agent systems introduces extra challenges. In a scenario with numerous agents, one of the most important concerns currently being addressed is how to develop sufficient collaboration between diverse agents. To address this problem, we consider the form of agent interaction based on neighborhood and propose a multi-agent reinforcement learning (MARL) algorithm based on the actor-critic method, which can adaptively construct the hypergraph structure representing the agent interaction and further implement effective information extraction and representation learning through hypergraph convolution networks, leading to effective cooperation. Based on different hypergraph generation methods, we present two variants: Actor Hypergraph Convolutional Critic Network (HGAC) and Actor Attention Hypergraph Critic Network (ATT-HGAC). Experiments with different settings demonstrate the advantages of our approach over other existing methods.

Depthwise Convolution for Multi-Agent Communication with Enhanced Mean-Field Approximation

Authors:Donghan Xie, Zhi Wang, Chunlin Chen, Daoyi Dong
Date:2022-03-06 07:42:43

Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC).

Recursive Reasoning Graph for Multi-Agent Reinforcement Learning

Authors:Xiaobai Ma, David Isele, Jayesh K. Gupta, Kikuo Fujimura, Mykel J. Kochenderfer
Date:2022-03-06 00:57:50

Multi-agent reinforcement learning (MARL) provides an efficient way for simultaneously learning policies for multiple agents interacting with each other. However, in scenarios requiring complex interactions, existing algorithms can suffer from an inability to accurately anticipate the influence of self-actions on other agents. Incorporating an ability to reason about other agents' potential responses can allow an agent to formulate more effective strategies. This paper adopts a recursive reasoning model in a centralized-training-decentralized-execution framework to help learning agents better cooperate with or compete against others. The proposed algorithm, referred to as the Recursive Reasoning Graph (R2G), shows state-of-the-art performance on multiple multi-agent particle and robotics games.

Diffusive shock acceleration at oblique high Mach number shocks

Authors:Allard Jan van Marle, Artem Bohdan, Paul J. Morris, Martin Pohl, Alexandre Marcowith
Date:2022-03-01 11:03:49

The current paradigm of cosmic ray (CR) origin states that the most part of galactic CRs is produced by supernova remnants. The interaction of supernova ejecta with the interstellar medium after supernova's explosions results in shocks responsible for CR acceleration via diffusive shock acceleration (DSA). We use particle-in-cell (PIC) simulations and a combined PIC-magnetohydrodynamic (PIC-MHD) technique to investigate whether DSA can occur in oblique high Mach number shocks. Using the PIC method, we follow the formation of the shock and determine the fraction of the particles that gets involved in DSA. Then, with this result, we use PIC-MHD simulations to model the large-scale structure of the plasma and the magnetic field surrounding the shock and find out whether or not the reflected particles can generate the upstream turbulence and trigger DSA. We find that the feasibility of this process in oblique shocks depends strongly on the Alfvenic Mach number, and the DSA process is more likely triggered at high Mach number shocks.

Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?

Authors:Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri
Date:2022-02-28 19:03:09

Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.

Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value Functions

Authors:Gangshan Jing, He Bai, Jemin George, Aranya Chakrabortty, Piyush K. Sharma
Date:2022-02-26 03:01:51

Achieving distributed reinforcement learning (RL) for large-scale cooperative multi-agent systems (MASs) is challenging because: (i) each agent has access to only limited information; (ii) issues on convergence or computational complexity emerge due to the curse of dimensionality. In this paper, we propose a general computationally efficient distributed framework for cooperative multi-agent reinforcement learning (MARL) by utilizing the structures of graphs involved in this problem. We introduce three coupling graphs describing three types of inter-agent couplings in MARL, namely, the state graph, the observation graph and the reward graph. By further considering a communication graph, we propose two distributed RL approaches based on local value-functions derived from the coupling graphs. The first approach is able to reduce sample complexity significantly under specific conditions on the aforementioned four graphs. The second approach provides an approximate solution and can be efficient even for problems with dense coupling graphs. Here there is a trade-off between minimizing the approximation error and reducing the computational complexity. Simulations show that our RL algorithms have a significantly improved scalability to large-scale MASs compared with centralized and consensus-based distributed RL algorithms.

Hierarchical Control for Head-to-Head Autonomous Racing

Authors:Rishabh Saumil Thakkar, Aryaman Singh Samyal, David Fridovich-Keil, Zhe Xu, Ufuk Topcu
Date:2022-02-25 18:11:52

We develop a hierarchical controller for head-to-head autonomous racing. We first introduce a formulation of a racing game with realistic safety and fairness rules. A high-level planner approximates the original formulation as a discrete game with simplified state, control, and dynamics to easily encode the complex safety and fairness rules and calculates a series of target waypoints. The low-level controller takes the resulting waypoints as a reference trajectory and computes high-resolution control inputs by solving an alternative formulation approximation with simplified objectives and constraints. We consider two approaches for the low-level planner, constructing two hierarchical controllers. One approach uses multi-agent reinforcement learning (MARL), and the other solves a linear-quadratic Nash game (LQNG) to produce control inputs. The controllers are compared against three baselines: an end-to-end MARL controller, a MARL controller tracking a fixed racing line, and an LQNG controller tracking a fixed racing line. Quantitative results show that the proposed hierarchical methods outperform their respective baseline methods in terms of head-to-head race wins and abiding by the rules. The hierarchical controller using MARL for low-level control consistently outperformed all other methods by winning over 90% of head-to-head races and more consistently adhered to the complex racing rules. Qualitatively, we observe the proposed controllers mimicking actions performed by expert human drivers such as shielding/blocking, overtaking, and long-term planning for delayed advantages. We show that hierarchical planning for game-theoretic reasoning produces competitive behavior even when challenged with complex rules and constraints.

Collective Conditioned Reflex: A Bio-Inspired Fast Emergency Reaction Mechanism for Designing Safe Multi-Robot Systems

Authors:Bowei He, Zhenting Zhao, Wenhao Luo, Rui Liu
Date:2022-02-24 07:07:20

A multi-robot system (MRS) is a group of coordinated robots designed to cooperate with each other and accomplish given tasks. Due to the uncertainties in operating environments, the system may encounter emergencies, such as unobserved obstacles, moving vehicles, and extreme weather. Animal groups such as bee colonies initiate collective emergency reaction behaviors such as bypassing obstacles and avoiding predators, similar to muscle-conditioned reflex which organizes local muscles to avoid hazards in the first response without delaying passage through the brain. Inspired by this, we develop a similar collective conditioned reflex mechanism for multi-robot systems to respond to emergencies. In this study, Collective Conditioned Reflex (CCR), a bio-inspired emergency reaction mechanism, is developed based on animal collective behavior analysis and multi-agent reinforcement learning (MARL). The algorithm uses a physical model to determine if the robots are experiencing an emergency; then, rewards for robots involved in the emergency are augmented with corresponding heuristic rewards, which evaluate emergency magnitudes and consequences and decide local robots' participation. CCR is validated on three typical emergency scenarios: \textit{turbulence, strong wind, and hidden obstacle}. Simulation results demonstrate that CCR improves robot teams' emergency reaction capability with faster reaction speed and safer trajectory adjustment compared with baseline methods.

A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets

Authors:Chengchun Shi, Runzhe Wan, Ge Song, Shikai Luo, Rui Song, Hongtu Zhu
Date:2022-02-21 23:36:40

The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.

MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning

Authors:Jian Zhao, Mingyu Yang, Youpeng Zhao, Xunhan Hu, Wengang Zhou, Jiangcheng Zhu, Houqiang Li
Date:2022-02-21 11:28:00

In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of Q-value for both individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes a novel value-based MARL framework from a distributional perspective, \emph{i.e.}, parameterizing value function via \underline{M}ixture of \underline{C}ategorical distributions for MARL. Specifically, we model both individual Q-values and global Q-value with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (\emph{e.g.}, VDN and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. Empirically, we evaluate MCMARL on both a stochastic matrix game and a challenging set of StarCraft II micromanagement tasks, showing the efficacy of our framework.

PooL: Pheromone-inspired Communication Framework forLarge Scale Multi-Agent Reinforcement Learning

Authors:Zixuan Cao, Mengzhi Shi, Zhanbo Zhao, Xiujun Ma
Date:2022-02-20 03:09:53

Being difficult to scale poses great problems in multi-agent coordination. Multi-agent Reinforcement Learning (MARL) algorithms applied in small-scale multi-agent systems are hard to extend to large-scale ones because the latter is far more dynamic and the number of interactions increases exponentially with the growing number of agents. Some swarm intelligence algorithms simulate the release and utilization mechanism of pheromones to control large-scale agent coordination. Inspired by such algorithms, \textbf{PooL}, an \textbf{p}her\textbf{o}m\textbf{o}ne-based indirect communication framework applied to large scale multi-agent reinforcement \textbf{l}earning is proposed in order to solve the large-scale multi-agent coordination problem. Pheromones released by agents of PooL are defined as outputs of most reinforcement learning algorithms, which reflect agents' views of the current environment. The pheromone update mechanism can efficiently organize the information of all agents and simplify the complex interactions among agents into low-dimensional representations. Pheromones perceived by agents can be regarded as a summary of the views of nearby agents which can better reflect the real situation of the environment. Q-Learning is taken as our base model to implement PooL and PooL is evaluated in various large-scale cooperative environments. Experiments show agents can capture effective information through PooL and achieve higher rewards than other state-of-arts methods with lower communication costs.

Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games

Authors:Dingyang Chen, Yile Li, Qi Zhang
Date:2022-02-18 20:35:00

Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain form of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training.

Distributed Multi-Agent Reinforcement Learning with One-hop Neighbors and Compute Straggler Mitigation

Authors:Baoqian Wang, Junfei Xie, Nikolay Atanasov
Date:2022-02-18 04:55:09

Most multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. With increasing numbers of agents, the number of training iterations required to find the optimal behaviors increases exponentially due to the exponentially growing joint state and action spaces. This paper tackles this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that addresses the curse of dimensionality by restricting information exchanges among the agents to one-hop neighbors when representing value and policy functions. Each agent optimizes its value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying neighbor numbers and states. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Distributed computing systems, however, contain straggler compute nodes, which are slow or unresponsive due to communication bottlenecks, software or hardware problems. To mitigate the detrimental straggler effect, we introduce a novel coded distributed learning architecture, which leverages coding theory to improve the resilience of the learning system to stragglers. Comprehensive experiments show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases. Moreover, the coded distributed learning architecture improves training efficiency in the presence of stragglers.

Disentangling Successor Features for Coordination in Multi-agent Reinforcement Learning

Authors:Seung Hyun Kim, Neale Van Stralen, Girish Chowdhary, Huy T. Tran
Date:2022-02-15 21:48:26

Multi-agent reinforcement learning (MARL) is a promising framework for solving complex tasks with many agents. However, a key challenge in MARL is defining private utility functions that ensure coordination when training decentralized agents. This challenge is especially prevalent in unstructured tasks with sparse rewards and many agents. We show that successor features can help address this challenge by disentangling an individual agent's impact on the global value function from that of all other agents. We use this disentanglement to compactly represent private utilities that support stable training of decentralized agents in unstructured tasks. We implement our approach using a centralized training, decentralized execution architecture and test it in a variety of multi-agent environments. Our results show improved performance and training time relative to existing methods and suggest that disentanglement of successor features offers a promising approach to coordination in MARL.

Group-Agent Reinforcement Learning

Authors:Kaiyue Wu, Xiao-Jun Zeng
Date:2022-02-10 16:40:59

It can largely benefit the reinforcement learning (RL) process of each agent if multiple geographically distributed agents perform their separate RL tasks cooperatively. Different from multi-agent reinforcement learning (MARL) where multiple agents are in a common environment and should learn to cooperate or compete with each other, in this case each agent has its separate environment and only communicates with others to share knowledge without any cooperative or competitive behaviour as a learning outcome. In fact, this scenario exists widely in real life whose concept can be utilised in many applications, but is not well understood yet and not well formulated. As the first effort, we propose group-agent system for RL as a formulation of this scenario and the third type of RL system with respect to single-agent and multi-agent systems. We then propose a distributed RL framework called DDAL (Decentralised Distributed Asynchronous Learning) designed for group-agent reinforcement learning (GARL). We show through experiments that DDAL achieved desirable performance with very stable training and has good scalability.

Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning

Authors:Zehao Dou, Jakub Grudzien Kuba, Yaodong Yang
Date:2022-02-10 06:59:08

Value function decomposition is becoming a popular rule of thumb for scaling up multi-agent reinforcement learning (MARL) in cooperative games. For such a decomposition rule to hold, the assumption of the individual-global max (IGM) principle must be made; that is, the local maxima on the decomposed value function per every agent must amount to the global maximum on the joint value function. This principle, however, does not have to hold in general. As a result, the applicability of value decomposition algorithms is concealed and their corresponding convergence properties remain unknown. In this paper, we make the first effort to answer these questions. Specifically, we introduce the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games. In decomposable games, we theoretically prove that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Q-function. In non-decomposable games, the estimated Q-function by MA-FQI can still converge to the optimum under the circumstance that the Q-function needs projecting into the decomposable function space at each iteration. In both settings, we consider value function representations by practical deep neural networks and derive their corresponding convergence rates. To summarize, our results, for the first time, offer theoretical insights for MARL practitioners in terms of when value decomposition algorithms converge and why they perform well.

Attacking c-MARL More Effectively: A Data Driven Approach

Authors:Nhan H. Pham, Lam M. Nguyen, Jie Chen, Hoang Thanh Lam, Subhro Das, Tsui-Wei Weng
Date:2022-02-07 23:28:22

In recent years, a proliferation of methods were developed for cooperative multi-agent reinforcement learning (c-MARL). However, the robustness of c-MARL agents against adversarial attacks has been rarely explored. In this paper, we propose to evaluate the robustness of c-MARL agents via a model-based approach, named c-MBA. Our proposed formulation can craft much stronger adversarial state perturbations of c-MARL agents to lower total team rewards than existing model-free approaches. In addition, we propose the first victim-agent selection strategy and the first data-driven approach to define targeted failure states where each of them allows us to develop even stronger adversarial attack without the expert knowledge to the underlying environment. Our numerical experiments on two representative MARL benchmarks illustrate the advantage of our approach over other baselines: our model-based attack consistently outperforms other baselines in all tested environments.

Analysis of Independent Learning in Network Agents: A Packet Forwarding Use Case

Authors:Abu Saleh Md Tayeen, Milan Biswal, Abderrahmen Mtibaa, Satyajayant Misra
Date:2022-02-04 19:15:58

Multi-Agent Reinforcement Learning (MARL) is nowadays widely used to solve real-world and complex decisions in various domains. While MARL can be categorized into independent and cooperative approaches, we consider the independent approach as a simple, more scalable, and less costly method for large-scale distributed systems, such as network packet forwarding. In this paper, we quantitatively and qualitatively assess the benefits of leveraging such independent agents learning approach, in particular IQL-based algorithm, for packet forwarding in computer networking, using the Named Data Networking (NDN) architecture as a driving example. We put multiple IQL-based forwarding strategies (IDQF) to the test and compare their performances against very basic forwarding schemes and simple topologies/traffic models to highlight major challenges and issues. We discuss the main issues related to the poor performance of IDQF and quantify the impact of these issues on isolation when training and testing the IDQF models under different model tuning parameters and network topologies/characteristics.

Robustness and Adaptability of Reinforcement Learning based Cooperative Autonomous Driving in Mixed-autonomy Traffic

Authors:Rodolfo Valiente, Behrad Toghi, Ramtin Pedarsani, Yaser P. Fallah
Date:2022-02-02 05:15:14

Building autonomous vehicles (AVs) is a complex problem, but enabling them to operate in the real world where they will be surrounded by human-driven vehicles (HVs) is extremely challenging. Prior works have shown the possibilities of creating inter-agent cooperation between a group of AVs that follow a social utility. Such altruistic AVs can form alliances and affect the behavior of HVs to achieve socially desirable outcomes. We identify two major challenges in the co-existence of AVs and HVs. First, social preferences and individual traits of a given human driver, e.g., selflessness and aggressiveness are unknown to an AV, and it is almost impossible to infer them in real-time during a short AV-HV interaction. Second, contrary to AVs that are expected to follow a policy, HVs do not necessarily follow a stationary policy and therefore are extremely hard to predict. To alleviate the above-mentioned challenges, we formulate the mixed-autonomy problem as a multi-agent reinforcement learning (MARL) problem and propose a decentralized framework and reward function for training cooperative AVs. Our approach enables AVs to learn the decision-making of HVs implicitly from experience, optimizes for a social utility while prioritizing safety and allowing adaptability; robustifying altruistic AVs to different human behaviors and constraining them to a safe action space. Finally, we investigate the robustness, safety and sensitivity of AVs to various HVs behavioral traits and present the settings in which the AVs can learn cooperative policies that are adaptable to different situations.

Trust Region Bounds for Decentralized PPO Under Non-stationarity

Authors:Mingfei Sun, Sam Devlin, Jacob Beck, Katja Hofmann, Shimon Whiteson
Date:2022-01-31 20:39:48

We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.

Multi-Agent Reinforcement Learning for Network Load Balancing in Data Center

Authors:Zhiyuan Yao, Zihan Ding, Thomas Clausen
Date:2022-01-27 18:47:59

This paper presents the network load balancing problem, a challenging real-world task for multi-agent reinforcement learning (MARL) methods. Traditional heuristic solutions like Weighted-Cost Multi-Path (WCMP) and Local Shortest Queue (LSQ) are less flexible to the changing workload distributions and arrival rates, with a poor balance among multiple load balancers. The cooperative network load balancing task is formulated as a Dec-POMDP problem, which naturally induces the MARL methods. To bridge the reality gap for applying learning-based methods, all methods are directly trained and evaluated on an emulation system from moderate-to large-scale. Experiments on realistic testbeds show that the independent and "selfish" load balancing strategies are not necessarily the globally optimal ones, while the proposed MARL solution has a superior performance over different realistic settings. Additionally, the potential difficulties of MARL methods for network load balancing are analysed, which helps to draw the attention of the learning and network communities to such challenges.

Exploiting Semantic Epsilon Greedy Exploration Strategy in Multi-Agent Reinforcement Learning

Authors:Hon Tik Tse, Ho-fung Leung
Date:2022-01-26 08:21:11

Multi-agent reinforcement learning (MARL) can model many real world applications. However, many MARL approaches rely on epsilon greedy for exploration, which may discourage visiting advantageous states in hard scenarios. In this paper, we propose a new approach QMIX(SEG) for tackling MARL. It makes use of the value function factorization method QMIX to train per-agent policies and a novel Semantic Epsilon Greedy (SEG) exploration strategy. SEG is a simple extension to the conventional epsilon greedy exploration strategy, yet it is experimentally shown to greatly improve the performance of MARL. We first cluster actions into groups of actions with similar effects and then use the groups in a bi-level epsilon greedy exploration hierarchy for action selection. We argue that SEG facilitates semantic exploration by exploring in the space of groups of actions, which have richer semantic meanings than atomic actions. Experiments show that QMIX(SEG) largely outperforms QMIX and leads to strong performance competitive with current state-of-the-art MARL approaches on the StarCraft Multi-Agent Challenge (SMAC) benchmark.

Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming

Authors:Sachin Konan, Esmaeil Seraj, Matthew Gombolay
Date:2022-01-20 22:54:32

Information sharing is key in building team cognition and enables coordination and cooperation. High-performing human teams also benefit from acting strategically with hierarchical levels of iterated communication and rationalizability, meaning a human agent can reason about the actions of their teammates in their decision-making. Yet, the majority of prior work in Multi-Agent Reinforcement Learning (MARL) does not support iterated rationalizability and only encourage inter-agent communication, resulting in a suboptimal equilibrium cooperation strategy. In this work, we show that reformulating an agent's policy to be conditional on the policies of its neighboring teammates inherently maximizes Mutual Information (MI) lower-bound when optimizing under Policy Gradient (PG). Building on the idea of decision-making under bounded rationality and cognitive hierarchy theory, we show that our modified PG approach not only maximizes local agent rewards but also implicitly reasons about MI between agents without the need for any explicit ad-hoc regularization terms. Our approach, InfoPG, outperforms baselines in learning emergent collaborative behaviors and sets the state-of-the-art in decentralized cooperative MARL tasks. Our experiments validate the utility of InfoPG by achieving higher sample efficiency and significantly larger cumulative reward in several complex cooperative multi-agent domains.

Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning

Authors:Baicen Xiao, Bhaskar Ramasubramanian, Radha Poovendran
Date:2022-01-12 18:35:46

This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal. Solving such MARL problems requires addressing two challenges: identifying (1) relative importance of states along the length of an episode (along time), and (2) relative importance of individual agents' states at any single time-step (among agents). In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges. AREL uses attention mechanisms to characterize the influence of actions on state transitions along trajectories (temporal attention), and how each agent is affected by other agents at each time-step (agent attention). The redistributed rewards predicted by AREL are dense, and can be integrated with any given MARL algorithm. We evaluate AREL on challenging tasks from the Particle World environment and the StarCraft Multi-Agent Challenge. AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods. Our code is available at https://github.com/baicenxiao/AREL.

Distributed Cooperative Multi-Agent Reinforcement Learning with Directed Coordination Graph

Authors:Gangshan Jing, He Bai, Jemin George, Aranya Chakrabortty, Piyush. K. Sharma
Date:2022-01-10 04:14:46

Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with directed coordination graphs, and propose a distributed RL algorithm where the local policy evaluations are based on local value functions. The local value function of each agent is obtained by local communication with its neighbors through a directed learning-induced communication graph, without using any consensus algorithm. A zeroth-order optimization (ZOO) approach based on parameter perturbation is employed to achieve gradient estimation. By comparing with existing ZOO-based RL algorithms, we show that our proposed distributed RL algorithm guarantees high scalability. A distributed resource allocation example is shown to illustrate the effectiveness of our algorithm.

A Multi-agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning

Authors:Sai Qian Zhang, Jieyu Lin, Qi Zhang
Date:2022-01-09 05:55:17

Federated learning (FL) is a training technique that enables client devices to jointly learn a shared model by aggregating locally-computed models without exposing their raw data. While most of the existing work focuses on improving the FL model accuracy, in this paper, we focus on the improving the training efficiency, which is often a hurdle for adopting FL in real-world applications. Specifically, we design an efficient FL framework which jointly optimizes model accuracy, processing latency and communication efficiency, all of which are primary design considerations for real implementation of FL. Inspired by the recent success of Multi-Agent Reinforcement Learning (MARL) in solving complex control problems, we present \textit{FedMarl}, an MARL-based FL framework which performs efficient run-time client selection. Experiments show that FedMarl can significantly improve model accuracy with much lower processing latency and communication cost.

Analyzing Micro-Founded General Equilibrium Models with Many Agents using Deep Reinforcement Learning

Authors:Michael Curry, Alexander Trott, Soham Phade, Yu Bai, Stephan Zheng
Date:2022-01-03 17:00:17

Real economies can be modeled as a sequential imperfect-information game with many heterogeneous agents, such as consumers, firms, and governments. Dynamic general equilibrium (DGE) models are often used for macroeconomic analysis in this setting. However, finding general equilibria is challenging using existing theoretical or computational methods, especially when using microfoundations to model individual agents. Here, we show how to use deep multi-agent reinforcement learning (MARL) to find $\epsilon$-meta-equilibria over agent types in microfounded DGE models. Whereas standard MARL fails to learn non-trivial solutions, our structured learning curricula enable stable convergence to meaningful solutions. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., continuous market clearing, that are commonly used for analytical tractability. Furthermore, our end-to-end GPU implementation enables fast real-time convergence with a large number of RL economic agents. We showcase our approach in open and closed real-business-cycle (RBC) models with 100 worker-consumers, 10 firms, and a social planner who taxes and redistributes. We validate the learned solutions are $\epsilon$-meta-equilibria through best-response analyses, show that they align with economic intuitions, and show our approach can learn a spectrum of qualitatively distinct $\epsilon$-meta-equilibria in open RBC models. As such, we show that hardware-accelerated MARL is a promising framework for modeling the complexity of economies based on microfoundations.

Multi-Agent Reinforcement Learning via Adaptive Kalman Temporal Difference and Successor Representation

Authors:Mohammad Salimibeni, Arash Mohammadi, Parvin Malekzadeh, Konstantinos N. Plataniotis
Date:2021-12-30 18:21:53

Distributed Multi-Agent Reinforcement Learning (MARL) algorithms has attracted a surge of interest lately mainly due to the recent advancements of Deep Neural Networks (DNNs). Conventional Model-Based (MB) or Model-Free (MF) RL algorithms are not directly applicable to the MARL problems due to utilization of a fixed reward model for learning the underlying value function. While DNN-based solutions perform utterly well when a single agent is involved, such methods fail to fully generalize to the complexities of MARL problems. In other words, although recently developed approaches based on DNNs for multi-agent environments have achieved superior performance, they are still prone to overfiting, high sensitivity to parameter selection, and sample inefficiency. The paper proposes the Multi-Agent Adaptive Kalman Temporal Difference (MAK-TD) framework and its Successor Representation-based variant, referred to as the MAK-SR. Intuitively speaking, the main objective is to capitalize on unique characteristics of Kalman Filtering (KF) such as uncertainty modeling and online second order learning. The proposed MAK-TD/SR frameworks consider the continuous nature of the action-space that is associated with high dimensional multi-agent environments and exploit Kalman Temporal Difference (KTD) to address the parameter uncertainty. By leveraging the KTD framework, SR learning procedure is modeled into a filtering problem, where Radial Basis Function (RBF) estimators are used to encode the continuous space into feature vectors. On the other hand, for learning localized reward functions, we resort to Multiple Model Adaptive Estimation (MMAE), to deal with the lack of prior knowledge on the observation noise covariance and observation mapping function. The proposed MAK-TD/SR frameworks are evaluated via several experiments, which are implemented through the OpenAI Gym MARL benchmarks.

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Authors:Xiaoyang Zhao, Chuan Wu
Date:2021-12-26 10:51:48

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs occurs due to resource contention. Interference-aware job placement has been studied, with white-box approaches based on explicit interference modeling and black-box schedulers with reinforcement learning. In today's clusters containing thousands of GPU servers, running a single scheduler to manage all arrival jobs in a timely and effective manner is challenging, due to the large workload scale. We adopt multiple schedulers in a large-scale cluster/data center, and propose a multi-agent reinforcement learning (MARL) scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing job completion time (JCT). To achieve topology-aware placements, our proposed framework uses hierarchical graph neural networks to encode the data center topology and server architecture. In view of a common lack of precise reward samples corresponding to different placements, a job interference model is further devised to predict interference levels in face of various co-locations, for training of the MARL schedulers. Testbed and trace-driven evaluations show that our scheduler framework outperforms representative scheduling schemes by more than 20% in terms of average JCT, and is adaptive to various machine learning cluster topologies.

Learning Cooperative Multi-Agent Policies with Partial Reward Decoupling

Authors:Benjamin Freed, Aditya Kapoor, Ian Abraham, Jeff Schneider, Howie Choset
Date:2021-12-23 17:48:04

One of the preeminent obstacles to scaling multi-agent reinforcement learning to large numbers of agents is assigning credit to individual agents' actions. In this paper, we address this credit assignment problem with an approach that we call \textit{partial reward decoupling} (PRD), which attempts to decompose large cooperative multi-agent RL problems into decoupled subproblems involving subsets of agents, thereby simplifying credit assignment. We empirically demonstrate that decomposing the RL problem using PRD in an actor-critic algorithm results in lower variance policy gradient estimates, which improves data efficiency, learning stability, and asymptotic performance across a wide array of multi-agent RL tasks, compared to various other actor-critic approaches. Additionally, we relate our approach to counterfactual multi-agent policy gradient (COMA), a state-of-the-art MARL algorithm, and empirically show that our approach outperforms COMA by making better use of information in agents' reward streams, and by enabling recent advances in advantage estimation to be used.

Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

Authors:Raphaël Avalos, Mathieu Reymond, Ann Nowé, Diederik M. Roijers
Date:2021-12-23 10:55:33

Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.

CGIBNet: Bandwidth-constrained Communication with Graph Information Bottleneck in Multi-Agent Reinforcement Learning

Authors:Qi Tian, Kun Kuang, Baoxiang Wang, Furui Liu, Fei Wu
Date:2021-12-20 07:53:44

Communication is one of the core components for cooperative multi-agent reinforcement learning (MARL). The communication bandwidth, in many real applications, is always subject to certain constraints. To improve communication efficiency, in this article, we propose to simultaneously optimize whom to communicate with and what to communicate for each agent in MARL. By initiating the communication between agents with a directed complete graph, we propose a novel communication model, named Communicative Graph Information Bottleneck Network (CGIBNet), to simultaneously compress the graph structure and the node information with the graph information bottleneck principle. The graph structure compression is designed to cut the redundant edges for determining whom to communicate with. The node information compression aims to address the problem of what to communicate via learning compact node representations. Moreover, CGIBNet is the first universal module for bandwidth-constrained communication, which can be applied to various training frameworks (i.e., policy-based and value-based MARL frameworks) and communication modes (i.e., single-round and multi-round communication). Extensive experiments are conducted in Traffic Control and StarCraft II environments. The results indicate that our method can achieve better performance in bandwidth-constrained settings compared with state-of-the-art algorithms.

Learning to Share in Multi-Agent Reinforcement Learning

Authors:Yuxuan Yi, Ge Li, Yaowei Wang, Zongqing Lu
Date:2021-12-16 08:43:20

In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents to make decisions in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective through collectives. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize the local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenarios across scales.

Finite-Sample Analysis of Decentralized Q-Learning for Stochastic Games

Authors:Zuguang Gao, Qianqian Ma, Tamer Başar, John R. Birge
Date:2021-12-15 03:33:39

Learning in stochastic games is arguably the most standard and fundamental setting in multi-agent reinforcement learning (MARL). In this paper, we consider decentralized MARL in stochastic games in the non-asymptotic regime. In particular, we establish the finite-sample complexity of fully decentralized Q-learning algorithms in a significant class of general-sum stochastic games (SGs) - weakly acyclic SGs, which includes the common cooperative MARL setting with an identical reward to all agents (a Markov team problem) as a special case. We focus on the practical while challenging setting of fully decentralized MARL, where neither the rewards nor the actions of other agents can be observed by each agent. In fact, each agent is completely oblivious to the presence of other decision makers. Both the tabular and the linear function approximation cases have been considered. In the tabular setting, we analyze the sample complexity for the decentralized Q-learning algorithm to converge to a Markov perfect equilibrium (Nash equilibrium). With linear function approximation, the results are for convergence to a linear approximated equilibrium - a new notion of equilibrium that we propose - which describes that each agent's policy is a best reply (to other agents) within a linear space. Numerical experiments are also provided for both settings to demonstrate the results.

PantheonRL: A MARL Library for Dynamic Training Interactions

Authors:Bidipta Sarkar, Aditi Talati, Andy Shih, Dorsa Sadigh
Date:2021-12-13 21:08:43

We present PantheonRL, a multiagent reinforcement learning software package for dynamic training interactions such as round-robin, adaptive, and ad-hoc training. Our package is designed around flexible agent objects that can be easily configured to support different training interactions, and handles fully general multiagent environments with mixed rewards and n agents. Built on top of StableBaselines3, our package works directly with existing powerful deep RL algorithms. Finally, PantheonRL comes with an intuitive yet functional web user interface for configuring experiments and launching multiple asynchronous jobs. Our package can be found at https://github.com/Stanford-ILIAD/PantheonRL.

The Partially Observable Asynchronous Multi-Agent Cooperation Challenge

Authors:Meng Yao, Qiyue Yin, Jun Yang, Tongtong Yu, Shengqi Shen, Junge Zhang, Bin Liang, Kaiqi Huang
Date:2021-12-07 16:35:14

Multi-agent reinforcement learning (MARL) has received increasing attention for its applications in various domains. Researchers have paid much attention on its partially observable and cooperative settings for meeting real-world requirements. For testing performance of different algorithms, standardized environments are designed such as the StarCraft Multi-Agent Challenge, which is one of the most successful MARL benchmarks. To our best knowledge, most of current environments are synchronous, where agents execute actions in the same pace. However, heterogeneous agents usually have their own action spaces and there is no guarantee for actions from different agents to have the same executed cycle, which leads to asynchronous multi-agent cooperation. Inspired from the Wargame, a confrontation game between two armies abstracted from real world environment, we propose the first Partially Observable Asynchronous multi-agent Cooperation challenge (POAC) for the MARL community. Specifically, POAC supports two teams of heterogeneous agents to fight with each other, where an agent selects actions based on its own observations and cooperates asynchronously with its allies. Moreover, POAC is a light weight, flexible and easy to use environment, which can be configured by users to meet different experimental requirements such as self-play model, human-AI model and so on. Along with our benchmark, we offer six game scenarios of varying difficulties with the built-in rule-based AI as opponents. Finally, since most MARL algorithms are designed for synchronous agents, we revise several representatives to meet the asynchronous setting, and the relatively poor experimental results validate the challenge of POAC. Source code is released in \url{http://turingai.ia.ac.cn/data\_center/show}.

Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

Authors:Linghui Meng, Muning Wen, Yaodong Yang, Chenyang Le, Xiyun Li, Weinan Zhang, Ying Wen, Haifeng Zhang, Jun Wang, Bo Xu
Date:2021-12-06 08:11:05

Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.

LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning

Authors:David Henry Mguni, Taher Jafferjee, Jianhong Wang, Oliver Slumbers, Nicolas Perez-Nieves, Feifei Tong, Li Yang, Jiangcheng Zhu, Yaodong Yang, Jun Wang
Date:2021-12-05 16:50:23

Efficient exploration is important for reinforcement learners to achieve high rewards. In multi-agent systems, coordinated exploration and behaviour is critical for agents to jointly achieve optimal outcomes. In this paper, we introduce a new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL). Our framework, named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS) introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents' joint exploration and joint behaviour. Using a novel combination of MARL and switching controls, LIGS determines the best states to learn to add intrinsic rewards which leads to a highly efficient learning process. LIGS can subdivide complex tasks making them easier to solve and enables systems of MARL agents to quickly solve environments with sparse rewards. LIGS can seamlessly adopt existing MARL algorithms and, our theory shows that it ensures convergence to policies that deliver higher system performance. We demonstrate its superior performance in challenging tasks in Foraging and StarCraft II.

Multi-Agent Intention Sharing via Leader-Follower Forest

Authors:Zeyang Liu, Lipeng Wan, Xue sui, Kewu Sun, Xuguang Lan
Date:2021-12-02 09:37:29

Intention sharing is crucial for efficient cooperation under partially observable environments in multi-agent reinforcement learning (MARL). However, message deceiving, i.e., a mismatch between the propagated intentions and the final decisions, may happen when agents change strategies simultaneously according to received intentions. Message deceiving leads to potential miscoordination and difficulty for policy learning. This paper proposes the leader-follower forest (LFF) to learn the hierarchical relationship between agents based on interdependencies, achieving one-sided intention sharing in multi-agent communication. By limiting the flowings of intentions through directed edges, intention sharing via LFF (IS-LFF) can eliminate message deceiving effectively and achieve better coordination. In addition, a twostage learning algorithm is proposed to train the forest and the agent network. We evaluate IS-LFF on multiple partially observable MARL benchmarks, and the experimental results show that our method outperforms state-of-the-art communication algorithms.

The Power of Communication in a Distributed Multi-Agent System

Authors:Philipp Dominic Siedler
Date:2021-11-30 18:00:58

Single-Agent (SA) Reinforcement Learning systems have shown outstanding re-sults on non-stationary problems. However, Multi-Agent Reinforcement Learning(MARL) can surpass SA systems generally and when scaling. Furthermore, MAsystems can be super-powered by collaboration, which can happen through ob-serving others, or a communication system used to share information betweencollaborators. Here, we developed a distributed MA learning mechanism withthe ability to communicate based on decentralised partially observable Markovdecision processes (Dec-POMDPs) and Graph Neural Networks (GNNs). Minimis-ing the time and energy consumed by training Machine Learning models whileimproving performance can be achieved by collaborative MA mechanisms. Wedemonstrate this in a real-world scenario, an offshore wind farm, including a set ofdistributed wind turbines, where the objective is to maximise collective efficiency.Compared to a SA system, MA collaboration has shown significantly reducedtraining time and higher cumulative rewards in unseen and scaled scenarios.

DeepCQ+: Robust and Scalable Routing with Multi-Agent Deep Reinforcement Learning for Highly Dynamic Networks

Authors:Saeed Kaviani, Bo Ryu, Ejaz Ahmed, Kevin Larson, Anh Le, Alex Yahja, Jae H. Kim
Date:2021-11-29 23:05:49

Highly dynamic mobile ad-hoc networks (MANETs) remain as one of the most challenging environments to develop and deploy robust, efficient, and scalable routing protocols. In this paper, we present DeepCQ+ routing protocol which, in a novel manner integrates emerging multi-agent deep reinforcement learning (MADRL) techniques into existing Q-learning-based routing protocols and their variants and achieves persistently higher performance across a wide range of topology and mobility configurations. While keeping the overall protocol structure of the Q-learning-based routing protocols, DeepCQ+ replaces statically configured parameterized thresholds and hand-written rules with carefully designed MADRL agents such that no configuration of such parameters is required a priori. Extensive simulation shows that DeepCQ+ yields significantly increased end-to-end throughput with lower overhead and no apparent degradation of end-to-end delays (hop counts) compared to its Q-learning based counterparts. Qualitatively, and perhaps more significantly, DeepCQ+ maintains remarkably similar performance gains under many scenarios that it was not trained for in terms of network sizes, mobility conditions, and traffic dynamics. To the best of our knowledge, this is the first successful application of the MADRL framework for the MANET routing problem that demonstrates a high degree of scalability and robustness even under environments that are outside the trained range of scenarios. This implies that our MARL-based DeepCQ+ design solution significantly improves the performance of Q-learning based CQ+ baseline approach for comparison and increases its practicality and explainability because the real-world MANET environment will likely vary outside the trained range of MANET scenarios. Additional techniques to further increase the gains in performance and scalability are discussed.

Evaluating Generalization and Transfer Capacity of Multi-Agent Reinforcement Learning Across Variable Number of Agents

Authors:Bengisu Guresti, Nazim Kemal Ure
Date:2021-11-28 15:29:46

Multi-agent Reinforcement Learning (MARL) problems often require cooperation among agents in order to solve a task. Centralization and decentralization are two approaches used for cooperation in MARL. While fully decentralized methods are prone to converge to suboptimal solutions due to partial observability and nonstationarity, the methods involving centralization suffer from scalability limitations and lazy agent problem. Centralized training decentralized execution paradigm brings out the best of these two approaches; however, centralized training still has an upper limit of scalability not only for acquired coordination performance but also for model size and training time. In this work, we adopt the centralized training with decentralized execution paradigm and investigate the generalization and transfer capacity of the trained models across variable number of agents. This capacity is assessed by training variable number of agents in a specific MARL problem and then performing greedy evaluations with variable number of agents for each training configuration. Thus, we analyze the evaluation performance for each combination of agent count for training versus evaluation. We perform experimental evaluations on predator prey and traffic junction environments and demonstrate that it is possible to obtain similar or higher evaluation performance by training with less agents. We conclude that optimal number of agents to perform training may differ from the target number of agents and argue that transfer across large number of agents can be a more efficient solution to scaling up than directly increasing number of agents during training.

Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning

Authors:Xiaoxiao Zhao, Jinlong Lei, Li Li, Jie Chen
Date:2021-11-25 08:07:30

This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL), where agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns. Due to the non-concave performance function of policy gradient, the existing distributed stochastic optimization methods for convex problems cannot be directly used for policy gradient in MARL. This paper proposes a distributed policy gradient with variance reduction and gradient tracking to address the high variances of policy gradient, and utilizes importance weight to solve the {distribution shift} problem in the sampling process. We then provide an upper bound on the mean-squared stationary gap, which depends on the number of iterations, the mini-batch size, the epoch size, the problem parameters, and the network topology. We further establish the sample and communication complexity to obtain an $\epsilon$-approximate stationary point. Numerical experiments are performed to validate the effectiveness of the proposed algorithm.

Off-Policy Correction For Multi-Agent Reinforcement Learning

Authors:Michał Zawalski, Błażej Osiński, Henryk Michalewski, Piotr Miłoś
Date:2021-11-22 14:23:13

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded - we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration

Authors:Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, Chongjie Zhang
Date:2021-11-22 07:34:47

Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local action-observation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.

Calculus of Consent via MARL: Legitimating the Collaborative Governance Supplying Public Goods

Authors:Yang Hu, Zhui Zhu, Sirui Song, Xue Liu, Yang Yu
Date:2021-11-20 16:45:13

Public policies that supply public goods, especially those involve collaboration by limiting individual liberty, always give rise to controversies over governance legitimacy. Multi-Agent Reinforcement Learning (MARL) methods are appropriate for supporting the legitimacy of the public policies that supply public goods at the cost of individual interests. Among these policies, the inter-regional collaborative pandemic control is a prominent example, which has become much more important for an increasingly inter-connected world facing a global pandemic like COVID-19. Different patterns of collaborative strategies have been observed among different systems of regions, yet it lacks an analytical process to reason for the legitimacy of those strategies. In this paper, we use the inter-regional collaboration for pandemic control as an example to demonstrate the necessity of MARL in reasoning, and thereby legitimizing policies enforcing such inter-regional collaboration. Experimental results in an exemplary environment show that our MARL approach is able to demonstrate the effectiveness and necessity of restrictions on individual liberty for collaborative supply of public goods. Different optimal policies are learned by our MARL agents under different collaboration levels, which change in an interpretable pattern of collaboration that helps to balance the losses suffered by regions of different types, and consequently promotes the overall welfare. Meanwhile, policies learned with higher collaboration levels yield higher global rewards, which illustrates the benefit of, and thus provides a novel justification for the legitimacy of, promoting inter-regional collaboration. Therefore, our method shows the capability of MARL in computationally modeling and supporting the theory of calculus of consent, developed by Nobel Prize winner J. M. Buchanan.

Interactive Medical Image Segmentation with Self-Adaptive Confidence Calibration

Authors:Wenhao Li, Qisen Xu, Chuyun Shen, Bin Hu, Fengping Zhu, Yuxin Li, Bo Jin, Xiangfeng Wang
Date:2021-11-15 12:38:56

Medical image segmentation is one of the fundamental problems for artificial intelligence-based clinical decision systems. Current automatic medical image segmentation methods are often failed to meet clinical requirements. As such, a series of interactive segmentation algorithms are proposed to utilize expert correction information. However, existing methods suffer from some segmentation refining failure problems after long-term interactions and some cost problems from expert annotation, which hinder clinical applications. This paper proposes an interactive segmentation framework, called interactive MEdical segmentation with self-adaptive Confidence CAlibration (MECCA), by introducing the corrective action evaluation, which combines the action-based confidence learning and multi-agent reinforcement learning (MARL). The evaluation is established through a novel action-based confidence network, and the corrective actions are obtained from MARL. Based on the confidential information, a self-adaptive reward function is designed to provide more detailed feedback, and a simulated label generation mechanism is proposed on unsupervised data to reduce over-reliance on labeled data. Experimental results on various medical image datasets have shown the significant performance of the proposed algorithm.

Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Authors:Yuzi Yan, Xiaoxiang Li, Xinyou Qiu, Jiantao Qiu, Jian Wang, Yu Wang, Yuan Shen
Date:2021-11-14 13:02:45

Multi-agent formation as well as obstacle avoidance is one of the most actively studied topics in the field of multi-agent systems. Although some classic controllers like model predictive control (MPC) and fuzzy control achieve a certain measure of success, most of them require precise global information which is not accessible in harsh environments. On the other hand, some reinforcement learning (RL) based approaches adopt the leader-follower structure to organize different agents' behaviors, which sacrifices the collaboration between agents thus suffering from bottlenecks in maneuverability and robustness. In this paper, we propose a distributed formation and obstacle avoidance method based on multi-agent reinforcement learning (MARL). Agents in our system only utilize local and relative information to make decisions and control themselves distributively. Agent in the multi-agent system will reorganize themselves into a new topology quickly in case that any of them is disconnected. Our method achieves better performance regarding formation error, formation convergence rate and on-par success rate of obstacle avoidance compared with baselines (both classic control methods and another RL-based method). The feasibility of our method is verified by both simulation and hardware implementation with Ackermann-steering vehicles.

Causal Multi-Agent Reinforcement Learning: Review and Open Problems

Authors:St John Grimbly, Jonathan Shock, Arnu Pretorius
Date:2021-11-12 13:44:31

This paper serves to introduce the reader to the field of multi-agent reinforcement learning (MARL) and its intersection with methods from the study of causality. We highlight key challenges in MARL and discuss these in the context of how causal methods may assist in tackling them. We promote moving toward a 'causality first' perspective on MARL. Specifically, we argue that causality can offer improved safety, interpretability, and robustness, while also providing strong theoretical guarantees for emergent behaviour. We discuss potential solutions for common challenges, and use this context to motivate future research directions.

Multi-agent Reinforcement Learning for Cooperative Lane Changing of Connected and Autonomous Vehicles in Mixed Traffic

Authors:Wei Zhou, Dong Chen, Jun Yan, Zhaojian Li, Huilin Yin, Wanchen Ge
Date:2021-11-11 17:17:24

Autonomous driving has attracted significant research interests in the past two decades as it offers many potential benefits, including releasing drivers from exhausting driving and mitigating traffic congestion, among others. Despite promising progress, lane-changing remains a great challenge for autonomous vehicles (AV), especially in mixed and dynamic traffic scenarios. Recently, reinforcement learning (RL), a powerful data-driven control method, has been widely explored for lane-changing decision makings in AVs with encouraging results demonstrated. However, the majority of those studies are focused on a single-vehicle setting, and lane-changing in the context of multiple AVs coexisting with human-driven vehicles (HDVs) have received scarce attention. In this paper, we formulate the lane-changing decision making of multiple AVs in a mixed-traffic highway environment as a multi-agent reinforcement learning (MARL) problem, where each AV makes lane-changing decisions based on the motions of both neighboring AVs and HDVs. Specifically, a multi-agent advantage actor-critic network (MA2C) is developed with a novel local reward design and a parameter sharing scheme. In particular, a multi-objective reward function is proposed to incorporate fuel efficiency, driving comfort, and safety of autonomous driving. Comprehensive experimental results, conducted under three different traffic densities and various levels of human driver aggressiveness, show that our proposed MARL framework consistently outperforms several state-of-the-art benchmarks in terms of efficiency, safety and driver comfort.

On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning

Authors:Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, Sujoy Ganguly
Date:2021-11-10 23:45:08

The creation and destruction of agents in cooperative multi-agent reinforcement learning (MARL) is a critically under-explored area of research. Current MARL algorithms often assume that the number of agents within a group remains fixed throughout an experiment. However, in many practical problems, an agent may terminate before their teammates. This early termination issue presents a challenge: the terminated agent must learn from the group's success or failure which occurs beyond its own existence. We refer to propagating value from rewards earned by remaining teammates to terminated agents as the Posthumous Credit Assignment problem. Current MARL methods handle this problem by placing these agents in an absorbing state until the entire group of agents reaches a termination condition. Although absorbing states enable existing algorithms and APIs to handle terminated agents without modification, practical training efficiency and resource use problems exist. In this work, we first demonstrate that sample complexity increases with the quantity of absorbing states in a toy supervised learning task for a fully connected network, while attention is more robust to variable size input. Then, we present a novel architecture for an existing state-of-the-art MARL algorithm which uses attention instead of a fully connected layer with absorbing states. Finally, we demonstrate that this novel architecture significantly outperforms the standard architecture on tasks in which agents are created or destroyed within episodes as well as standard multi-agent coordination tasks.

PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems

Authors:David Biagioni, Xiangyu Zhang, Dylan Wald, Deepthi Vaidhynathan, Rohit Chintala, Jennifer King, Ahmed S. Zamzam
Date:2021-11-10 22:22:07

We present the PowerGridworld software package to provide users with a lightweight, modular, and customizable framework for creating power-systems-focused, multi-agent Gym environments that readily integrate with existing training frameworks for reinforcement learning (RL). Although many frameworks exist for training multi-agent RL (MARL) policies, none can rapidly prototype and develop the environments themselves, especially in the context of heterogeneous (composite, multi-device) power systems where power flow solutions are required to define grid-level variables and costs. PowerGridworld is an open-source software package that helps to fill this gap. To highlight PowerGridworld's key features, we present two case studies and demonstrate learning MARL policies using both OpenAI's multi-agent deep deterministic policy gradient (MADDPG) and RLLib's proximal policy optimization (PPO) algorithms. In both cases, at least some subset of agents incorporates elements of the power flow solution at each time step as part of their reward (negative cost) structures.

DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning

Authors:Zhaoxing Yang, Rong Ding, Haiming Jin, Yifei Wei, Haoyi You, Guiyun Fan, Xiaoying Gan, Xinbing Wang
Date:2021-11-10 12:31:30

In recent years, multi-agent reinforcement learning (MARL) has presented impressive performance in various applications. However, physical limitations, budget restrictions, and many other factors usually impose \textit{constraints} on a multi-agent system (MAS), which cannot be handled by traditional MARL frameworks. Specifically, this paper focuses on constrained MASes where agents work \textit{cooperatively} to maximize the expected team-average return under various constraints on expected team-average costs, and develops a \textit{constrained cooperative MARL} framework, named DeCOM, for such MASes. In particular, DeCOM decomposes the policy of each agent into two modules, which empowers information sharing among agents to achieve better cooperation. In addition, with such modularization, the training algorithm of DeCOM separates the original constrained optimization into an unconstrained optimization on reward and a constraints satisfaction problem on costs. DeCOM then iteratively solves these problems in a computationally efficient manner, which makes DeCOM highly scalable. We also provide theoretical guarantees on the convergence of DeCOM's policy update algorithm. Finally, we validate the effectiveness of DeCOM with various types of costs in both toy and large-scale (with 500 agents) environments.

Cross Modality 3D Navigation Using Reinforcement Learning and Neural Style Transfer

Authors:Cesare Magnetti, Hadrien Reynaud, Bernhard Kainz
Date:2021-11-05 13:11:45

This paper presents the use of Multi-Agent Reinforcement Learning (MARL) to perform navigation in 3D anatomical volumes from medical imaging. We utilize Neural Style Transfer to create synthetic Computed Tomography (CT) agent gym environments and assess the generalization capabilities of our agents to clinical CT volumes. Our framework does not require any labelled clinical data and integrates easily with several image translation techniques, enabling cross modality applications. Further, we solely condition our agents on 2D slices, breaking grounds for 3D guidance in much more difficult imaging modalities, such as ultrasound imaging. This is an important step towards user guidance during the acquisition of standardised diagnostic view planes, improving diagnostic consistency and facilitating better case comparison.

Pulse shape discrimination for GRIT: beam test of a new integrated charge and current preamplifier coupled with high granularity Silicon detectors

Authors:J. -J. Dormard, M. Assié, L. Grassi, E. Rauly, D. Beaumel, G. Brulin, M. Chabot, J. -L. Coacolo, F. Flavigny, B. Genolini, F. Hammache, T. Id Barkach, E. Rindel, Ph. Rosier, N. de Séréville, E. Wanlin
Date:2021-11-04 14:23:50

The GRIT (Granularity, Resolution, Identification, Transparency) Silicon array is intended to measure direct reactions. Its design is based on several layers (three layers in the forward direction, two backward) of custom-made trapezoidal and square detectors. The first stage is 500 {\mu}m thick and features 128x128 orthogonal strips. Pulse shape analysis for particle identification is implemented for this first layer. Given the compacity of this array and the large number of channels involved (>7,500), an integrated preamplifier, iPACI, that gives charge and current information has been developed in the AMS 0.35 {\mu}m BiCMOS technology. The design specifications and results of the test bench are presented. Considering an energy range of 50 MeV and an energy resolution (FWHM) of 12 keV (FWHM) for the preamplifier, the energy resolution for one strip obtained from alpha source measurement in real conditions is 35 keV. The current output bandwidth is measured at 130 MHz for small signals and the power consumption reaches 40 mW per detector channel. A first beam test was performed coupling a nTD trapezoidal double-sided stripped Silicon detector of GRIT with the iPACI preamplifier and a 64-channel digitizer. Z=1 particles are discriminated with pulse shape analysis technique down to 2 MeV for protons, 2.5 MeV for deuterons and 3 MeV for tritons. The effect of the strip length due to the trapezoidal shape of the detector is investigated on both the N- and the P-side, showing no significant impact.

Robust Dynamic Bus Control: A Distributional Multi-agent Reinforcement Learning Approach

Authors:Jiawei Wang, Lijun Sun
Date:2021-11-02 23:41:09

Bus system is a critical component of sustainable urban transportation. However, the operation of a bus fleet is unstable in nature, and bus bunching has become a common phenomenon that undermines the efficiency and reliability of bus systems. Recently research has demonstrated the promising application of multi-agent reinforcement learning (MARL) to achieve efficient vehicle holding control to avoid bus bunching. However, existing studies essentially overlook the robustness issue resulting from various events, perturbations and anomalies in a transit system, which is of utmost importance when transferring the models for real-world deployment/application. In this study, we integrate implicit quantile network and meta-learning to develop a distributional MARL framework -- IQNC-M -- to learn continuous control. The proposed IQNC-M framework achieves efficient and reliable control decisions through better handling various uncertainties/events in real-time transit operations. Specifically, we introduce an interpretable meta-learning module to incorporate global information into the distributional MARL framework, which is an effective solution to circumvent the credit assignment issue in the transit system. In addition, we design a specific learning procedure to train each agent within the framework to pursue a robust control policy. We develop simulation environments based on real-world bus services and passenger demand data and evaluate the proposed framework against both traditional holding control models and state-of-the-art MARL models. Our results show that the proposed IQNC-M framework can effectively handle the various extreme events, such as traffic state perturbations, service interruptions, and demand surges, thus improving both efficiency and reliability of the system.

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Authors:Hsu Kao, Chen-Yu Wei, Vijay Subramanian
Date:2021-11-01 09:18:07

Multi-agent reinforcement learning (MARL) problems are challenging due to information asymmetry. To overcome this challenge, existing methods often require high level of coordination or communication between the agents. We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications, which we exploit to propose simpler and more efficient algorithms that require no coordination or communication. In the structure, in each step the ``leader" chooses her action first, and then the ``follower" decides his action after observing the leader's action. The two agents observe the same reward (and the same state transition in the MDP setting) that depends on their joint action. For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of $\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the leader and the follower, respectively, and $T$ is the number of steps. We further extend to the case of multiple followers and the case with a deep hierarchy, where we both obtain near-optimal regret bounds. For the MDP setting, we obtain $\widetilde{\mathcal{O}}(\sqrt{H^7S^2ABT})$ regret, where $H$ is the number of steps per episode, $S$ is the number of states, $T$ is the number of episodes. This matches the existing lower bound in terms of $A, B$, and $T$.

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Authors:Kuo Li, Qing-Shan Jia
Date:2021-10-31 09:08:46

We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work. In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others. There is a communication graph, among which the agents can exchange information with their neighbors. The agents make individual decisions and cooperate to reach a higher accumulated reward. Towards this end, we first propose a decentralized actor-critic (AC) setting. Then, the policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively. Furthermore, convergence analysis is given under the discrete-space case, which guarantees that the policy will be reinforced by alternating between the processes of policy evaluation and policy improvement. In order to validate the effectiveness of algorithms, we design experiments and compare them with previous algorithms, e.g., Q-learning \cite{watkins1992q} and MADDPG \cite{lowe2017multi}. The results show that our algorithms perform better from the aspects of both learning speed and final performance. Moreover, the algorithms can be executed in an off-policy manner, which greatly improves the data efficiency compared with on-policy algorithms.

Learning to Ground Multi-Agent Communication with Autoencoders

Authors:Toru Lin, Minyoung Huh, Chris Stauffer, Ser-Nam Lim, Phillip Isola
Date:2021-10-28 17:57:26

Communication requires having a common language, a lingua franca, between agents. This language could emerge via a consensus process, but it may require many generations of trial and error. Alternatively, the lingua franca can be given by the environment, where agents ground their language in representations of the observed world. We demonstrate a simple way to ground language in learned representations, which facilitates decentralized multi-agent communication and coordination. We find that a standard representation learning algorithm -- autoencoding -- is sufficient for arriving at a grounded common language. When agents broadcast these representations, they learn to understand and respond to each other's utterances and achieve surprisingly strong task performance across a variety of multi-agent communication environments.

V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL

Authors:Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu
Date:2021-10-27 16:25:55

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms -- V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i^{\rm th}$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$. V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.

Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks

Authors:Jianhong Wang, Wangkun Xu, Yunjie Gu, Wenbin Song, Tim C. Green
Date:2021-10-27 09:31:22

This paper presents a problem in power networks that creates an exciting and yet challenging real-world scenario for application of multi-agent reinforcement learning (MARL). The emerging trend of decarbonisation is placing excessive stress on power distribution networks. Active voltage control is seen as a promising solution to relieve power congestion and improve voltage quality without extra hardware investment, taking advantage of the controllable apparatuses in the network, such as roof-top photovoltaics (PVs) and static var compensators (SVCs). These controllable apparatuses appear in a vast number and are distributed in a wide geographic area, making MARL a natural candidate. This paper formulates the active voltage control problem in the framework of Dec-POMDP and establishes an open-source environment. It aims to bridge the gap between the power community and the MARL community and be a drive force towards real-world applications of MARL algorithms. Finally, we analyse the special characteristics of the active voltage control problems that cause challenges (e.g. interpretability) for state-of-the-art MARL approaches, and summarise the potential directions.

A Law of Iterated Logarithm for Multi-Agent Reinforcement Learning

Authors:Gugan Thoppe, Bhumesh Kumar
Date:2021-10-27 08:01:17

In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making. It has wide-ranging applications in gaming, robotics, finance, etc. In this work, we derive a novel law of iterated logarithm for a family of distributed nonlinear stochastic approximation schemes that is useful in MARL. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with $\gamma \in (0, 1),$ the distributed TD(0) algorithm with linear function approximation has a convergence rate of $O(\sqrt{n^{-\gamma} \ln n })$ a.s.; for the $1/n$ type stepsize, the same is $O(\sqrt{n^{-1} \ln \ln n})$ a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.

Learning to Simulate Self-Driven Particles System with Coordinated Policy Optimization

Authors:Zhenghao Peng, Quanyi Li, Ka Ming Hui, Chunxiao Liu, Bolei Zhou
Date:2021-10-26 16:20:23

Self-Driven Particles (SDP) describe a category of multi-agent systems common in everyday life, such as flocking birds and traffic flows. In a SDP system, each agent pursues its own goal and constantly changes its cooperative or competitive behaviors with its nearby agents. Manually designing the controllers for such SDP system is time-consuming, while the resulting emergent behaviors are often not realistic nor generalizable. Thus the realistic simulation of SDP systems remains challenging. Reinforcement learning provides an appealing alternative for automating the development of the controller for SDP. However, previous multi-agent reinforcement learning (MARL) methods define the agents to be teammates or enemies before hand, which fail to capture the essence of SDP where the role of each agent varies to be cooperative or competitive even within one episode. To simulate SDP with MARL, a key challenge is to coordinate agents' behaviors while still maximizing individual objectives. Taking traffic simulation as the testing bed, in this work we develop a novel MARL method called Coordinated Policy Optimization (CoPO), which incorporates social psychology principle to learn neural controller for SDP. Experiments show that the proposed method can achieve superior performance compared to MARL baselines in various metrics. Noticeably the trained vehicles exhibit complex and diverse social behaviors that improve performance and safety of the population as a whole. Demo video and source code are available at: https://decisionforce.github.io/CoPO/

Applications of Multi-Agent Reinforcement Learning in Future Internet: A Comprehensive Survey

Authors:Tianxu Li, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Qihui Wu, Yang Zhang, Bing Chen
Date:2021-10-26 08:26:55

Future Internet involves several emerging technologies such as 5G and beyond 5G networks, vehicular networks, unmanned aerial vehicle (UAV) networks, and Internet of Things (IoTs). Moreover, future Internet becomes heterogeneous and decentralized with a large number of involved network entities. Each entity may need to make its local decision to improve the network performance under dynamic and uncertain network environments. Standard learning algorithms such as single-agent Reinforcement Learning (RL) or Deep Reinforcement Learning (DRL) have been recently used to enable each network entity as an agent to learn an optimal decision-making policy adaptively through interacting with the unknown environments. However, such an algorithm fails to model the cooperations or competitions among network entities, and simply treats other entities as a part of the environment that may result in the non-stationarity issue. Multi-agent Reinforcement Learning (MARL) allows each network entity to learn its optimal policy by observing not only the environments, but also other entities' policies. As a result, MARL can significantly improve the learning efficiency of the network entities, and it has been recently used to solve various issues in the emerging networks. In this paper, we thus review the applications of MARL in the emerging networks. In particular, we provide a tutorial of MARL and a comprehensive survey of applications of MARL in next generation Internet. In particular, we first introduce single-agent RL and MARL. Then, we review a number of applications of MARL to solve emerging issues in future Internet. The issues consist of network access, transmit power control, computation offloading, content caching, packet routing, trajectory design for UAV-aided networks, and network security issues.

Multi-Agent Advisor Q-Learning

Authors:Sriram Ganapathi Subramanian, Matthew E. Taylor, Kate Larson, Mark Crowley
Date:2021-10-26 00:21:15

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online sub-optimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed-point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

Common Information based Approximate State Representations in Multi-Agent Reinforcement Learning

Authors:Hsu Kao, Vijay Subramanian
Date:2021-10-25 02:32:06

Due to information asymmetry, finding optimal policies for Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) is hard with the complexity growing doubly exponentially in the horizon length. The challenge increases greatly in the multi-agent reinforcement learning (MARL) setting where the transition probabilities, observation kernel, and reward function are unknown. Here, we develop a general compression framework with approximate common and private state representations, based on which decentralized policies can be constructed. We derive the optimality gap of executing dynamic programming (DP) with the approximate states in terms of the approximation error parameters and the remaining time steps. When the compression is exact (no error), the resulting DP is equivalent to the one in existing work. Our general framework generalizes a number of methods proposed in the literature. The results shed light on designing practically useful deep-MARL network structures under the "centralized learning distributed execution" scheme.

Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming

Authors:Alec Koppel, Amrit Singh Bedi, Bhargav Ganguly, Vaneet Aggarwal
Date:2021-10-22 03:48:41

In tabular multi-agent reinforcement learning with average-cost criterion, a team of agents sequentially interacts with the environment and observes local incentives. We focus on the case that the global reward is a sum of local rewards, the joint policy factorizes into agents' marginals, and full state observability. To date, few global optimality guarantees exist even for this simple setting, as most results yield convergence to stationarity for parameterized policies in large/possibly continuous spaces. To solidify the foundations of MARL, we build upon linear programming (LP) reformulations, for which stochastic primal-dual methods yields a model-free approach to achieve \emph{optimal sample complexity} in the centralized case. We develop multi-agent extensions, whereby agents solve their local saddle point problems and then perform local weighted averaging. We establish that the sample complexity to obtain near-globally optimal solutions matches tight dependencies on the cardinality of the state and action spaces, and exhibits classical scalings with respect to the network in accordance with multi-agent optimization. Experiments corroborate these results in practice.

State-based Episodic Memory for Multi-Agent Reinforcement Learning

Authors:Xiao Ma, Wu-Jun Li
Date:2021-10-19 09:39:19

Multi-agent reinforcement learning (MARL) algorithms have made promising progress in recent years by leveraging the centralized training and decentralized execution (CTDE) paradigm. However, existing MARL algorithms still suffer from the sample inefficiency problem. In this paper, we propose a simple yet effective approach, called state-based episodic memory (SEM), to improve sample efficiency in MARL. SEM adopts episodic memory (EM) to supervise the centralized training procedure of CTDE in MARL. To the best of our knowledge, SEM is the first work to introduce EM into MARL. We can theoretically prove that, when using for MARL, SEM has lower space complexity and time complexity than state and action based EM (SAEM), which is originally proposed for single-agent reinforcement learning. Experimental results on StarCraft multi-agent challenge (SMAC) show that introducing episodic memory into MARL can improve sample efficiency and SEM can reduce storage cost and time cost compared with SAEM.

Containerized Distributed Value-Based Multi-Agent Reinforcement Learning

Authors:Siyang Wu, Tonghan Wang, Chenghao Li, Yang Hu, Chongjie Zhang
Date:2021-10-15 15:54:06

Multi-agent reinforcement learning tasks put a high demand on the volume of training samples. Different from its single-agent counterpart, distributed value-based multi-agent reinforcement learning faces the unique challenges of demanding data transfer, inter-process communication management, and high requirement of exploration. We propose a containerized learning framework to solve these problems. We pack several environment instances, a local learner and buffer, and a carefully designed multi-queue manager which avoids blocking into a container. Local policies of each container are encouraged to be as diverse as possible, and only trajectories with highest priority are sent to a global learner. In this way, we achieve a scalable, time-efficient, and diverse distributed MARL learning framework with high system throughput. To own knowledge, our method is the first to solve the challenging Google Research Football full game $5\_v\_5$. On the StarCraft II micromanagement benchmark, our method gets $4$-$18\times$ better results compared to state-of-the-art non-distributed MARL algorithms.

Provably Efficient Multi-Agent Reinforcement Learning with Fully Decentralized Communication

Authors:Justin Lidard, Udari Madhushani, Naomi Ehrich Leonard
Date:2021-10-14 14:27:27

A challenge in reinforcement learning (RL) is minimizing the cost of sampling associated with exploration. Distributed exploration reduces sampling complexity in multi-agent RL (MARL). We investigate the benefits to performance in MARL when exploration is fully decentralized. Specifically, we consider a class of online, episodic, tabular $Q$-learning problems under time-varying reward and transition dynamics, in which agents can communicate in a decentralized manner.We show that group performance, as measured by the bound on regret, can be significantly improved through communication when each agent uses a decentralized message-passing protocol, even when limited to sending information up to its $\gamma$-hop neighbors. We prove regret and sample complexity bounds that depend on the number of agents, communication network structure and $\gamma.$ We show that incorporating more agents and more information sharing into the group learning scheme speeds up convergence to the optimal policy. Numerical simulations illustrate our results and validate our theoretical claims.

On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning

Authors:Weichao Mao, Lin F. Yang, Kaiqing Zhang, Tamer Başar
Date:2021-10-12 02:45:12

Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. For learning (coarse) correlated equilibria in general-sum Markov games, we propose \emph{stage-based} V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-\emph{weighted}-regret bandit subroutine. For learning Nash equilibria in Markov potential games, we propose an independent policy gradient algorithm with a decentralized momentum-based variance reduction technique. All our algorithms are decentralized in that each agent can make decisions based on only its local information. Neither communication nor centralized coordination is required during learning, leading to a natural generalization to a large number of agents. We also provide numerical simulations to corroborate our theoretical findings.

Learning to Coordinate in Multi-Agent Systems: A Coordinated Actor-Critic Algorithm and Finite-Time Guarantees

Authors:Siliang Zeng, Tianyi Chen, Alfredo Garcia, Mingyi Hong
Date:2021-10-11 20:26:16

Multi-agent reinforcement learning (MARL) has attracted much research attention recently. However, unlike its single-agent counterpart, many theoretical and algorithmic aspects of MARL have not been well-understood. In this paper, we study the emergence of coordinated behavior by autonomous agents using an actor-critic (AC) algorithm. Specifically, we propose and analyze a class of coordinated actor-critic algorithms (CAC) in which individually parametrized policies have a {\it shared} part (which is jointly optimized among all agents) and a {\it personalized} part (which is only locally optimized). Such kind of {\it partially personalized} policy allows agents to learn to coordinate by leveraging peers' past experience and adapt to individual tasks. The flexibility in our design allows the proposed MARL-CAC algorithm to be used in a {\it fully decentralized} setting, where the agents can only communicate with their neighbors, as well as a {\it federated} setting, where the agents occasionally communicate with a server while optimizing their (partially personalized) local models. Theoretically, we show that under some standard regularity assumptions, the proposed MARL-CAC algorithm requires $\mathcal{O}(\epsilon^{-\frac{5}{2}})$ samples to achieve an $\epsilon$-stationary solution (defined as the solution whose squared norm of the gradient of the objective function is less than $\epsilon$). To the best of our knowledge, this work provides the first finite-sample guarantee for decentralized AC algorithm with partially personalized policies.

Satisficing Paths and Independent Multi-Agent Reinforcement Learning in Stochastic Games

Authors:Bora Yongacoglu, Gürdal Arslan, Serdar Yüksel
Date:2021-10-09 19:57:21

In multi-agent reinforcement learning (MARL), independent learners are those that do not observe the actions of other agents in the system. Due to the decentralization of information, it is challenging to design independent learners that drive play to equilibrium. This paper investigates the feasibility of using satisficing dynamics to guide independent learners to approximate equilibrium in stochastic games. For $\epsilon \geq 0$, an $\epsilon$-satisficing policy update rule is any rule that instructs the agent to not change its policy when it is $\epsilon$-best-responding to the policies of the remaining players; $\epsilon$-satisficing paths are defined to be sequences of joint policies obtained when each agent uses some $\epsilon$-satisficing policy update rule to select its next policy. We establish structural results on the existence of $\epsilon$-satisficing paths into $\epsilon$-equilibrium in both symmetric $N$-player games and general stochastic games with two players. We then present an independent learning algorithm for $N$-player symmetric games and give high probability guarantees of convergence to $\epsilon$-equilibrium under self-play. This guarantee is made using symmetry alone, leveraging the previously unexploited structure of $\epsilon$-satisficing paths.

Dissipation in 2D degenerate gases with non-vanishing rest mass

Authors:A. R. Mendez, A. L. Garcia-Perciante, G. Chacon-Acosta
Date:2021-10-06 17:43:22

The complete set of transport coefficients for two dimensional relativistic degenerate gases is derived within a relaxation approximation in kinetic theory, by considering both the particle and energy frames. A thorough comparison between Marle and Anderson-Witting's models is carried out, pointing out the drawbacks of the former when compared both to the latter and to the full Boltzmann equation results in the non-degenerate limit. Such task is accomplished by solving the relativistic Uehling-Uhlenbeck equation, in both the particle and energy frames, in order to establish the constitutive equations for the heat flux and the Navier tensor together with analytical expressions for the transport coefficients in such representations. In particular, the temperature dependence of the thermal conductivity (associated with a generalized thermal force) and the bulk and shear viscosities are analyzed and compared within both models and with the non-degenerate, non-relativistic and ultra-relativistic limits.

Multi-Agent Constrained Policy Optimisation

Authors:Shangding Gu, Jakub Grudzien Kuba, Munning Wen, Ruiqing Chen, Ziyan Wang, Zheng Tian, Jun Wang, Alois Knoll, Yaodong Yang
Date:2021-10-06 14:17:09

Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is particularly challenging because each individual agent has to not only meet its own safety constraints, but also consider those of others so that their joint behaviour can be guaranteed safe. Despite its importance, the problem of safe multi-agent learning has not been rigorously studied; very few solutions have been proposed, nor a sharable testing environment or benchmarks. To fill these gaps, in this work, we formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods. Our solutions -- Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning. Crucially, our methods enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration. To examine the effectiveness of our methods, we develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines. Experimental results justify that MACPO/MAPPO-Lagrangian can consistently satisfy safety constraints, meanwhile achieving comparable performance to strong baselines.

Cooperative Multi-Agent Actor-Critic for Privacy-Preserving Load Scheduling in a Residential Microgrid

Authors:Zhaoming Qin, Nanqing Dong, Eric P. Xing, Junwei Cao
Date:2021-10-06 14:05:26

As a scalable data-driven approach, multi-agent reinforcement learning (MARL) has made remarkable advances in solving the cooperative residential load scheduling problems. However, the common centralized training strategy of MARL algorithms raises privacy risks for involved households. In this work, we propose a privacy-preserving multi-agent actor-critic framework where the decentralized actors are trained with distributed critics, such that both the decentralized execution and the distributed training do not require the global state information. The proposed framework can preserve the privacy of the households while simultaneously learn the multi-agent credit assignment mechanism implicitly. The simulation experiments demonstrate that the proposed framework significantly outperforms the existing privacy-preserving actor-critic framework, and can achieve comparable performance to the state-of-the-art actor-critic framework without privacy constraints.

Robustness and sample complexity of model-based MARL for general-sum Markov games

Authors:Jayakumar Subramanian, Amit Sinha, Aditya Mahajan
Date:2021-10-05 20:50:21

Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best-response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of \emph{sample complexity} for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality based bounds to show that $\tilde{\mathcal{O}}( (1-\gamma)^{-4} \alpha^{-2})$ samples per state-action pair are sufficient to obtain a $\alpha$-approximate Markov perfect equilibrium with high probability, where $\gamma$ is the discount factor, and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms. We then use Bernstein inequality based bounds to show that $\tilde{\mathcal{O}}( (1-\gamma)^{-1} \alpha^{-2} )$ samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.

Influencing Towards Stable Multi-Agent Interactions

Authors:Woodrow Z. Wang, Andy Shih, Annie Xie, Dorsa Sadigh
Date:2021-10-05 16:46:04

Learning in multi-agent environments is difficult due to the non-stationarity introduced by an opponent's or partner's changing behaviors. Instead of reactively adapting to the other agent's (opponent or partner) behavior, we propose an algorithm to proactively influence the other agent's strategy to stabilize -- which can restrain the non-stationarity caused by the other agent. We learn a low-dimensional latent representation of the other agent's strategy and the dynamics of how the latent strategy evolves with respect to our robot's behavior. With this learned dynamics model, we can define an unsupervised stability reward to train our robot to deliberately influence the other agent to stabilize towards a single strategy. We demonstrate the effectiveness of stabilizing in improving efficiency of maximizing the task reward in a variety of simulated environments, including autonomous driving, emergent communication, and robotic manipulation. We show qualitative results on our website: https://sites.google.com/view/stable-marl/.

Divergence-Regularized Multi-Agent Actor-Critic

Authors:Kefan Su, Zongqing Lu
Date:2021-10-01 10:27:42

Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective of the original Markov Decision Process (MDP). Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Theoretically, we derive the update rule of DMAC which is naturally off-policy and guarantees monotonic policy improvement and convergence in both the original MDP and divergence-regularized MDP. We also give a bound of the discrepancy between the converged policy and optimal policy in the original MDP. DMAC is a flexible framework and can be combined with many existing MARL algorithms. Empirically, we evaluate DMAC in a didactic stochastic game and StarCraft Multi-Agent Challenge and show that DMAC substantially improves the performance of existing MARL algorithms.

Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

Authors:Jueming Hu, Zhe Xu, Weichang Wang, Guannan Qu, Yutian Pang, Yongming Liu
Date:2021-09-30 21:41:55

In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP) where the dynamics of neighboring agents are coupled. We use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently, based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of Q-function on other agents decreases exponentially as the distance between them increases. Furthermore, the complexity of DGRM is related to the local information size of the largest $\kappa$-hop neighborhood, and DGRM can find an $O(\rho^{\kappa+1})$-approximation of a stationary point of the objective function. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by two case studies, UAV package delivery and COVID-19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID-19 pandemic mitigation.

Stratification of the transverse momentum map

Authors:Maarten Mol
Date:2021-09-28 17:04:06

Given a Hamiltonian action of a proper symplectic groupoid (for instance, a Hamiltonian action of a compact Lie group), we show that the transverse momentum map admits a natural constant rank stratification. To this end, we construct a refinement of the canonical stratification associated to the Lie groupoid action (the orbit type stratification, in the case of a Hamiltonian Lie group action) that seems not to have appeared before, even in the literature on Hamiltonian Lie group actions. This refinement turns out to be compatible with the Poisson geometry of the Hamiltonian action: it is a Poisson stratification of the orbit space, each stratum of which is a regular Poisson manifold that admits a natural proper symplectic groupoid integrating it. The main tools in our proofs (which we believe could be of independent interest) are a version of the Marle-Guillemin-Sternberg normal form theorem for Hamiltonian actions of proper symplectic groupoids and a notion of equivalence between Hamiltonian actions of symplectic groupoids, closely related to Morita equivalence between symplectic groupoids.

LINDA: Multi-Agent Local Information Decomposition for Awareness of Teammates

Authors:Jiahan Cao, Lei Yuan, Jianhao Wang, Shaowei Zhang, Chongjie Zhang, Yang Yu, De-Chuan Zhan
Date:2021-09-26 06:46:51

In cooperative multi-agent reinforcement learning (MARL), where agents only have access to partial observations, efficiently leveraging local information is critical. During long-time observations, agents can build \textit{awareness} for teammates to alleviate the problem of partial observability. However, previous MARL methods usually neglect this kind of utilization of local information. To address this problem, we propose a novel framework, multi-agent \textit{Local INformation Decomposition for Awareness of teammates} (LINDA), with which agents learn to decompose local information and build awareness for each teammate. We model the awareness as stochastic random variables and perform representation learning to ensure the informativeness of awareness representations by maximizing the mutual information between awareness and the actual trajectory of the corresponding agent. LINDA is agnostic to specific algorithms and can be flexibly integrated to different MARL methods. Sufficient experiments show that the proposed framework learns informative awareness from local partial observations for better collaboration and significantly improves the learning performance, especially on challenging tasks.

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Authors:Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, Yaodong Yang
Date:2021-09-23 09:44:35

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

Towards Multi-Agent Reinforcement Learning using Quantum Boltzmann Machines

Authors:Tobias Müller, Christoph Roch, Kyrill Schmid, Philipp Altmann
Date:2021-09-22 17:59:24

Reinforcement learning has driven impressive advances in machine learning. Simultaneously, quantum-enhanced machine learning algorithms using quantum annealing underlie heavy developments. Recently, a multi-agent reinforcement learning (MARL) architecture combining both paradigms has been proposed. This novel algorithm, which utilizes Quantum Boltzmann Machines (QBMs) for Q-value approximation has outperformed regular deep reinforcement learning in terms of time-steps needed to converge. However, this algorithm was restricted to single-agent and small 2x2 multi-agent grid domains. In this work, we propose an extension to the original concept in order to solve more challenging problems. Similar to classic DQNs, we add an experience replay buffer and use different networks for approximating the target and policy values. The experimental results show that learning becomes more stable and enables agents to find optimal policies in grid-domains with higher complexity. Additionally, we assess how parameter sharing influences the agents behavior in multi-agent domains. Quantum sampling proves to be a promising method for reinforcement learning tasks, but is currently limited by the QPU size and therefore by the size of the input and Boltzmann machine.

Locality Matters: A Scalable Value Decomposition Approach for Cooperative Multi-Agent Reinforcement Learning

Authors:Roy Zohar, Shie Mannor, Guy Tennenholtz
Date:2021-09-22 10:08:15

Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents. As environments grow in size, effective credit assignment becomes increasingly harder and often results in infeasible learning times. Still, in many real-world settings, there exist simplified underlying dynamics that can be leveraged for more scalable solutions. In this work, we exploit such locality structures effectively whilst maintaining global cooperation. We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Centralized Training Decentralized Execution paradigm. Additionally, we provide a direct reward decomposition method for finding these local rewards when only a global signal is provided. We test our method empirically, showing it scales well compared to other methods, significantly improving performance and convergence speed.

Regularize! Don't Mix: Multi-Agent Reinforcement Learning without Explicit Centralized Structures

Authors:Chapman Siu, Jason Traish, Richard Yi Da Xu
Date:2021-09-19 00:58:38

We propose using regularization for Multi-Agent Reinforcement Learning rather than learning explicit cooperative structures called {\em Multi-Agent Regularized Q-learning} (MARQ). Many MARL approaches leverage centralized structures in order to exploit global state information or removing communication constraints when the agents act in a decentralized manner. Instead of learning redundant structures which is removed during agent execution, we propose instead to leverage shared experiences of the agents to regularize the individual policies in order to promote structured exploration. We examine several different approaches to how MARQ can either explicitly or implicitly regularize our policies in a multi-agent setting. MARQ aims to address these limitations in the MARL context through applying regularization constraints which can correct bias in off-policy out-of-distribution agent experiences and promote diverse exploration. Our algorithm is evaluated on several benchmark multi-agent environments and we show that MARQ consistently outperforms several baselines and state-of-the-art algorithms; learning in fewer steps and converging to higher returns.

Greedy UnMixing for Q-Learning in Multi-Agent Reinforcement Learning

Authors:Chapman Siu, Jason Traish, Richard Yi Da Xu
Date:2021-09-19 00:35:18

This paper introduces Greedy UnMix (GUM) for cooperative multi-agent reinforcement learning (MARL). Greedy UnMix aims to avoid scenarios where MARL methods fail due to overestimation of values as part of the large joint state-action space. It aims to address this through a conservative Q-learning approach through restricting the state-marginal in the dataset to avoid unobserved joint state action spaces, whilst concurrently attempting to unmix or simplify the problem space under the centralized training with decentralized execution paradigm. We demonstrate the adherence to Q-function lower bounds in the Q-learning for MARL scenarios, and demonstrate superior performance to existing Q-learning MARL approaches as well as more general MARL algorithms over a set of benchmark MARL tasks, despite its relative simplicity compared with state-of-the-art approaches.

Learning Observation-Based Certifiable Safe Policy for Decentralized Multi-Robot Navigation

Authors:Yuxiang Cui, Longzhong Lin, Xiaolong Huang, Dongkun Zhang, Yue Wang, Rong Xiong
Date:2021-09-16 07:16:39

Safety is of great importance in multi-robot navigation problems. In this paper, we propose a control barrier function (CBF) based optimizer that ensures robot safety with both high probability and flexibility, using only sensor measurement. The optimizer takes action commands from the policy network as initial values and then provides refinement to drive the potentially dangerous ones back into safe regions. With the help of a deep transition model that predicts the evolution of surrounding dynamics and the consequences of different actions, the CBF module can guide the optimization in a reasonable time horizon. We also present a novel joint training framework that improves the cooperation between the Reinforcement Learning (RL) based policy and the CBF-based optimizer both in training and inference procedures by utilizing reward feedback from the CBF module. We observe that the policy using our method can achieve a higher success rate while maintaining the safety of multiple robots in significantly fewer episodes compared with other methods. Experiments are conducted in multiple scenarios both in simulation and the real world, the results demonstrate the effectiveness of our method in maintaining the safety of multi-robot navigation. Code is available at \url{https://github.com/YuxiangCui/MARL-OCBF

Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing, Communication, and Localization Constraints

Authors:Manav Mishra, Prithvi Poddar, Rajat Agarwal, Jingxi Chen, Pratap Tokekar, P. B. Sujit
Date:2021-09-14 17:16:37

Determining multi-robot motion policies for persistently monitoring a region with limited sensing, communication, and localization constraints in non-GPS environments is a challenging problem. To take the localization constraints into account, in this paper, we consider a heterogeneous robotic system consisting of two types of agents: anchor agents with accurate localization capability and auxiliary agents with low localization accuracy. To localize itself, the auxiliary agents must be within the communication range of an {anchor}, directly or indirectly. The robotic team's objective is to minimize environmental uncertainty through persistent monitoring. We propose a multi-agent deep reinforcement learning (MARL) based architecture with graph convolution called Graph Localized Proximal Policy Optimization (GALOPP), which incorporates the limited sensor field-of-view, communication, and localization constraints of the agents along with persistent monitoring objectives to determine motion policies for each agent. We evaluate the performance of GALOPP on open maps with obstacles having a different number of anchor and auxiliary agents. We further study (i) the effect of communication range, obstacle density, and sensing range on the performance and (ii) compare the performance of GALOPP with non-RL baselines, namely, greedy search, random search, and random search with communication constraint. For its generalization capability, we also evaluated GALOPP in two different environments -- 2-room and 4-room. The results show that GALOPP learns the policies and monitors the area well. As a proof-of-concept, we perform hardware experiments to demonstrate the performance of GALOPP.

ROMAX: Certifiably Robust Deep Multiagent Reinforcement Learning via Convex Relaxation

Authors:Chuangchuang Sun, Dong-Ki Kim, Jonathan P. How
Date:2021-09-14 16:18:35

In a multirobot system, a number of cyber-physical attacks (e.g., communication hijack, observation perturbations) can challenge the robustness of agents. This robustness issue worsens in multiagent reinforcement learning because there exists the non-stationarity of the environment caused by simultaneously learning agents whose changing policies affect the transition and reward functions. In this paper, we propose a minimax MARL approach to infer the worst-case policy update of other agents. As the minimax formulation is computationally intractable to solve, we apply the convex relaxation of neural networks to solve the inner minimization problem. Such convex relaxation enables robustness in interacting with peer agents that may have significantly different behaviors and also achieves a certified bound of the original optimization problem. We evaluate our approach on multiple mixed cooperative-competitive tasks and show that our method outperforms the previous state of the art approaches on this topic.

Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain

Authors:Jianye Hao, Tianpei Yang, Hongyao Tang, Chenjia Bai, Jinyi Liu, Zhaopeng Meng, Peng Liu, Zhen Wang
Date:2021-09-14 13:16:33

Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant successes across a wide range of domains, including game AI, autonomous vehicles, robotics, and so on. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning towards the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and non-stationary co-learners. In this paper, we conduct a comprehensive survey on existing exploration methods for both single-agent and multi-agent RL. We start the survey by identifying several key challenges to efficient exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.

On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC)

Authors:Washim Uddin Mondal, Mridul Agarwal, Vaneet Aggarwal, Satish V. Ukkusuri
Date:2021-09-09 03:52:49

Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given as $e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$, $e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})$ and $e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$, respectively, where $A, B$ are some constants and $|\mathcal{X}|,|\mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $\mathcal{O}(e_j)$ error with a sample complexity of $\mathcal{O}(e_j^{-3})$, $j\in\{1,2,3\}$, respectively.

On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games

Authors:Xiaotie Deng, Ningyuan Li, David Mguni, Jun Wang, Yaodong Yang
Date:2021-09-04 05:47:59

Similar to the role of Markov decision processes in reinforcement learning, Stochastic Games (SGs) lay the foundation for the study of multi-agent reinforcement learning (MARL) and sequential agent interactions. In this paper, we derive that computing an approximate Markov Perfect Equilibrium (MPE) in a finite-state discounted Stochastic Game within the exponential precision is \textbf{PPAD}-complete. We adopt a function with a polynomially bounded description in the strategy space to convert the MPE computation to a fixed-point problem, even though the stochastic game may demand an exponential number of pure strategies, in the number of states, for each agent. The completeness result follows the reduction of the fixed-point problem to {\sc End of the Line}. Our results indicate that finding an MPE in SGs is highly unlikely to be \textbf{NP}-hard unless \textbf{NP}=\textbf{co-NP}. Our work offers confidence for MARL research to study MPE computation on general-sum SGs and to develop fruitful algorithms as currently on zero-sum SGs.

Multi-agent Natural Actor-critic Reinforcement Learning Algorithms

Authors:Prashant Trivedi, Nandyala Hemachandra
Date:2021-09-03 17:57:17

Multi-agent actor-critic algorithms are an important part of the Reinforcement Learning paradigm. We propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms in this work. The objective is to collectively find a joint policy that maximizes the average long-term return of these agents. In the absence of a central controller and to preserve privacy, agents communicate some information to their neighbors via a time-varying communication network. We prove convergence of all the 3 MAN algorithms to a globally asymptotically stable set of the ODE corresponding to actor update; these use linear function approximations. We show that the Kullback-Leibler divergence between policies of successive iterates is proportional to the objective function's gradient. We observe that the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension. Using this, we theoretically show that the optimal value of the deterministic variant of the MAN algorithm at each iterate dominates that of the standard gradient-based multi-agent actor-critic (MAAC) algorithm. To our knowledge, it is a first such result in multi-agent reinforcement learning (MARL). To illustrate the usefulness of our proposed algorithms, we implement them on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25\% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic $15$ agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm.

Is Machine Learning Ready for Traffic Engineering Optimization?

Authors:Guillermo Bernárdez, José Suárez-Varela, Albert López, Bo Wu, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, Albert Cabellos-Aparicio
Date:2021-09-03 11:10:14

Traffic Engineering (TE) is a basic building block of the Internet. In this paper, we analyze whether modern Machine Learning (ML) methods are ready to be used for TE optimization. We address this open question through a comparative analysis between the state of the art in ML and the state of the art in TE. To this end, we first present a novel distributed system for TE that leverages the latest advancements in ML. Our system implements a novel architecture that combines Multi-Agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN) to minimize network congestion. In our evaluation, we compare our MARL+GNN system with DEFO, a network optimizer based on Constraint Programming that represents the state of the art in TE. Our experimental results show that the proposed MARL+GNN solution achieves equivalent performance to DEFO in a wide variety of network scenarios including three real-world network topologies. At the same time, we show that MARL+GNN can achieve significant reductions in execution time (from the scale of minutes with DEFO to a few seconds with our solution).

Direct Numerical Simulations of Cosmic-ray Acceleration at Dense Circumstellar Medium: Magnetic Field Amplification by Bell Instability and Maximum Energy

Authors:Tsuyoshi Inoue, Alexandre Marcowith, Gwenael Giacinti, Allard Jan van Marle, Shogo Nishino
Date:2021-08-30 18:00:02

Galactic cosmic rays are believed to be accelerated at supernova remnants. However, whether supernova remnants can be Pevatrons is still very unclear. In this work we argue that PeV cosmic rays can be accelerated during the early phase of a supernova blast wave expansion in dense red supergiant winds. We solve in spherical geometry a system combining a diffusive-convection equation which treats cosmic-ray dynamics coupled to magnetohydrodynamics to follow gas dynamics. The fast shock expanding in a dense ionized wind is able to trigger the fast non-resonant streaming instability over day timescales, and energizes cosmic-rays even under the effect of p-p losses. We find that such environments make the blast wave a Pevatron, although the maximum energy depends on various parameters such as the injection rate and mass-loss rate of the winds. Multi-PeV energies can be reached if the progenitor mass loss rates are of the order of $10^{-3}$ Msun yr$^{-1}$. It has been recently invoked that, prior to the explosion, hydrogen rich massive stars can produce enhanced mass loss rates. These enhanced rates would then favor the production of a Pevatron phase in early times after the shock breakout.

Influence-Based Reinforcement Learning for Intrinsically-Motivated Agents

Authors:Ammar Fayad, Majd Ibrahim
Date:2021-08-28 05:36:10

Discovering successful coordinated behaviors is a central challenge in Multi-Agent Reinforcement Learning (MARL) since it requires exploring a joint action space that grows exponentially with the number of agents. In this paper, we propose a mechanism for achieving sufficient exploration and coordination in a team of agents. Specifically, agents are rewarded for contributing to a more diversified team behavior by employing proper intrinsic motivation functions. To learn meaningful coordination protocols, we structure agents' interactions by introducing a novel framework, where at each timestep, an agent simulates counterfactual rollouts of its policy and, through a sequence of computations, assesses the gap between other agents' current behaviors and their targets. Actions that minimize the gap are considered highly influential and are rewarded. We evaluate our approach on a set of challenging tasks with sparse rewards and partial observability that require learning complex cooperative strategies under a proper exploration scheme, such as the StarCraft Multi-Agent Challenge. Our methods show significantly improved performances over different baselines across all tasks.

Settling the Variance of Multi-Agent Policy Gradients

Authors:Jakub Grudzien Kuba, Muning Wen, Yaodong Yang, Linghui Meng, Shangding Gu, Haifeng Zhang, David Henry Mguni, Jun Wang
Date:2021-08-19 10:49:10

Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.

Graph Attention Network-based Multi-agent Reinforcement Learning for Slicing Resource Management in Dense Cellular Network

Authors:Yan Shao, Rongpeng Li, Bing Hu, Yingxiao Wu, Zhifeng Zhao, Honggang Zhang
Date:2021-08-11 07:02:52

Network slicing (NS) management devotes to providing various services to meet distinct requirements over the same physical communication infrastructure and allocating resources on demands. Considering a dense cellular network scenario that contains several NS over multiple base stations (BSs), it remains challenging to design a proper real-time inter-slice resource management strategy, so as to cope with frequent BS handover and satisfy the fluctuations of distinct service requirements. In this paper, we propose to formulate this challenge as a multi-agent reinforcement learning (MARL) problem in which each BS represents an agent. Then, we leverage graph attention network (GAT) to strengthen the temporal and spatial cooperation between agents. Furthermore, we incorporate GAT into deep reinforcement learning (DRL) and correspondingly design an intelligent real-time inter-slice resource management strategy. More specially, we testify the universal effectiveness of GAT for advancing DRL in the multi-agent system, by applying GAT on the top of both the value-based method deep Q-network (DQN) and a combination of policy-based and value-based method advantage actor-critic (A2C). Finally, we verify the superiority of the GAT-based MARL algorithms through extensive simulations.

Semantic Tracklets: An Object-Centric Representation for Visual Multi-Agent Reinforcement Learning

Authors:Iou-Jen Liu, Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing
Date:2021-08-06 22:19:09

Solving complex real-world tasks, e.g., autonomous fleet control, often involves a coordinated team of multiple agents which learn strategies from visual inputs via reinforcement learning. Many existing multi-agent reinforcement learning (MARL) algorithms however don't scale to environments where agents operate on visual inputs. To address this issue, algorithmically, recent works have focused on non-stationarity and exploration. In contrast, we study whether scalability can also be achieved via a disentangled representation. For this, we explicitly construct an object-centric intermediate representation to characterize the states of an environment, which we refer to as `semantic tracklets.' We evaluate `semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment. `Semantic tracklets' consistently outperform baselines on VMPE, and achieve a +2.4 higher score difference than baselines on GFootball. Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.

Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach

Authors:Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu
Date:2021-08-05 16:52:36

One of the challenges for multi-agent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. While exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterwards the learned decentralized policies, which depend only on agents' current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, where the actor-critic approach with over-parameterized neural networks is proposed. The convergence and sample complexity is established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network based MARL algorithm with network structure and provably convergence guarantee.

Survey of Recent Multi-Agent Reinforcement Learning Algorithms Utilizing Centralized Training

Authors:Piyush K. Sharma, Rolando Fernandez, Erin Zaroukian, Michael Dorothy, Anjon Basak, Derrik E. Asher
Date:2021-07-29 20:29:12

Much work has been dedicated to the exploration of Multi-Agent Reinforcement Learning (MARL) paradigms implementing a centralized learning with decentralized execution (CLDE) approach to achieve human-like collaboration in cooperative tasks. Here, we discuss variations of centralized training and describe a recent survey of algorithmic approaches. The goal is to explore how different implementations of information sharing mechanism in centralized learning may give rise to distinct group coordinated behaviors in multi-agent systems performing cooperative tasks.

Decentralized Multi-Agent Reinforcement Learning for Task Offloading Under Uncertainty

Authors:Yuanchao Xu, Amal Feriani, Ekram Hossain
Date:2021-07-16 20:49:30

Multi-Agent Reinforcement Learning (MARL) is a challenging subarea of Reinforcement Learning due to the non-stationarity of the environments and the large dimensionality of the combined action space. Deep MARL algorithms have been applied to solve different task offloading problems. However, in real-world applications, information required by the agents (i.e. rewards and states) are subject to noise and alterations. The stability and the robustness of deep MARL to practical challenges is still an open research problem. In this work, we apply state-of-the art MARL algorithms to solve task offloading with reward uncertainty. We show that perturbations in the reward signal can induce decrease in the performance compared to learning with perfect rewards. We expect this paper to stimulate more research in studying and addressing the practical challenges of deploying deep MARL solutions in wireless communications systems.

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot

Authors:Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charles Beattie, Igor Mordatch, Thore Graepel
Date:2021-07-14 17:22:14

Existing evaluation suites for multi-agent reinforcement learning (MARL) do not assess generalization to novel situations as their primary objective (unlike supervised-learning benchmarks). Our contribution, Melting Pot, is a MARL evaluation suite that fills this gap, and uses reinforcement learning to reduce the human labor required to create novel test scenarios. This works because one agent's behavior constitutes (part of) another agent's environment. To demonstrate scalability, we have created over 80 unique test scenarios covering a broad range of research topics such as social dilemmas, reciprocity, resource sharing, and task partitioning. We apply these test scenarios to standard MARL training algorithms, and demonstrate how Melting Pot reveals weaknesses not apparent from training performance alone.

Transfer Learning in Multi-Agent Reinforcement Learning with Double Q-Networks for Distributed Resource Sharing in V2X Communication

Authors:Hammad Zafar, Zoran Utkovski, Martin Kasparick, Slawomir Stanczak
Date:2021-07-13 15:50:10

This paper addresses the problem of decentralized spectrum sharing in vehicle-to-everything (V2X) communication networks. The aim is to provide resource-efficient coexistence of vehicle-to-infrastructure(V2I) and vehicle-to-vehicle(V2V) links. A recent work on the topic proposes a multi-agent reinforcement learning (MARL) approach based on deep Q-learning, which leverages a fingerprint-based deep Q-network (DQN) architecture. This work considers an extension of this framework by combining Double Q-learning (via Double DQN) and transfer learning. The motivation behind is that Double Q-learning can alleviate the problem of overestimation of the action values present in conventional Q-learning, while transfer learning can leverage knowledge acquired by an expert model to accelerate learning in the MARL setting. The proposed algorithm is evaluated in a realistic V2X setting, with synthetic data generated based on a geometry-based propagation model that incorporates location-specific geographical descriptors of the simulated environment(outlines of buildings, foliage, and vehicles). The advantages of the proposed approach are demonstrated via numerical simulations.

Effects of Smart Traffic Signal Control on Air Quality

Authors:Paolo Fazzini, Marco Torre, Valeria Rizza, Francesco Petracchini
Date:2021-07-06 02:48:42

Adaptive traffic signal control (ATSC) in urban traffic networks poses a challenging task due to the complicated dynamics arising in traffic systems. In recent years, several approaches based on multi-agent deep reinforcement learning (MARL) have been studied experimentally. These approaches propose distributed techniques in which each signalized intersection is seen as an agent in a stochastic game whose purpose is to optimize the flow of vehicles in its vicinity. In this setting, the systems evolves towards an equilibrium among the agents that shows beneficial for the whole traffic network. A recently developed multi-agent variant of the well-established advantage actor-critic (A2C) algorithm, called MA2C (multi-agent A2C) exploits the promising idea of some communication among the agents. In this view,the agents share their strategies with other neighbor agents, thereby stabilizing the learning process even when the agents grow in number and variety. We experimented MA2C in two traffic networks located in Bologna (Italy) and found that its action translates into a significant decrease of the amount of pollutants released into the environment.

Mava: a research library for distributed multi-agent reinforcement learning in JAX

Authors:Ruan de Kock, Omayma Mahjoub, Sasha Abramowitz, Wiem Khlifi, Callum Rhys Tilbury, Claude Formanek, Andries Smit, Arnu Pretorius
Date:2021-07-03 16:23:31

Multi-agent reinforcement learning (MARL) research is inherently computationally expensive and it is often difficult to obtain a sufficient number of experiment samples to test hypotheses and make robust statistical claims. Furthermore, MARL algorithms are typically complex in their design and can be tricky to implement correctly. These aspects of MARL present a difficult challenge when it comes to creating useful software for advanced research. Our criteria for such software is that it should be simple enough to use to implement new ideas quickly, while at the same time be scalable and fast enough to test those ideas in a reasonable amount of time. In this preliminary technical report, we introduce Mava, a research library for MARL written purely in JAX, that aims to fulfill these criteria. We discuss the design and core features of Mava, and demonstrate its use and performance across a variety of environments. In particular, we show Mava's substantial speed advantage, with improvements of 10-100x compared to other popular MARL frameworks, while maintaining strong performance. This allows for researchers to test ideas in a few minutes instead of several hours. Finally, Mava forms part of an ecosystem of libraries that seamlessly integrate with each other to help facilitate advanced research in MARL. We hope Mava will benefit the community and help drive scientifically sound and statistically robust research in the field. The open-source repository for Mava is available at https://github.com/instadeepai/Mava.

Collaborative Visual Navigation

Authors:Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, Liwei Wang
Date:2021-07-02 15:48:16

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques. However, previous MARL methods largely focused on grid-world like or game environments; MAS in visually rich environments has remained less explored. To narrow this gap and emphasize the crucial role of perception in MAS, we propose a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN). In CollaVN, multiple agents are entailed to cooperatively navigate across photo-realistic environments to reach target locations. Diverse MAVN variants are explored to make our problem more general. Moreover, a memory-augmented communication framework is proposed. Each agent is equipped with a private, external memory to persistently store communication information. This allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning. In our experiments, several baselines and evaluation metrics are designed. We also empirically verify the efficacy of our proposed MARL approach across different MAVN task settings.

Federated Dynamic Spectrum Access

Authors:Yifei Song, Hao-Hsuan Chang, Zhou Zhou, Shashank Jere, Lingjia Liu
Date:2021-06-28 20:49:41

Due to the growing volume of data traffic produced by the surge of Internet of Things (IoT) devices, the demand for radio spectrum resources is approaching their limitation defined by Federal Communications Commission (FCC). To this end, Dynamic Spectrum Access (DSA) is considered as a promising technology to handle this spectrum scarcity. However, standard DSA techniques often rely on analytical modeling wireless networks, making its application intractable in under-measured network environments. Therefore, utilizing neural networks to approximate the network dynamics is an alternative approach. In this article, we introduce a Federated Learning (FL) based framework for the task of DSA, where FL is a distributive machine learning framework that can reserve the privacy of network terminals under heterogeneous data distributions. We discuss the opportunities, challenges, and opening problems of this framework. To evaluate its feasibility, we implement a Multi-Agent Reinforcement Learning (MARL)-based FL as a realization associated with its initial evaluation results.

RAN Resource Slicing in 5G Using Multi-Agent Correlated Q-Learning

Authors:Hao Zhou, Medhat Elsayed, Melike Erol-Kantarci
Date:2021-06-24 01:09:22

5G is regarded as a revolutionary mobile network, which is expected to satisfy a vast number of novel services, ranging from remote health care to smart cities. However, heterogeneous Quality of Service (QoS) requirements of different services and limited spectrum make the radio resource allocation a challenging problem in 5G. In this paper, we propose a multi-agent reinforcement learning (MARL) method for radio resource slicing in 5G. We model each slice as an intelligent agent that competes for limited radio resources, and the correlated Q-learning is applied for inter-slice resource block (RB) allocation. The proposed correlated Q-learning based interslice RB allocation (COQRA) scheme is compared with Nash Q-learning (NQL), Latency-Reliability-Throughput Q-learning (LRTQ) methods, and the priority proportional fairness (PPF) algorithm. Our simulation results show that the proposed COQRA achieves 32.4% lower latency and 6.3% higher throughput when compared with LRTQ, and 5.8% lower latency and 5.9% higher throughput than NQL. Significantly higher throughput and lower packet drop rate (PDR) is observed in comparison to PPF.

Many Agent Reinforcement Learning Under Partial Observability

Authors:Keyang He, Prashant Doshi, Bikramjit Banerjee
Date:2021-06-17 21:24:29

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.

Cooperative Multi-Agent Reinforcement Learning Based Distributed Dynamic Spectrum Access in Cognitive Radio Networks

Authors:Xiang Tan, Li Zhou, Haijun Wang, Yuli Sun, Haitao Zhao, Boon-Chong Seet, Jibo Wei, Victor C. M. Leung
Date:2021-06-17 06:52:21

With the development of the 5G and Internet of Things, amounts of wireless devices need to share the limited spectrum resources. Dynamic spectrum access (DSA) is a promising paradigm to remedy the problem of inefficient spectrum utilization brought upon by the historical command-and-control approach to spectrum allocation. In this paper, we investigate the distributed DSA problem for multi-user in a typical multi-channel cognitive radio network. The problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), and we proposed a centralized off-line training and distributed on-line execution framework based on cooperative multi-agent reinforcement learning (MARL). We employ the deep recurrent Q-network (DRQN) to address the partial observability of the state for each cognitive user. The ultimate goal is to learn a cooperative strategy which maximizes the sum throughput of cognitive radio network in distributed fashion without coordination information exchange between cognitive users. Finally, we validate the proposed algorithm in various settings through extensive experiments. From the simulation results, we can observe that the proposed algorithm can converge fast and achieve almost the optimal performance.

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning

Authors:Ming Zhou, Ziyu Wan, Hanjing Wang, Muning Wen, Runzhe Wu, Ying Wen, Yaodong Yang, Weinan Zhang, Jun Wang
Date:2021-06-05 03:27:08

Population-based multi-agent reinforcement learning (PB-MARL) refers to the series of methods nested with reinforcement learning (RL) algorithms, which produces a self-generated sequence of tasks arising from the coupled population dynamics. By leveraging auto-curricula to induce a population of distinct emergent strategies, PB-MARL has achieved impressive success in tackling multi-agent tasks. Despite remarkable prior arts of distributed RL frameworks, PB-MARL poses new challenges for parallelizing the training frameworks due to the additional complexity of multiple nested workloads between sampling, training and evaluation involved with heterogeneous policy interactions. To solve these problems, we present MALib, a scalable and efficient computing framework for PB-MARL. Our framework is comprised of three key components: (1) a centralized task dispatching model, which supports the self-generated tasks and scalable training with heterogeneous policy combinations; (2) a programming architecture named Actor-Evaluator-Learner, which achieves high parallelism for both training and sampling, and meets the evaluation requirement of auto-curriculum learning; (3) a higher-level abstraction of MARL training paradigms, which enables efficient code reuse and flexible deployments on different distributed computing paradigms. Experiments on a series of complex tasks such as multi-agent Atari Games show that MALib achieves throughput higher than 40K FPS on a single machine with $32$ CPU cores; 5x speedup than RLlib and at least 3x speedup than OpenSpiel in multi-agent training tasks. MALib is publicly available at https://github.com/sjtu-marl/malib.

Decentralized Q-Learning in Zero-sum Markov Games

Authors:Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, Asuman Ozdaglar
Date:2021-06-04 22:42:56

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

Neural Auto-Curricula

Authors:Xidong Feng, Oliver Slumbers, Ziyu Wan, Bo Liu, Stephen McAleer, Ying Wen, Jun Wang, Yaodong Yang
Date:2021-06-04 22:30:25

When solving two-player zero-sum games, multi-agent reinforcement learning (MARL) algorithms often create populations of agents where, at each iteration, a new agent is discovered as the best response to a mixture over the opponent population. Within such a process, the update rules of "who to compete with" (i.e., the opponent mixture) and "how to beat them" (i.e., finding best responses) are underpinned by manually developed game theoretical principles such as fictitious play and Double Oracle. In this paper, we introduce a novel framework -- Neural Auto-Curricula (NAC) -- that leverages meta-gradient descent to automate the discovery of the learning update rule without explicit human design. Specifically, we parameterise the opponent selection module by neural networks and the best-response module by optimisation subroutines, and update their parameters solely via interaction with the game engine, where both players aim to minimise their exploitability. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance with the state-of-the-art population-based game solvers (e.g., PSRO) on Games of Skill, differentiable Lotto, non-transitive Mixture Games, Iterated Matching Pennies, and Kuhn Poker. Additionally, we show that NAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO on Leduc Poker. Our work inspires a promising future direction to discover general MARL algorithms solely from data.

Celebrating Diversity in Shared Multi-Agent Reinforcement Learning

Authors:Chenghao Li, Tonghan Wang, Chengjie Wu, Qianchuan Zhao, Jun Yang, Chongjie Zhang
Date:2021-06-04 00:55:03

Recently, deep multi-agent reinforcement learning (MARL) has shown the promise to solve complex cooperative tasks. Its success is partly because of parameter sharing among agents. However, such sharing may lead agents to behave similarly and limit their coordination capacity. In this paper, we aim to introduce diversity in both optimization and representation of shared multi-agent reinforcement learning. Specifically, we propose an information-theoretical regularization to maximize the mutual information between agents' identities and their trajectories, encouraging extensive exploration and diverse individualized behaviors. In representation, we incorporate agent-specific modules in the shared neural network architecture, which are regularized by L1-norm to promote learning sharing among agents while keeping necessary diversity. Empirical results show that our method achieves state-of-the-art performance on Google Research Football and super hard StarCraft II micromanagement tasks.

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Authors:Tianze Zhou, Fubiao Zhang, Kun Shao, Kai Li, Wenhan Huang, Jun Luo, Weixun Wang, Yaodong Yang, Hangyu Mao, Bin Wang, Dong Li, Wulong Liu, Jianye Hao
Date:2021-06-01 14:22:57

Extending transfer learning to cooperative multi-agent reinforcement learning (MARL) has recently received much attention. In contrast to the single-agent setting, the coordination indispensable in cooperative MARL constrains each agent's policy. However, existing transfer methods focus exclusively on agent policy and ignores coordination knowledge. We propose a new architecture that realizes robust coordination knowledge transfer through appropriate decomposition of the overall coordination into several coordination patterns. We use a novel mixing network named level-adaptive QTransformer (LA-QTransformer) to realize agent coordination that considers credit assignment, with appropriate coordination patterns for different agents realized by a novel level-adaptive Transformer (LA-Transformer) dedicated to the transfer of coordination knowledge. In addition, we use a novel agent network named Population Invariant agent with Transformer (PIT) to realize the coordination transfer in more varieties of scenarios. Extensive experiments in StarCraft II micro-management show that LA-QTransformer together with PIT achieves superior performance compared with state-of-the-art baselines.

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

Authors:Jiahui Li, Kun Kuang, Baoxiang Wang, Furui Liu, Long Chen, Fei Wu, Jun Xiao
Date:2021-06-01 07:38:34

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

Tesseract: Tensorised Actors for Multi-Agent Reinforcement Learning

Authors:Anuj Mahajan, Mikayel Samvelyan, Lei Mao, Viktor Makoviychuk, Animesh Garg, Jean Kossaifi, Shimon Whiteson, Yuke Zhu, Animashree Anandkumar
Date:2021-05-31 23:08:05

Reinforcement Learning in large action spaces is a challenging problem. Cooperative multi-agent reinforcement learning (MARL) exacerbates matters by imposing various constraints on communication and observability. In this work, we consider the fundamental hurdle affecting both value-based and policy-gradient approaches: an exponential blowup of the action space with the number of agents. For value-based methods, it poses challenges in accurately representing the optimal value function. For policy gradient methods, it makes training the critic difficult and exacerbates the problem of the lagging critic. We show that from a learning theory perspective, both problems can be addressed by accurately representing the associated action-value function with a low-complexity hypothesis class. This requires accurately modelling the agent interactions in a sample efficient way. To this end, we propose a novel tensorised formulation of the Bellman equation. This gives rise to our method Tesseract, which views the Q-function as a tensor whose modes correspond to the action spaces of different agents. Algorithms derived from Tesseract decompose the Q-tensor across agents and utilise low-rank tensor approximations to model agent interactions relevant to the task. We provide PAC analysis for Tesseract-based algorithms and highlight their relevance to the class of rich observation MDPs. Empirical results in different domains confirm Tesseract's gains in sample efficiency predicted by the theory.

SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning

Authors:Jianhong Wang, Yuan Zhang, Yunjie Gu, Tae-Kyun Kim
Date:2021-05-31 14:50:52

Value factorisation is a useful technique for multi-agent reinforcement learning (MARL) in global reward game, however its underlying mechanism is not yet fully understood. This paper studies a theoretical framework for value factorisation with interpretability via Shapley value theory. We generalise Shapley value to Markov convex game called Markov Shapley value (MSV) and apply it as a value factorisation method in global reward game, which is obtained by the equivalence between the two games. Based on the properties of MSV, we derive Shapley-Bellman optimality equation (SBOE) to evaluate the optimal MSV, which corresponds to an optimal joint deterministic policy. Furthermore, we propose Shapley-Bellman operator (SBO) that is proved to solve SBOE. With a stochastic approximation and some transformations, a new MARL algorithm called Shapley Q-learning (SHAQ) is established, the implementation of which is guided by the theoretical results of SBO and MSV. We also discuss the relationship between SHAQ and relevant value factorisation methods. In the experiments, SHAQ exhibits not only superior performances on all tasks but also the interpretability that agrees with the theoretical analysis. The implementation of this paper is on https://github.com/hsvgbkhgbv/shapley-q-learning.

MARL with General Utilities via Decentralized Shadow Reward Actor-Critic

Authors:Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel
Date:2021-05-29 19:05:48

We posit a new mechanism for cooperation in multi-agent reinforcement learning (MARL) based upon any nonlinear function of the team's long-term state-action occupancy measure, i.e., a \emph{general utility}. This subsumes the cumulative return but also allows one to incorporate risk-sensitivity, exploration, and priors. % We derive the {\bf D}ecentralized {\bf S}hadow Reward {\bf A}ctor-{\bf C}ritic (DSAC) in which agents alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). DSAC augments the classic critic step by requiring agents to (i) estimate their local occupancy measure in order to (ii) estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward". DSAC converges to $\epsilon$-stationarity in $\mathcal{O}(1/\epsilon^{2.5})$ (Theorem \ref{theorem:final}) or faster $\mathcal{O}(1/\epsilon^{2})$ (Corollary \ref{corollary:communication}) steps with high probability, depending on the amount of communications. We further establish the non-existence of spurious stationary points for this problem, that is, DSAC finds the globally optimal policy (Corollary \ref{corollary:global}). Experiments demonstrate the merits of goals beyond the cumulative return in cooperative MARL.

KnowSR: Knowledge Sharing among Homogeneous Agents in Multi-agent Reinforcement Learning

Authors:Zijian Gao, Kele Xu, Bo Ding, Huaimin Wang, Yiying Li, Hongda Jia
Date:2021-05-25 02:19:41

Recently, deep reinforcement learning (RL) algorithms have made great progress in multi-agent domain. However, due to characteristics of RL, training for complex tasks would be resource-intensive and time-consuming. To meet this challenge, mutual learning strategy between homogeneous agents is essential, which is under-explored in previous studies, because most existing methods do not consider to use the knowledge of agent models. In this paper, we present an adaptation method of the majority of multi-agent reinforcement learning (MARL) algorithms called KnowSR which takes advantage of the differences in learning between agents. We employ the idea of knowledge distillation (KD) to share knowledge among agents to shorten the training phase. To empirically demonstrate the robustness and effectiveness of KnowSR, we performed extensive experiments on state-of-the-art MARL algorithms in collaborative and competitive scenarios. The results demonstrate that KnowSR outperforms recently reported methodologies, emphasizing the importance of the proposed knowledge sharing for MARL.

Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound

Authors:Xin Yang, Yuhao Huang, Ruobing Huang, Haoran Dou, Rui Li, Jikuan Qian, Xiaoqiong Huang, Wenlong Shi, Chaoyu Chen, Yuanji Zhang, Haixia Wang, Yi Xiong, Dong Ni
Date:2021-05-22 02:48:23

3D ultrasound (US) has become prevalent due to its rich spatial and diagnostic information not contained in 2D US. Moreover, 3D US can contain multiple standard planes (SPs) in one shot. Thus, automatically localizing SPs in 3D US has the potential to improve user-independence and scanning-efficiency. However, manual SP localization in 3D US is challenging because of the low image quality, huge search space and large anatomical variability. In this work, we propose a novel multi-agent reinforcement learning (MARL) framework to simultaneously localize multiple SPs in 3D US. Our contribution is four-fold. First, our proposed method is general and it can accurately localize multiple SPs in different challenging US datasets. Second, we equip the MARL system with a recurrent neural network (RNN) based collaborative module, which can strengthen the communication among agents and learn the spatial relationship among planes effectively. Third, we explore to adopt the neural architecture search (NAS) to automatically design the network architecture of both the agents and the collaborative module. Last, we believe we are the first to realize automatic SP localization in pelvic US volumes, and note that our approach can handle both normal and abnormal uterus cases. Extensively validated on two challenging datasets of the uterus and fetal brain, our proposed method achieves the average localization accuracy of 7.03 degrees/1.59mm and 9.75 degrees/1.19mm. Experimental results show that our light-weight MARL model has higher accuracy than state-of-the-art methods.

Programming and Deployment of Autonomous Swarms using Multi-Agent Reinforcement Learning

Authors:Jayson Boubin, Codi Burley, Peida Han, Bowen Li, Barry Porter, Christopher Stewart
Date:2021-05-21 23:22:43

Autonomous systems (AS) carry out complex missions by continuously observing the state of their surroundings and taking actions toward a goal. Swarms of AS working together can complete missions faster and more effectively than single AS alone. To build swarms today, developers handcraft their own software for storing, aggregating, and learning from observations. We present the Fleet Computer, a platform for developing and managing swarms. The Fleet Computer provides a programming paradigm that simplifies multi-agent reinforcement learning (MARL) -- an emerging class of algorithms that coordinate swarms of agents. Using just two programmer-provided functions Map() and Eval(), the Fleet Computer compiles and deploys swarms and continuously updates the reinforcement learning models that govern actions. To conserve compute resources, the Fleet Computer gives priority scheduling to models that contribute to effective actions, drawing a novel link between online learning and resource management. We developed swarms for unmanned aerial vehicles (UAV) in agriculture and for video analytics on urban traffic. Compared to individual AS, our swarms achieved speedup of 4.4X using 4 UAV and 62X using 130 video cameras. Compared to a competing approach for building swarms that is widely used in practice, our swarms were 3X more effective, using 3.9X less energy.

Dependent Multi-Task Learning with Causal Intervention for Image Captioning

Authors:Wenqing Chen, Jidong Tian, Caoyun Fan, Hao He, Yaohui Jin
Date:2021-05-18 14:57:33

Recent work for image captioning mainly followed an extract-then-generate paradigm, pre-extracting a sequence of object-based features and then formulating image captioning as a single sequence-to-sequence task. Although promising, we observed two problems in generated captions: 1) content inconsistency where models would generate contradicting facts; 2) not informative enough where models would miss parts of important information. From a causal perspective, the reason is that models have captured spurious statistical correlations between visual features and certain expressions (e.g., visual features of "long hair" and "woman"). In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI). Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning. The intermediate task would help the model better understand the visual features and thus alleviate the content inconsistency problem. Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders and thus letting models focus on the causal visual features. Specifically, the high-frequency concept set is considered as the proxy confounders where the real confounders are inferred in the continuous space. Finally, we use a multi-agent reinforcement learning (MARL) strategy to enable end-to-end training and reduce the inter-task error accumulations. The extensive experiments show that our model outperforms the baseline models and achieves competitive performance with state-of-the-art models.

Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach

Authors:Yan Li, Lingxiao Wang, Jiachen Yang, Ethan Wang, Zhaoran Wang, Tuo Zhao, Hongyuan Zha
Date:2021-05-18 04:35:41

Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as a mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.

Deep Multi-agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic

Authors:Dong Chen, Mohammad Hajidavalloo, Zhaojian Li, Kaian Chen, Yongqiang Wang, Longsheng Jiang, Yue Wang
Date:2021-05-12 14:37:43

On-ramp merging is a challenging task for autonomous vehicles (AVs), especially in mixed traffic where AVs coexist with human-driven vehicles (HDVs). In this paper, we formulate the mixed-traffic highway on-ramp merging problem as a multi-agent reinforcement learning (MARL) problem, where the AVs (on both merge lane and through lane) collaboratively learn a policy to adapt to HDVs to maximize the traffic throughput. We develop an efficient and scalable MARL framework that can be used in dynamic traffic where the communication topology could be time-varying. Parameter sharing and local rewards are exploited to foster inter-agent cooperation while achieving great scalability. An action masking scheme is employed to improve learning efficiency by filtering out invalid/unsafe actions at each step. In addition, a novel priority-based safety supervisor is developed to significantly reduce collision rate and greatly expedite the training process. A gym-like simulation environment is developed and open-sourced with three different levels of traffic densities. We exploit curriculum learning to efficiently learn harder tasks from trained models under simpler settings. Comprehensive experimental results show the proposed MARL framework consistently outperforms several state-of-the-art benchmarks.

Hierarchical RNNs-Based Transformers MADDPG for Mixed Cooperative-Competitive Environments

Authors:Xiaolong Wei, LiFang Yang, Xianglin Huang, Gang Cao, Tao Zhulin, Zhengyang Du, Jing An
Date:2021-05-11 09:22:52

At present, attention mechanism has been widely applied to the fields of deep learning models. Structural models that based on attention mechanism can not only record the relationships between features position, but also can measure the importance of different features based on their weights. By establishing dynamically weighted parameters for choosing relevant and irrelevant features, the key information can be strengthened, and the irrelevant information can be weakened. Therefore, the efficiency of deep learning algorithms can be significantly elevated and improved. Although transformers have been performed very well in many fields including reinforcement learning, there are still many problems and applications can be solved and made with transformers within this area. MARL (known as Multi-Agent Reinforcement Learning) can be recognized as a set of independent agents trying to adapt and learn through their way to reach the goal. In order to emphasize the relationship between each MDP decision in a certain time period, we applied the hierarchical coding method and validated the effectiveness of this method. This paper proposed a hierarchical transformers MADDPG based on RNN which we call it Hierarchical RNNs-Based Transformers MADDPG(HRTMADDPG). It consists of a lower level encoder based on RNNs that encodes multiple step sizes in each time sequence, and it also consists of an upper sequence level encoder based on transformer for learning the correlations between multiple sequences so that we can capture the causal relationship between sub-time sequences and make HRTMADDPG more efficient.

AoI-Aware Resource Allocation for Platoon-Based C-V2X Networks via Multi-Agent Multi-Task Reinforcement Learning

Authors:Mohammad Parvini, Mohammad Reza Javan, Nader Mokari, Bijan Abbasi, Eduard A. Jorswieck
Date:2021-05-10 08:39:56

This paper investigates the problem of age of information (AoI) aware radio resource management for a platooning system. Multiple autonomous platoons exploit the cellular wireless vehicle-to-everything (C-V2X) communication technology to disseminate the cooperative awareness messages (CAMs) to their followers while ensuring timely delivery of safety-critical messages to the Road-Side Unit (RSU). Due to the challenges of dynamic channel conditions, centralized resource management schemes that require global information are inefficient and lead to large signaling overheads. Hence, we exploit a distributed resource allocation framework based on multi-agent reinforcement learning (MARL), where each platoon leader (PL) acts as an agent and interacts with the environment to learn its optimal policy. Existing MARL algorithms consider a holistic reward function for the group's collective success, which often ends up with unsatisfactory results and cannot guarantee an optimal policy for each agent. Consequently, motivated by the existing literature in RL, we propose a novel MARL framework that trains two critics with the following goals: A global critic which estimates the global expected reward and motivates the agents toward a cooperating behavior and an exclusive local critic for each agent that estimates the local individual reward. Furthermore, based on the tasks each agent has to accomplish, the individual reward of each agent is decomposed into multiple sub-reward functions where task-wise value functions are learned separately. Numerical results indicate our proposed algorithm's effectiveness compared with the conventional RL methods applied in this area.

Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts

Authors:Weinan Zhang, Xihuai Wang, Jian Shen, Ming Zhou
Date:2021-05-07 16:20:22

This paper investigates the model-based methods in multi-agent reinforcement learning (MARL). We specify the dynamics sample complexity and the opponent sample complexity in MARL, and conduct a theoretic analysis of return discrepancy upper bound. To reduce the upper bound with the intention of low sample complexity during the whole learning process, we propose a novel decentralized model-based MARL method, named Adaptive Opponent-wise Rollout Policy Optimization (AORPO). In AORPO, each agent builds its multi-agent environment model, consisting of a dynamics model and multiple opponent models, and trains its policy with the adaptive opponent-wise rollout. We further prove the theoretic convergence of AORPO under reasonable assumptions. Empirical experiments on competitive and cooperative tasks demonstrate that AORPO can achieve improved sample efficiency with comparable asymptotic performance over the compared MARL methods.

Reducing Bus Bunching with Asynchronous Multi-Agent Reinforcement Learning

Authors:Jiawei Wang, Lijun Sun
Date:2021-05-02 02:08:07

The bus system is a critical component of sustainable urban transportation. However, due to the significant uncertainties in passenger demand and traffic conditions, bus operation is unstable in nature and bus bunching has become a common phenomenon that undermines the reliability and efficiency of bus services. Despite recent advances in multi-agent reinforcement learning (MARL) on traffic control, little research has focused on bus fleet control due to the tricky asynchronous characteristic -- control actions only happen when a bus arrives at a bus stop and thus agents do not act simultaneously. In this study, we formulate route-level bus fleet control as an asynchronous multi-agent reinforcement learning (ASMR) problem and extend the classical actor-critic architecture to handle the asynchronous issue. Specifically, we design a novel critic network to effectively approximate the marginal contribution for other agents, in which graph attention neural network is used to conduct inductive learning for policy evaluation. The critic structure also helps the ego agent optimize its policy more efficiently. We evaluate the proposed framework on real-world bus services and actual passenger demand derived from smart card data. Our results show that the proposed model outperforms both traditional headway-based control methods and existing MARL methods.

MARL: Multimodal Attentional Representation Learning for Disease Prediction

Authors:Ali Hamdi, Amr Aboeleneen, Khaled Shaban
Date:2021-05-01 17:47:40

Existing learning models often utilise CT-scan images to predict lung diseases. These models are posed by high uncertainties that affect lung segmentation and visual feature learning. We introduce MARL, a novel Multimodal Attentional Representation Learning model architecture that learns useful features from multimodal data under uncertainty. We feed the proposed model with both the lung CT-scan images and their perspective historical patients' biological records collected over times. Such rich data offers to analyse both spatial and temporal aspects of the disease. MARL employs Fuzzy-based image spatial segmentation to overcome uncertainties in CT-scan images. We then utilise a pre-trained Convolutional Neural Network (CNN) to learn visual representation vectors from images. We augment patients' data with statistical features from the segmented images. We develop a Long Short-Term Memory (LSTM) network to represent the augmented data and learn sequential patterns of disease progressions. Finally, we inject both CNN and LSTM feature vectors to an attention layer to help focus on the best learning features. We evaluated MARL on regression of lung disease progression and status classification. MARL outperforms state-of-the-art CNN architectures, such as EfficientNet and DenseNet, and baseline prediction models. It achieves a 91% R^2 score, which is higher than the other models by a range of 8% to 27%. Also, MARL achieves 97% and 92% accuracy for binary and multi-class classification, respectively. MARL improves the accuracy of state-of-the-art CNN models with a range of 19% to 57%. The results show that combining spatial and sequential temporal features produces better discriminative feature.

Mean Field MARL Based Bandwidth Negotiation Method for Massive Devices Spectrum Sharing

Authors:Tianhao Li, Yu Tian, Shuai Yuan, Naijin Liu
Date:2021-04-30 16:07:39

In this paper, a novel bandwidth negotiation mechanism is proposed for massive devices wireless spectrum sharing, in which individual device locally negotiates bandwidth usage with neighbor devices and globally optimal spectrum utilization is achieved through distributed decision-making. Since only sparse feedback is needed, the proposed mechanism can greatly reduce the signaling overhead. In order to solve the distributed optimization problem when massive devices coexist, mean field multi-agent reinforcement learning (MF-MARL) based bandwidth decision algorithm is proposed, which allow device make globally optimal decision leveraging only neighborhood observation. In simulation, distributed bandwidth negotiation between 1000 devices is demonstrated and the spectrum utilization rate is above 95%. The proposed method is beneficial to reduce spectrum conflicts, increase spectrum utilization for massive devices spectrum sharing.

Neutron-proton pairing in the N=Z radioactive fp-shell nuclei 56Ni and 52Fe probed by pair transfer

Authors:B. Le Crom, M. Assié, Y. Blumenfeld, J. Guillot, H. Sagawa, T. Suzuki, M. Honma, N. L. Achouri, B. Bastin, R. Borcea, W. N. Catford, E. Clement, L. Caceres, M. Caamano, A. Corsi, G. De France, F. Delaunay, N. De Séréville, B. Fernandez-Dominguez, M. Fisichella, S. Franchoo, A. Georgiadou, J. Gibelin, A. Gillibert, F. Hammache, O. Kamalou, A. Knapton, V. Lapoux, S. Leblond, A. O. Macchiavelli, F. M. Marques, A. Matta, L. Menager, P. Morfouace, N. A. Orr, J. Pancin, X. Pereira-Lopez, L. Perrot, J. Piot, E. Pollacco, D. Ramos, T. Roger, F. Rotaru, A. M. Sanchez-Benitez, M. Sénoville, O. Sorlin, M. Stanoiu, I. Stefan, C. Stodel, D. Suzuki, J-C Thomas, M. Vandebrouck
Date:2021-04-21 18:14:04

The isovector and isoscalar components of neutron-proton pairing are investigated in the N=Z unstable nuclei of the \textit{fp}-shell through the two-nucleon transfer reaction (p,$^3$He) in inverse kinematics. The combination of particle and gamma-ray detection with radioactive beams of $^{56}$Ni and $^{52}$Fe, produced by fragmentation at the GANIL/LISE facility, made it possible to carry out this study for the first time in a closed and an open-shell nucleus in the \textit{fp}-shell. The transfer cross-sections for ground-state to ground-state (J=0$^+$,T=1) and to the first (J=1$^+$,T=0) state were extracted for both cases together with the transfer cross-section ratios $\sigma$(0$^+$,T=1) /$\sigma$(1$^+$,T=0). They are compared with second-order distorted-wave born approximation (DWBA) calculations. The enhancement of the ground-state to ground-state pair transfer cross-section close to mid-shell, in $^{52}$Fe, points towards a superfluid phase in the isovector channel. For the "deuteron-like" transfer, very low cross-sections to the first (J=1$^+$,T=0) state were observed both for \Ni\phe\, and \Fe\phe\, and are related to a strong hindrance of this channel due to spin-orbit effect. No evidence for an isoscalar deuteron-like condensate is observed.

The MUGAST-AGATA-VAMOS campaign : set-up and performance

Authors:M. Assié, E. Clément, A. Lemasson, D. Ramos, A. Raggio, I. Zanon, F. Galtarossa, C. Lenain, J. Casal, F. Flavigny, A. Matta, D. Mengoni, D. Beaumel, Y. Blumenfeld, R. Borcea, D. Brugnara, W. Catford, F. de Oliveira, N. De Séréville, F. Didierjean, C. Aa. Diget, J. Dudouet, B. Fernandez-Dominguez, C. Fougères, G. Frémont, V. Girard-Alcindor, A. Giret, A. Goasduff, A. Gottardo, J. Goupil, F. Hammache, P. R. John, A. Korichi, L. Lalanne, S. Leblond, A. Lefevre, F. Legruel, L. Menager, B. Million, C. Nicolle, F. Noury, E. Rauly, K. Rezynkina, E. Rindel, J. S. Rojo, M. Siciliano, M. Stanoiu, I. Stefan, L. Vatrinet
Date:2021-04-21 18:12:53

The MUGAST-AGATA-VAMOS set-up at GANIL combines the MUGAST highly-segmented silicon array with the state-of-the-art AGATA array and the large acceptance VAMOS spectrometer. The mechanical and electronics integration copes with the constraints of maximum efficiency for each device, in particular {\gamma}-ray transparency for the silicon array. This complete set-up offers a unique opportunity to perform exclusive measurements of direct reactions with the radioactive beams from the SPIRAL1 facility. The performance of the set-up is described through its commissioning and two examples of transfer reactions measured during the campaign. High accuracy spectroscopy of the nuclei of interest, including cross-sections and angular distributions, is achieved through the triple-coincidence measurement. In addition, the correction from Doppler effect of the {\gamma}-ray energies is improved by the detection of the light particles and the use of two-body kinematics and a full rejection of the background contributions is obtained through the identification of heavy residues. Moreover, the system can handle high intensity beams (up to 108 pps). The particle identification based on the measurement of the time-of-flight between MUGAST and VAMOS and the reconstruction of the trajectories is investigated.

Agent-Centric Representations for Multi-Agent Reinforcement Learning

Authors:Wenling Shang, Lasse Espeholt, Anton Raichuk, Tim Salimans
Date:2021-04-19 15:43:40

Object-centric representations have recently enabled significant progress in tackling relational reasoning tasks. By building a strong object-centric inductive bias into neural architectures, recent efforts have improved generalization and data efficiency of machine learning algorithms for these problems. One problem class involving relational reasoning that still remains under-explored is multi-agent reinforcement learning (MARL). Here we investigate whether object-centric representations are also beneficial in the fully cooperative MARL setting. Specifically, we study two ways of incorporating an agent-centric inductive bias into our RL algorithm: 1. Introducing an agent-centric attention module with explicit connections across agents 2. Adding an agent-centric unsupervised predictive objective (i.e. not using action labels), to be used as an auxiliary loss for MARL, or as the basis of a pre-training step. We evaluate these approaches on the Google Research Football environment as well as DeepMind Lab 2D. Empirically, agent-centric representation learning leads to the emergence of more complex cooperation strategies between agents as well as enhanced sample efficiency and generalization.

Multi-Agent Reinforcement Learning Based Coded Computation for Mobile Ad Hoc Computing

Authors:Baoqian Wang, Junfei Xie, Kejie Lu, Yan Wan, Shengli Fu
Date:2021-04-15 15:50:57

Mobile ad hoc computing (MAHC), which allows mobile devices to directly share their computing resources, is a promising solution to address the growing demands for computing resources required by mobile devices. However, offloading a computation task from a mobile device to other mobile devices is a challenging task due to frequent topology changes and link failures because of node mobility, unstable and unknown communication environments, and the heterogeneous nature of these devices. To address these challenges, in this paper, we introduce a novel coded computation scheme based on multi-agent reinforcement learning (MARL), which has many promising features such as adaptability to network changes, high efficiency and robustness to uncertain system disturbances, consideration of node heterogeneity, and decentralized load allocation. Comprehensive simulation studies demonstrate that the proposed approach can outperform state-of-the-art distributed computing schemes.

Dynamic Coded Caching in Wireless Networks Using Multi-Agent Reinforcement Learning

Authors:Jesper Pedersen, Alexandre Graell i Amat, Fredrik Brännström, Eirik Rosnes
Date:2021-04-14 09:26:26

We consider distributed caching of content across several small base stations (SBSs) in a wireless network, where the content is encoded using a maximum distance separable code. Specifically, we apply soft time-to-live (STTL) cache management policies, where coded packets may be evicted from the caches at periodic times. We propose a reinforcement learning (RL) approach to find coded STTL policies minimizing the overall network load. We demonstrate that such caching policies achieve almost the same network load as policies obtained through optimization, where the latter assumes perfect knowledge of the distribution of times between file requests as well the distribution of the number of SBSs within communication range of a user placing a request. We also suggest a multi-agent RL (MARL) framework for the scenario of non-uniformly distributed requests in space. For such a scenario, we show that MARL caching policies achieve lower network load as compared to optimized caching policies assuming a uniform request placement. We also provide convincing evidence that synchronous updates offer a lower network load than asynchronous updates for spatially homogeneous renewal request processes due to the memory of the renewal processes.

Towards Resilience for Multi-Agent $QD$-Learning

Authors:Yijing Xie, Shaoshuai Mou, Shreyas Sundaram
Date:2021-04-07 14:33:13

This paper considers the multi-agent reinforcement learning (MARL) problem for a networked (peer-to-peer) system in the presence of Byzantine agents. We build on an existing distributed $Q$-learning algorithm, and allow certain agents in the network to behave in an arbitrary and adversarial manner (as captured by the Byzantine attack model). Under the proposed algorithm, if the network topology is $(2F+1)$-robust and up to $F$ Byzantine agents exist in the neighborhood of each regular agent, we establish the almost sure convergence of all regular agents' value functions to the neighborhood of the optimal value function of all regular agents. For each state, if the optimal $Q$-values of all regular agents corresponding to different actions are sufficiently separated, our approach allows each regular agent to learn the optimal policy for all regular agents.

Flatland Competition 2020: MAPF and MARL for Efficient Train Coordination on a Grid World

Authors:Florian Laurent, Manuel Schneider, Christian Scheller, Jeremy Watson, Jiaoyang Li, Zhe Chen, Yi Zheng, Shao-Hung Chan, Konstantin Makhnev, Oleg Svidchenko, Vladimir Egorov, Dmitry Ivanov, Aleksei Shpilman, Evgenija Spirovska, Oliver Tanevski, Aleksandar Nikov, Ramon Grunder, David Galevski, Jakov Mitrovski, Guillaume Sartoretti, Zhiyao Luo, Mehul Damani, Nilabha Bhattacharya, Shivam Agarwal, Adrian Egli, Erik Nygren, Sharada Mohanty
Date:2021-03-30 17:13:29

The Flatland competition aimed at finding novel approaches to solve the vehicle re-scheduling problem (VRSP). The VRSP is concerned with scheduling trips in traffic networks and the re-scheduling of vehicles when disruptions occur, for example the breakdown of a vehicle. While solving the VRSP in various settings has been an active area in operations research (OR) for decades, the ever-growing complexity of modern railway networks makes dynamic real-time scheduling of traffic virtually impossible. Recently, multi-agent reinforcement learning (MARL) has successfully tackled challenging tasks where many agents need to be coordinated, such as multiplayer video games. However, the coordination of hundreds of agents in a real-life setting like a railway network remains challenging and the Flatland environment used for the competition models these real-world properties in a simplified manner. Submissions had to bring as many trains (agents) to their target stations in as little time as possible. While the best submissions were in the OR category, participants found many promising MARL approaches. Using both centralized and decentralized learning based approaches, top submissions used graph representations of the environment to construct tree-based observations. Further, different coordination mechanisms were implemented, such as communication and prioritization between agents. This paper presents the competition setup, four outstanding solutions to the competition, and a cross-comparison between them.

KnowRU: Knowledge Reusing via Knowledge Distillation in Multi-agent Reinforcement Learning

Authors:Zijian Gao, Kele Xu, Bo Ding, Huaimin Wang, Yiying Li, Hongda Jia
Date:2021-03-27 12:38:01

Recently, deep Reinforcement Learning (RL) algorithms have achieved dramatically progress in the multi-agent area. However, training the increasingly complex tasks would be time-consuming and resources-exhausting. To alleviate this problem, efficient leveraging the historical experience is essential, which is under-explored in previous studies as most of the exiting methods may fail to achieve this goal in a continuously variational system due to their complicated design and environmental dynamics. In this paper, we propose a method, named "KnowRU" for knowledge reusing which can be easily deployed in the majority of the multi-agent reinforcement learning algorithms without complicated hand-coded design. We employ the knowledge distillation paradigm to transfer the knowledge among agents with the goal to accelerate the training phase for new tasks, while improving the asymptotic performance of agents. To empirically demonstrate the robustness and effectiveness of KnowRU, we perform extensive experiments on state-of-the-art multi-agent reinforcement learning (MARL) algorithms on collaborative and competitive scenarios. The results show that KnowRU can outperform the recently reported methods, which emphasizes the importance of the proposed knowledge reusing for MARL.

Safe Multi-Agent Reinforcement Learning through Decentralized Multiple Control Barrier Functions

Authors:Zhiyuan Cai, Huanhui Cao, Wenjie Lu, Lin Zhang, Hao Xiong
Date:2021-03-23 13:54:21

Multi-Agent Reinforcement Learning (MARL) algorithms show amazing performance in simulation in recent years, but placing MARL in real-world applications may suffer safety problems. MARL with centralized shields was proposed and verified in safety games recently. However, centralized shielding approaches can be infeasible in several real-world multi-agent applications that involve non-cooperative agents or communication delay. Thus, we propose to combine MARL with decentralized Control Barrier Function (CBF) shields based on available local information. We establish a safe MARL framework with decentralized multiple CBFs and develop Multi-Agent Deep Deterministic Policy Gradient (MADDPG) to Multi-Agent Deep Deterministic Policy Gradient with decentralized multiple Control Barrier Functions (MADDPG-CBF). Based on a collision-avoidance problem that includes not only cooperative agents but obstacles, we demonstrate the construction of multiple CBFs with safety guarantees in theory. Experiments are conducted and experiment results verify that the proposed safe MARL framework can guarantee the safety of agents included in MARL.

A Joint Reinforcement-Learning Enabled Caching and Cross-Layer Network Code for Sum-Rate Maximization in F-RAN with D2D Communications

Authors:Mohammed S. Al-Abiad, Md. Zoheb Hassan, Md. Jahangir Hossain
Date:2021-03-22 18:57:45

In this paper, we leverage reinforcement learning (RL) and cross-layer network coding (CLNC) for efficiently pre-fetching users' contents to the local caches and delivering these contents to users in a downlink fog-radio access network (F-RAN) with device-to-device (D2D) communications. In the considered system, fog access points (F-APs) and cache-enabled D2D (CE-D2D) users are equipped with local caches for alleviating traffic burden at the fronthaul, while users' contents can be easily and quickly accommodated. In CLNC, the coding decisions take users' contents, their rates, and power levels of F-APs and CE-D2D users into account, and RL optimizes caching strategy. Towards this goal, a joint content placement and delivery problem is formulated as an optimization problem with a goal to maximize system sum-rate. For this NP-hard problem, we first develop an innovative decentralized CLNC coalition formation (CLNC-CF) algorithm to obtain a stable solution for the content delivery problem, where F-APs and CE-D2D users utilize CLNC resource allocation. By taking the behavior of F-APs and CE-D2D users into account, we then develop a multi-agent RL (MARL) algorithm for optimizing the content placements at both F-APs and CE-D2D users. Simulation results show that the proposed joint CLNC-CF and RL framework can effectively improve the sum-rate by up to 30\%, 60\%, and 150\%, respectively, compared to: 1) an optimal uncoded algorithm, 2) a standard rate-aware-NC algorithm, and 3) a benchmark classical NC with network-layer optimization.

Learning to Robustly Negotiate Bi-Directional Lane Usage in High-Conflict Driving Scenarios

Authors:Christoph Killing, Adam Villaflor, John M. Dolan
Date:2021-03-22 14:46:43

Recently, autonomous driving has made substantial progress in addressing the most common traffic scenarios like intersection navigation and lane changing. However, most of these successes have been limited to scenarios with well-defined traffic rules and require minimal negotiation with other vehicles. In this paper, we introduce a previously unconsidered, yet everyday, high-conflict driving scenario requiring negotiations between agents of equal rights and priorities. There exists no centralized control structure and we do not allow communications. Therefore, it is unknown if other drivers are willing to cooperate, and if so to what extent. We train policies to robustly negotiate with opposing vehicles of an unobservable degree of cooperativeness using multi-agent reinforcement learning (MARL). We propose Discrete Asymmetric Soft Actor-Critic (DASAC), a maximum-entropy off-policy MARL algorithm allowing for centralized training with decentralized execution. We show that using DASAC we are able to successfully negotiate and traverse the scenario considered over 99% of the time. Our agents are robust to an unknown timing of opponent decisions, an unobservable degree of cooperativeness of the opposing vehicle, and previously unencountered policies. Furthermore, they learn to exhibit human-like behaviors such as defensive driving, anticipating solution options and interpreting the behavior of other agents.

Regularized Softmax Deep Multi-Agent $Q$-Learning

Authors:Ling Pan, Tabish Rashid, Bei Peng, Longbo Huang, Shimon Whiteson
Date:2021-03-22 14:18:39

Tackling overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multi-agent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent $Q$-Learning, is general and can be applied to any $Q$-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.

Local Patch AutoAugment with Multi-Agent Collaboration

Authors:Shiqi Lin, Tao Yu, Ruoyu Feng, Xin Li, Xin Jin, Zhibo Chen
Date:2021-03-20 05:10:05

Data augmentation (DA) plays a critical role in improving the generalization of deep learning models. Recent works on automatically searching for DA policies from data have achieved great success. However, existing automated DA methods generally perform the search at the image level, which limits the exploration of diversity in local regions. In this paper, we propose a more fine-grained automated DA approach, dubbed Patch AutoAugment, to divide an image into a grid of patches and search for the joint optimal augmentation policies for the patches. We formulate it as a multi-agent reinforcement learning (MARL) problem, where each agent learns an augmentation policy for each patch based on its content together with the semantics of the whole image. The agents cooperate with each other to achieve the optimal augmentation effect of the entire image by sharing a team reward. We show the effectiveness of our method on multiple benchmark datasets of image classification and fine-grained image recognition (e.g., CIFAR-10, CIFAR-100, ImageNet, CUB-200-2011, Stanford Cars and FGVC-Aircraft). Extensive experiments demonstrate that our method outperforms the state-of-the-art DA methods while requiring fewer computational resources.

Learning in Nonzero-Sum Stochastic Games with Potentials

Authors:David Mguni, Yutong Wu, Yali Du, Yaodong Yang, Ziyi Wang, Minne Li, Ying Wen, Joel Jennings, Jun Wang
Date:2021-03-16 19:02:01

Multi-agent reinforcement learning (MARL) has become effective in tackling discrete cooperative game scenarios. However, MARL has yet to penetrate settings beyond those modelled by team and zero-sum games, confining it to a small subset of multi-agent systems. In this paper, we introduce a new generation of MARL learners that can handle nonzero-sum payoff structures and continuous settings. In particular, we study the MARL problem in a class of games known as stochastic potential games (SPGs) with continuous state-action spaces. Unlike cooperative games, in which all agents share a common reward, SPGs are capable of modelling real-world scenarios where agents seek to fulfil their individual goals. We prove theoretically our learning method, SPot-AC, enables independent agents to learn Nash equilibrium strategies in polynomial time. We demonstrate our framework tackles previously unsolvable tasks such as Coordination Navigation and large selfish routing games and that it outperforms the state of the art MARL baselines such as MADDPG and COMIX in such scenarios.

Multi-Agent Reinforcement Learning for Joint Cooperative Spectrum Sensing and Channel Access in Cognitive UAV Networks

Authors:Weiheng Jiang, Wanxin Yu, Wenbo Wang, Tiancong Huang
Date:2021-03-15 07:39:32

This paper studies the problem of distributed spectrum/channel access for cognitive radio-enabled unmanned aerial vehicles (CUAVs) that overlay upon primary channels. Under the framework of cooperative spectrum sensing and opportunistic transmission, a one-shot optimization problem for channel allocation, aiming to maximize the expected cumulative weighted reward of multiple CUAVs, is formulated. To handle the uncertainty due to the lack of prior knowledge about the primary user activities as well as the lack of the channel-access coordinator, the original problem is cast into a competition and cooperation hybrid multi-agent reinforcement learning (CCH-MARL) problem in the framework of Markov game (MG). Then, a value-iteration-based RL algorithm, which features upper confidence bound-Hoeffding (UCB-H) strategy searching, is proposed by treating each CUAV as an independent learner (IL). To address the curse of dimensionality, the UCB-H strategy is further extended with a double deep Q-network (DDQN). Numerical simulations show that the proposed algorithms are able to efficiently converge to stable strategies, and significantly improve the network performance when compared with the benchmark algorithms such as the vanilla Q-learning and DDQN algorithms.

Adversarial attacks in consensus-based multi-agent reinforcement learning

Authors:Martin Figura, Krishna Chaitanya Kosaraju, Vijay Gupta
Date:2021-03-11 21:44:18

Recently, many cooperative distributed multi-agent reinforcement learning (MARL) algorithms have been proposed in the literature. In this work, we study the effect of adversarial attacks on a network that employs a consensus-based MARL algorithm. We show that an adversarial agent can persuade all the other agents in the network to implement policies that optimize an objective that it desires. In this sense, the standard consensus-based MARL algorithms are fragile to attacks.

LSTMs and Deep Residual Networks for Carbohydrate and Bolus Recommendations in Type 1 Diabetes Management

Authors:Jeremy Beauchamp, Razvan Bunescu, Cindy Marling, Zhongen Li, Chang Liu
Date:2021-03-06 19:06:14

To avoid serious diabetic complications, people with type 1 diabetes must keep their blood glucose levels (BGLs) as close to normal as possible. Insulin dosages and carbohydrate consumption are important considerations in managing BGLs. Since the 1960s, models have been developed to forecast blood glucose levels based on the history of BGLs, insulin dosages, carbohydrate intake, and other physiological and lifestyle factors. Such predictions can be used to alert people of impending unsafe BGLs or to control insulin flow in an artificial pancreas. In past work, we have introduced an LSTM-based approach to blood glucose level prediction aimed at "what if" scenarios, in which people could enter foods they might eat or insulin amounts they might take and then see the effect on future BGLs. In this work, we invert the "what-if" scenario and introduce a similar architecture based on chaining two LSTMs that can be trained to make either insulin or carbohydrate recommendations aimed at reaching a desired BG level in the future. Leveraging a recent state-of-the-art model for time series forecasting, we then derive a novel architecture for the same recommendation task, in which the two LSTM chain is used as a repeating block inside a deep residual architecture. Experimental evaluations using real patient data from the OhioT1DM dataset show that the new integrated architecture compares favorably with the previous LSTM-based approach, substantially outperforming the baselines. The promising results suggest that this novel approach could potentially be of practical use to people with type 1 diabetes for self-management of BGLs.

Continuous Coordination As a Realistic Scenario for Lifelong Learning

Authors:Hadi Nekoei, Akilesh Badrinaaraayanan, Aaron Courville, Sarath Chandar
Date:2021-03-04 18:44:03

Current deep reinforcement learning (RL) algorithms are still highly task-specific and lack the ability to generalize to new environments. Lifelong learning (LLL), however, aims at solving multiple tasks sequentially by efficiently transferring and using knowledge between tasks. Despite a surge of interest in lifelong RL in recent years, the lack of a realistic testbed makes robust evaluation of LLL algorithms difficult. Multi-agent RL (MARL), on the other hand, can be seen as a natural scenario for lifelong RL due to its inherent non-stationarity, since the agents' policies change over time. In this work, we introduce a multi-agent lifelong learning testbed that supports both zero-shot and few-shot settings. Our setup is based on Hanabi -- a partially-observable, fully cooperative multi-agent game that has been shown to be challenging for zero-shot coordination. Its large strategy space makes it a desirable environment for lifelong RL tasks. We evaluate several recent MARL methods, and benchmark state-of-the-art LLL algorithms in limited memory and computation regimes to shed light on their strengths and weaknesses. This continual learning paradigm also provides us with a pragmatic way of going beyond centralized training which is the most commonly used training protocol in MARL. We empirically show that the agents trained in our setup are able to coordinate well with unseen agents, without any additional assumptions made by previous works. The code and all pre-trained models are available at https://github.com/chandar-lab/Lifelong-Hanabi.

Learning Emergent Discrete Message Communication for Cooperative Reinforcement Learning

Authors:Sheng Li, Yutai Zhou, Ross Allen, Mykel J. Kochenderfer
Date:2021-02-24 20:44:14

Communication is a important factor that enables agents work cooperatively in multi-agent reinforcement learning (MARL). Most previous work uses continuous message communication whose high representational capacity comes at the expense of interpretability. Allowing agents to learn their own discrete message communication protocol emerged from a variety of domains can increase the interpretability for human designers and other agents.This paper proposes a method to generate discrete messages analogous to human languages, and achieve communication by a broadcast-and-listen mechanism based on self-attention. We show that discrete message communication has performance comparable to continuous message communication but with much a much smaller vocabulary size.Furthermore, we propose an approach that allows humans to interactively send discrete messages to agents.

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Authors:Jianzhun Shao, Hongchang Zhang, Yuhang Jiang, Shuncheng He, Xiangyang Ji
Date:2021-02-24 12:03:37

Reward decomposition is a critical problem in centralized training with decentralized execution~(CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the related environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient~(MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after "exercise updates" through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network.

Dealing with Non-Stationarity in MARL via Trust-Region Decomposition

Authors:Wenhao Li, Xiangfeng Wang, Bo Jin, Junjie Sheng, Hongyuan Zha
Date:2021-02-21 14:46:50

Non-stationarity is one thorny issue in cooperative multi-agent reinforcement learning (MARL). One of the reasons is the policy changes of agents during the learning process. Some existing works have discussed various consequences caused by non-stationarity with several kinds of measurement indicators. This makes the objectives or goals of existing algorithms are inevitably inconsistent and disparate. In this paper, we introduce a novel notion, the $\delta$-measurement, to explicitly measure the non-stationarity of a policy sequence, which can be further proved to be bounded by the KL-divergence of consecutive joint policies. A straightforward but highly non-trivial way is to control the joint policies' divergence, which is difficult to estimate accurately by imposing the trust-region constraint on the joint policy. Although it has lower computational complexity to decompose the joint policy and impose trust-region constraints on the factorized policies, simple policy factorization like mean-field approximation will lead to more considerable policy divergence, which can be considered as the trust-region decomposition dilemma. We model the joint policy as a pairwise Markov random field and propose a trust-region decomposition network (TRD-Net) based on message passing to estimate the joint policy divergence more accurately. The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established by adjusting the trust-region of the local policies adaptively in an end-to-end manner. MAMT can approximately constrain the consecutive joint policies' divergence to satisfy $\delta$-stationarity and alleviate the non-stationarity problem. Our method can bring noticeable and stable performance improvement compared with baselines in cooperative tasks of different complexity.

Decentralized Deterministic Multi-Agent Reinforcement Learning

Authors:Antoine Grosnit, Desmond Cai, Laura Wynter
Date:2021-02-19 05:10:15

[Zhang, ICML 2018] provided the first decentralized actor-critic algorithm for multi-agent reinforcement learning (MARL) that offers convergence guarantees. In that work, policies are stochastic and are defined on finite action spaces. We extend those results to offer a provably-convergent decentralized actor-critic algorithm for learning deterministic policies on continuous action spaces. Deterministic policies are important in real-world settings. To handle the lack of exploration inherent in deterministic policies, we consider both off-policy and on-policy settings. We provide the expression of a local deterministic policy gradient, decentralized deterministic actor-critic algorithms and convergence guarantees for linearly-approximated value functions. This work will help enable decentralized MARL in high-dimensional action spaces and pave the way for more widespread use of MARL.

RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Authors:Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana Obraztsova, Zinovi Rabinovich
Date:2021-02-16 13:58:25

Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method significantly outperforms state-of-the-art methods on challenging StarCraft II tasks, demonstrating enhanced coordination and improved sample efficiency.

DFAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

Authors:Wei-Fang Sun, Cheng-Kuang Lee, Chun-Yi Lee
Date:2021-02-16 03:16:49

In fully cooperative multi-agent reinforcement learning (MARL) settings, the environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of the other agents. To address the above issues, we integrate distributional RL and value function factorization methods by proposing a Distributional Value Function Factorization (DFAC) framework to generalize expected value function factorization methods to their DFAC variants. DFAC extends the individual utility functions from deterministic variables to random variables, and models the quantile function of the total return as a quantile mixture. To validate DFAC, we demonstrate DFAC's ability to factorize a simple two-step matrix game with stochastic rewards and perform experiments on all Super Hard tasks of StarCraft Multi-Agent Challenge, showing that DFAC is able to outperform expected value function factorization baselines.

Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems

Authors:Yaodong Yang, Jun Luo, Ying Wen, Oliver Slumbers, Daniel Graves, Haitham Bou Ammar, Jun Wang, Matthew E. Taylor
Date:2021-02-15 16:40:02

Multiagent reinforcement learning (MARL) has achieved a remarkable amount of success in solving various types of video games. A cornerstone of this success is the auto-curriculum framework, which shapes the learning process by continually creating new challenging tasks for agents to adapt to, thereby facilitating the acquisition of new skills. In order to extend MARL methods to real-world domains outside of video games, we envision in this blue sky paper that maintaining a diversity-aware auto-curriculum is critical for successful MARL applications. Specifically, we argue that \emph{behavioural diversity} is a pivotal, yet under-explored, component for real-world multiagent learning systems, and that significant work remains in understanding how to design a diversity-aware auto-curriculum. We list four open challenges for auto-curriculum techniques, which we believe deserve more attention from this community. Towards validating our vision, we recommend modelling realistic interactive behaviours in autonomous driving as an important test bed, and recommend the SMARTS/ULTRA benchmark.

Modeling the Interaction between Agents in Cooperative Multi-Agent Reinforcement Learning

Authors:Xiaoteng Ma, Yiqin Yang, Chenghao Li, Yiwen Lu, Qianchuan Zhao, Yang Jun
Date:2021-02-10 01:58:28

Value-based methods of multi-agent reinforcement learning (MARL), especially the value decomposition methods, have been demonstrated on a range of challenging cooperative tasks. However, current methods pay little attention to the interaction between agents, which is essential to teamwork in games or real life. This limits the efficiency of value-based MARL algorithms in the two aspects: collaborative exploration and value function estimation. In this paper, we propose a novel cooperative MARL algorithm named as interactive actor-critic~(IAC), which models the interaction of agents from the perspectives of policy and value function. On the policy side, a multi-agent joint stochastic policy is introduced by adopting a collaborative exploration module, which is trained by maximizing the entropy-regularized expected return. On the value side, we use the shared attention mechanism to estimate the value function of each agent, which takes the impact of the teammates into consideration. At the implementation level, we extend the value decomposition methods to continuous control tasks and evaluate IAC on benchmark tasks including classic control and multi-agent particle environments. Experimental results indicate that our method outperforms the state-of-the-art approaches and achieves better performance in terms of cooperation.

Structured Diversification Emergence via Reinforced Organization Control and Hierarchical Consensus Learning

Authors:Wenhao Li, Xiangfeng Wang, Bo Jin, Junjie Sheng, Yun Hua, Hongyuan Zha
Date:2021-02-09 11:46:12

When solving a complex task, humans will spontaneously form teams and to complete different parts of the whole task, respectively. Meanwhile, the cooperation between teammates will improve efficiency. However, for current cooperative MARL methods, the cooperation team is constructed through either heuristics or end-to-end blackbox optimization. In order to improve the efficiency of cooperation and exploration, we propose a structured diversification emergence MARL framework named {\sc{Rochico}} based on reinforced organization control and hierarchical consensus learning. {\sc{Rochico}} first learns an adaptive grouping policy through the organization control module, which is established by independent multi-agent reinforcement learning. Further, the hierarchical consensus module based on the hierarchical intentions with consensus constraint is introduced after team formation. Simultaneously, utilizing the hierarchical consensus module and a self-supervised intrinsic reward enhanced decision module, the proposed cooperative MARL algorithm {\sc{Rochico}} can output the final diversified multi-agent cooperative policy. All three modules are organically combined to promote the structured diversification emergence. Comparative experiments on four large-scale cooperation tasks show that {\sc{Rochico}} is significantly better than the current SOTA algorithms in terms of exploration efficiency and cooperation strength.

Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning

Authors:Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, Shih-wei Liao
Date:2021-02-06 02:28:09

Many complex multi-agent systems such as robot swarms control and autonomous vehicle coordination can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks. QMIX, a widely popular MARL algorithm, has been used as a baseline for the benchmark environments, e.g., Starcraft Multi-Agent Challenge (SMAC), Difficulty-Enhanced Predator-Prey (DEPP). Recent variants of QMIX target relaxing the monotonicity constraint of QMIX, allowing for performance improvement in SMAC. In this paper, we investigate the code-level optimizations of these variants and the monotonicity constraint. (1) We find that such improvements of the variants are significantly affected by various code-level optimizations. (2) The experiment results show that QMIX with normalized optimizations outperforms other works in SMAC; (3) beyond the common wisdom from these works, the monotonicity constraint can improve sample efficiency in SMAC and DEPP. We also discuss why monotonicity constraints work well in purely cooperative tasks with a theoretical analysis. We open-source the code at \url{https://github.com/hijkzzz/pymarl2}.

Hybrid Information-driven Multi-agent Reinforcement Learning

Authors:William A. Dawson, Ruben Glatt, Edward Rusu, Braden C. Soper, Ryan A. Goldhahn
Date:2021-02-01 17:28:39

Information theoretic sensor management approaches are an ideal solution to state estimation problems when considering the optimal control of multi-agent systems, however they are too computationally intensive for large state spaces, especially when considering the limited computational resources typical of large-scale distributed multi-agent systems. Reinforcement learning (RL) is a promising alternative which can find approximate solutions to distributed optimal control problems that take into account the resource constraints inherent in many systems of distributed agents. However, the RL training can be prohibitively inefficient, especially in low-information environments where agents receive little to no feedback in large portions of the state space. We propose a hybrid information-driven multi-agent reinforcement learning (MARL) approach that utilizes information theoretic models as heuristics to help the agents navigate large sparse state spaces, coupled with information based rewards in an RL framework to learn higher-level policies. This paper presents our ongoing work towards this objective. Our preliminary findings show that such an approach can result in a system of agents that are approximately three orders of magnitude more efficient at exploring a sparse state space than naive baseline metrics. While the work is still in its early stages, it provides a promising direction for future research.

On a relativistic BGK model for polyatomic gases near equilibrium

Authors:Byung-Hoon Hwang, Tommaso Ruggeri, Seok-Bae Yun
Date:2021-01-31 14:39:22

Recently, a novel relativistic polyatomic BGK model was suggested by Pennisi and Ruggeri [J. of Phys. Conf. Series, 1035, (2018)] to overcome drawbacks of the Anderson-Witting model and Marle model.In this paper, we prove the unique existence and asymptotic behavior of classical solutions to the relativistic polyatomic BGK model when the initial data is sufficiently close to a global equilibrium.

Safe Multi-Agent Reinforcement Learning via Shielding

Authors:Ingy Elsayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu, Lu Feng
Date:2021-01-27 04:27:06

Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications, which require guaranteed safety (e.g., no unsafe states are ever visited) during the learning process.Unfortunately, current MARL methods do not have safety guarantees. Therefore, we present two shielding approaches for safe MARL. In centralized shielding, we synthesize a single shield to monitor all agents' joint actions and correct any unsafe action if necessary. In factored shielding, we synthesize multiple shields based on a factorization of the joint state space observed by all agents; the set of shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step.Experimental results show that both approaches can guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.

HAMMER: Multi-Level Coordination of Reinforcement Learning Agents via Learned Messaging

Authors:Nikunj Gupta, G Srinivasaraghavan, Swarup Kumar Mohalik, Nishant Kumar, Matthew E. Taylor
Date:2021-01-18 19:00:12

Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.

Cooperative and Competitive Biases for Multi-Agent Reinforcement Learning

Authors:Heechang Ryu, Hayong Shin, Jinkyoo Park
Date:2021-01-18 05:52:22

Training a multi-agent reinforcement learning (MARL) algorithm is more challenging than training a single-agent reinforcement learning algorithm, because the result of a multi-agent task strongly depends on the complex interactions among agents and their interactions with a stochastic and dynamic environment. We propose an algorithm that boosts MARL training using the biased action information of other agents based on a friend-or-foe concept. For a cooperative and competitive environment, there are generally two groups of agents: cooperative-agents and competitive-agents. In the proposed algorithm, each agent updates its value function using its own action and the biased action information of other agents in the two groups. The biased joint action of cooperative agents is computed as the sum of their actual joint action and the imaginary cooperative joint action, by assuming all the cooperative agents jointly maximize the target agent's value function. The biased joint action of competitive agents can be computed similarly. Each agent then updates its own value function using the biased action information, resulting in a biased value function and corresponding biased policy. Subsequently, the biased policy of each agent is inevitably subjected to recommend an action to cooperate and compete with other agents, thereby introducing more active interactions among agents and enhancing the MARL policy learning. We empirically demonstrate that our algorithm outperforms existing algorithms in various mixed cooperative-competitive environments. Furthermore, the introduced biases gradually decrease as the training proceeds and the correction based on the imaginary assumption vanishes.

Coding for Distributed Multi-Agent Reinforcement Learning

Authors:Baoqian Wang, Junfei Xie, Nikolay Atanasov
Date:2021-01-07 00:22:34

This paper aims to mitigate straggler effects in synchronous distributed learning for multi-agent reinforcement learning (MARL) problems. Stragglers arise frequently in a distributed learning system, due to the existence of various system disturbances such as slow-downs or failures of compute nodes and communication bottlenecks. To resolve this issue, we propose a coded distributed learning framework, which speeds up the training of MARL algorithms in the presence of stragglers, while maintaining the same accuracy as the centralized approach. As an illustration, a coded distributed version of the multi-agent deep deterministic policy gradient(MADDPG) algorithm is developed and evaluated. Different coding schemes, including maximum distance separable (MDS)code, random sparse code, replication-based code, and regular low density parity check (LDPC) code are also investigated. Simulations in several multi-robot problems demonstrate the promising performance of the proposed framework.

Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample Complexity

Authors:Kaiqing Zhang, Xiangyuan Zhang, Bin Hu, Tamer Başar
Date:2021-01-04 16:00:46

Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.

Effective Communications: A Joint Learning and Communication Framework for Multi-Agent Reinforcement Learning over Noisy Channels

Authors:Tze-Yang Tung, Szymon Kobus, Joan Roig Pujol, Deniz Gunduz
Date:2021-01-02 10:43:41

We propose a novel formulation of the "effectiveness problem" in communications, put forth by Shannon and Weaver in their seminal work [2], by considering multiple agents communicating over a noisy channel in order to achieve better coordination and cooperation in a multi-agent reinforcement learning (MARL) framework. Specifically, we consider a multi-agent partially observable Markov decision process (MA-POMDP), in which the agents, in addition to interacting with the environment can also communicate with each other over a noisy communication channel. The noisy communication channel is considered explicitly as part of the dynamics of the environment and the message each agent sends is part of the action that the agent can take. As a result, the agents learn not only to collaborate with each other but also to communicate "effectively" over a noisy channel. This framework generalizes both the traditional communication problem, where the main goal is to convey a message reliably over a noisy channel, and the "learning to communicate" framework that has received recent attention in the MARL literature, where the underlying communication channels are assumed to be error-free. We show via examples that the joint policy learned using the proposed framework is superior to that where the communication is considered separately from the underlying MA-POMDP. This is a very powerful framework, which has many real world applications, from autonomous vehicle planning to drone swarm control, and opens up the rich toolbox of deep reinforcement learning for the design of multi-user communication systems.

Vehicular Network Slicing for Reliable Access and Deadline-Constrained Data Offloading: A Multi-Agent On-Device Learning Approach

Authors:Md Ferdous Pervej, Shih-Chun Lin
Date:2020-12-31 11:15:10

Efficient data offloading plays a pivotal role in computational-intensive platforms as data rate over wireless channels is fundamentally limited. On top of that, high mobility adds an extra burden in vehicular edge networks (VENs), bolstering the desire for efficient user-centric solutions. Therefore, unlike the legacy inflexible network-centric approach, this paper exploits a software-defined flexible, open, and programmable networking platform for an efficient user-centric, fast, reliable, and deadline-constrained offloading solution in VENs. In the proposed model, each active vehicle user (VU) is served from multiple low-powered access points (APs) by creating a noble virtual cell (VC). A joint node association, power allocation, and distributed resource allocation problem is formulated. As centralized learning is not practical in many real-world problems, following the distributed nature of autonomous VUs, each VU is considered an edge learning agent. To that end, considering practical location-aware node associations, a joint radio and power resource allocation non-cooperative stochastic game is formulated. Leveraging reinforcement learning's (RL) efficacy, a multi-agent RL (MARL) solution is proposed where the edge learners aim to learn the Nash equilibrium (NE) strategies to solve the game efficiently. Besides, real-world map data, with a practical microscopic mobility model, are used for the simulation. Results suggest that the proposed user-centric approach can deliver remarkable performances in VENs. Moreover, the proposed MARL solution delivers near-optimal performances with approximately 3% collision probabilities in case of distributed random access in the uplink.

Fairness-Oriented User Scheduling for Bursty Downlink Transmission Using Multi-Agent Reinforcement Learning

Authors:Mingqi Yuan, Qi Cao, Man-on Pun, Yi Chen
Date:2020-12-30 08:41:51

In this work, we develop practical user scheduling algorithms for downlink bursty traffic with emphasis on user fairness. In contrast to the conventional scheduling algorithms that either equally divides the transmission time slots among users or maximizing some ratios without physcial meanings, we propose to use the 5%-tile user data rate (5TUDR) as the metric to evaluate user fairness. Since it is difficult to directly optimize 5TUDR, we first cast the problem into the stochastic game framework and subsequently propose a Multi-Agent Reinforcement Learning (MARL)-based algorithm to perform distributed optimization on the resource block group (RBG) allocation. Furthermore, each MARL agent is designed to take information measured by network counters from multiple network layers (e.g. Channel Quality Indicator, Buffer size) as the input states while the RBG allocation as action with a proposed reward function designed to maximize 5TUDR. Extensive simulation is performed to show that the proposed MARL-based scheduler can achieve fair scheduling while maintaining good average network throughput as compared to conventional schedulers.

Cooperative Policy Learning with Pre-trained Heterogeneous Observation Representations

Authors:Wenlei Shi, Xinran Wei, Jia Zhang, Xiaoyuan Ni, Arthur Jiang, Jiang Bian, Tie-Yan Liu
Date:2020-12-24 04:52:29

Multi-agent reinforcement learning (MARL) has been increasingly explored to learn the cooperative policy towards maximizing a certain global reward. Many existing studies take advantage of graph neural networks (GNN) in MARL to propagate critical collaborative information over the interaction graph, built upon inter-connected agents. Nevertheless, the vanilla GNN approach yields substantial defects in dealing with complex real-world scenarios since the generic message passing mechanism is ineffective between heterogeneous vertices and, moreover, simple message aggregation functions are incapable of accurately modeling the combinational interactions from multiple neighbors. While adopting complex GNN models with more informative message passing and aggregation mechanisms can obviously benefit heterogeneous vertex representations and cooperative policy learning, it could, on the other hand, increase the training difficulty of MARL and demand more intense and direct reward signals compared to the original global reward. To address these challenges, we propose a new cooperative learning framework with pre-trained heterogeneous observation representations. Particularly, we employ an encoder-decoder based graph attention to learn the intricate interactions and heterogeneous representations that can be more easily leveraged by MARL. Moreover, we design a pre-training with local actor-critic algorithm to ease the difficulty in cooperative policy learning. Extensive experiments over real-world scenarios demonstrate that our new approach can significantly outperform existing MARL baselines as well as operational research solutions that are widely-used in industry.

QVMix and QVMix-Max: Extending the Deep Quality-Value Family of Algorithms to Cooperative Multi-Agent Reinforcement Learning

Authors:Pascal Leroy, Damien Ernst, Pierre Geurts, Gilles Louppe, Jonathan Pisane, Matthia Sabatelli
Date:2020-12-22 14:53:42

This paper introduces four new algorithms that can be used for tackling multi-agent reinforcement learning (MARL) problems occurring in cooperative settings. All algorithms are based on the Deep Quality-Value (DQV) family of algorithms, a set of techniques that have proven to be successful when dealing with single-agent reinforcement learning problems (SARL). The key idea of DQV algorithms is to jointly learn an approximation of the state-value function $V$, alongside an approximation of the state-action value function $Q$. We follow this principle and generalise these algorithms by introducing two fully decentralised MARL algorithms (IQV and IQV-Max) and two algorithms that are based on the centralised training with decentralised execution training paradigm (QVMix and QVMix-Max). We compare our algorithms with state-of-the-art MARL techniques on the popular StarCraft Multi-Agent Challenge (SMAC) environment. We show competitive results when QVMix and QVMix-Max are compared to well-known MARL techniques such as QMIX and MAVEN and show that QVMix can even outperform them on some of the tested environments, being the algorithm which performs best overall. We hypothesise that this is due to the fact that QVMix suffers less from the overestimation bias of the $Q$ function.

Multi-Agent Reinforcement Learning for Dynamic Ocean Monitoring by a Swarm of Buoys

Authors:Maryam Kouzehgar, Malika Meghjani, Roland Bouffanais
Date:2020-12-21 19:11:23

Autonomous marine environmental monitoring problem traditionally encompasses an area coverage problem which can only be effectively carried out by a multi-robot system. In this paper, we focus on robotic swarms that are typically operated and controlled by means of simple swarming behaviors obtained from a subtle, yet ad hoc combination of bio-inspired strategies. We propose a novel and structured approach for area coverage using multi-agent reinforcement learning (MARL) which effectively deals with the non-stationarity of environmental features. Specifically, we propose two dynamic area coverage approaches: (1) swarm-based MARL, and (2) coverage-range-based MARL. The former is trained using the multi-agent deep deterministic policy gradient (MADDPG) approach whereas, a modified version of MADDPG is introduced for the latter with a reward function that intrinsically leads to a collective behavior. Both methods are tested and validated with different geometric shaped regions with equal surface area (square vs. rectangle) yielding acceptable area coverage, and benefiting from the structured learning in non-stationary environments. Both approaches are advantageous compared to a na\"{i}ve swarming method. However, coverage-range-based MARL outperforms the swarm-based MARL with stronger convergence features in learning criteria and higher spreading of agents for area coverage.

MAGNet: Multi-agent Graph Network for Deep Multi-agent Reinforcement Learning

Authors:Aleksandra Malysheva, Daniel Kudenko, Aleksei Shpilman
Date:2020-12-17 17:19:36

Over recent years, deep reinforcement learning has shown strong successes in complex single-agent tasks, and more recently this approach has also been applied to multi-agent domains. In this paper, we propose a novel approach, called MAGNet, to multi-agent reinforcement learning that utilizes a relevance graph representation of the environment obtained by a self-attention mechanism, and a message-generation technique. We applied our MAGnet approach to the synthetic predator-prey multi-agent environment and the Pommerman game and the results show that it significantly outperforms state-of-the-art MARL solutions, including Multi-agent Deep Q-Networks (MADQN), Multi-agent Deep Deterministic Policy Gradient (MADDPG), and QMIX

Learning Fair Policies in Decentralized Cooperative Multi-Agent Reinforcement Learning

Authors:Matthieu Zimmer, Claire Glanois, Umer Siddique, Paul Weng
Date:2020-12-17 07:17:36

We consider the problem of learning fair policies in (deep) cooperative multi-agent reinforcement learning (MARL). We formalize it in a principled way as the problem of optimizing a welfare function that explicitly encodes two important aspects of fairness: efficiency and equity. As a solution method, we propose a novel neural network architecture, which is composed of two sub-networks specifically designed for taking into account the two aspects of fairness. In experiments, we demonstrate the importance of the two sub-networks for fair optimization. Our overall approach is general as it can accommodate any (sub)differentiable welfare function. Therefore, it is compatible with various notions of fairness that have been proposed in the literature (e.g., lexicographic maximin, generalized Gini social welfare function, proportional fairness). Our solution method is generic and can be implemented in various MARL settings: centralized training and decentralized execution, or fully decentralized. Finally, we experimentally validate our approach in various domains and show that it can perform much better than previous methods.

Robust Multi-Agent Reinforcement Learning with Social Empowerment for Coordination and Communication

Authors:T. van der Heiden, C. Salge, E. Gavves, H. van Hoof
Date:2020-12-15 12:43:17

We consider the problem of robust multi-agent reinforcement learning (MARL) for cooperative communication and coordination tasks. MARL agents, mainly those trained in a centralized way, can be brittle because they can adopt policies that act under the expectation that other agents will act a certain way rather than react to their actions. Our objective is to bias the learning process towards finding strategies that remain reactive towards others' behavior. Social empowerment measures the potential influence between agents' actions. We propose it as an additional reward term, so agents better adapt to other agents' actions. We show that the proposed method results in obtaining higher rewards faster and a higher success rate in three cooperative communication and coordination tasks.

Multi-Objective Optimization of the Textile Manufacturing Process Using Deep-Q-Network Based Multi-Agent Reinforcement Learning

Authors:Zhenglei He, Kim Phuc Tran, Sebastien Thomassey, Xianyi Zeng, Jie Xu, Changhai Yi
Date:2020-12-02 11:37:44

Multi-objective optimization of the textile manufacturing process is an increasing challenge because of the growing complexity involved in the development of the textile industry. The use of intelligent techniques has been often discussed in this domain, although a significant improvement from certain successful applications has been reported, the traditional methods failed to work with high-as well as human intervention. Upon which, this paper proposed a multi-agent reinforcement learning (MARL) framework to transform the optimization process into a stochastic game and introduced the deep Q-networks algorithm to train the multiple agents. A utilitarian selection mechanism was employed in the stochastic game, which (-greedy policy) in each state to avoid the interruption of multiple equilibria and achieve the correlated equilibrium optimal solutions of the optimizing process. The case study result reflects that the proposed MARL system is possible to achieve the optimal solutions for the textile ozonation process and it performs better than the traditional approaches.

On Gibbs states of mechanical systems with symmetries

Authors:Charles-Michel Marle
Date:2020-12-01 15:39:20

Gibbs states for the Hamiltonian action of a Lie group on a symplectic manifold were studied, and their possible applications in Physics and Cosmology were considered, by the French mathematician and physicist Jean-Marie Souriau. They are presented here with detailed proofs of all the stated results. Using an adaptation of the cross product for pseudo-Euclidean three-dimensional vector spaces, we present several examples of such Gibbs states, together with the associated thermodynamic functions, for various two-dimensional symplectic manifolds, including the pseudo-spheres, the Poincar\'e disk and the Poincar\'e half-plane.

A Q-values Sharing Framework for Multiagent Reinforcement Learning under Budget Constraint

Authors:Changxi Zhu, Ho-fung Leung, Shuyue Hu, Yi Cai
Date:2020-11-29 04:51:57

In teacher-student framework, a more experienced agent (teacher) helps accelerate the learning of another agent (student) by suggesting actions to take in certain states. In cooperative multiagent reinforcement learning (MARL), where agents need to cooperate with one another, a student may fail to cooperate well with others even by following the teachers' suggested actions, as the polices of all agents are ever changing before convergence. When the number of times that agents communicate with one another is limited (i.e., there is budget constraint), the advising strategy that uses actions as advices may not be good enough. We propose a partaker-sharer advising framework (PSAF) for cooperative MARL agents learning with budget constraint. In PSAF, each Q-learner can decide when to ask for Q-values and share its Q-values. We perform experiments in three typical multiagent learning problems. Evaluation results show that our approach PSAF outperforms existing advising methods under both unlimited and limited budget, and we give an analysis of the impact of advising actions and sharing Q-values on agents' learning.

TLeague: A Framework for Competitive Self-Play based Distributed Multi-Agent Reinforcement Learning

Authors:Peng Sun, Jiechao Xiong, Lei Han, Xinghai Sun, Shuxing Li, Jiawei Xu, Meng Fang, Zhengyou Zhang
Date:2020-11-25 17:24:20

Competitive Self-Play (CSP) based Multi-Agent Reinforcement Learning (MARL) has shown phenomenal breakthroughs recently. Strong AIs are achieved for several benchmarks, including Dota 2, Glory of Kings, Quake III, StarCraft II, to name a few. Despite the success, the MARL training is extremely data thirsty, requiring typically billions of (if not trillions of) frames be seen from the environment during training in order for learning a high performance agent. This poses non-trivial difficulties for researchers or engineers and prevents the application of MARL to a broader range of real-world problems. To address this issue, in this manuscript we describe a framework, referred to as TLeague, that aims at large-scale training and implements several main-stream CSP-MARL algorithms. The training can be deployed in either a single machine or a cluster of hybrid machines (CPUs and GPUs), where the standard Kubernetes is supported in a cloud native manner. TLeague achieves a high throughput and a reasonable scale-up when performing distributed training. Thanks to the modular design, it is also easy to extend for solving other multi-agent problems or implementing and verifying MARL algorithms. We present experiments over StarCraft II, ViZDoom and Pommerman to show the efficiency and effectiveness of TLeague. The code is open-sourced and available at https://github.com/tencent-ailab/tleague_projpage

PowerNet: Multi-agent Deep Reinforcement Learning for Scalable Powergrid Control

Authors:Dong Chen, Kaian Chen. Zhaojian Li, Tianshu Chu, Rui Yao, Feng Qiu, Kaixiang Lin
Date:2020-11-24 20:22:36

This paper develops an efficient multi-agent deep reinforcement learning algorithm for cooperative controls in powergrids. Specifically, we consider the decentralized inverter-based secondary voltage control problem in distributed generators (DGs), which is first formulated as a cooperative multi-agent reinforcement learning (MARL) problem. We then propose a novel on-policy MARL algorithm, PowerNet, in which each agent (DG) learns a control policy based on (sub-)global reward but local states from its neighboring agents. Motivated by the fact that a local control from one agent has limited impact on agents distant from it, we exploit a novel spatial discount factor to reduce the effect from remote agents, to expedite the training process and improve scalability. Furthermore, a differentiable, learning-based communication protocol is employed to foster the collaborations among neighboring agents. In addition, to mitigate the effects of system uncertainty and random noise introduced during on-policy learning, we utilize an action smoothing factor to stabilize the policy execution. To facilitate training and evaluation, we develop PGSim, an efficient, high-fidelity powergrid simulation platform. Experimental results in two microgrid setups show that the developed PowerNet outperforms a conventional model-based control, as well as several state-of-the-art MARL algorithms. The decentralized learning scheme and high sample efficiency also make it viable to large-scale power grids.

Multi-Agent Reinforcement Learning for Markov Routing Games: A New Modeling Paradigm For Dynamic Traffic Assignment

Authors:Zhenyu Shou, Xu Chen, Yongjie Fu, Xuan Di
Date:2020-11-22 02:31:14

This paper aims to develop a paradigm that models the learning behavior of intelligent agents (including but not limited to autonomous vehicles, connected and automated vehicles, or human-driven vehicles with intelligent navigation systems where human drivers follow the navigation instructions completely) with a utility-optimizing goal and the system's equilibrating processes in a routing game among atomic selfish agents. Such a paradigm can assist policymakers in devising optimal operational and planning countermeasures under both normal and abnormal circumstances. To this end, we develop a Markov routing game (MRG) in which each agent learns and updates her own en-route path choice policy while interacting with others in transportation networks. To efficiently solve MRG, we formulate it as multi-agent reinforcement learning (MARL) and devise a mean field multi-agent deep Q learning (MF-MA-DQL) approach that captures the competition among agents. The linkage between the classical DUE paradigm and our proposed Markov routing game (MRG) is discussed. We show that the routing behavior of intelligent agents is shown to converge to the classical notion of predictive dynamic user equilibrium (DUE) when traffic environments are simulated using dynamic loading models (DNL). In other words, the MRG depicts DUEs assuming perfect information and deterministic environments propagated by DNL models. Four examples are solved to illustrate the algorithm efficiency and consistency between DUE and the MRG equilibrium, on a simple network without and with spillback, the Ortuzar Willumsen (OW) Network, and a real-world network near Columbia University's campus in Manhattan of New York City.

Scalable Reinforcement Learning Policies for Multi-Agent Control

Authors:Christopher D. Hsu, Heejin Jeong, George J. Pappas, Pratik Chaudhari
Date:2020-11-16 16:11:12

We develop a Multi-Agent Reinforcement Learning (MARL) method to learn scalable control policies for target tracking. Our method can handle an arbitrary number of pursuers and targets; we show results for tasks consisting up to 1000 pursuers tracking 1000 targets. We use a decentralized, partially-observable Markov Decision Process framework to model pursuers as agents receiving partial observations (range and bearing) about targets which move using fixed, unknown policies. An attention mechanism is used to parameterize the value function of the agents; this mechanism allows us to handle an arbitrary number of targets. Entropy-regularized off-policy RL methods are used to train a stochastic policy, and we discuss how it enables a hedging behavior between pursuers that leads to a weak form of cooperation in spite of completely decentralized control execution. We further develop a masking heuristic that allows training on smaller problems with few pursuers-targets and execution on much larger problems. Thorough simulation experiments, ablation studies, and comparisons to state of the art algorithms are performed to study the scalability of the approach and robustness of performance to varying numbers of agents and targets.

Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences

Authors:Bowen Baker
Date:2020-11-10 20:06:19

Multi-agent reinforcement learning (MARL) has shown recent success in increasingly complex fixed-team zero-sum environments. However, the real world is not zero-sum nor does it have fixed teams; humans face numerous social dilemmas and must learn when to cooperate and when to compete. To successfully deploy agents into the human world, it may be important that they be able to understand and help in our conflicts. Unfortunately, selfish MARL agents typically fail when faced with social dilemmas. In this work, we show evidence of emergent direct reciprocity, indirect reciprocity and reputation, and team formation when training agents with randomized uncertain social preferences (RUSP), a novel environment augmentation that expands the distribution of environments agents play in. RUSP is generic and scalable; it can be applied to any multi-agent environment without changing the original underlying game dynamics or objectives. In particular, we show that with RUSP these behaviors can emerge and lead to higher social welfare equilibria in both classic abstract social dilemmas like Iterated Prisoner's Dilemma as well in more complex intertemporal environments.

Single and Multi-Agent Deep Reinforcement Learning for AI-Enabled Wireless Networks: A Tutorial

Authors:Amal Feriani, Ekram Hossain
Date:2020-11-06 22:12:40

Deep Reinforcement Learning (DRL) has recently witnessed significant advances that have led to multiple successes in solving sequential decision-making problems in various domains, particularly in wireless communications. The future sixth-generation (6G) networks are expected to provide scalable, low-latency, ultra-reliable services empowered by the application of data-driven Artificial Intelligence (AI). The key enabling technologies of future 6G networks, such as intelligent meta-surfaces, aerial networks, and AI at the edge, involve more than one agent which motivates the importance of multi-agent learning techniques. Furthermore, cooperation is central to establishing self-organizing, self-sustaining, and decentralized networks. In this context, this tutorial focuses on the role of DRL with an emphasis on deep Multi-Agent Reinforcement Learning (MARL) for AI-enabled 6G networks. The first part of this paper will present a clear overview of the mathematical frameworks for single-agent RL and MARL. The main idea of this work is to motivate the application of RL beyond the model-free perspective which was extensively adopted in recent years. Thus, we provide a selective description of RL algorithms such as Model-Based RL (MBRL) and cooperative MARL and we highlight their potential applications in 6G wireless networks. Finally, we overview the state-of-the-art of MARL in fields such as Mobile Edge Computing (MEC), Unmanned Aerial Vehicles (UAV) networks, and cell-free massive MIMO, and identify promising future research directions. We expect this tutorial to stimulate more research endeavors to build scalable and decentralized systems based on MARL.

Multi-Agent Reinforcement Learning for Visibility-based Persistent Monitoring

Authors:Jingxi Chen, Amrish Baskaran, Zhongshun Zhang, Pratap Tokekar
Date:2020-11-02 17:05:40

The Visibility-based Persistent Monitoring (VPM) problem seeks to find a set of trajectories (or controllers) for robots to persistently monitor a changing environment. Each robot has a sensor, such as a camera, with a limited field-of-view that is obstructed by obstacles in the environment. The robots may need to coordinate with each other to ensure no point in the environment is left unmonitored for long periods of time. We model the problem such that there is a penalty that accrues every time step if a point is left unmonitored. However, the dynamics of the penalty are unknown to us. We present a Multi-Agent Reinforcement Learning (MARL) algorithm for the VPM problem. Specifically, we present a Multi-Agent Graph Attention Proximal Policy Optimization (MA-G-PPO) algorithm that takes as input the local observations of all agents combined with a low resolution global map to learn a policy for each agent. The graph attention allows agents to share their information with others leading to an effective joint policy. Our main focus is to understand how effective MARL is for the VPM problem. We investigate five research questions with this broader goal. We find that MA-G-PPO is able to learn a better policy than the non-RL baseline in most cases, the effectiveness depends on agents sharing information with each other, and the policy learnt shows emergent behavior for the agents.

Interpreting Graph Drawing with Multi-Agent Reinforcement Learning

Authors:Ilkin Safarli, Youjia Zhou, Bei Wang
Date:2020-11-02 05:00:14

Applying machine learning techniques to graph drawing has become an emergent area of research in visualization. In this paper, we interpret graph drawing as a multi-agent reinforcement learning (MARL) problem. We first demonstrate that a large number of classic graph drawing algorithms, including force-directed layouts and stress majorization, can be interpreted within the framework of MARL. Using this interpretation, a node in the graph is assigned to an agent with a reward function. Via multi-agent reward maximization, we obtain an aesthetically pleasing graph layout that is comparable to the outputs of classic algorithms. The main strength of a MARL framework for graph drawing is that it not only unifies a number of classic drawing algorithms in a general formulation but also supports the creation of novel graph drawing algorithms by introducing a diverse set of reward functions.

An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective

Authors:Yaodong Yang, Jun Wang
Date:2020-11-01 18:24:40

Following the remarkable success of the AlphaGO series, 2019 was a booming year that witnessed significant advances in multi-agent reinforcement learning (MARL) techniques. MARL corresponds to the learning problem in a multi-agent system in which multiple agents learn simultaneously. It is an interdisciplinary domain with a long history that includes game theory, machine learning, stochastic control, psychology, and optimisation. Although MARL has achieved considerable empirical success in solving real-world games, there is a lack of a self-contained overview in the literature that elaborates the game theoretical foundations of modern MARL methods and summarises the recent advances. In fact, the majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments in the research frontier. The goal of our monograph is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances.

Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning

Authors:Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari, Guilin Qi, Wei Wu
Date:2020-10-29 18:28:16

A compelling approach to complex question answering is to convert the question to a sequence of actions, which can then be executed on the knowledge base to yield the answer, aka the programmer-interpreter approach. Use similar training questions to the test question, meta-learning enables the programmer to adapt to unseen questions to tackle potential distributional biases quickly. However, this comes at the cost of manually labeling similar questions to learn a retrieval model, which is tedious and expensive. In this paper, we present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision, i.e., the system's performance with respect to the produced answers. To the best of our knowledge, this is the first attempt to train the retrieval model with the programmer jointly. Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases. We have released our code at https://github.com/DevinJake/MARL.

Succinct and Robust Multi-Agent Communication With Temporal Message Control

Authors:Sai Qian Zhang, Jieyu Lin, Qi Zhang
Date:2020-10-27 15:55:08

Recent studies have shown that introducing communication between agents can significantly improve overall performance in cooperative Multi-agent reinforcement learning (MARL). However, existing communication schemes often require agents to exchange an excessive number of messages at run-time under a reliable communication channel, which hinders its practicality in many real-world situations. In this paper, we present \textit{Temporal Message Control} (TMC), a simple yet effective approach for achieving succinct and robust communication in MARL. TMC applies a temporal smoothing technique to drastically reduce the amount of information exchanged between agents. Experiments show that TMC can significantly reduce inter-agent communication overhead without impacting accuracy. Furthermore, TMC demonstrates much better robustness against transmission loss than existing approaches in lossy networking environments.

Multi-UAV Path Planning for Wireless Data Harvesting with Deep Reinforcement Learning

Authors:Harald Bayerlein, Mirco Theile, Marco Caccamo, David Gesbert
Date:2020-10-23 14:59:30

Harvesting data from distributed Internet of Things (IoT) devices with multiple autonomous unmanned aerial vehicles (UAVs) is a challenging problem requiring flexible path planning methods. We propose a multi-agent reinforcement learning (MARL) approach that, in contrast to previous work, can adapt to profound changes in the scenario parameters defining the data harvesting mission, such as the number of deployed UAVs, number, position and data amount of IoT devices, or the maximum flying time, without the need to perform expensive recomputations or relearn control policies. We formulate the path planning problem for a cooperative, non-communicating, and homogeneous team of UAVs tasked with maximizing collected data from distributed IoT sensor nodes subject to flying time and collision avoidance constraints. The path planning problem is translated into a decentralized partially observable Markov decision process (Dec-POMDP), which we solve through a deep reinforcement learning (DRL) approach, approximating the optimal UAV control policy without prior knowledge of the challenging wireless channel characteristics in dense urban environments. By exploiting a combination of centered global and local map representations of the environment that are fed into convolutional layers of the agents, we show that our proposed network architecture enables the agents to cooperate effectively by carefully dividing the data collection task among themselves, adapt to large complex environments and state spaces, and make movement decisions that balance data collection goals, flight-time efficiency, and navigation constraints. Finally, learning a control policy that generalizes over the scenario parameter space enables us to analyze the influence of individual parameters on collection performance and provide some intuition about system-level benefits.

Integrating LEO Satellites and Multi-UAV Reinforcement Learning for Hybrid FSO/RF Non-Terrestrial Networks

Authors:Ju-Hyung Lee, Jihong Park, Mehdi Bennis, Young-Chai Ko
Date:2020-10-20 09:07:10

A mega-constellation of low-altitude earth orbit (LEO) satellites (SATs) and burgeoning unmanned aerial vehicles (UAVs) are promising enablers for high-speed and long-distance communications in beyond fifth-generation (5G) systems. Integrating SATs and UAVs within a non-terrestrial network (NTN), in this article we investigate the problem of forwarding packets between two faraway ground terminals through SAT and UAV relays using either millimeter-wave (mmWave) radio-frequency (RF) or free-space optical (FSO) link. Towards maximizing the communication efficiency, the real-time associations with orbiting SATs and the moving trajectories of UAVs should be optimized with suitable FSO/RF links, which is challenging due to the time-varying network topology and a huge number of possible control actions. To overcome the difficulty, we lift this problem to multi-agent deep reinforcement learning (MARL) with a novel action dimensionality reduction technique. Simulation results corroborate that our proposed SAT-UAV integrated scheme achieves 1.99x higher end-to-end sum throughput compared to a benchmark scheme with fixed ground relays. While improving the throughput, our proposed scheme also aims to reduce the UAV control energy, yielding 2.25x higher energy efficiency than a baseline method only maximizing the throughput. Lastly, thanks to utilizing hybrid FSO/RF links, the proposed scheme achieves up to 62.56x higher peak throughput and 21.09x higher worst-case throughput than the cases utilizing either RF or FSO links, highlighting the importance of co-designing SAT-UAV associations, UAV trajectories, and hybrid FSO/RF links in beyond-5G NTNs.

Multi-Agent Collaboration via Reward Attribution Decomposition

Authors:Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian
Date:2020-10-16 17:42:11

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and don't generalize to new agent configurations even on the same game. In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that each agent has an approximately optimal policy that decomposes into two parts: one part that only relies on the agent's own state, and the other part that is related to states of nearby agents. Following this novel finding, CollaQ decomposes the Q-function of each agent into a self term and an interactive term, with a Multi-Agent Reward Attribution (MARA) loss that regularizes the training. CollaQ is evaluated on various StarCraft maps and shows that it outperforms existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of samples. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

Authors:Keyang He, Bikramjit Banerjee, Prashant Doshi
Date:2020-10-15 21:37:07

Consider a typical organization whose worker agents seek to collectively cooperate for its general betterment. However, each individual agent simultaneously seeks to act to secure a larger chunk than its co-workers of the annual increment in compensation, which usually comes from a {\em fixed} pot. As such, the individual agent in the organization must cooperate and compete. Another feature of many organizations is that a worker receives a bonus, which is often a fraction of previous year's total profit. As such, the agent derives a reward that is also partly dependent on historical performance. How should the individual agent decide to act in this context? Few methods for the mixed cooperative-competitive setting have been presented in recent years, but these are challenged by problem domains whose reward functions do not depend on the current state and action only. Recent deep multi-agent reinforcement learning (MARL) methods using long short-term memory (LSTM) may be used, but these adopt a joint perspective to the interaction or require explicit exchange of information among the agents to promote cooperation, which may not be possible under competition. In this paper, we first show that the agent's decision-making problem can be modeled as an interactive partially observable Markov decision process (I-POMDP) that captures the dynamic of a history-dependent reward. We present an interactive advantage actor-critic method (IA2C$^+$), which combines the independent advantage actor-critic network with a belief filter that maintains a belief distribution over other agents' models. Empirical results show that IA2C$^+$ learns the optimal policy faster and more robustly than several other baselines including one that uses a LSTM, even when attributed models are incorrect.

Multi-Agent Trust Region Policy Optimization

Authors:Hepeng Li, Haibo He
Date:2020-10-15 17:49:47

We extend trust region policy optimization (TRPO) to multi-agent reinforcement learning (MARL) problems. We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases. By making a series of approximations to the consensus optimization model, we propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO). This algorithm can optimize distributed policies based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/action-value functions of other agents. The agents only share a likelihood ratio with their neighbors during the training process. The algorithm is fully decentralized and privacy-preserving. Our experiments on two cooperative games demonstrate its robust performance on complicated MARL tasks.

Graph Convolutional Value Decomposition in Multi-Agent Reinforcement Learning

Authors:Navid Naderializadeh, Fan H. Hung, Sean Soleyman, Deepak Khosla
Date:2020-10-09 18:01:01

We propose a novel framework for value function factorization in multi-agent deep reinforcement learning (MARL) using graph neural networks (GNNs). In particular, we consider the team of agents as the set of nodes of a complete directed graph, whose edge weights are governed by an attention mechanism. Building upon this underlying graph, we introduce a mixing GNN module, which is responsible for i) factorizing the team state-action value function into individual per-agent observation-action value functions, and ii) explicit credit assignment to each agent in terms of fractions of the global team reward. Our approach, which we call GraphMIX, follows the centralized training and decentralized execution paradigm, enabling the agents to make their decisions independently once training is completed. We show the superiority of GraphMIX as compared to the state-of-the-art on several scenarios in the StarCraft II multi-agent challenge (SMAC) benchmark. We further demonstrate how GraphMIX can be used in conjunction with a recent hierarchical MARL architecture to both improve the agents' performance and enable fine-tuning them on mismatched test scenarios with higher numbers of agents and/or actions.

UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

Authors:Tarun Gupta, Anuj Mahajan, Bei Peng, Wendelin Böhmer, Shimon Whiteson
Date:2020-10-06 19:08:47

VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this problem can be overcome by improving the joint exploration of all agents during training. Specifically, we propose a novel MARL approach called Universal Value Exploration (UneVEn) that learns a set of related tasks simultaneously with a linear decomposition of universal successor features. With the policies of already solved related tasks, the joint exploration process of all agents can be improved to help them achieve better coordination. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.

RODE: Learning Roles to Decompose Multi-Agent Tasks

Authors:Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, Chongjie Zhang
Date:2020-10-04 09:20:59

Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy -- the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos are available at https://sites.google.com/view/rode-marl .

Correcting Experience Replay for Multi-Agent Communication

Authors:Sanjeevan Ahilan, Peter Dayan
Date:2020-10-02 20:49:24

We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. We find in our experiments that it substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.

PettingZoo: Gym for Multi-Agent Reinforcement Learning

Authors:J. K. Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis Santos, Rodrigo Perez, Caroline Horsch, Clemens Dieffendahl, Niall L. Williams, Yashas Lokesh, Praveen Ravi
Date:2020-09-30 06:42:09

This paper introduces the PettingZoo library and the accompanying Agent Environment Cycle ("AEC") games model. PettingZoo is a library of diverse sets of multi-agent environments with a universal, elegant Python API. PettingZoo was developed with the goal of accelerating research in Multi-Agent Reinforcement Learning ("MARL"), by making work more interchangeable, accessible and reproducible akin to what OpenAI's Gym library did for single-agent reinforcement learning. PettingZoo's API, while inheriting many features of Gym, is unique amongst MARL APIs in that it's based around the novel AEC games model. We argue, in part through case studies on major problems in popular MARL environments, that the popular game models are poor conceptual models of games commonly used in MARL and accordingly can promote confusing bugs that are hard to detect, and that the AEC games model addresses these problems.

Agent Environment Cycle Games

Authors:J K Terry, Nathaniel Grammel, Benjamin Black, Ananth Hari, Caroline Horsch, Luis Santos
Date:2020-09-28 04:02:08

Partially Observable Stochastic Games (POSGs) are the most general and common model of games used in Multi-Agent Reinforcement Learning (MARL). We argue that the POSG model is conceptually ill suited to software MARL environments, and offer case studies from the literature where this mismatch has led to severely unexpected behavior. In response to this, we introduce the Agent Environment Cycle Games (AEC Games) model, which is more representative of software implementation. We then prove it's as an equivalent model to POSGs. The AEC games model is also uniquely useful in that it can elegantly represent both all forms of MARL environments, whereas for example POSGs cannot elegantly represent strictly turn based games like chess.

Byzantine-Resilient Decentralized TD Learning with Linear Function Approximation

Authors:Zhaoxian Wu, Han Shen, Tianyi Chen, Qing Ling
Date:2020-09-23 13:46:23

This paper considers the policy evaluation problem in a multi-agent reinforcement learning (MARL) environment over decentralized and directed networks. The focus is on decentralized temporal difference (TD) learning with linear function approximation in the presence of unreliable or even malicious agents, termed as Byzantine agents. In order to evaluate the quality of a fixed policy in a common environment, agents usually run decentralized TD($\lambda$) collaboratively. However, when some Byzantine agents behave adversarially, decentralized TD($\lambda$) is unable to learn an accurate linear approximation for the true value function. We propose a trimmed-mean based Byzantine-resilient decentralized TD($\lambda$) algorithm to perform policy evaluation in this setting. We establish the finite-time convergence rate, as well as the asymptotic learning error in the presence of Byzantine agents. Numerical experiments corroborate the robustness of the proposed algorithm.

Ultra-dense Low Data Rate (UDLD) Communication in the THz

Authors:Rohit Singh, Doug Sicker
Date:2020-09-22 16:52:58

In the future, with the advent of Internet of Things (IoT), wireless sensors, and multiple 5G killer applications, an indoor room might be filled with $1000$s of devices demanding low data rates. Such high-level densification and mobility of these devices will overwhelm the system and result in higher interference, frequent outages, and lower coverage. The THz band has a massive amount of greenfield spectrum to cater to this dense-indoor deployment. However, a limited coverage range of the THz will require networks to have more infrastructure and depend on non-line-of-sight (NLOS) type communication. This form of communication might not be profitable for network operators and can even result in inefficient resource utilization for devices demanding low data rates. Using distributed device-to-device (D2D) communication in the THz, we can cater to these Ultra-dense Low Data Rate (UDLD) type applications. D2D in THz can be challenging, but with opportunistic allocation and smart learning algorithms, these challenges can be mitigated. We propose a 2-Layered distributed D2D model, where devices use coordinated multi-agent reinforcement learning (MARL) to maximize efficiency and user coverage for dense-indoor deployment. We show that densification and mobility in a network can be used to further the limited coverage range of THz devices, without the need for extra infrastructure or resources.

Lyapunov-Based Reinforcement Learning for Decentralized Multi-Agent Control

Authors:Qingrui Zhang, Hao Dong, Wei Pan
Date:2020-09-20 06:11:42

Decentralized multi-agent control has broad applications, ranging from multi-robot cooperation to distributed sensor networks. In decentralized multi-agent control, systems are complex with unknown or highly uncertain dynamics, where traditional model-based control methods can hardly be applied. Compared with model-based control in control theory, deep reinforcement learning (DRL) is promising to learn the controller/policy from data without the knowing system dynamics. However, to directly apply DRL to decentralized multi-agent control is challenging, as interactions among agents make the learning environment non-stationary. More importantly, the existing multi-agent reinforcement learning (MARL) algorithms cannot ensure the closed-loop stability of a multi-agent system from a control-theoretic perspective, so the learned control polices are highly possible to generate abnormal or dangerous behaviors in real applications. Hence, without stability guarantee, the application of the existing MARL algorithms to real multi-agent systems is of great concern, e.g., UAVs, robots, and power systems, etc. In this paper, we aim to propose a new MARL algorithm for decentralized multi-agent control with a stability guarantee. The new MARL algorithm, termed as a multi-agent soft-actor critic (MASAC), is proposed under the well-known framework of "centralized-training-with-decentralized-execution". The closed-loop stability is guaranteed by the introduction of a stability constraint during the policy improvement in our MASAC algorithm. The stability constraint is designed based on Lyapunov's method in control theory. To demonstrate the effectiveness, we present a multi-agent navigation example to show the efficiency of the proposed MASAC algorithm.

Energy-based Surprise Minimization for Multi-Agent Value Factorization

Authors:Karush Suri, Xiao Qi Shi, Konstantinos Plataniotis, Yuri Lawryshyn
Date:2020-09-16 19:42:42

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant success in training decentralised policies in a centralised manner by making use of value factorization methods. However, addressing surprise across spurious states and approximation bias remain open problems for multi-agent settings. Towards this goal, we introduce the Energy-based MIXer (EMIX), an algorithm which minimizes surprise utilizing the energy across agents. Our contributions are threefold; (1) EMIX introduces a novel surprise minimization technique across multiple agents in the case of multi-agent partially-observable settings. (2) EMIX highlights a practical use of energy functions in MARL with theoretical guarantees and experiment validations of the energy operator. Lastly, (3) EMIX extends Maxmin Q-learning for addressing overestimation bias across agents in MARL. In a study of challenging StarCraft II micromanagement scenarios, EMIX demonstrates consistent stable performance for multiagent surprise minimization. Moreover, our ablation study highlights the necessity of the energy-based scheme and the need for elimination of overestimation bias in MARL. Our implementation of EMIX can be found at karush17.github.io/emix-web/.

QR-MIX: Distributional Value Function Factorisation for Cooperative Multi-Agent Reinforcement Learning

Authors:Jian Hu, Seth Austin Harding, Haibin Wu, Siyue Hu, Shih-wei Liao
Date:2020-09-09 10:28:44

In Cooperative Multi-Agent Reinforcement Learning (MARL) and under the setting of Centralized Training with Decentralized Execution (CTDE), agents observe and interact with their environment locally and independently. With local observation and random sampling, the randomness in rewards and observations leads to randomness in long-term returns. Existing methods such as Value Decomposition Network (VDN) and QMIX estimate the value of long-term returns as a scalar that does not contain the information of randomness. Our proposed model QR-MIX introduces quantile regression, modeling joint state-action values as a distribution, combining QMIX with Implicit Quantile Network (IQN). However, the monotonicity in QMIX limits the expression of joint state-action value distribution and may lead to incorrect estimation results in non-monotonic cases. Therefore, we proposed a flexible loss function to approximate the monotonicity found in QMIX. Our model is not only more tolerant of the randomness of returns, but also more tolerant of the randomness of monotonic constraints. The experimental results demonstrate that QR-MIX outperforms the previous state-of-the-art method QMIX in the StarCraft Multi-Agent Challenge (SMAC) environment.

PAC Reinforcement Learning Algorithm for General-Sum Markov Games

Authors:Ashkan Zehfroosh, Herbert G. Tanner
Date:2020-09-05 21:54:27

This paper presents a theoretical framework for probably approximately correct (PAC) multi-agent reinforcement learning (MARL) algorithms for Markov games. The paper offers an extension to the well-known Nash Q-learning algorithm, using the idea of delayed Q-learning, in order to build a new PAC MARL algorithm for general-sum Markov games. In addition to guiding the design of a provably PAC MARL algorithm, the framework enables checking whether an arbitrary MARL algorithm is PAC. Comparative numerical results demonstrate performance and robustness.

The reinforcement learning-based multi-agent cooperative approach for the adaptive speed regulation on a metallurgical pickling line

Authors:Anna Bogomolova, Kseniia Kingsep, Boris Voskresenskii
Date:2020-08-16 15:10:39

We present a holistic data-driven approach to the problem of productivity increase on the example of a metallurgical pickling line. The proposed approach combines mathematical modeling as a base algorithm and a cooperative Multi-Agent Reinforcement Learning (MARL) system implemented such as to enhance the performance by multiple criteria while also meeting safety and reliability requirements and taking into account the unexpected volatility of certain technological processes. We demonstrate how Deep Q-Learning can be applied to a real-life task in a heavy industry, resulting in significant improvement of previously existing automation systems.The problem of input data scarcity is solved by a two-step combination of LSTM and CGAN, which helps to embrace both the tabular representation of the data and its sequential properties. Offline RL training, a necessity in this setting, has become possible through the sophisticated probabilistic kinematic environment.

REMAX: Relational Representation for Multi-Agent Exploration

Authors:Heechang Ryu, Hayong Shin, Jinkyoo Park
Date:2020-08-12 10:23:35

Training a multi-agent reinforcement learning (MARL) model with a sparse reward is generally difficult because numerous combinations of interactions among agents induce a certain outcome (i.e., success or failure). Earlier studies have tried to resolve this issue by employing an intrinsic reward to induce interactions that are helpful for learning an effective policy. However, this approach requires extensive prior knowledge for designing an intrinsic reward. To train the MARL model effectively without designing the intrinsic reward, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a game state such that (1) the state can be compactly encoded to a latent representation by considering relationships among agents, and (2) the latent representation can be used as an effective input for a coupled surrogate model to predict an exploration score. The proposed method then finds new latent representations that maximize the exploration scores and decodes these representations to generate initial states from which the MARL model starts training in the game and thus experiences novel and rewardable states. We demonstrate that our method improves the training and performance of the MARL model more than the existing exploration methods.

QPLEX: Duplex Dueling Multi-Agent Q-Learning

Authors:Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, Chongjie Zhang
Date:2020-08-03 17:52:09

We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.

Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound

Authors:Yuhao Huang, Xin Yang, Rui Li, Jikuan Qian, Xiaoqiong Huang, Wenlong Shi, Haoran Dou, Chaoyu Chen, Yuanji Zhang, Huanjia Luo, Alejandro Frangi, Yi Xiong, Dong Ni
Date:2020-07-30 07:23:55

3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost. Automated standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation. In this study, we propose a novel Multi-Agent Reinforcement Learning (MARL) framework to localize multiple uterine SPs in 3D US simultaneously. Our contribution is two-fold. First, we equip the MARL with a one-shot neural architecture search (NAS) module to obtain the optimal agent for each plane. Specifically, Gradient-based search using Differentiable Architecture Sampler (GDAS) is employed to accelerate and stabilize the training process. Second, we propose a novel collaborative strategy to strengthen agents' communication. Our strategy uses recurrent neural network (RNN) to learn the spatial relationship among SPs effectively. Extensively validated on a large dataset, our approach achieves the accuracy of 7.05 degree/2.21mm, 8.62 degree/2.36mm and 5.93 degree/0.89mm for the mid-sagittal, transverse and coronal plane localization, respectively. The proposed MARL framework can significantly increase the plane localization accuracy and reduce the computational cost and model size.

Value-Decomposition Multi-Agent Actor-Critics

Authors:Jianyu Su, Stephen Adams, Peter A. Beling
Date:2020-07-24 00:50:02

The exploitation of extra state information has been an active research area in multi-agent reinforcement learning (MARL). QMIX represents the joint action-value using a non-negative function approximator and achieves the best performance, by far, on multi-agent benchmarks, StarCraft II micromanagement tasks. However, our experiments show that, in some cases, QMIX is incompatible with A2C, a training paradigm that promotes algorithm training efficiency. To obtain a reasonable trade-off between training efficiency and algorithm performance, we extend value-decomposition to actor-critics that are compatible with A2C and propose a novel actor-critic framework, value-decomposition actor-critics (VDACs). We evaluate VDACs on the testbed of StarCraft II micromanagement tasks and demonstrate that the proposed framework improves median performance over other actor-critic methods. Furthermore, we use a set of ablation experiments to identify the key factors that contribute to the performance of VDACs.

Towards Joint Learning of Optimal MAC Signaling and Wireless Channel Access

Authors:Alvaro Valcarce, Jakob Hoydis
Date:2020-07-20 08:54:36

Communication protocols are the languages used by network nodes. Before a user equipment (UE) can exchange data with a base station (BS), it must first negotiate the conditions and parameters for that transmission. This negotiation is supported by signaling messages at all layers of the protocol stack. Each year, the mobile communications industry defines and standardizes these messages, which are designed by humans during lengthy technical (and often political) debates. Following this standardization effort, the development phase begins, wherein the industry interprets and implements the resulting standards. But is this massive development undertaking the only way to implement a given protocol? We address the question of whether radios can learn a pre-given target protocol as an intermediate step towards evolving their own. Furthermore, we train cellular radios to emerge a channel access policy that performs optimally under the constraints of the target protocol. We show that multi-agent reinforcement learning (MARL) and learning-to-communicate (L2C) techniques achieve this goal with gains over expert systems. Finally, we provide insight into the transferability of these results to scenarios never seen during training.

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Authors:Kaiqing Zhang, Sham M. Kakade, Tamer Başar, Lin F. Yang
Date:2020-07-15 03:25:24

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

Decentralized Deep Reinforcement Learning for Network Level Traffic Signal Control

Authors:Jin Guo
Date:2020-07-02 06:58:27

In this thesis, I propose a family of fully decentralized deep multi-agent reinforcement learning (MARL) algorithms to achieve high, real-time performance in network-level traffic signal control. In this approach, each intersection is modeled as an agent that plays a Markovian Game against the other intersection nodes in a traffic signal network modeled as an undirected graph, to approach the optimal reduction in delay. Following Partially Observable Markov Decision Processes (POMDPs), there are 3 levels of communication schemes between adjacent learning agents: independent deep Q-leaning (IDQL), shared states reinforcement learning (S2RL) and a shared states & rewards version of S2RL--S2R2L. In these 3 variants of decentralized MARL schemes, individual agent trains its local deep Q network (DQN) separately, enhanced by convergence-guaranteed techniques like double DQN, prioritized experience replay, multi-step bootstrapping, etc. To test the performance of the proposed three MARL algorithms, a SUMO-based simulation platform is developed to mimic the traffic evolution of the real world. Fed with random traffic demand between permitted OD pairs, a 4x4 Manhattan-style grid network is set up as the testbed, two different vehicle arrival rates are generated for model training and testing. The experiment results show that S2R2L has a quicker convergence rate and better convergent performance than IDQL and S2RL in the training process. Moreover, three MARL schemes all reveal exceptional generalization abilities. Their testing results surpass the benchmark Max Pressure (MP) algorithm, under the criteria of average vehicle delay, network-level queue length and fuel consumption rate. Notably, S2R2L has the best testing performance of reducing 34.55% traffic delay and dissipating 10.91% queue length compared with MP.

Opinion Diffusion Software with Strategic Opinion Revelation and Unfriending

Authors:Patrick Shepherd, Mia Weaver, Judy Goldsmith
Date:2020-06-22 19:09:35

We present a novel software suite for social network modeling and opinion diffusion processes. Much research on social network science has assumed networks with static topologies. More recently, attention has been turned to networks that evolve. Although software for modeling both the topological evolution of networks and diffusion processes are constantly improving, very little attention has been paid to agent modeling. Our software is designed to be robust, modular, and extensible, providing the ability to model dynamic social network topologies and multidimensional diffusion processes, different styles of agent including non-homophilic paradigms, as well as a testing environment for multi-agent reinforcement learning (MARL) experiments with diverse sets of agent types. We also illustrate the value of diverse agent modeling, and environments that allow for strategic unfriending. Our work shows that polarization and consensus dynamics, as well as topological clustering effects, may rely more than previously known on individuals' goals for the composition of their neighborhood's opinions.

QTRAN++: Improved Value Transformation for Cooperative Multi-Agent Reinforcement Learning

Authors:Kyunghwan Son, Sungsoo Ahn, Roben Delos Reyes, Jinwoo Shin, Yung Yi
Date:2020-06-22 05:08:36

QTRAN is a multi-agent reinforcement learning (MARL) algorithm capable of learning the largest class of joint-action value functions up to date. However, despite its strong theoretical guarantee, it has shown poor empirical performance in complex environments, such as Starcraft Multi-Agent Challenge (SMAC). In this paper, we identify the performance bottleneck of QTRAN and propose a substantially improved version, coined QTRAN++. Our gains come from (i) stabilizing the training objective of QTRAN, (ii) removing the strict role separation between the action-value estimators of QTRAN, and (iii) introducing a multi-head mixing network for value transformation. Through extensive evaluation, we confirm that our diagnosis is correct, and QTRAN++ successfully bridges the gap between empirical performance and theoretical guarantee. In particular, QTRAN++ newly achieves state-of-the-art performance in the SMAC environment. The code will be released.

Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning

Authors:Lingxiao Wang, Zhuoran Yang, Zhaoran Wang
Date:2020-06-21 21:45:50

Multi-agent reinforcement learning (MARL) achieves significant empirical successes. However, MARL suffers from the curse of many agents. In this paper, we exploit the symmetry of agents in MARL. In the most generic form, we study a mean-field MARL problem. Such a mean-field MARL is defined on mean-field states, which are distributions that are supported on continuous space. Based on the mean embedding of the distributions, we propose MF-FQI algorithm that solves the mean-field MARL and establishes a non-asymptotic analysis for MF-FQI algorithm. We highlight that MF-FQI algorithm enjoys a "blessing of many agents" property in the sense that a larger number of observed agents improves the performance of MF-FQI algorithm.

Deep Implicit Coordination Graphs for Multi-agent Reinforcement Learning

Authors:Sheng Li, Jayesh K. Gupta, Peter Morales, Ross Allen, Mykel J. Kochenderfer
Date:2020-06-19 23:41:49

Multi-agent reinforcement learning (MARL) requires coordination to efficiently solve certain tasks. Fully centralized control is often infeasible in such domains due to the size of joint action spaces. Coordination graph based formalization allows reasoning about the joint action based on the structure of interactions. However, they often require domain expertise in their design. This paper introduces the deep implicit coordination graph (DICG) architecture for such scenarios. DICG consists of a module for inferring the dynamic coordination graph structure which is then used by a graph neural network based module to learn to implicitly reason about the joint actions or values. DICG allows learning the tradeoff between full centralization and decentralization via standard actor-critic methods to significantly improve coordination for domains with large number of agents. We apply DICG to both centralized-training-centralized-execution and centralized-training-decentralized-execution regimes. We demonstrate that DICG solves the relative overgeneralization pathology in predatory-prey tasks as well as outperforms various MARL baselines on the challenging StarCraft II Multi-agent Challenge (SMAC) and traffic junction environments.

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

Authors:Yan Zhang, Michael M. Zavlanos
Date:2020-06-18 19:36:22

In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function. The size of this neighborhood depends on the agents' learning rates, the exploration parameters, and the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.

Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Authors:Tabish Rashid, Gregory Farquhar, Bei Peng, Shimon Whiteson
Date:2020-06-18 18:34:50

QMIX is a popular $Q$-learning algorithm for cooperative MARL in the centralised training and decentralised execution paradigm. In order to enable easy decentralisation, QMIX restricts the joint action $Q$-values it can represent to be a monotonic mixing of each agent's utilities. However, this restriction prevents it from representing value functions in which an agent's ordering over its actions can depend on other agents' actions. To analyse this representational limitation, we first formalise the objective QMIX optimises, which allows us to view QMIX as an operator that first computes the $Q$-learning targets and then projects them into the space representable by QMIX. This projection returns a representable $Q$-value that minimises the unweighted squared error across all joint actions. We show in particular that this projection can fail to recover the optimal policy even with access to $Q^*$, which primarily stems from the equal weighting placed on each joint action. We rectify this by introducing a weighting into the projection, in order to place more importance on the better joint actions. We propose two weighting schemes and prove that they recover the correct maximal action for any joint action $Q$-values, and therefore for $Q^*$ as well. Based on our analysis and results in the tabular setting, we introduce two scalable versions of our algorithm, Centrally-Weighted (CW) QMIX and Optimistically-Weighted (OW) QMIX and demonstrate improved performance on both predator-prey and challenging multi-agent StarCraft benchmark tasks.

On the influence of supra-thermal particle acceleration on the morphology of low-Mach, high-$β$ shocks

Authors:Allard Jan van Marle
Date:2020-06-17 12:11:58

When two galaxy clusters encounter each other, the interaction results in a collisionless shock that is characterized by a low (1-4) sonic Mach number, and a high Alfv{\'e}nic Mach number. Our goal is to determine if, and to what extent, such shocks can accelerate particles to sufficient velocities that they can contribute to the cosmic ray spectrum. We combine two different computational methods, magnetohydrodynamics (MHD) and particle-in-cell (PIC) into a single code that allows us to take advantage of the high computational efficiency of MHD while maintaining the ability to model the behaviour of individual non-thermal particles. Using this method, we perform a series of simulations covering the expected parameter space of galaxy cluster collision shocks. Our results show that for shocks with a sonic Mach number below 2.25 no diffusive shock acceleration can take place because of a lack of instabilities in the magnetic field, whereas for shocks with a sonic Mach number $\geq\,3$ the acceleration is efficient and can accelerate particles to relativistic speeds. In the regime between these two extremes, diffusive shock acceleration can occur but is relatively inefficient because of the time- and space-dependent nature of the instabilities. For those shocks that show efficient acceleration, the instabilities in the upstream gas increase to the point where they change the nature of the shock, which, in turn, will influence the particle injection process.

Multiagent Reinforcement Learning based Energy Beamforming Control

Authors:Liping Bai, Zhongqiang Pang
Date:2020-06-15 23:48:32

Ultra low power devices make far-field wireless power transfer a viable option for energy delivery despite the exponential attenuation. Electromagnetic beams are constructed from the stations such that wireless energy is directionally concentrated around the ultra low power devices. Energy beamforming faces different challenges compare to information beamforming due to the lack of feedback on channel state. Various methods have been proposed such as one-bit channel feedback to enhance energy beamforming capacity, yet it still has considerable computation overhead and need to be computed centrally. Valuable resources and time is wasted on transfering control information back and forth. In this paper, we propose a novel multiagent reinforcement learning(MARL) formulation for codebook based beamforming control. It takes advantage of the inherienntly distributed structure in a wirelessly powered network and lay the ground work for fully locally computed beam control algorithms. Source code can be found at https://github.com/BaiLiping/WirelessPowerTransfer.

Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks

Authors:Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, Stefano V. Albrecht
Date:2020-06-14 11:22:53

Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we provide a systematic evaluation and comparison of three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.

Human and Multi-Agent collaboration in a human-MARL teaming framework

Authors:Neda Navidi, Francoi Chabo, Saga Kurandwa, Iv Lutigma, Vincent Robt, Gregry Szrftgr, Andea Schuh
Date:2020-06-12 16:32:42

Reinforcement learning provides effective results with agents learning from their observations, received rewards, and internal interactions between agents. This study proposes a new open-source MARL framework, called COGMENT, to efficiently leverage human and agent interactions as a source of learning. We demonstrate these innovations by using a designed real-time environment with unmanned aerial vehicles driven by RL agents, collaborating with a human. The results of this study show that the proposed collaborative paradigm and the open-source framework leads to significant reductions in both human effort and exploration costs.

Learning to Communicate Using Counterfactual Reasoning

Authors:Simon Vanneste, Astrid Vanneste, Kevin Mets, Tom De Schepper, Ali Anwar, Siegfried Mercelis, Steven Latré, Peter Hellinckx
Date:2020-06-12 14:02:04

Learning to communicate in order to share state information is an active problem in the area of multi-agent reinforcement learning (MARL). The credit assignment problem, the non-stationarity of the communication environment and the creation of influenceable agents are major challenges within this research field which need to be overcome in order to learn a valid communication protocol. This paper introduces the novel multi-agent counterfactual communication learning (MACC) method which adapts counterfactual reasoning in order to overcome the credit assignment problem for communicating agents. Secondly, the non-stationarity of the communication environment while learning the communication Q-function is overcome by creating the communication Q-function using the action policy of the other agents and the Q-function of the action environment. Additionally, a social loss function is introduced in order to create influenceable agents which is required to learn a valid communication protocol. Our experiments show that MACC is able to outperform the state-of-the-art baselines in four different scenarios in the Particle environment.

Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward

Authors:Guannan Qu, Yiheng Lin, Adam Wierman, Na Li
Date:2020-06-11 17:23:17

It has long been recognized that multi-agent reinforcement learning (MARL) faces significant scalability issues due to the fact that the size of the state and action spaces are exponentially large in the number of agents. In this paper, we identify a rich class of networked MARL problems where the model exhibits a local dependence structure that allows it to be solved in a scalable manner. Specifically, we propose a Scalable Actor-Critic (SAC) method that can learn a near optimal localized policy for optimizing the average reward with complexity scaling with the state-action space size of local neighborhoods, as opposed to the entire network. Our result centers around identifying and exploiting an exponential decay property that ensures the effect of agents on each other decays exponentially fast in their graph distance.

Multi-Agent Reinforcement Learning in Stochastic Networked Systems

Authors:Yiheng Lin, Guannan Qu, Longbo Huang, Adam Wierman
Date:2020-06-11 16:08:16

We study multi-agent reinforcement learning (MARL) in a stochastic network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and stochastic, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of MARL in networked systems.

The Emergence of Individuality

Authors:Jiechuan Jiang, Zongqing Lu
Date:2020-06-10 14:11:21

Individuality is essential in human society, which induces the division of labor and thus improves the efficiency and productivity. Similarly, it should also be the key to multi-agent cooperation. Inspired by that individuality is of being an individual separate from others, we propose a simple yet efficient method for the emergence of individuality (EOI) in multi-agent reinforcement learning (MARL). EOI learns a probabilistic classifier that predicts a probability distribution over agents given their observation and gives each agent an intrinsic reward of being correctly predicted by the classifier. The intrinsic reward encourages the agents to visit their own familiar observations, and learning the classifier by such observations makes the intrinsic reward signals stronger and the agents more identifiable. To further enhance the intrinsic reward and promote the emergence of individuality, two regularizers are proposed to increase the discriminability of the classifier. We implement EOI on top of popular MARL algorithms. Empirically, we show that EOI significantly outperforms existing methods in a variety of multi-agent cooperative scenarios.

Learning to Model Opponent Learning

Authors:Ian Davies, Zheng Tian, Jun Wang
Date:2020-06-06 17:19:04

Multi-Agent Reinforcement Learning (MARL) considers settings in which a set of coexisting agents interact with one another and their environment. The adaptation and learning of other agents induces non-stationarity in the environment dynamics. This poses a great challenge for value function-based algorithms whose convergence usually relies on the assumption of a stationary environment. Policy search algorithms also struggle in multi-agent settings as the partial observability resulting from an opponent's actions not being known introduces high variance to policy training. Modelling an agent's opponent(s) is often pursued as a means of resolving the issues arising from the coexistence of learning opponents. An opponent model provides an agent with some ability to reason about other agents to aid its own decision making. Most prior works learn an opponent model by assuming the opponent is employing a stationary policy or switching between a set of stationary policies. Such an approach can reduce the variance of training signals for policy search algorithms. However, in the multi-agent setting, agents have an incentive to continually adapt and learn. This means that the assumptions concerning opponent stationarity are unrealistic. In this work, we develop a novel approach to modelling an opponent's learning dynamics which we term Learning to Model Opponent Learning (LeMOL). We show our structured opponent model is more accurate and stable than naive behaviour cloning baselines. We further show that opponent modelling can improve the performance of algorithmic agents in multi-agent settings.

Logical Team Q-learning: An approach towards factored policies in cooperative MARL

Authors:Lucas Cassano, Ali H. Sayed
Date:2020-06-05 17:02:36

We address the challenge of learning factored policies in cooperative MARL scenarios. In particular, we consider the situation in which a team of agents collaborates to optimize a common cost. The goal is to obtain factored policies that determine the individual behavior of each agent so that the resulting joint policy is optimal. The main contribution of this work is the introduction of Logical Team Q-learning (LTQL). LTQL does not rely on assumptions about the environment and hence is generally applicable to any collaborative MARL scenario. We derive LTQL as a stochastic approximation to a dynamic programming method we introduce in this work. We conclude the paper by providing experiments (both in the tabular and deep settings) that illustrate the claims.

A Maximum Mutual Information Framework for Multi-Agent Reinforcement Learning

Authors:Woojun Kim, Whiyoung Jung, Myungsik Cho, Youngchul Sung
Date:2020-06-04 09:43:52

In this paper, we propose a maximum mutual information (MMI) framework for multi-agent reinforcement learning (MARL) to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the mutual information between actions. By introducing a latent variable to induce nonzero mutual information between actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic (VM3-AC), which follows centralized learning with decentralized execution (CTDE). We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms MADDPG and other MARL algorithms in multi-agent tasks requiring coordination.

Experience Augmentation: Boosting and Accelerating Off-Policy Multi-Agent Reinforcement Learning

Authors:Zhenhui Ye, Yining Chen, Guanghua Song, Bowei Yang, Shen Fan
Date:2020-05-19 13:57:11

Exploration of the high-dimensional state action space is one of the biggest challenges in Reinforcement Learning (RL), especially in multi-agent domain. We present a novel technique called Experience Augmentation, which enables a time-efficient and boosted learning based on a fast, fair and thorough exploration to the environment. It can be combined with arbitrary off-policy MARL algorithms and is applicable to either homogeneous or heterogeneous environments. We demonstrate our approach by combining it with MADDPG and verifing the performance in two homogeneous and one heterogeneous environments. In the best performing scenario, the MADDPG with experience augmentation reaches to the convergence reward of vanilla MADDPG with 1/4 realistic time, and its convergence beats the original model by a significant margin. Our ablation studies show that experience augmentation is a crucial ingredient which accelerates the training process and boosts the convergence.

Automating Turbulence Modeling by Multi-Agent Reinforcement Learning

Authors:Guido Novati, Hugues Lascombes de Laroussilhe, Petros Koumoutsakos
Date:2020-05-18 18:45:09

The modeling of turbulent flows is critical to scientific and engineering problems ranging from aircraft design to weather forecasting and climate prediction. Over the last sixty years numerous turbulence models have been proposed, largely based on physical insight and engineering intuition. Recent advances in machine learning and data science have incited new efforts to complement these approaches. To date, all such efforts have focused on supervised learning which, despite demonstrated promise, encounters difficulties in generalizing beyond the distributions of the training data. In this work we introduce multi-agent reinforcement learning (MARL) as an automated discovery tool of turbulence models. We demonstrate the potential of this approach on Large Eddy Simulations of homogeneous and isotropic turbulence using as reward the recovery of the statistical properties of Direct Numerical Simulations. Here, the closure model is formulated as a control policy enacted by cooperating agents, which detect critical spatio-temporal patterns in the flow field to estimate the unresolved sub-grid scale (SGS) physics. The present results are obtained with state-of-the-art algorithms based on experience replay and compare favorably with established dynamic SGS modeling approaches. Moreover, we show that the present turbulence models generalize across grid sizes and flow conditions as expressed by the Reynolds numbers.

Delay-Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments

Authors:Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, Ding Zhao
Date:2020-05-11 21:21:50

Action and observation delays exist prevalently in the real-world cyber-physical systems which may pose challenges in reinforcement learning design. It is particularly an arduous task when handling multi-agent systems where the delay of one agent could spread to other agents. To resolve this problem, this paper proposes a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define the Delay-Aware Markov Game that incorporates the delays of all agents in the environment. To solve Delay-Aware Markov Games, we apply centralized training and decentralized execution that allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training, without the need of a centralized controller during execution. Experiments are conducted in multi-agent particle environments including cooperative communication, cooperative navigation, and competitive experiments. We also test the proposed algorithm in traffic scenarios that require coordination of all autonomous vehicles to show the practical value of delay-awareness. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviates the performance degradation introduced by delay. Codes and demo videos are available at: https://github.com/baimingc/delay-aware-MARL.

Multi-agent Reinforcement Learning for Decentralized Stable Matching

Authors:Kshitija Taywade, Judy Goldsmith, Brent Harrison
Date:2020-05-03 15:28:41

In the real world, people/entities usually find matches independently and autonomously, such as finding jobs, partners, roommates, etc. It is possible that this search for matches starts with no initial knowledge of the environment. We propose the use of a multi-agent reinforcement learning (MARL) paradigm for a spatially formulated decentralized two-sided matching market with independent and autonomous agents. Having autonomous agents acting independently makes our environment very dynamic and uncertain. Moreover, agents lack the knowledge of preferences of other agents and have to explore the environment and interact with other agents to discover their own preferences through noisy rewards. We think such a setting better approximates the real world and we study the usefulness of our MARL approach for it. Along with conventional stable matching case where agents have strictly ordered preferences, we check the applicability of our approach for stable matching with incomplete lists and ties. We investigate our results for stability, level of instability (for unstable results), and fairness. Our MARL approach mostly yields stable and fair outcomes.

F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning

Authors:Wenhao Li, Bo Jin, Xiangfeng Wang, Junchi Yan, Hongyuan Zha
Date:2020-04-17 14:56:29

Traditional centralized multi-agent reinforcement learning (MARL) algorithms are sometimes unpractical in complicated applications, due to non-interactivity between agents, curse of dimensionality and computation complexity. Hence, several decentralized MARL algorithms are motivated. However, existing decentralized methods only handle the fully cooperative setting where massive information needs to be transmitted in training. The block coordinate gradient descent scheme they used for successive independent actor and critic steps can simplify the calculation, but it causes serious bias. In this paper, we propose a flexible fully decentralized actor-critic MARL framework, which can combine most of actor-critic methods, and handle large-scale general cooperative multi-agent setting. A primal-dual hybrid gradient descent type algorithm framework is designed to learn individual agents separately for decentralization. From the perspective of each agent, policy improvement and value evaluation are jointly optimized, which can stabilize multi-agent policy learning. Furthermore, our framework can achieve scalability and stability for large-scale environment and reduce information transmission, by the parameter sharing mechanism and a novel modeling-other-agents methods based on theory-of-mind and online supervised learning. Sufficient experiments in cooperative Multi-agent Particle Environment and StarCraft II show that our decentralized MARL instantiation algorithms perform competitively against conventional centralized and decentralized methods.

MARLeME: A Multi-Agent Reinforcement Learning Model Extraction Library

Authors:Dmitry Kazhdan, Zohreh Shams, Pietro Liò
Date:2020-04-16 20:27:38

Multi-Agent Reinforcement Learning (MARL) encompasses a powerful class of methodologies that have been applied in a wide range of fields. An effective way to further empower these methodologies is to develop libraries and tools that could expand their interpretability and explainability. In this work, we introduce MARLeME: a MARL model extraction library, designed to improve explainability of MARL systems by approximating them with symbolic models. Symbolic models offer a high degree of interpretability, well-defined properties, and verifiable behaviour. Consequently, they can be used to inspect and better understand the underlying MARL system and corresponding MARL agents, as well as to replace all/some of the agents that are particularly safety and security critical.

Re-conceptualising the Language Game Paradigm in the Framework of Multi-Agent Reinforcement Learning

Authors:Paul Van Eecke, Katrien Beuls
Date:2020-04-09 17:55:15

In this paper, we formulate the challenge of re-conceptualising the language game experimental paradigm in the framework of multi-agent reinforcement learning (MARL). If successful, future language game experiments will benefit from the rapid and promising methodological advances in the MARL community, while future MARL experiments on learning emergent communication will benefit from the insights and results gained from language game experiments. We strongly believe that this cross-pollination has the potential to lead to major breakthroughs in the modelling of how human-like languages can emerge and evolve in multi-agent systems.

Networked Multi-Agent Reinforcement Learning with Emergent Communication

Authors:Shubham Gupta, Rishi Hazra, Ambedkar Dukkipati
Date:2020-04-06 16:13:23

Multi-Agent Reinforcement Learning (MARL) methods find optimal policies for agents that operate in the presence of other learning agents. Central to achieving this is how the agents coordinate. One way to coordinate is by learning to communicate with each other. Can the agents develop a language while learning to perform a common task? In this paper, we formulate and study a MARL problem where cooperative agents are connected to each other via a fixed underlying network. These agents can communicate along the edges of this network by exchanging discrete symbols. However, the semantics of these symbols are not predefined and, during training, the agents are required to develop a language that helps them in accomplishing their goals. We propose a method for training these agents using emergent communication. We demonstrate the applicability of the proposed framework by applying it to the problem of managing traffic controllers, where we achieve state-of-the-art performance as compared to a number of strong baselines. More importantly, we perform a detailed analysis of the emergent communication to show, for instance, that the developed language is grounded and demonstrate its relationship with the underlying network topology. To the best of our knowledge, this is the only work that performs an in depth analysis of emergent communication in a networked MARL setting while being applicable to a broad class of problems.

A Deep Ensemble Multi-Agent Reinforcement Learning Approach for Air Traffic Control

Authors:Supriyo Ghosh, Sean Laguna, Shiau Hong Lim, Laura Wynter, Hasan Poonawala
Date:2020-04-03 06:03:53

Air traffic control is an example of a highly challenging operational problem that is readily amenable to human expertise augmentation via decision support technologies. In this paper, we propose a new intelligent decision making framework that leverages multi-agent reinforcement learning (MARL) to dynamically suggest adjustments of aircraft speeds in real-time. The goal of the system is to enhance the ability of an air traffic controller to provide effective guidance to aircraft to avoid air traffic congestion, near-miss situations, and to improve arrival timeliness. We develop a novel deep ensemble MARL method that can concisely capture the complexity of the air traffic control problem by learning to efficiently arbitrate between the decisions of a local kernel-based RL model and a wider-reaching deep MARL model. The proposed method is trained and evaluated on an open-source air traffic management simulator developed by Eurocontrol. Extensive empirical results on a real-world dataset including thousands of aircraft demonstrate the feasibility of using multi-agent RL for the problem of en-route air traffic control and show that our proposed deep ensemble MARL method significantly outperforms three state-of-the-art benchmark approaches.

Multi-agent Reinforcement Learning for Networked System Control

Authors:Tianshu Chu, Sandeep Chinchali, Sachin Katti
Date:2020-04-03 02:21:07

This paper considers multi-agent reinforcement learning (MARL) in networked system control. Specifically, each agent learns a decentralized control policy based on local observations and messages from connected neighbors. We formulate such a networked MARL (NMARL) problem as a spatiotemporal Markov decision process and introduce a spatial discount factor to stabilize the training of each local agent. Further, we propose a new differentiable communication protocol, called NeurComm, to reduce information loss and non-stationarity in NMARL. Based on experiments in realistic NMARL scenarios of adaptive traffic signal control and cooperative adaptive cruise control, an appropriate spatial discount factor effectively enhances the learning curves of non-communicative MARL algorithms, while NeurComm outperforms existing communication protocols in both learning efficiency and control performance.

Information State Embedding in Partially Observable Cooperative Multi-Agent Reinforcement Learning

Authors:Weichao Mao, Kaiqing Zhang, Erik Miehling, Tamer Başar
Date:2020-04-02 16:03:42

Multi-agent reinforcement learning (MARL) under partial observability has long been considered challenging, primarily due to the requirement for each agent to maintain a belief over all other agents' local histories -- a domain that generally grows exponentially over time. In this work, we investigate a partially observable MARL problem in which agents are cooperative. To enable the development of tractable algorithms, we introduce the concept of an information state embedding that serves to compress agents' histories. We quantify how the compression error influences the resulting value functions for decentralized control. Furthermore, we propose an instance of the embedding based on recurrent neural networks (RNNs). The embedding is then used as an approximate information state, and can be fed into any MARL algorithm. The proposed embed-then-learn pipeline opens the black-box of existing (partially observable) MARL algorithms, allowing us to establish some theoretical guarantees (error bounds of value functions) while still achieving competitive performance with many end-to-end approaches.

Parallel Knowledge Transfer in Multi-Agent Reinforcement Learning

Authors:Yongyuan Liang, Bangwei Li
Date:2020-03-29 17:42:00

Multi-agent reinforcement learning is a standard framework for modeling multi-agent interactions applied in real-world scenarios. Inspired by experience sharing in human groups, learning knowledge parallel reusing between agents can potentially promote team learning performance, especially in multi-task environments. When all agents interact with the environment and learn simultaneously, how each independent agent selectively learns from other agents' behavior knowledge is a problem that we need to solve. This paper proposes a novel knowledge transfer framework in MARL, PAT (Parallel Attentional Transfer). We design two acting modes in PAT, student mode and self-learning mode. Each agent in our approach trains a decentralized student actor-critic to determine its acting mode at each time step. When agents are unfamiliar with the environment, the shared attention mechanism in student mode effectively selects learning knowledge from other agents to decide agents' actions. PAT outperforms state-of-the-art empirical evaluation results against the prior advising approaches. Our approach not only significantly improves team learning rate and global performance, but also is flexible and transferable to be applied in various multi-agent systems.

Value-Decomposition Networks based Distributed Interference Control in Multi-platoon Groupcast

Authors:Xiongfeng Guo, Tianhao Wu, Lin Zhang
Date:2020-03-25 03:08:47

Platooning is considered one of the most representative 5G use cases. Due to the small spacing within the platoon, the platoon needs more reliable transmission to guarantee driving safety while improving fuel and driving efficiency. However, efficient resource allocation between platoons has been a challenge, especially considering that the channel and power selected by each platoon will affect other platoons. Therefore, platoons need to coordinate with each other to ensure the groupcast quality of each platoon. To solve these challenges, we model the multi-platoon resource selection problem as Markov games and then propose a distributed resource allocation algorithm based on Value-Decomposition Networks. Our scheme utilizes the historical data of each platoon for centralized training. In distributed execution, agents only need their local observations to make decisions. At the same time, we decrease the training burden by sharing the neural network parameters. Simulation results show that the proposed algorithm has excellent convergence. Compared with another multi-agent algorithm (MARL) and random algorithm, our proposed solution can dramatically reduce the probability of platoon groupcast failure and improve the quality of platoon groupcast.

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Authors:Qian Long, Zihan Zhou, Abhibav Gupta, Fei Fang, Yi Wu, Xiaolong Wang
Date:2020-03-23 17:49:39

In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially.

ROMA: Multi-Agent Reinforcement Learning with Emergent Roles

Authors:Tonghan Wang, Heng Dong, Victor Lesser, Chongjie Zhang
Date:2020-03-18 04:29:42

The role concept provides a useful tool to design and understand complex multi-agent systems, which allows agents with a similar role to share similar behaviors. However, existing role-based methods use prior domain knowledge and predefine role structures and behaviors. In contrast, multi-agent reinforcement learning (MARL) provides flexibility and adaptability, but less efficiency in complex tasks. In this paper, we synergize these two paradigms and propose a role-oriented MARL framework (ROMA). In this framework, roles are emergent, and agents with similar roles tend to share their learning and to be specialized on certain sub-tasks. To this end, we construct a stochastic role embedding space by introducing two novel regularizers and conditioning individual policies on roles. Experiments show that our method can learn specialized, dynamic, and identifiable roles, which help our method push forward the state of the art on the StarCraft II micromanagement benchmark. Demonstrative videos are available at https://sites.google.com/view/romarl/.

Value Variance Minimization for Learning Approximate Equilibrium in Aggregation Systems

Authors:Tanvi Verma, Pradeep Varakantham
Date:2020-03-16 10:02:42

For effective matching of resources (e.g., taxis, food, bikes, shopping items) to customer demand, aggregation systems have been extremely successful. In aggregation systems, a central entity (e.g., Uber, Food Panda, Ofo) aggregates supply (e.g., drivers, delivery personnel) and matches demand to supply on a continuous basis (sequential decisions). Due to the objective of the central entity to maximize its profits, individual suppliers get sacrificed thereby creating incentive for individuals to leave the system. In this paper, we consider the problem of learning approximate equilibrium solutions (win-win solutions) in aggregation systems, so that individuals have an incentive to remain in the aggregation system. Unfortunately, such systems have thousands of agents and have to consider demand uncertainty and the underlying problem is a (Partially Observable) Stochastic Game. Given the significant complexity of learning or planning in a stochastic game, we make three key contributions: (a) To exploit infinitesimally small contribution of each agent and anonymity (reward and transitions between agents are dependent on agent counts) in interactions, we represent this as a Multi-Agent Reinforcement Learning (MARL) problem that builds on insights from non-atomic congestion games model; (b) We provide a novel variance reduction mechanism for moving joint solution towards Nash Equilibrium that exploits the infinitesimally small contribution of each agent; and finally (c) We provide detailed results on three different domains to demonstrate the utility of our approach in comparison to state-of-the-art methods.

A Multi-Agent Reinforcement Learning Approach For Safe and Efficient Behavior Planning Of Connected Autonomous Vehicles

Authors:Songyang Han, Shanglin Zhou, Jiangwei Wang, Lynn Pepin, Caiwen Ding, Jie Fu, Fei Miao
Date:2020-03-09 19:15:30

The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather information about their environment by vehicle-to-vehicle (V2V) communication. In this work, we design an information-sharing-based multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. The safe actor-critic algorithm we propose has two new techniques: the truncated Q-function and safe action mapping. The truncated Q-function utilizes the shared information from neighboring CAVs such that the joint state and action spaces of the Q-function do not grow in our algorithm for a large-scale CAV system. We prove the bound of the approximation error between the truncated-Q and global Q-functions. The safe action mapping provides a provable safety guarantee for both the training and execution based on control barrier functions. Using the CARLA simulator for experiments, we show that our approach can improve the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams.

"Other-Play" for Zero-Shot Coordination

Authors:Hengyuan Hu, Adam Lerer, Alex Peysakhovich, Jakob Foerster
Date:2020-03-06 00:39:37

We consider the problem of zero-shot coordination - constructing AI agents that can coordinate with novel partners they have not seen before (e.g. humans). Standard Multi-Agent Reinforcement Learning (MARL) methods typically focus on the self-play (SP) setting where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores when paired with independently trained agents. In preliminary results we also show that our OP agents obtains higher average scores when paired with human players, compared to state-of-the-art SP agents.

Reward Design in Cooperative Multi-agent Reinforcement Learning for Packet Routing

Authors:Hangyu Mao, Zhibo Gong, Zhen Xiao
Date:2020-03-05 02:27:46

In cooperative multi-agent reinforcement learning (MARL), how to design a suitable reward signal to accelerate learning and stabilize convergence is a critical problem. The global reward signal assigns the same global reward to all agents without distinguishing their contributions, while the local reward signal provides different local rewards to each agent based solely on individual behavior. Both of the two reward assignment approaches have some shortcomings: the former might encourage lazy agents, while the latter might produce selfish agents. In this paper, we study reward design problem in cooperative MARL based on packet routing environments. Firstly, we show that the above two reward signals are prone to produce suboptimal policies. Then, inspired by some observations and considerations, we design some mixed reward signals, which are off-the-shelf to learn better policies. Finally, we turn the mixed reward signals into the adaptive counterparts, which achieve best results in our experiments. Other reward signals are also discussed in this paper. As reward design is a very fundamental problem in RL and especially in MARL, we hope that MARL researchers can rethink the rewards used in their systems.

Scaling Up Multiagent Reinforcement Learning for Robotic Systems: Learn an Adaptive Sparse Communication Graph

Authors:Chuangchuang Sun, Macheng Shen, Jonathan P. How
Date:2020-03-02 17:18:25

The complexity of multiagent reinforcement learning (MARL) in multiagent systems increases exponentially with respect to the agent number. This scalability issue prevents MARL from being applied in large-scale multiagent systems. However, one critical feature in MARL that is often neglected is that the interactions between agents are quite sparse. Without exploiting this sparsity structure, existing works aggregate information from all of the agents and thus have a high sample complexity. To address this issue, we propose an adaptive sparse attention mechanism by generalizing a sparsity-inducing activation function. Then a sparse communication graph in MARL is learned by graph neural networks based on this new attention mechanism. Through this sparsity structure, the agents can communicate in an effective as well as efficient way via only selectively attending to agents that matter the most and thus the scale of the MARL problem is reduced with little optimality compromised. Comparative results show that our algorithm can learn an interpretable sparse structure and outperforms previous works by a significant margin on applications involving a large-scale multiagent system.

Dynamic Power Allocation and Virtual Cell Formation for Throughput-Optimal Vehicular Edge Networks in Highway Transportation

Authors:Md Ferdous Pervej, Shih-Chun Lin
Date:2020-02-24 22:52:26

This paper investigates highly mobile vehicular networks from users' perspectives in highway transportation. Particularly, a centralized software-defined architecture is introduced in which centralized resources can be assigned, programmed, and controlled using the anchor nodes (ANs) of the edge servers. Unlike the legacy networks, where a typical user is served from only one access point (AP), in the proposed system model, a vehicle user is served from multiple APs simultaneously. While this increases the reliability and the spectral efficiency of the assisted users, it also necessitates an accurate power allocation in all transmission time slots. As such, a joint user association and power allocation problem is formulated to achieve enhanced reliability and weighted user sum rate. However, the formulated problem is a complex combinatorial problem, remarkably hard to solve. Therefore, fine-grained machine learning algorithms are used to efficiently optimize joint user associations and power allocations of the APs in a highly mobile vehicular network. Furthermore, a distributed single-agent reinforcement learning algorithm, namely SARL-MARL, is proposed which obtains nearly identical genie-aided optimal solutions within a nominal number of training episodes than the baseline solution. Simulation results validate that our solution outperforms existing schemes and can attain genie-aided optimal performances.

Scalable Multi-Agent Inverse Reinforcement Learning via Actor-Attention-Critic

Authors:Wonseok Jeon, Paul Barde, Derek Nowrouzezahrai, Joelle Pineau
Date:2020-02-24 20:30:45

Multi-agent adversarial inverse reinforcement learning (MA-AIRL) is a recent approach that applies single-agent AIRL to multi-agent problems where we seek to recover both policies for our agents and reward functions that promote expert-like behavior. While MA-AIRL has promising results on cooperative and competitive tasks, it is sample-inefficient and has only been validated empirically for small numbers of agents -- its ability to scale to many agents remains an open question. We propose a multi-agent inverse RL algorithm that is more sample-efficient and scalable than previous works. Specifically, we employ multi-agent actor-attention-critic (MAAC) -- an off-policy multi-agent RL (MARL) method -- for the RL inner loop of the inverse RL procedure. In doing so, we are able to increase sample efficiency compared to state-of-the-art baselines, across both small- and large-scale tasks. Moreover, the RL agents trained on the rewards recovered by our method better match the experts than those trained on the rewards derived from the baselines. Finally, our method requires far fewer agent-environment interactions, particularly as the number of agents increases.

Multi-Agent Reinforcement Learning as a Computational Tool for Language Evolution Research: Historical Context and Future Challenges

Authors:Clément Moulin-Frier, Pierre-Yves Oudeyer
Date:2020-02-20 17:26:46

Computational models of emergent communication in agent populations are currently gaining interest in the machine learning community due to recent advances in Multi-Agent Reinforcement Learning (MARL). Current contributions are however still relatively disconnected from the earlier theoretical and computational literature aiming at understanding how language might have emerged from a prelinguistic substance. The goal of this paper is to position recent MARL contributions within the historical context of language evolution research, as well as to extract from this theoretical and computational background a few challenges for future research.

An Efficient Transfer Learning Framework for Multiagent Reinforcement Learning

Authors:Tianpei Yang, Weixun Wang, Hongyao Tang, Jianye Hao, Zhaopeng Meng, Hangyu Mao, Dong Li, Wulong Liu, Chengwei Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan
Date:2020-02-19 07:01:38

Transfer Learning has shown great potential to enhance single-agent Reinforcement Learning (RL) efficiency. Similarly, Multiagent RL (MARL) can also be accelerated if agents can share knowledge with each other. However, it remains a problem of how an agent should learn from other agents. In this paper, we propose a novel Multiagent Policy Transfer Framework (MAPTF) to improve MARL efficiency. MAPTF learns which agent's policy is the best to reuse for each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. Furthermore, in practice, the option module can only collect all agent's local experiences for update due to the partial observability of the environment. While in this setting, each agent's experience may be inconsistent with each other, which may cause the inaccuracy and oscillation of the option-value's estimation. Therefore, we propose a novel option learning algorithm, the successor representation option learning to solve it by decoupling the environment dynamics from rewards and learning the option-value under each agent's preference. MAPTF can be easily combined with existing deep RL and MARL approaches, and experimental results show it significantly boosts the performance of existing methods in both discrete and continuous state spaces.

Reward Design for Driver Repositioning Using Multi-Agent Reinforcement Learning

Authors:Zhenyu Shou, Xuan Di
Date:2020-02-17 00:10:58

A large portion of passenger requests is reportedly unserviced, partially due to vacant for-hire drivers' cruising behavior during the passenger seeking process. This paper aims to model the multi-driver repositioning task through a mean field multi-agent reinforcement learning (MARL) approach that captures competition among multiple agents. Because the direct application of MARL to the multi-driver system under a given reward mechanism will likely yield a suboptimal equilibrium due to the selfishness of drivers, this study proposes a reward design scheme with which a more desired equilibrium can be reached. To effectively solve the bilevel optimization problem with upper level as the reward design and the lower level as a multi-agent system, a Bayesian optimization (BO) algorithm is adopted to speed up the learning process. We then apply the bilevel optimization model to two case studies, namely, e-hailing driver repositioning under service charge and multiclass taxi driver repositioning under NYC congestion pricing. In the first case study, the model is validated by the agreement between the derived optimal control from BO and that from an analytical solution. With a simple piecewise linear service charge, the objective of the e-hailing platform can be increased by 8.4%. In the second case study, an optimal toll charge of $5.1 is solved using BO, which improves the objective of city planners by 7.9%, compared to that without any toll charge. Under this optimal toll charge, the number of taxis in the NYC central business district is decreased, indicating a better traffic condition, without substantially increasing the crowdedness of the subway system.

R-MADDPG for Partially Observable Environments and Limited Communication

Authors:Rose E. Wang, Michael Everett, Jonathan P. How
Date:2020-02-16 21:25:44

There are several real-world tasks that would benefit from applying multiagent reinforcement learning (MARL) algorithms, including the coordination among self-driving cars. The real world has challenging conditions for multiagent learning systems, such as its partial observable and nonstationary nature. Moreover, if agents must share a limited resource (e.g. network bandwidth) they must all learn how to coordinate resource use. This paper introduces a deep recurrent multiagent actor-critic framework (R-MADDPG) for handling multiagent coordination under partial observable set-tings and limited communication. We investigate recurrency effects on performance and communication use of a team of agents. We demonstrate that the resulting framework learns time dependencies for sharing missing observations, handling resource limitations, and developing different communication patterns among agents.

Multi-Agent Reinforcement Learning and Human Social Factors in Climate Change Mitigation

Authors:Kyle Tilbury, Jesse Hoey
Date:2020-02-12 18:46:48

Many complex real-world problems, such as climate change mitigation, are intertwined with human social factors. Climate change mitigation, a social dilemma made difficult by the inherent complexities of human behavior, has an impact at a global scale. We propose applying multi-agent reinforcement learning (MARL) in this setting to develop intelligent agents that can influence the social factors at play in climate change mitigation. There are ethical, practical, and technical challenges that must be addressed when deploying MARL in this way. In this paper, we present these challenges and outline an approach to address them. Understanding how intelligent agents can be used to impact human social factors is important to prevent their abuse and can be beneficial in furthering our knowledge of these complex problems as a whole. The challenges we present are not limited to our specific application but are applicable to broader MARL. Thus, developing MARL for social factors in climate change mitigation helps address general problems hindering MARL's applicability to other real-world problems while also motivating discussion on the social implications of MARL deployment.

Learning Structured Communication for Multi-agent Reinforcement Learning

Authors:Junjie Sheng, Xiangfeng Wang, Bo Jin, Junchi Yan, Wenhao Li, Tsung-Hui Chang, Jun Wang, Hongyuan Zha
Date:2020-02-11 07:19:45

This work explores the large-scale multi-agent communication mechanism under a multi-agent reinforcement learning (MARL) setting. We summarize the general categories of topology for communication structures in MARL literature, which are often manually specified. Then we propose a novel framework termed as Learning Structured Communication (LSC) by using a more flexible and efficient communication topology. Our framework allows for adaptive agent grouping to form different hierarchical formations over episodes, which is generated by an auxiliary task combined with a hierarchical routing protocol. Given each formed topology, a hierarchical graph neural network is learned to enable effective message information generation and propagation among inter- and intra-group communications. In contrast to existing communication mechanisms, our method has an explicit while learnable design for hierarchical communication. Experiments on challenging tasks show the proposed LSC enjoys high communication efficiency, scalability, and global cooperation capability.

Mean-Field Controls with Q-learning for Cooperative MARL: Convergence and Complexity Analysis

Authors:Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu
Date:2020-02-10 23:30:39

Multi-agent reinforcement learning (MARL), despite its popularity and empirical success, suffers from the curse of dimensionality. This paper builds the mathematical framework to approximate cooperative MARL by a mean-field control (MFC) approach, and shows that the approximation error is of $\mathcal{O}(\frac{1}{\sqrt{N}})$. By establishing an appropriate form of the dynamic programming principle for both the value function and the Q function, it proposes a model-free kernel-based Q-learning algorithm (MFC-K-Q), which is shown to have a linear convergence rate for the MFC problem, the first of its kind in the MARL literature. It further establishes that the convergence rate and the sample complexity of MFC-K-Q are independent of the number of agents $N$, which provides an $\mathcal{O}(\frac{1}{\sqrt{N}})$ approximation to the MARL problem with $N$ agents in the learning environment. Empirical studies for the network traffic congestion problem demonstrate that MFC-K-Q outperforms existing MARL algorithms when $N$ is large, for instance when $N>50$.

Q-value Path Decomposition for Deep Multiagent Reinforcement Learning

Authors:Yaodong Yang, Jianye Hao, Guangyong Chen, Hongyao Tang, Yingfeng Chen, Yujing Hu, Changjie Fan, Zhongyu Wei
Date:2020-02-10 17:03:58

Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm. During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

Authors:Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, Hongyao Tang
Date:2020-02-10 16:48:03

In many real-world tasks, multiple agents must learn to coordinate with each other given their private observations and limited communication ability. Deep multiagent reinforcement learning (Deep-MARL) algorithms have shown superior performance in such challenging settings. One representative class of work is multiagent value decomposition, which decomposes the global shared multiagent Q-value $Q_{tot}$ into individual Q-values $Q^{i}$ to guide individuals' behaviors, i.e. VDN imposing an additive formation and QMIX adopting a monotonic assumption using an implicit mixing method. However, most of the previous efforts impose certain assumptions between $Q_{tot}$ and $Q^{i}$ and lack theoretical groundings. Besides, they do not explicitly consider the agent-level impact of individuals to the whole system when transforming individual $Q^{i}$s into $Q_{tot}$. In this paper, we theoretically derive a general formula of $Q_{tot}$ in terms of $Q^{i}$, based on which we can naturally implement a multi-head attention formation to approximate $Q_{tot}$, resulting in not only a refined representation of $Q_{tot}$ with an agent-level attention mechanism, but also a tractable maximization algorithm of decentralized policies. Extensive experiments demonstrate that our method outperforms state-of-the-art MARL methods on the widely adopted StarCraft benchmark across different scenarios, and attention analysis is further conducted with valuable insights.

Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Authors:Seyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar
Date:2020-01-27 23:37:41

Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.

On Solving Cooperative MARL Problems with a Few Good Experiences

Authors:Rajiv Ranjan Kumar, Pradeep Varakantham
Date:2020-01-22 12:53:53

Cooperative Multi-agent Reinforcement Learning (MARL) is crucial for cooperative decentralized decision learning in many domains such as search and rescue, drone surveillance, package delivery and fire fighting problems. In these domains, a key challenge is learning with a few good experiences, i.e., positive reinforcements are obtained only in a few situations (e.g., on extinguishing a fire or tracking a crime or delivering a package) and in most other situations there is zero or negative reinforcement. Learning decisions with a few good experiences is extremely challenging in cooperative MARL problems due to three reasons. First, compared to the single agent case, exploration is harder as multiple agents have to be coordinated to receive a good experience. Second, environment is not stationary as all the agents are learning at the same time (and hence change policies). Third, scale of problem increases significantly with every additional agent. Relevant existing work is extensive and has focussed on dealing with a few good experiences in single-agent RL problems or on scalable approaches for handling non-stationarity in MARL problems. Unfortunately, neither of these approaches (or their extensions) are able to address the problem of sparse good experiences effectively. Therefore, we provide a novel fictitious self imitation approach that is able to simultaneously handle non-stationarity and sparse good experiences in a scalable manner. Finally, we provide a thorough comparison (experimental or descriptive) against relevant cooperative MARL algorithms to demonstrate the utility of our approach.

Individual specialization in multi-task environments with multiagent reinforcement learners

Authors:Marco Jerome Gasparrini, Ricard Solé, Martí Sánchez-Fibla
Date:2019-12-29 15:20:24

There is a growing interest in Multi-Agent Reinforcement Learning (MARL) as the first steps towards building general intelligent agents that learn to make low and high-level decisions in non-stationary complex environments in the presence of other agents. Previous results point us towards increased conditions for coordination, efficiency/fairness, and common-pool resource sharing. We further study coordination in multi-task environments where several rewarding tasks can be performed and thus agents don't necessarily need to perform well in all tasks, but under certain conditions may specialize. An observation derived from the study is that epsilon greedy exploration of value-based reinforcement learning methods is not adequate for multi-agent independent learners because the epsilon parameter that controls the probability of selecting a random action synchronizes the agents artificially and forces them to have deterministic policies at the same time. By using policy-based methods with independent entropy regularised exploration updates, we achieved a better and smoother convergence. Another result that needs to be further investigated is that with an increased number of agents specialization tends to be more probable.

Resolving Congestions in the Air Traffic Management Domain via Multiagent Reinforcement Learning Methods

Authors:Theocharis Kravaris, Christos Spatharis, Alevizos Bastas, George A. Vouros, Konstantinos Blekas, Gennady Andrienko, Natalia Andrienko, Jose Manuel Cordero Garcia
Date:2019-12-14 15:06:35

In this article, we report on the efficiency and effectiveness of multiagent reinforcement learning methods (MARL) for the computation of flight delays to resolve congestion problems in the Air Traffic Management (ATM) domain. Specifically, we aim to resolve cases where demand of airspace use exceeds capacity (demand-capacity problems), via imposing ground delays to flights at the pre-tactical stage of operations (i.e. few days to few hours before operation). Casting this into the multiagent domain, agents, representing flights, need to decide on own delays w.r.t. own preferences, having no information about others' payoffs, preferences and constraints, while they plan to execute their trajectories jointly with others, adhering to operational constraints. Specifically, we formalize the problem as a multiagent Markov Decision Process (MA-MDP) and we show that it can be considered as a Markov game in which interacting agents need to reach an equilibrium: What makes the problem more interesting is the dynamic setting in which agents operate, which is also due to the unforeseen, emergent effects of their decisions in the whole system. We propose collaborative multiagent reinforcement learning methods to resolve demand-capacity imbalances: Extensive experimental study on real-world cases, shows the potential of the proposed approaches in resolving problems, while advanced visualizations provide detailed views towards understanding the quality of solutions provided.

Decentralized Multi-Agent Reinforcement Learning with Networked Agents: Recent Advances

Authors:Kaiqing Zhang, Zhuoran Yang, Tamer Başar
Date:2019-12-09 02:33:57

Multi-agent reinforcement learning (MARL) has long been a significant and everlasting research topic in both machine learning and control. With the recent development of (single-agent) deep RL, there is a resurgence of interests in developing new MARL algorithms, especially those that are backed by theoretical analysis. In this paper, we review some recent advances a sub-area of this topic: decentralized MARL with networked agents. Specifically, multiple agents perform sequential decision-making in a common environment, without the coordination of any central controller. Instead, the agents are allowed to exchange information with their neighbors over a communication network. Such a setting finds broad applications in the control and operation of robots, unmanned vehicles, mobile sensor networks, and smart grid. This review is built upon several our research endeavors in this direction, together with some progresses made by other researchers along the line. We hope this review to inspire the devotion of more research efforts to this exciting yet challenging area.

Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery

Authors:Jiachen Yang, Igor Borovikov, Hongyuan Zha
Date:2019-12-07 20:41:32

Human players in professional team sports achieve high level coordination by dynamically choosing complementary skills and executing primitive actions to perform these skills. As a step toward creating intelligent agents with this capability for fully cooperative multi-agent settings, we propose a two-level hierarchical multi-agent reinforcement learning (MARL) algorithm with unsupervised skill discovery. Agents learn useful and distinct skills at the low level via independent Q-learning, while they learn to select complementary latent skill variables at the high level via centralized multi-agent training with an extrinsic team reward. The set of low-level skills emerges from an intrinsic reward that solely promotes the decodability of latent skill variables from the trajectory of a low-level skill, without the need for hand-crafted rewards for each skill. For scalable decentralized execution, each agent independently chooses latent skill variables and primitive actions based on local observations. Our overall method enables the use of general cooperative MARL algorithms for training high level policies and single-agent RL for training low level skills. Experiments on a stochastic high dimensional team game show the emergence of useful skills and cooperative team play. The interpretability of the learned skills show the promise of the proposed method for achieving human-AI cooperation in team sports games.

Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Authors:Hangyu Mao, Wulong Liu, Jianye Hao, Jun Luo, Dong Li, Zhengchao Zhang, Jun Wang, Zhen Xiao
Date:2019-12-03 02:34:11

Social psychology and real experiences show that cognitive consistency plays an important role to keep human society in order: if people have a more consistent cognition about their environments, they are more likely to achieve better cooperation. Meanwhile, only cognitive consistency within a neighborhood matters because humans only interact directly with their neighbors. Inspired by these observations, we take the first step to introduce \emph{neighborhood cognitive consistency} (NCC) into multi-agent reinforcement learning (MARL). Our NCC design is quite general and can be easily combined with existing MARL methods. As examples, we propose neighborhood cognition consistent deep Q-learning and Actor-Critic to facilitate large-scale multi-agent cooperations. Extensive experiments on several challenging tasks (i.e., packet routing, wifi configuration, and Google football player control) justify the superior performance of our methods compared with state-of-the-art MARL approaches.

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Authors:Kaiqing Zhang, Zhuoran Yang, Tamer Başar
Date:2019-11-24 22:50:32

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

Inducing Cooperation via Team Regret Minimization based Multi-Agent Deep Reinforcement Learning

Authors:Runsheng Yu, Zhenyu Shi, Xinrun Wang, Rundong Wang, Buhong Liu, Xinwen Hou, Hanjiang Lai, Bo An
Date:2019-11-18 15:41:15

Existing value-factorized based Multi-Agent deep Reinforce-ment Learning (MARL) approaches are well-performing invarious multi-agent cooperative environment under thecen-tralized training and decentralized execution(CTDE) scheme,where all agents are trained together by the centralized valuenetwork and each agent execute its policy independently. How-ever, an issue remains open: in the centralized training process,when the environment for the team is partially observable ornon-stationary, i.e., the observation and action informationof all the agents cannot represent the global states, existingmethods perform poorly and sample inefficiently. Regret Min-imization (RM) can be a promising approach as it performswell in partially observable and fully competitive settings.However, it tends to model others as opponents and thus can-not work well under the CTDE scheme. In this work, wepropose a novel team RM based Bayesian MARL with threekey contributions: (a) we design a novel RM method to traincooperative agents as a team and obtain a team regret-basedpolicy for that team; (b) we introduce a novel method to de-compose the team regret to generate the policy for each agentfor decentralized execution; (c) to further improve the perfor-mance, we leverage a differential particle filter (a SequentialMonte Carlo method) network to get an accurate estimation ofthe state for each agent. Experimental results on two-step ma-trix games (cooperative game) and battle games (large-scalemixed cooperative-competitive games) demonstrate that ouralgorithm significantly outperforms state-of-the-art methods.

UW-MARL: Multi-Agent Reinforcement Learning for Underwater Adaptive Sampling using Autonomous Vehicles

Authors:Mehdi Rahmati, Mohammad Nadeem, Vidyasagar Sadhu, Dario Pompili
Date:2019-11-11 06:18:04

Near-real-time water-quality monitoring in uncertain environments such as rivers, lakes, and water reservoirs of different variables is critical to protect the aquatic life and to prevent further propagation of the potential pollution in the water. In order to measure the physical values in a region of interest, adaptive sampling is helpful as an energy- and time-efficient technique since an exhaustive search of an area is not feasible with a single vehicle. We propose an adaptive sampling algorithm using multiple autonomous vehicles, which are well-trained, as agents, in a Multi-Agent Reinforcement Learning (MARL) framework to make efficient sequence of decisions on the adaptive sampling procedure. The proposed solution is evaluated using experimental data, which is fed into a simulation framework. Experiments were conducted in the Raritan River, Somerset and in Carnegie Lake, Princeton, NJ during July 2019.

SMIX($λ$): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

Authors:Xinghu Yao, Chao Wen, Yuhui Wang, Xiaoyang Tan
Date:2019-11-11 05:56:13

Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multi-agent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with the number of agents in such scenarios. This paper proposes an approach, named SMIX(${\lambda}$), to address the issue using an efficient off-policy centralized training method within a flexible learner search space. As importance sampling for such off-policy training is both computationally costly and numerically unstable, we proposed to use the ${\lambda}$-return as a proxy to compute the TD error. With this new loss function objective, we adopt a modified QMIX network structure as the base to train our model. By further connecting it with the ${Q(\lambda)}$ approach from an unified expectation correction viewpoint, we show that the proposed SMIX(${\lambda}$) is equivalent to ${Q(\lambda)}$ and hence shares its convergence properties, while without being suffered from the aforementioned curse of dimensionality problem inherent in MARL. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark demonstrate that our approach not only outperforms several state-of-the-art MARL methods by a large margin, but also can be used as a general tool to improve the overall performance of other CTDE-type algorithms by enhancing their CVFs.

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Authors:Jun Sun, Gang Wang, Georgios B. Giannakis, Qinmin Yang, Zaiyue Yang
Date:2019-11-03 17:30:07

Motivated by the emerging use of multi-agent reinforcement learning (MARL) in engineering applications such as networked robotics, swarming drones, and sensor networks, we investigate the policy evaluation problem in a fully decentralized setting, using temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of a group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an `additional' projection step to control the `gradient' bias incurred by the Markovian observations. In this paper, we provide a finite-sample analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a small neighborhood of the optimum. The resultant error bounds are the first of its type---in the sense that they hold under the most practical assumptions ---which is made possible by means of a novel multi-step Lyapunov analysis.

Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization

Authors:Zhi Zhang, Jiachen Yang, Hongyuan Zha
Date:2019-09-23 23:39:00

Traffic congestion in metropolitan areas is a world-wide problem that can be ameliorated by traffic lights that respond dynamically to real-time conditions. Recent studies applying deep reinforcement learning (RL) to optimize single traffic lights have shown significant improvement over conventional control. However, optimization of global traffic condition over a large road network fundamentally is a cooperative multi-agent control problem, for which single-agent RL is not suitable due to environment non-stationarity and infeasibility of optimizing over an exponential joint-action space. Motivated by these challenges, we propose QCOMBO, a simple yet effective multi-agent reinforcement learning (MARL) algorithm that combines the advantages of independent and centralized learning. We ensure scalability by selecting actions from individually optimized utility functions, which are shaped to maximize global performance via a novel consistency regularization loss between individual utility and a global action-value function. Experiments on diverse road topologies and traffic flow conditions in the SUMO traffic simulator show competitive performance of QCOMBO versus recent state-of-the-art MARL algorithms. We further show that policies trained on small sub-networks can effectively generalize to larger networks under different traffic flow conditions, providing empirical evidence for the suitability of MARL for intelligent traffic control.

Robust Opponent Modeling via Adversarial Ensemble Reinforcement Learning in Asymmetric Imperfect-Information Games

Authors:Macheng Shen, Jonathan P. How
Date:2019-09-18 23:34:22

This paper presents an algorithmic framework for learning robust policies in asymmetric imperfect-information games, where the joint reward could depend on the uncertain opponent type (a private information known only to the opponent itself and its ally). In order to maximize the reward, the protagonist agent has to infer the opponent type through agent modeling. We use multiagent reinforcement learning (MARL) to learn opponent models through self-play, which captures the full strategy interaction and reasoning between agents. However, agent policies learned from self-play can suffer from mutual overfitting. Ensemble training methods can be used to improve the robustness of agent policy against different opponents, but it also significantly increases the computational overhead. In order to achieve a good trade-off between the robustness of the learned policy and the computation complexity, we propose to train a separate opponent policy against the protagonist agent for evaluation purposes. The reward achieved by this opponent is a noisy measure of the robustness of the protagonist agent policy due to the intrinsic stochastic nature of a reinforcement learner. To handle this stochasticity, we apply a stochastic optimization scheme to dynamically update the opponent ensemble to optimize an objective function that strikes a balance between robustness and computation complexity. We empirically show that, under the same limited computational budget, the proposed method results in more robust policy learning than standard ensemble training.

Three-dimensional simulations of non-resonant streaming instability and particle acceleration near non-relativistic astrophysical shocks

Authors:Allard Jan van Marle, Fabien Casse, Alexandre Marcowith
Date:2019-09-16 01:48:15

We use particle-in-magnetohydrodynamics-cells to model particle acceleration and magnetic field amplification in a high Mach, parallel shock in three dimensions and compare the result to 2-D models. This allows us to determine whether 2-D simulations can be relied upon to yield accurate results in terms of particle acceleration, magnetic field amplification and the growth rate of instabilities. Our simulations show that the behaviour of the gas and the evolution of the instabilities are qualitatively similar for both the 2-D and 3-D models, with only minor quantitative differences that relate primarily to the growth speed of the instabilities. The main difference between 2-D and 3-D models can be found in the spectral energy distributions (SEDs) of the non-thermal particles. The 2-D simulations prove to be more efficient, accelerating a larger fraction of the particles and achieving higher velocities. We conclude that, while 2-D models are sufficient to investigate the instabilities in the gas, their results have to be treated with some caution when predicting the expected SED of a given shock.

On Memory Mechanism in Multi-Agent Reinforcement Learning

Authors:Yilun Zhou, Derrik E. Asher, Nicholas R. Waytowich, Julie A. Shah
Date:2019-09-11 17:42:14

Multi-agent reinforcement learning (MARL) extends (single-agent) reinforcement learning (RL) by introducing additional agents and (potentially) partial observability of the environment. Consequently, algorithms for solving MARL problems incorporate various extensions beyond traditional RL methods, such as a learned communication protocol between cooperative agents that enables exchange of private information or adaptive modeling of opponents in competitive settings. One popular algorithmic construct is a memory mechanism such that an agent's decisions can depend not only upon the current state but also upon the history of observed states and actions. In this paper, we study how a memory mechanism can be useful in environments with different properties, such as observability, internality and presence of a communication channel. Using both prior work and new experiments, we show that a memory mechanism is helpful when learning agents need to model other agents and/or when communication is constrained in some way; however we must to be cautious of agents achieving effective memoryfulness through other means.

Signal Instructed Coordination in Cooperative Multi-agent Reinforcement Learning

Authors:Liheng Chen, Hongyi Guo, Yali Du, Fei Fang, Haifeng Zhang, Yaoming Zhu, Ming Zhou, Weinan Zhang, Qing Wang, Yong Yu
Date:2019-09-10 01:28:25

In many real-world problems, a team of agents need to collaborate to maximize the common reward. Although existing works formulate this problem into a centralized learning with decentralized execution framework, which avoids the non-stationary problem in training, their decentralized execution paradigm limits the agents' capability to coordinate. Inspired by the concept of correlated equilibrium, we propose to introduce a coordination signal to address this limitation, and theoretically show that following mild conditions, decentralized agents with the coordination signal can coordinate their individual policies as manipulated by a centralized controller. The idea of introducing coordination signal is to encapsulate coordinated strategies into the signals, and use the signals to instruct the collaboration in decentralized execution. To encourage agents to learn to exploit the coordination signal, we propose Signal Instructed Coordination (SIC), a novel coordination module that can be integrated with most existing MARL frameworks. SIC casts a common signal sampled from a pre-defined distribution to all agents, and introduces an information-theoretic regularization to facilitate the consistency between the observed signal and agents' policies. Our experiments show that SIC consistently improves performance over well-recognized MARL models in both matrix games and a predator-prey game with high-dimensional strategy space.

Bi-level Actor-Critic for Multi-agent Coordination

Authors:Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, Jun Wang
Date:2019-09-08 17:10:50

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE selection. In this paper, we treat agents \emph{unequally} and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find an asymmetric solution in a highway merge environment.

Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control

Authors:Sai Qian Zhang, Qi Zhang, Jieyu Lin
Date:2019-09-06 00:26:05

Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications. However, achieving efficient communication among agents has always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. By limiting the variance of the exchanged messages between agents during the training phase, the noisy component in the messages can be eliminated effectively, while the useful part can be preserved and utilized by the agents for better performance. Our evaluation using a challenging set of StarCraft II benchmarks indicates that our method achieves $2-10\times$ lower in communication overhead than state-of-the-art MARL algorithms, while allowing agents to better collaborate by developing sophisticated strategies.

Normal Form of Equivariant Maps and Singular Symplectic Reduction in Infinite Dimensions with Applications to Gauge Field Theory

Authors:Tobias Diez
Date:2019-09-02 14:45:41

A local normal form theorem for smooth equivariant maps between Fr\'echet manifolds is established. Moreover, an elliptic version of this theorem is obtained. The proof these normal form results is inspired by the Lyapunov-Schmidt reduction for dynamical systems and by the Kuranishi method for moduli spaces, and uses a slice theorem for Fr\'echet manifolds as the main technical tool. As a consequence of this equivariant normal form theorem, the abstract moduli space obtained by factorizing a level set of the equivariant map with respect to the group action carries the structure of a Kuranishi space. Moreover, the theory of singular symplectic reduction is developed in the infinite-dimensional Fr\'echet setting. By refining the above construction, a normal form for momentum maps similar to the classical Marle-Guillemin-Sternberg normal form is established. Analogous to the reasoning in finite dimensions, this normal form result is then used to show that the reduced phase space decomposes into smooth manifolds each carrying a natural symplectic structure. Finally, the singular symplectic reduction scheme is further investigated in the situation where the original phase space is an infinite-dimensional cotangent bundle. The fibered structure of the cotangent bundle yields a refinement of the usual orbit-momentum type strata into so-called seams. Using a suitable normal form theorem, it is shown that these seams are manifolds. Taking the harmonic oscillator as an example, the influence of the seams on dynamics is illustrated. The general results stated above are applied to various gauge theory models. The moduli spaces of anti-self-dual connections in four dimensions and of Yang-Mills connections in two dimensions is studied. Moreover, the stratified structure of the reduced phase space of the Yang-Mills-Higgs theory is investigated in a Hamiltonian formulation.

Iterative Update and Unified Representation for Multi-Agent Reinforcement Learning

Authors:Jiancheng Long, Hongming Zhang, Tianyang Yu, Bo Xu
Date:2019-08-16 07:39:59

Multi-agent systems have a wide range of applications in cooperative and competitive tasks. As the number of agents increases, nonstationarity gets more serious in multi-agent reinforcement learning (MARL), which brings great difficulties to the learning process. Besides, current mainstream algorithms configure each agent an independent network,so that the memory usage increases linearly with the number of agents which greatly slows down the interaction with the environment. Inspired by Generative Adversarial Networks (GAN), this paper proposes an iterative update method (IU) to stabilize the nonstationary environment. Further, we add first-person perspective and represent all agents by only one network which can change agents' policies from sequential compute to batch compute. Similar to continual lifelong learning, we realize the iterative update method in this unified representative network (IUUR). In this method, iterative update can greatly alleviate the nonstationarity of the environment, unified representation can speed up the interaction with environment and avoid the linear growth of memory usage. Besides, this method does not bother decentralized execution and distributed deployment. Experiments show that compared with MADDPG, our algorithm achieves state-of-the-art performance and saves wall-clock time by a large margin especially with more agents.

A Review of Cooperative Multi-Agent Deep Reinforcement Learning

Authors:Afshin OroojlooyJadid, Davood Hajinezhad
Date:2019-08-11 21:40:11

Deep Reinforcement Learning has made significant progress in multi-agent systems in recent years. In this review article, we have focused on presenting recent approaches on Multi-Agent Reinforcement Learning (MARL) algorithms. In particular, we have focused on five common approaches on modeling and solving cooperative multi-agent reinforcement learning problems: (I) independent learners, (II) fully observable critic, (III) value function factorization, (IV) consensus, and (IV) learn to communicate. First, we elaborate on each of these methods, possible challenges, and how these challenges were mitigated in the relevant papers. If applicable, we further make a connection among different papers in each category. Next, we cover some new emerging research areas in MARL along with the relevant recent papers. Due to the recent success of MARL in real-world applications, we assign a section to provide a review of these applications and corresponding articles. Also, a list of available environments for MARL research is provided in this survey. Finally, the paper is concluded with proposals on the possible research directions.

Large-Scale Traffic Signal Control Using a Novel Multi-Agent Reinforcement Learning

Authors:Xiaoqiang Wang, Liangjun Ke, Zhimin Qiao, Xinghua Chai
Date:2019-08-10 14:19:21

Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multi-Agent Reinforcement Learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this paper, a new MARL, called Cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the UCB policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method. In addition, we analyze the convergence properties of the proposed algorithm. Co-DQL is applied on TSC and tested on a multi-traffic signal simulator. According to the results obtained on several traffic scenarios, Co- DQL outperforms several state-of-the-art decentralized MARL algorithms. It can effectively shorten the average waiting time of the vehicles in the whole road system.

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Authors:Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Shilei Wen
Date:2019-07-31 08:51:31

Video Recognition has drawn great research interest and great progress has been made. A suitable frame sampling strategy can improve the accuracy and efficiency of recognition. However, mainstream solutions generally adopt hand-crafted frame sampling strategies for recognition. It could degrade the performance, especially in untrimmed videos, due to the variation of frame-level saliency. To this end, we concentrate on improving untrimmed video classification via developing a learning-based frame sampling strategy. We intuitively formulate the frame sampling procedure as multiple parallel Markov decision processes, each of which aims at picking out a frame/clip by gradually adjusting an initial sampling. Then we propose to solve the problems with multi-agent reinforcement learning (MARL). Our MARL framework is composed of a novel RNN-based context-aware observation network which jointly models context information among nearby agents and historical states of a specific agent, a policy network which generates the probability distribution over a predefined action space at each step and a classification network for reward calculation as well as final recognition. Extensive experimental results show that our MARL-based scheme remarkably outperforms hand-crafted strategies with various 2D and 3D baseline methods. Our single RGB model achieves a comparable performance of ActivityNet v1.3 champion submission with multi-modal multi-model fusion and new state-of-the-art results on YouTube Birds and YouTube Cars.

Arena: a toolkit for Multi-Agent Reinforcement Learning

Authors:Qing Wang, Jiechao Xiong, Lei Han, Meng Fang, Xinghai Sun, Zhuobin Zheng, Peng Sun, Zhengyou Zhang
Date:2019-07-20 05:13:53

We introduce Arena, a toolkit for multi-agent reinforcement learning (MARL) research. In MARL, it usually requires customizing observations, rewards and actions for each agent, changing cooperative-competitive agent-interaction, and playing with/against a third-party agent, etc. We provide a novel modular design, called Interface, for manipulating such routines in essentially two ways: 1) Different interfaces can be concatenated and combined, which extends the OpenAI Gym Wrappers concept to MARL scenarios. 2) During MARL training or testing, interfaces can be embedded in either wrapped OpenAI Gym compatible Environments or raw environment compatible Agents. We offer off-the-shelf interfaces for several popular MARL platforms, including StarCraft II, Pommerman, ViZDoom, Soccer, etc. The interfaces effectively support self-play RL and cooperative-competitive hybrid MARL. Also, Arena can be conveniently extended to your own favorite MARL platform.

Prioritized Guidance for Efficient Multi-Agent Reinforcement Learning Exploration

Authors:Qisheng Wang, Qichao Wang
Date:2019-07-18 02:27:55

Exploration efficiency is a challenging problem in multi-agent reinforcement learning (MARL), as the policy learned by confederate MARL depends on the collaborative approach among multiple agents. Another important problem is the less informative reward restricts the learning speed of MARL compared with the informative label in supervised learning. In this work, we leverage on a novel communication method to guide MARL to accelerate exploration and propose a predictive network to forecast the reward of current state-action pair and use the guidance learned by the predictive network to modify the reward function. An improved prioritized experience replay is employed to better take advantage of the different knowledge learned by different agents which utilizes Time-difference (TD) error more effectively. Experimental results demonstrates that the proposed algorithm outperforms existing methods in cooperative multi-agent environments. We remark that this algorithm can be extended to supervised learning to speed up its training.

Shapley Q-value: A Local Reward Approach to Solve Global Reward Games

Authors:Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, Yunjie Gu
Date:2019-07-11 15:12:33

Cooperative game is a critical research area in the multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize the global reward. Credit assignment is an important problem studied in the global reward game. Most of previous works stood by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate reward on its contribution to the group, which could cause inefficient learning. To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value. Shapley Q-value is able to distribute the global reward, reflecting each agent's own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as the critic for each agent. We evaluate SQDDPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with the state-of-the-art algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the experiments, SQDDPG shows a significant improvement on the convergence rate. Finally, we plot Shapley Q-value and validate the property of fair credit assignment.

Voting-Based Multi-Agent Reinforcement Learning for Intelligent IoT

Authors:Yue Xu, Zengde Deng, Mengdi Wang, Wenjun Xu, Anthony Man-Cho So, Shuguang Cui
Date:2019-07-02 14:12:51

The recent success of single-agent reinforcement learning (RL) in Internet of things (IoT) systems motivates the study of multi-agent reinforcement learning (MARL), which is more challenging but more useful in large-scale IoT. In this paper, we consider a voting-based MARL problem, in which the agents vote to make group decisions and the goal is to maximize the globally averaged returns. To this end, we formulate the MARL problem based on the linear programming form of the policy optimization problem and propose a distributed primal-dual algorithm to obtain the optimal solution. We also propose a voting mechanism through which the distributed learning achieves the same sublinear convergence rate as centralized learning. In other words, the distributed decision making does not slow down the process of achieving global consensus on optimality. Lastly, we verify the convergence of our proposed algorithm with numerical simulations and conduct case studies in practical multi-agent IoT systems.

Rethinking Formal Models of Partially Observable Multiagent Decision Making

Authors:Vojtěch Kovařík, Martin Schmid, Neil Burch, Michael Bowling, Viliam Lisý
Date:2019-06-26 14:01:34

Multiagent decision-making in partially observable environments is usually modelled as either an extensive-form game (EFG) in game theory or a partially observable stochastic game (POSG) in multiagent reinforcement learning (MARL). One issue with the current situation is that while most practical problems can be modelled in both formalisms, the relationship of the two models is unclear, which hinders the transfer of ideas between the two communities. A second issue is that while EFGs have recently seen significant algorithmic progress, their classical formalization is unsuitable for efficient presentation of the underlying ideas, such as those around decomposition. To solve the first issue, we introduce factored-observation stochastic games (FOSGs), a minor modification of the POSG formalism which distinguishes between private and public observation and thereby greatly simplifies decomposition. To remedy the second issue, we show that FOSGs and POSGs are naturally connected to EFGs: by "unrolling" a FOSG into its tree form, we obtain an EFG. Conversely, any perfect-recall timeable EFG corresponds to some underlying FOSG in this manner. Moreover, this relationship justifies several minor modifications to the classical EFG formalization that recently appeared as an implicit response to the model's issues with decomposition. Finally, we illustrate the transfer of ideas between EFGs and MARL by presenting three key EFG techniques -- counterfactual regret minimization, sequence form, and decomposition -- in the FOSG framework.

On Poisson structures arising from a Lie group action

Authors:G. M. Beffa, E. L. Mansfield
Date:2019-06-26 00:05:34

We investigate some infinite dimensional Lie algebras and their associated Poisson structures which arise from a Lie group action on a manifold. If $G$ is a Lie group, $\g$ its Lie algebra and $M$ is a manifold on which $G$ acts, then the set of smooth maps from $M$ to $\g$ has at least two Lie algebra structures, both satisfying the required property to be a Lie algebroid. We may then apply a {construction} by Marle to obtain a Poisson bracket on the set of smooth real or complex valued functions on $M\times \g^*$. In this paper, we investigate these Poisson brackets. We show that the set of examples include the standard Darboux symplectic structure and the classical Lie Poisson brackets, but is a strictly larger class of Poisson brackets than these. Our study includes the associated Hamiltonian flows and their invariants, canonical maps induced by the Lie group action, and compatible Poisson structures. Our approach is mainly computational and we detail numerous examples. The Lie brackets from which our results derive, arose from the consideration of connections on bundles with zero curvature and constant torsion. We give an alternate derivation of the Lie bracket which will be suited to applications to Lie group actions for applications not involving a Riemannian metric. We also begin a study of the infinite dimensional Poisson brackets which may be obtained by considering a central extension of the Lie algebras.

A Regularized Opponent Model with Maximum Entropy Objective

Authors:Zheng Tian, Ying Wen, Zhichen Gong, Faiz Punakkath, Shihao Zou, Jun Wang
Date:2019-05-17 12:30:59

In a single-agent setting, reinforcement learning (RL) tasks can be cast into an inference problem by introducing a binary random variable o, which stands for the "optimality". In this paper, we redefine the binary random variable o in multi-agent setting and formalize multi-agent reinforcement learning (MARL) as probabilistic inference. We derive a variational lower bound of the likelihood of achieving the optimality and name it as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO). From ROMMEO, we present a novel perspective on opponent modeling and show how it can improve the performance of training agents theoretically and empirically in cooperative games. To optimize ROMMEO, we first introduce a tabular Q-iteration method ROMMEO-Q with proof of convergence. We extend the exact algorithm to complex environments by proposing an approximate version, ROMMEO-AC. We evaluate these two algorithms on the challenging iterated matrix game and differential game respectively and show that they can outperform strong MARL baselines.

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

Authors:Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi
Date:2019-05-14 06:29:51

We explore value-based solutions for multi-agent reinforcement learning (MARL) tasks in the centralized training with decentralized execution (CTDE) regime popularized recently. However, VDN and QMIX are representative examples that use the idea of factorization of the joint action-value function into individual ones for decentralized execution. VDN and QMIX address only a fraction of factorizable MARL tasks due to their structural constraint in factorization such as additivity and monotonicity. In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. QTRAN guarantees more general factorization than VDN or QMIX, thus covering a much wider class of MARL tasks than does previous methods. Our experiments for the tasks of multi-domain Gaussian-squeeze and modified predator-prey demonstrate QTRAN's superior performance with especially larger margins in games whose payoffs penalize non-cooperative behavior more aggressively.

Argus: Smartphone-enabled Human Cooperation via Multi-Agent Reinforcement Learning for Disaster Situational Awareness

Authors:Vidyasagar Sadhu, Gabriel Salles-Loustau, Dario Pompili, Saman Zonouz, Vincent Sritapan
Date:2019-04-29 02:16:32

Argus exploits a Multi-Agent Reinforcement Learning (MARL) framework to create a 3D mapping of the disaster scene using agents present around the incident zone to facilitate the rescue operations. The agents can be both human bystanders at the disaster scene as well as drones or robots that can assist the humans. The agents are involved in capturing the images of the scene using their smartphones (or on-board cameras in case of drones) as directed by the MARL algorithm. These images are used to build real time a 3D map of the disaster scene. Via both simulations and real experiments, an evaluation of the framework in terms of effectiveness in tracking random dynamicity of the environment is presented.

Teaching on a Budget in Multi-Agent Deep Reinforcement Learning

Authors:Ercüment İlhan, Jeremy Gow, Diego Perez-Liebana
Date:2019-04-19 04:22:09

Deep Reinforcement Learning (RL) algorithms can solve complex sequential decision tasks successfully. However, they have a major drawback of having poor sample efficiency which can often be tackled by knowledge reuse. In Multi-Agent Reinforcement Learning (MARL) this drawback becomes worse, but at the same time, a new set of opportunities to leverage knowledge are also presented through agent interactions. One promising approach among these is peer-to-peer action advising through a teacher-student framework. Despite being introduced for single-agent RL originally, recent studies show that it can also be applied to multi-agent scenarios with promising empirical results. However, studies in this line of research are currently very limited. In this paper, we propose heuristics-based action advising techniques in cooperative decentralised MARL, using a nonlinear function approximation based task-level policy. By adopting Random Network Distillation technique, we devise a measurement for agents to assess their knowledge in any given state and be able to initiate the teacher-student dynamics with no prior role assumptions. Experimental results in a gridworld environment show that such an approach may indeed be useful and needs to be further investigated.

Interaction-aware Decision Making with Adaptive Strategies under Merging Scenarios

Authors:Yeping Hu, Alireza Nakhaei, Masayoshi Tomizuka, Kikuo Fujimura
Date:2019-04-12 04:01:18

In order to drive safely and efficiently under merging scenarios, autonomous vehicles should be aware of their surroundings and make decisions by interacting with other road participants. Moreover, different strategies should be made when the autonomous vehicle is interacting with drivers having different level of cooperativeness. Whether the vehicle is on the merge-lane or main-lane will also influence the driving maneuvers since drivers will behave differently when they have the right-of-way than otherwise. Many traditional methods have been proposed to solve decision making problems under merging scenarios. However, these works either are incapable of modeling complicated interactions or require implementing hand-designed rules which cannot properly handle the uncertainties in real-world scenarios. In this paper, we proposed an interaction-aware decision making with adaptive strategies (IDAS) approach that can let the autonomous vehicle negotiate the road with other drivers by leveraging their cooperativeness under merging scenarios. A single policy is learned under the multi-agent reinforcement learning (MARL) setting via the curriculum learning strategy, which enables the agent to automatically infer other drivers' various behaviors and make decisions strategically. A masking mechanism is also proposed to prevent the agent from exploring states that violate common sense of human judgment and increase the learning efficiency. An exemplar merging scenario was used to implement and examine the proposed method.

Distributed Power Control for Large Energy Harvesting Networks: A Multi-Agent Deep Reinforcement Learning Approach

Authors:Mohit K. Sharma, Alessio Zappone, Mohamad Assaad, Merouane Debbah, Spyridon Vassilaras
Date:2019-04-01 07:16:51

In this paper, we develop a multi-agent reinforcement learning (MARL) framework to obtain online power control policies for a large energy harvesting (EH) multiple access channel, when only causal information about the EH process and wireless channel is available. In the proposed framework, we model the online power control problem as a discrete-time mean-field game (MFG), and analytically show that the MFG has a unique stationary solution. Next, we leverage the fictitious play property of the mean-field games, and the deep reinforcement learning technique to learn the stationary solution of the game, in a completely distributed fashion. We analytically show that the proposed procedure converges to the unique stationary solution of the MFG. This, in turn, ensures that the optimal policies can be learned in a completely distributed fashion. In order to benchmark the performance of the distributed policies, we also develop a deep neural network (DNN) based centralized as well as distributed online power control schemes. Our simulation results show the efficacy of the proposed power control policies. In particular, the DNN based centralized power control policies provide a very good performance for large EH networks for which the design of optimal policies is intractable using the conventional methods such as Markov decision processes. Further, performance of both the distributed policies is close to the throughput achieved by the centralized policies.

Policy Distillation and Value Matching in Multiagent Reinforcement Learning

Authors:Samir Wadhwania, Dong-Ki Kim, Shayegan Omidshafiei, Jonathan P. How
Date:2019-03-15 15:13:02

Multiagent reinforcement learning algorithms (MARL) have been demonstrated on complex tasks that require the coordination of a team of multiple agents to complete. Existing works have focused on sharing information between agents via centralized critics to stabilize learning or through communication to increase performance, but do not generally look at how information can be shared between agents to address the curse of dimensionality in MARL. We posit that a multiagent problem can be decomposed into a multi-task problem where each agent explores a subset of the state space instead of exploring the entire state space. This paper introduces a multiagent actor-critic algorithm and method for combining knowledge from homogeneous agents through distillation and value-matching that outperforms policy distillation alone and allows further learning in both discrete and continuous action spaces.

Resource Abstraction for Reinforcement Learning in Multiagent Congestion Problems

Authors:Kleanthis Malialis, Sam Devlin, Daniel Kudenko
Date:2019-03-13 11:54:04

Real-world congestion problems (e.g. traffic congestion) are typically very complex and large-scale. Multiagent reinforcement learning (MARL) is a promising candidate for dealing with this emerging complexity by providing an autonomous and distributed solution to these problems. However, there are three limiting factors that affect the deployability of MARL approaches to congestion problems. These are learning time, scalability and decentralised coordination i.e. no communication between the learning agents. In this paper we introduce Resource Abstraction, an approach that addresses these challenges by allocating the available resources into abstract groups. This abstraction creates new reward functions that provide a more informative signal to the learning agents and aid the coordination amongst them. Experimental work is conducted on two benchmark domains from the literature, an abstract congestion problem and a realistic traffic congestion problem. The current state-of-the-art for solving multiagent congestion problems is a form of reward shaping called difference rewards. We show that the system using Resource Abstraction significantly improves the learning speed and scalability, and achieves the highest possible or near-highest joint performance/social welfare for both congestion problems in large-scale scenarios involving up to 1000 reinforcement learning agents.

Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal Control

Authors:Tianshu Chu, Jie Wang, Lara Codecà, Zhaojian Li
Date:2019-03-11 18:28:58

Reinforcement learning (RL) is a promising data-driven approach for adaptive traffic signal control (ATSC) in complex urban traffic networks, and deep neural networks further enhance its learning power. However, centralized RL is infeasible for large-scale ATSC due to the extremely high dimension of the joint action space. Multi-agent RL (MARL) overcomes the scalability issue by distributing the global control to each local RL agent, but it introduces new challenges: now the environment becomes partially observable from the viewpoint of each local agent due to limited communication among agents. Most existing studies in MARL focus on designing efficient communication and coordination among traditional Q-learning agents. This paper presents, for the first time, a fully scalable and decentralized MARL algorithm for the state-of-the-art deep RL agent: advantage actor critic (A2C), within the context of ATSC. In particular, two methods are proposed to stabilize the learning procedure, by improving the observability and reducing the learning difficulty of each local agent. The proposed multi-agent A2C is compared against independent A2C and independent Q-learning algorithms, in both a large synthetic traffic grid and a large real-world traffic network of Monaco city, under simulated peak-hour traffic dynamics. Results demonstrate its optimality, robustness, and sample efficiency over other state-of-the-art decentralized MARL algorithms.

Thermal dissipation in two dimensional relativistic Fermi gases with a relaxation time model

Authors:A. R. Méndez, A. L. García-Perciante, G. Chacón-Acosta
Date:2019-03-09 02:54:05

The thermal transport properties of a two dimensional Fermi gas are explored, for the full range of temperatures and densities. The heat flux is established by solving the Uehling-Uhlebeck equation using a relaxation approximation given by Marle's collisional kernel and considering the temperature and chemical potential gradients as independent thermodynamic forces. It is shown that the corresponding transport coefficients are proportional to each other, which leads to the possibility of defining a generalized thermal force and a single transport coefficient. The behavior of such conductivity with the temperature and chemical potential is analyzed and a discussion on its dependence with the relaxation parameter is also included. The relevance and applications of the results are briefly addressed.

Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning? - A Case Study in Dynamic Courier Dispatching System

Authors:Yujie Chen, Yu Qian, Yichen Yao, Zili Wu, Rongqi Li, Yinzhi Zhou, Haoyuan Hu, Yinghui Xu
Date:2019-03-07 03:49:07

In this paper, we study a courier dispatching problem (CDP) raised from an online pickup-service platform of Alibaba. The CDP aims to assign a set of couriers to serve pickup requests with stochastic spatial and temporal arrival rate among urban regions. The objective is to maximize the revenue of served requests given a limited number of couriers over a period of time. Many online algorithms such as dynamic matching and vehicle routing strategy from existing literature could be applied to tackle this problem. However, these methods rely on appropriately predefined optimization objectives at each decision point, which is hard in dynamic situations. This paper formulates the CDP as a Markov decision process (MDP) and proposes a data-driven approach to derive the optimal dispatching rule-set under different scenarios. Our method stacks multi-layer images of the spatial-and-temporal map and apply multi-agent reinforcement learning (MARL) techniques to evolve dispatching models. This method solves the learning inefficiency caused by traditional centralized MDP modeling. Through comprehensive experiments on both artificial dataset and real-world dataset, we show: 1) By utilizing historical data and considering long-term revenue gains, MARL achieves better performance than myopic online algorithms; 2) MARL is able to construct the mapping between complex scenarios to sophisticated decisions such as the dispatching rule. 3) MARL has the scalability to adopt in large-scale real-world scenarios.

Market-Based Model in CR-WSN: A Q-Probabilistic Multi-agent Learning Approach

Authors:Dan Wang, Wei Zhang, Bin Song, Xiaojiang Du, Mohsen Guizani
Date:2019-02-26 01:15:47

The ever-increasingly urban populations and their material demands have brought unprecedented burdens to cities. Smart cities leverage emerging technologies like the Internet of Things (IoT), Cognitive Radio Wireless Sensor Network (CR-WSN) to provide better QoE and QoS for all citizens. However, resource scarcity is an important challenge in CR-WSN. Generally, this problem is handled by auction theory or game theory. To make CR-WSN nodes intelligent and more autonomous in resource allocation, we propose a multi-agent reinforcement learning (MARL) algorithm to learn the optimal resource allocation strategy in the oligopoly market model. Firstly, we model a multi-agent scenario, in which the primary users (PUs) is the sellers and the secondary users (SUs) is the buyers. Then, we propose the Q-probabilistic multiagent learning (QPML) and apply it to allocate resources in the market. In the multi-agent interactive learning process, the PUs and SUs learn strategies to maximize their benefits and improve spectrum utilization. Experimental results show the efficiency of our QPML approach, which can also converge quickly.

Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning

Authors:Minne Li, Zhiwei, Qin, Yan Jiao, Yaodong Yang, Zhichen Gong, Jun Wang, Chenxi Wang, Guobin Wu, Jieping Ye
Date:2019-01-31 16:26:48

A fundamental question in any peer-to-peer ridesharing system is how to, both effectively and efficiently, dispatch user's ride requests to the right driver in real time. Traditional rule-based solutions usually work on a simplified problem setting, which requires a sophisticated hand-crafted weight design for either centralized authority control or decentralized multi-agent scheduling systems. Although recent approaches have used reinforcement learning to provide centralized combinatorial optimization algorithms with informative weight values, their single-agent setting can hardly model the complex interactions between drivers and orders. In this paper, we address the order dispatching problem using multi-agent reinforcement learning (MARL), which follows the distributed nature of the peer-to-peer ridesharing problem and possesses the ability to capture the stochastic demand-supply dynamics in large-scale ridesharing scenarios. Being more reliable than centralized approaches, our proposed MARL solutions could also support fully distributed execution through recent advances in the Internet of Vehicles (IoV) and the Vehicle-to-Network (V2N). Furthermore, we adopt the mean field approximation to simplify the local interactions by taking an average action among neighborhoods. The mean field approximation is capable of globally capturing dynamic demand-supply variations by propagating many local interactions between agents and the environment. Our extensive experiments have shown the significant improvements of MARL order dispatching algorithms over several strong baselines on the gross merchandise volume (GMV), and order response rate measures. Besides, the simulated experiments with real data have also justified that our solution can alleviate the supply-demand gap during the rush hours, thus possessing the capability of reducing traffic congestion.

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Authors:Chao Qu, Shie Mannor, Huan Xu, Yuan Qi, Le Song, Junwu Xiong
Date:2019-01-27 05:56:42

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.

Modelling Bounded Rationality in Multi-Agent Interactions by Generalized Recursive Reasoning

Authors:Ying Wen, Yaodong Yang, Rui Luo, Jun Wang
Date:2019-01-26 13:55:55

Though limited in real-world decision making, most multi-agent reinforcement learning (MARL) models assume perfectly rational agents -- a property hardly met due to individual's cognitive limitation and/or the tractability of the decision problem. In this paper, we introduce generalized recursive reasoning (GR2) as a novel framework to model agents with different \emph{hierarchical} levels of rationality; our framework enables agents to exhibit varying levels of "thinking" ability thereby allowing higher-level agents to best respond to various less sophisticated learners. We contribute both theoretically and empirically. On the theory side, we devise the hierarchical framework of GR2 through probabilistic graphical models and prove the existence of a perfect Bayesian equilibrium. Within the GR2, we propose a practical actor-critic solver, and demonstrate its convergent property to a stationary point in two-player games through Lyapunov analysis. On the empirical side, we validate our findings on a variety of MARL benchmarks. Precisely, we first illustrate the hierarchical thinking process on the Keynes Beauty Contest, and then demonstrate significant improvements compared to state-of-the-art opponent modeling baselines on the normal-form games and the cooperative navigation benchmark.

The Multi-Agent Reinforcement Learning in MalmÖ (MARLÖ) Competition

Authors:Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noburu Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, Daniel Ionita
Date:2019-01-23 21:01:27

Learning in multi-agent scenarios is a fruitful research direction, but current approaches still show scalability problems in multiple games with general reward settings and different opponent types. The Multi-Agent Reinforcement Learning in Malm\"O (MARL\"O) competition is a new challenge that proposes research in this domain using multiple 3D games. The goal of this contest is to foster research in general agents that can learn across different games and opponent types, proposing a challenge as a milestone in the direction of Artificial General Intelligence.

Normal Forms for Dirac-Jacobi bundles and Splitting Theorems for Jacobi Structures

Authors:Jonas Schnitzer
Date:2019-01-01 20:31:18

The aim of this paper is to prove a normal form Theorem for Dirac-Jacobi bundles using the recent techniques from Bursztyn, Lima and Meinrenken. As the most important consequence, we can prove the splitting theorems of Jacobi pairs which was proposed by Dazord, Lichnerowicz and Marle. As an application we provide a alternative proof of the splitting theorem of homogeneous Poisson structures.

Finite-Sample Analysis For Decentralized Batch Multi-Agent Reinforcement Learning With Networked Agents

Authors:Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, Tamer Başar
Date:2018-12-06 20:09:40

Despite the increasing interest in multi-agent reinforcement learning (MARL) in multiple communities, understanding its theoretical foundation has long been recognized as a challenging problem. In this work, we address this problem by providing a finite-sample analysis for decentralized batch MARL with networked agents. Specifically, we consider two decentralized MARL settings, where teams of agents are connected by time-varying communication networks, and either collaborate or compete in a zero-sum game setting, without any central controller. These settings cover many conventional MARL settings in the literature. For both settings, we develop batch MARL algorithms that can be implemented in a decentralized fashion, and quantify the finite-sample errors of the estimated action-value functions. Our error analysis captures how the function class, the number of samples within each iteration, and the number of iterations determine the statistical accuracy of the proposed algorithms. Our results, compared to the finite-sample bounds for single-agent RL, involve additional error terms caused by decentralized computation, which is inherent in our decentralized MARL setting. This work appears to be the first finite-sample analysis for batch MARL, a step towards rigorous theoretical understanding of general MARL algorithms in the finite-sample regime.

Deep Multi-Agent Reinforcement Learning with Relevance Graphs

Authors:Aleksandra Malysheva, Tegg Taekyong Sung, Chae-Bong Sohn, Daniel Kudenko, Aleksei Shpilman
Date:2018-11-30 00:38:18

Over recent years, deep reinforcement learning has shown strong successes in complex single-agent tasks, and more recently this approach has also been applied to multi-agent domains. In this paper, we propose a novel approach, called MAGnet, to multi-agent reinforcement learning (MARL) that utilizes a relevance graph representation of the environment obtained by a self-attention mechanism, and a message-generation technique inspired by the NerveNet architecture. We applied our MAGnet approach to the Pommerman game and the results show that it significantly outperforms state-of-the-art MARL solutions, including DQN, MADDPG, and MCTS.

Evolving intrinsic motivations for altruistic behavior

Authors:Jane X. Wang, Edward Hughes, Chrisantha Fernando, Wojciech M. Czarnecki, Edgar A. Duenez-Guzman, Joel Z. Leibo
Date:2018-11-14 18:01:10

Multi-agent cooperation is an important feature of the natural world. Many tasks involve individual incentives that are misaligned with the common good, yet a wide range of organisms from bacteria to insects and humans are able to overcome their differences and collaborate. Therefore, the emergence of cooperative behavior amongst self-interested individuals is an important question for the fields of multi-agent reinforcement learning (MARL) and evolutionary theory. Here, we study a particular class of multi-agent problems called intertemporal social dilemmas (ISDs), where the conflict between the individual and the group is particularly sharp. By combining MARL with appropriately structured natural selection, we demonstrate that individual inductive biases for cooperation can be learned in a model-free way. To achieve this, we introduce an innovative modular architecture for deep reinforcement learning agents which supports multi-level selection. We present results in two challenging environments, and interpret these in the context of cultural and ecological evolution.

Multi-Agent Reinforcement Learning Based Resource Allocation for UAV Networks

Authors:Jingjing Cui, Yuanwei Liu, Arumugam Nallanathan
Date:2018-10-24 14:05:28

Unmanned aerial vehicles (UAVs) are capable of serving as aerial base stations (BSs) for providing both cost-effective and on-demand wireless communications. This article investigates dynamic resource allocation of multiple UAVs enabled communication networks with the goal of maximizing long-term rewards. More particularly, each UAV communicates with a ground user by automatically selecting its communicating users, power levels and subchannels without any information exchange among UAVs. To model the uncertainty of environments, we formulate the long-term resource allocation problem as a stochastic game for maximizing the expected rewards, where each UAV becomes a learning agent and each resource allocation solution corresponds to an action taken by the UAVs. Afterwards, we develop a multi-agent reinforcement learning (MARL) framework that each agent discovers its best strategy according to its local observations using learning. More specifically, we propose an agent-independent method, for which all agents conduct a decision algorithm independently but share a common structure based on Q-learning. Finally, simulation results reveal that: 1) appropriate parameters for exploitation and exploration are capable of enhancing the performance of the proposed MARL based resource allocation algorithm; 2) the proposed MARL algorithm provides acceptable performance compared to the case with complete information exchanges among UAVs. By doing so, it strikes a good tradeoff between performance gains and information exchange overheads.

Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

Authors:Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas
Date:2018-10-19 19:01:15

We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents' actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents' behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.

M$^3$RL: Mind-aware Multi-agent Management Reinforcement Learning

Authors:Tianmin Shu, Yuandong Tian
Date:2018-09-29 04:33:15

Most of the prior work on multi-agent reinforcement learning (MARL) achieves optimal collaboration by directly controlling the agents to maximize a common reward. In this paper, we aim to address this from a different angle. In particular, we consider scenarios where there are self-interested agents (i.e., worker agents) which have their own minds (preferences, intentions, skills, etc.) and can not be dictated to perform tasks they do not wish to do. For achieving optimal coordination among these agents, we train a super agent (i.e., the manager) to manage them by first inferring their minds based on both current and past observations and then initiating contracts to assign suitable tasks to workers and promise to reward them with corresponding bonuses so that they will agree to work together. The objective of the manager is maximizing the overall productivity as well as minimizing payments made to the workers for ad-hoc worker teaming. To train the manager, we propose Mind-aware Multi-agent Management Reinforcement Learning (M^3RL), which consists of agent modeling and policy learning. We have evaluated our approach in two environments, Resource Collection and Crafting, to simulate multi-agent management problems with various task settings and multiple designs for the worker agents. The experimental results have validated the effectiveness of our approach in modeling worker agents' minds online, and in achieving optimal ad-hoc teaming with good generalization and fast adaptation.

Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction

Authors:Hongyao Tang, Jianye Hao, Tangjie Lv, Yingfeng Chen, Zongzhang Zhang, Hangtian Jia, Chunxu Ren, Yan Zheng, Zhaopeng Meng, Changjie Fan, Li Wang
Date:2018-09-25 06:19:22

Multiagent reinforcement learning (MARL) is commonly considered to suffer from non-stationary environments and exponentially increasing policy space. It would be even more challenging when rewards are sparse and delayed over long trajectories. In this paper, we study hierarchical deep MARL in cooperative multiagent problems with sparse and delayed reward. With temporal abstraction, we decompose the problem into a hierarchy of different time scales and investigate how agents can learn high-level coordination based on the independent skills learned at the low level. Three hierarchical deep MARL architectures are proposed to learn hierarchical policies under different MARL paradigms. Besides, we propose a new experience replay mechanism to alleviate the issue of the sparse transitions at the high level of abstraction and the non-stationarity of multiagent learning. We empirically demonstrate the effectiveness of our approaches in two domains with extremely sparse feedback: (1) a variety of Multiagent Trash Collection tasks, and (2) a challenging online mobile game, i.e., Fever Basketball Defense.

IntelligentCrowd: Mobile Crowdsensing via Multi-Agent Reinforcement Learning

Authors:Yize Chen, Hao Wang
Date:2018-09-20 19:56:08

The prosperity of smart mobile devices has made mobile crowdsensing (MCS) a promising paradigm for completing complex sensing and computation tasks. In the past, great efforts have been made on the design of incentive mechanisms and task allocation strategies from MCS platform's perspective to motivate mobile users' participation. However, in practice, MCS participants face many uncertainties coming from their sensing environment as well as other participants' strategies, and how do they interact with each other and make sensing decisions is not well understood. In this paper, we take MCS participants' perspective to derive an online sensing policy to maximize their payoffs via MCS participation. Specifically, we model the interactions of mobile users and sensing environments as a multi-agent Markov decision process. Each participant cannot observe others' decisions, but needs to decide her effort level in sensing tasks only based on local information, e.g., its own record of sensed signals' quality. To cope with the stochastic sensing environment, we develop an intelligent crowdsensing algorithm IntelligentCrowd by leveraging the power of multi-agent reinforcement learning (MARL). Our algorithm leads to the optimal sensing policy for each user to maximize the expected payoff against stochastic sensing environments, and can be implemented at individual participant's level in a distributed fashion. Numerical simulations demonstrate that IntelligentCrowd significantly improves users' payoffs in sequential MCS tasks under various sensing dynamics.

Developing a self-consistent AGB wind model: I. Chemical, thermal, and dynamical coupling

Authors:Jels Boulangier, Nicola Clementel, Allard Jan van Marle, Leen Decin, Alex de Koter
Date:2018-09-20 11:23:09

The material lost through stellar winds of Asymptotic Giant Branch (AGB) stars is one of the main contributors to the chemical enrichment of galaxies. The general hypothesis of the mass loss mechanism of AGB winds is a combination of stellar pulsations and radiative pressure on dust grains, yet current models still suffer from limitations. Among others, they assume chemical equilibrium of the gas, which may not be justified due to rapid local dynamical changes in the wind. This is important as it is the chemical composition that regulates the thermal structure of the wind, the creation of dust grains in the wind, and ultimately the mass loss by the wind. Using a self-consistent hydrochemical model, we investigated how non-equilibrium chemistry affects the dynamics of the wind. This paper compares a hydrodynamical and a hydrochemical dust-free wind, with focus on the chemical heating and cooling processes. No sustainable wind arises in a purely hydrodynamical model with physically reasonable pulsations. Moreover, temperatures are too high for dust formation to happen, rendering radiative pressure on grains impossible. A hydrochemical wind is even harder to initiate due to efficient chemical cooling. However, temperatures are sufficiently low in dense regions for dust formation to take place. These regions occur close to the star, which is needed for radiation pressure on dust to sufficiently aid in creating a wind. Extending this model self-consistently with dust formation and evolution, and including radiation pressure, will help to understand the mass loss by AGB winds.

Coordination-driven learning in multi-agent problem spaces

Authors:Sean L. Barton, Nicholas R. Waytowich, Derrik E. Asher
Date:2018-09-13 12:44:48

We discuss the role of coordination as a direct learning objective in multi-agent reinforcement learning (MARL) domains. To this end, we present a novel means of quantifying coordination in multi-agent systems, and discuss the implications of using such a measure to optimize coordinated agent policies. This concept has important implications for adversary-aware RL, which we take to be a sub-domain of multi-agent learning.

A Multi-Agent Reinforcement Learning Method for Impression Allocation in Online Display Advertising

Authors:Di Wu, Cheng Chen, Xun Yang, Xiujun Chen, Qing Tan, Jian Xu, Kun Gai
Date:2018-09-10 06:38:22

In online display advertising, guaranteed contracts and real-time bidding (RTB) are two major ways to sell impressions for a publisher. Despite the increasing popularity of RTB, there is still half of online display advertising revenue generated from guaranteed contracts. Therefore, simultaneously selling impressions through both guaranteed contracts and RTB is a straightforward choice for a publisher to maximize its yield. However, deriving the optimal strategy to allocate impressions is not a trivial task, especially when the environment is unstable in real-world applications. In this paper, we formulate the impression allocation problem as an auction problem where each contract can submit virtual bids for individual impressions. With this formulation, we derive the optimal impression allocation strategy by solving the optimal bidding functions for contracts. Since the bids from contracts are decided by the publisher, we propose a multi-agent reinforcement learning (MARL) approach to derive cooperative policies for the publisher to maximize its yield in an unstable environment. The proposed approach also resolves the common challenges in MARL such as input dimension explosion, reward credit assignment, and non-stationary environment. Experimental evaluations on large-scale real datasets demonstrate the effectiveness of our approach.

MARL-FWC: Optimal Coordination of Freeway Traffic Control Measures

Authors:Ahmed Fares, Walid Gomaa, Mohamed A. Khamis
Date:2018-08-27 08:28:20

The objective of this article is to optimize the overall traffic flow on freeways using multiple ramp metering controls plus its complementary Dynamic Speed Limits (DSLs). An optimal freeway operation can be reached when minimizing the difference between the freeway density and the critical ratio for maximum traffic flow. In this article, a Multi-Agent Reinforcement Learning for Freeways Control (MARL-FWC) system for ramps metering and DSLs is proposed. MARL-FWC introduces a new microscopic framework at the network level based on collaborative Markov Decision Process modeling (Markov game) and an associated cooperative Q-learning algorithm. The technique incorporates payoff propagation (Max-Plus algorithm) under the coordination graphs framework, particularly suited for optimal control purposes. MARL-FWC provides three control designs: fully independent, fully distributed, and centralized; suited for different network architectures. MARL-FWC was extensively tested in order to assess the proposed model of the joint payoff, as well as the global payoff. Experiments are conducted with heavy traffic flow under the renowned VISSIM traffic simulator to evaluate MARL-FWC. The experimental results show a significant decrease in the total travel time and an increase in the average speed (when compared with the base case) while maintaining an optimal traffic flow.

Proton Acceleration in Weak Quasi-parallel Intracluster Shocks: Injection and Early Acceleration

Authors:Ji-Hoon Ha, Dongsu Ryu, Hyesung Kang, Allard Jan van Marle
Date:2018-07-25 00:58:01

Collisionless shocks with low sonic Mach numbers, $M_{\rm s} \lesssim 4$, are expected to accelerate cosmic ray (CR) protons via diffusive shock acceleration (DSA) in the intracluster medium (ICM). However, observational evidence for CR protons in the ICM has yet to be established. Performing particle-in-cell simulations, we study the injection of protons into DSA and the early development of a nonthermal particle population in weak shocks in high $\beta$ ($\approx 100$) plasmas. Reflection of incident protons, self-excitation of plasma waves via CR-driven instabilities, and multiple cycles of shock drift acceleration are essential to the early acceleration of CR protons in supercritical quasi-parallel shocks. We find that only in ICM shocks with $M_{\rm s} \gtrsim M_{\rm s}^*\approx 2.25$, a sufficient fraction of incoming protons are reflected by the overshoot in the shock electric potential and magnetic mirror at locally perpendicular magnetic fields, leading to efficient excitation of magnetic waves via CR streaming instabilities and the injection into the DSA process. Since a significant fraction of ICM shocks have $M_{\rm s} < M_{\rm s}^*$, CR proton acceleration in the ICM might be less efficient than previously expected. This may explain why the diffuse gamma-ray emission from galaxy clusters due to proton-proton collisions has not been detected so far.

Learning Existing Social Conventions via Observationally Augmented Self-Play

Authors:Adam Lerer, Alexander Peysakhovich
Date:2018-06-26 15:46:44

In order for artificial agents to coordinate effectively with people, they must act consistently with existing conventions (e.g. how to navigate in traffic, which language to speak, or how to coordinate with teammates). A group's conventions can be viewed as a choice of equilibrium in a coordination game. We consider the problem of an agent learning a policy for a coordination game in a simulated environment and then using this policy when it enters an existing group. When there are multiple possible conventions we show that learning a policy via multi-agent reinforcement learning (MARL) is likely to find policies which achieve high payoffs at training time but fail to coordinate with the real group into which the agent enters. We assume access to a small number of samples of behavior from the true convention and show that we can augment the MARL objective to help it find policies consistent with the real group's convention. In three environments from the literature - traffic, communication, and team coordination - we observe that augmenting MARL with a small amount of imitation learning greatly increases the probability that the strategy found by MARL fits well with the existing social convention. We show that this works even in an environment where standard training methods very rarely find the true convention of the agent's partners.

Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

Authors:Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, Mingyi Hong
Date:2018-06-03 21:07:34

Despite the success of single-agent reinforcement learning, multi-agent reinforcement learning (MARL) remains challenging due to complex interactions between agents. Motivated by decentralized applications such as sensor networks, swarm robotics, and power grids, we study policy evaluation in MARL, where agents with jointly observed state-action pairs and private local rewards collaborate to learn the value of a given policy. In this paper, we propose a double averaging scheme, where each agent iteratively performs averaging over both space and time to incorporate neighboring gradient information and local reward information, respectively. We prove that the proposed algorithm converges to the optimal solution at a global geometric rate. In particular, such an algorithm is built upon a primal-dual reformulation of the mean squared projected Bellman error minimization problem, which gives rise to a decentralized convex-concave saddle-point problem. To the best of our knowledge, the proposed double averaging primal-dual optimization algorithm is the first to achieve fast finite-time convergence on decentralized convex-concave saddle-point problems.

GraphIt: A High-Performance DSL for Graph Analytics

Authors:Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, Saman Amarasinghe
Date:2018-05-02 17:38:35

The performance bottlenecks of graph applications depend not only on the algorithm and the underlying hardware, but also on the size and structure of the input graph. Programmers must try different combinations of a large set of techniques to develop the best implementation for a specific algorithm and type of graph. Existing graph frameworks lack flexibility, supporting only a limited set of optimizations. This paper introduces GraphIt, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures. GraphIt separates what is computed (algorithm) from how it is computed (schedule). Programmers specify the algorithm using an algorithm language, and performance optimizations are specified using a scheduling language. The algorithm language simplifies expressing the algorithms. We formulate graph optimizations, including edge traversal direction, data layout, parallelization, cache, NUMA, and kernel fusion optimizations, as tradeoffs among locality, parallelism, and work-efficiency. The scheduling language enables programmers to easily search through this complicated tradeoff space by composing together optimizations. We also built an autotuner to automatically find high-performance schedules. The compiler uses a new scheduling representation, the graph iteration space, to model, compose, and ensure the validity of the large number of optimizations. GraphIt outperforms the next fastest of six state-of-the-art shared-memory frameworks (Ligra, Green-Marl, GraphMat, Galois, Gemini, and Grazelle) on 24 out of 32 experiments by up to 4.8$\times$, and is never more than 43% slower than the fastest framework on the other experiments. GraphIt also reduces the lines of code by up to an order of magnitude compared to the next fastest framework.

Entropy based Independent Learning in Anonymous Multi-Agent Settings

Authors:Tanvi Verma, Pradeep Varakantham, Hoong Chuin Lau
Date:2018-03-27 07:10:20

Efficient sequential matching of supply and demand is a problem of interest in many online to offline services. For instance, Uber, Lyft, Grab for matching taxis to customers; Ubereats, Deliveroo, FoodPanda etc for matching restaurants to customers. In these online to offline service problems, individuals who are responsible for supply (e.g., taxi drivers, delivery bikes or delivery van drivers) earn more by being at the "right" place at the "right" time. We are interested in developing approaches that learn to guide individuals to be in the "right" place at the "right" time (to maximize revenue) in the presence of other similar "learning" individuals and only local aggregated observation of other agents states (e.g., only number of other taxis in same zone as current agent). A key characteristic of the domains of interest is that the interactions between individuals are anonymous, i.e., the outcome of an interaction (competing for demand) is dependent only on the number and not on the identity of the agents. We model these problems using the Anonymous MARL (AyMARL) model. The key contribution of this paper is in employing principle of maximum entropy to provide a general framework of independent learning that is both empirically effective (even with only local aggregated information of agent population distribution) and theoretically justified. Finally, our approaches provide a significant improvement with respect to joint and individual revenue on a generic simulator for online to offline services and a real world taxi problem over existing approaches. More importantly, this is achieved while having the least variance in revenues earned by the learning individuals, an indicator of fairness.

Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage

Authors:Huy Xuan Pham, Hung Manh La, David Feil-Seifer, Aria Nefian
Date:2018-03-20 04:24:23

This paper proposes a distributed Multi-Agent Reinforcement Learning (MARL) algorithm for a team of Unmanned Aerial Vehicles (UAVs). The proposed MARL algorithm allows UAVs to learn cooperatively to provide a full coverage of an unknown field of interest while minimizing the overlapping sections among their field of views. Two challenges in MARL for such a system are discussed in the paper: firstly, the complex dynamic of the joint-actions of the UAV team, that will be solved using game-theoretic correlated equilibrium, and secondly, the challenge in huge dimensional state space representation will be tackled with efficient function approximation techniques. We also provide our experimental results in detail with both simulation and physical implementation to show that the UAV team can successfully learn to accomplish the task.

Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents

Authors:Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, Tamer Başar
Date:2018-02-23 22:53:32

We consider the problem of \emph{fully decentralized} multi-agent reinforcement learning (MARL), where the agents are located at the nodes of a time-varying communication network. Specifically, we assume that the reward functions of the agents might correspond to different tasks, and are only known to the corresponding agent. Moreover, each agent makes individual decisions based on both the information observed locally and the messages received from its neighbors over the network. Within this setting, the collective goal of the agents is to maximize the globally averaged return over the network through exchanging information with their neighbors. To this end, we propose two decentralized actor-critic algorithms with function approximation, which are applicable to large-scale MARL problems where both the number of states and the number of agents are massively large. Under the decentralized structure, the actor step is performed individually by each agent with no need to infer the policies of others. For the critic step, we propose a consensus update via communication over the network. Our algorithms are fully incremental and can be implemented in an online fashion. Convergence analyses of the algorithms are provided when the value functions are approximated within the class of linear functions. Extensive simulation results with both linear and nonlinear function approximations are presented to validate the proposed algorithms. Our work appears to be the first study of fully decentralized MARL algorithms for networked agents with function approximation, with provable convergence guarantees.

Collective classical motion on hyperbolic spacetimes of any dimensions

Authors:Ion I. Cotaescu
Date:2018-02-06 14:23:49

The geodesics equations on de Sitter and anti-de Sitter spacetimes of any dimensions, are the starting point for deriving the general form of the Boltzmann equation in terms of conserved quantities. The simple equation for the non-equilibrium Marle and Anderson-Witting models are derived and the distributions of the Boltzmann-Marle model on these manifolds are written down first in terms of conserved quantities and then as functions of canonical variables.

Stationary solutions to the boundary value problem for relativistic BGK model in a slab

Authors:Byung-Hoon Hwang, Seok-Bae Yun
Date:2018-01-25 12:47:11

In this paper, we are concerned with the boundary value problem in a slab for the stationary relativistic BGK model of Marle type, which is a relaxation model of the relativistic Boltzmann equation. In the case of fixed inflow boundary conditions, we establish the existence of unique stationary solutions.

Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning

Authors:Clemens Rosenbaum, Tim Klinger, Matthew Riemer
Date:2017-11-03 17:07:51

Multi-task learning (MTL) with neural networks leverages commonalities in tasks to improve performance, but often suffers from task interference which reduces the benefits of transfer. To address this issue we introduce the routing network paradigm, a novel neural network and training algorithm. A routing network is a kind of self-organizing neural network consisting of two components: a router and a set of one or more function blocks. A function block may be any neural network - for example a fully-connected or a convolutional layer. Given an input the router makes a routing decision, choosing a function block to apply and passing the output back to the router recursively, terminating when a fixed recursion depth is reached. In this way the routing network dynamically composes different function blocks for each input. We employ a collaborative multi-agent reinforcement learning (MARL) approach to jointly train the router and function blocks. We evaluate our model against cross-stitch networks and shared-layer baselines on multi-task settings of the MNIST, mini-imagenet, and CIFAR-100 datasets. Our experiments demonstrate a significant improvement in accuracy, with sharper convergence. In addition, routing networks have nearly constant per-task training cost while cross-stitch networks scale linearly with the number of tasks. On CIFAR-100 (20 tasks) we obtain cross-stitch performance levels with an 85% reduction in training time.

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Authors:Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel
Date:2017-11-02 17:34:24

To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (non-stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as InRL, iterated best response, double oracle, and fictitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker.

On magnetic field amplification and particle acceleration near non-relativistic astrophysical shocks: Particles in MHD Cells simulations

Authors:Allard Jan van Marle, Fabien Casse, Alexandre Marcowith
Date:2017-09-25 13:38:01

We present simulations of magnetized astrophysical shocks taking into account the interplay between the thermal plasma of the shock and supra-thermal particles. Such interaction is depicted by combining a grid-based magneto-hydrodynamics description of the thermal fluid with particle in cell techniques devoted to the dynamics of supra-thermal particles. This approach, which incorporates the use of adaptive mesh refinement features, is potentially a key to simulate astrophysical systems on spatial scales that are beyond the reach of pure particle-in-cell simulations. We consider in this study non-relativistic shocks with various Alfvenic Mach numbers and magnetic field obliquity. We recover all the features of both magnetic field amplification and particle acceleration from previous studies when the magnetic field is parallel to the normal to the shock. In contrast with previous particle-in-cell-hybrid simulations, we find that particle acceleration and magnetic field amplification also occur when the magnetic field is oblique to the normal to the shock but on larger timescales than in the parallel case. We show that in our simulations, the supra-thermal particles are experiencing acceleration thanks to a pre-heating process of the particle similar to a shock drift acceleration leading to the corrugation of the shock front. Such oscillations of the shock front and the magnetic field locally help the particles to enter the upstream region and to initiate a non-resonant streaming instability and finally to induce diffuse particle acceleration.

Analysing Congestion Problems in Multi-agent Reinforcement Learning

Authors:Roxana Rădulescu, Peter Vrancx, Ann Nowé
Date:2017-02-28 10:49:36

Congestion problems are omnipresent in today's complex networks and represent a challenge in many research domains. In the context of Multi-agent Reinforcement Learning (MARL), approaches like difference rewards and resource abstraction have shown promising results in tackling such problems. Resource abstraction was shown to be an ideal candidate for solving large-scale resource allocation problems in a fully decentralized manner. However, its performance and applicability strongly depends on some, until now, undocumented assumptions. Two of the main congestion benchmark problems considered in the literature are: the Beach Problem Domain and the Traffic Lane Domain. In both settings the highest system utility is achieved when overcrowding one resource and keeping the rest at optimum capacity. We analyse how abstract grouping can promote this behaviour and how feasible it is to apply this approach in a real-world domain (i.e., what assumptions need to be satisfied and what knowledge is necessary). We introduce a new test problem, the Road Network Domain (RND), where the resources are no longer independent, but rather part of a network (e.g., road network), thus choosing one path will also impact the load on other paths having common road segments. We demonstrate the application of state-of-the-art MARL methods for this new congestion model and analyse their performance. RND allows us to highlight an important limitation of resource abstraction and show that the difference rewards approach manages to better capture and inform the agents about the dynamics of the environment.

The works of William Rowan Hamilton in geometrical optics and the Malus-Dupin theorem

Authors:Charles-Michel Marle
Date:2017-02-18 18:38:25

The works of William Rowan Hamilton in Geometrical Optics are presented, with emphasis on the Malus-Dupin theorem. According to that theorem, a family of light rays depending on two parameters can be focused to a single point by an optical instrument made of reflecting or refracting surfaces if and only if, before entering the optical instrument, the family of rays is rectangular (\emph{i.e.}, admits orthogonal surfaces). Moreover, that theorem states that a rectangular system of rays remains rectangular after an arbitrary number of reflections through, or refractions across, smooth surfaces of arbitrary shape. The original proof of that theorem due to Hamilton is presented, along with another proof founded in symplectic geometry. It was the proof of that theorem which led Hamilton to introduce his \emph{characteristic function} in Optics, then in Dynamics under the name \emph{action integral}

A molecular dynamics approach to dissipative relativistic hydrodynamics: propagation of fluctuations

Authors:Leila Shahsavar, Malihe Ghodrat, Afshin Montakhab
Date:2016-11-12 16:50:03

Relativistic generalization of hydrodynamic theory has attracted much attention from a theoretical point of view. However, it has many important practical applications in high energy as well as astrophysical contexts. Despite various attempts to formulate relativistic hydrodynamics, no definitive consensus has been achieved. In this work, we propose to test the predictions of four types of \emph{first-order} hydrodynamic theories for non-perfect fluids in the light of numerically exact molecular dynamics simulations of a fully relativistic particle system in the low density regime. In this regard, we study the propagation of density, velocity and heat fluctuations in a wide range of temperatures using extensive simulations and compare them to the corresponding analytic expressions we obtain for each of the proposed theories. As expected in the low temperature classical regime all theories give the same results consistent with the numerics. In the high temperature extremely relativistic regime, not all considered theories are distinguishable from one another. However, in the intermediate regime, a meaningful distinction exists in the predictions of various theories considered here. We find that the predictions of the recent formulation due to Tsumura-Kunihiro-Ohnishi are more consistent with our numerical results than the traditional theories due to Meixner, modified Eckart and modified Marle-Stewart.

The Deep and Transient Universe in the SVOM Era: New Challenges and Opportunities - Scientific prospects of the SVOM mission

Authors:J. Wei, B. Cordier, S. Antier, P. Antilogus, J. -L. Atteia, A. Bajat, S. Basa, V. Beckmann, M. G. Bernardini, S. Boissier, L. Bouchet, V. Burwitz, A. Claret, Z. -G. Dai, F. Daigne, J. Deng, D. Dornic, H. Feng, T. Foglizzo, H. Gao, N. Gehrels, O. Godet, A. Goldwurm, F. Gonzalez, L. Gosset, D. Götz, C. Gouiffes, F. Grise, A. Gros, J. Guilet, X. Han, M. Huang, Y. -F. Huang, M. Jouret, A. Klotz, O. La Marle, C. Lachaud, E. Le Floch, W. Lee, N. Leroy, L. -X. Li, S. C. Li, Z. Li, E. -W. Liang, H. Lyu, K. Mercier, G. Migliori, R. Mochkovitch, P. O'Brien, J. Osborne, J. Paul, E. Perinati, P. Petitjean, F. Piron, Y. Qiu, A. Rau, J. Rodriguez, S. Schanne, N. Tanvir, E. Vangioni, S. Vergani, F. -Y. Wang, J. Wang, X. -G. Wang, X. -Y. Wang, A. Watson, N. Webb, J. J. Wei, R. Willingale, C. Wu, X. -F. Wu, L. -P. Xin, D. Xu, S. Yu, W. -F. Yu, Y. -W. Yu, B. Zhang, S. -N. Zhang, Y. Zhang, X. L. Zhou
Date:2016-10-21 19:03:44

To take advantage of the astrophysical potential of Gamma-Ray Bursts (GRBs), Chinese and French astrophysicists have engaged the SVOM mission (Space-based multi-band astronomical Variable Objects Monitor). Major advances in GRB studies resulting from the synergy between space and ground observations, the SVOM mission implements space and ground instrumentation. The scientific objectives of the mission put a special emphasis on two categories of GRBs: very distant GRBs at z$>$5 which constitute exceptional cosmological probes, and faint/soft nearby GRBs which allow probing the nature of the progenitors and the physics at work in the explosion. These goals have a major impact on the design of the mission: the on-board hard X-ray imager is sensitive down to 4 keV and computes on line image and rate triggers, and the follow-up telescopes on the ground are sensitive in the NIR. At the beginning of the next decade, SVOM will be the main provider of GRB positions and spectral parameters on very short time scale. The SVOM instruments will operate simultaneously with a wide range of powerful astronomical devices. This rare instrumental conjunction, combined with the relevance of the scientific topics connected with GRB studies, warrants a remarkable scientific return for SVOM. In addition, the SVOM instrumentation, primarily designed for GRB studies, composes a unique multi-wavelength observatory with rapid slew capability that will find multiple applications for the whole astronomy community beyond the specific objectives linked to GRBs. This report lists the scientific themes that will benefit from observations made with SVOM, whether they are specific GRB topics, or more generally all the issues that can take advantage of the multi-wavelength capabilities of SVOM.

From tools in symplectic and Poisson geometry to Souriau's theories of statistical mechanics and thermodynamics

Authors:Charles-Michel Marle
Date:2016-07-30 11:37:22

I present in this paper some tools in Symplectic and Poisson Geometry in view of their applications in Geometric mechanics and Mathematical Physics. After a short discussion of the Lagrangian and Hamiltonian formalisms, including the use of symmetry groups, and a presentation of the Tulczyjew's isomorphisms (which explain some aspects of the relations between these formalisms), I explain the concept of manifold of motions of a mechanical system and its use, due to J.-M. Souriau, in Statistical Mechanics and Thermodynamics. The generalization of the notion of thermodynamic equilibrium in which the one-dimensional group of time translations is replaced by a multi-dimensional, maybe non-commutative Lie group, is discussed and examples of applications in Physics are given.

Pinwheels in the sky, with dust: 3D modeling of the Wolf-Rayet 98a environment

Authors:Tom Hendrix, Rony Keppens, Allard Jan van Marle, Peter Camps, Maarten Baes, Zakaria Meliani
Date:2016-05-30 14:01:16

The Wolf-Rayet 98a (WR 98a) system is a prime target for interferometric surveys, since its identification as a "rotating pinwheel nebulae", where infrared images display a spiral dust lane revolving with a 1.4 year periodicity. WR 98a hosts a WC9+OB star, and the presence of dust is puzzling given the extreme luminosities of Wolf-Rayet stars. We present 3D hydrodynamic models for WR 98a, where dust creation and redistribution are self-consistently incorporated. Our grid-adaptive simulations resolve details in the wind collision region at scales below one percent of the orbital separation (~4 AU), while simulating up to 1300 AU. We cover several orbital periods under conditions where the gas component alone behaves adiabatic, or is subject to effective radiative cooling. In the adiabatic case, mixing between stellar winds is effective in a well-defined spiral pattern, where optimal conditions for dust creation are met. When radiative cooling is incorporated, the interaction gets dominated by thermal instabilities along the wind collision region, and dust concentrates in clumps and filaments in a volume-filling fashion, so WR 98a must obey close to adiabatic evolutions to demonstrate the rotating pinwheel structure. We mimic Keck, ALMA or future E-ELT observations and confront photometric long-term monitoring. We predict an asymmetry in the dust distribution between leading and trailing edge of the spiral, show that ALMA and E-ELT would be able to detect fine-structure in the spiral indicative of Kelvin-Helmholtz development, and confirm the variation in photometry due to the orientation. Historic Keck images are reproduced, but their resolution is insufficient to detect the details we predict.

Maxwell-Juttner distribution for rigidly-rotating flows in spherically symmetric spacetimes using the tetrad formalism

Authors:Victor E. Ambrus, Ion I. Cotaescu
Date:2016-05-23 14:55:16

We consider rigidly rotating states in thermal equilibrium on static spherically symmetric spacetimes. Using the Maxwell-Juttner equilibrium distribution function, onstructed as a solution of the relativistic Boltzmann equation, the equilibrium particle flow four-vector, stress-energy tensor and the transport coefficients in the Marle model are computed. Their properties are discussed in view of the topology of the speed-of-light surface induced by the rotation for two classes of spacetimes: maximally symmetric (Minkowski, de Sitter and anti-de Sitter) and charged (Reissner-Nordstrom) black-hole spacetimes. To facilitate our analysis, we employ a non-holonomic comoving tetrad field, obtained unambiguously by applying a Lorentz boost on a fixed background tetrad.

Feature Selection as a Multiagent Coordination Problem

Authors:Kleanthis Malialis, Jun Wang, Gary Brooks, George Frangou
Date:2016-03-16 15:49:37

Datasets with hundreds to tens of thousands features is the new norm. Feature selection constitutes a central problem in machine learning, where the aim is to derive a representative set of features from which to construct a classification (or prediction) model for a specific task. Our experimental study involves microarray gene expression datasets, these are high-dimensional and noisy datasets that contain genetic data typically used for distinguishing between benign or malicious tissues or classifying different types of cancer. In this paper, we formulate feature selection as a multiagent coordination problem and propose a novel feature selection method using multiagent reinforcement learning. The central idea of the proposed approach is to "assign" a reinforcement learning agent to each feature where each agent learns to control a single feature, we refer to this approach as MARL. Applying this to microarray datasets creates an enormous multiagent coordination problem between thousands of learning agents. To address the scalability challenge we apply a form of reward shaping called CLEAN rewards. We compare in total nine feature selection methods, including state-of-the-art methods, and show that the proposed method using CLEAN rewards can significantly scale-up, thus outperforming the rest of learning-based methods. We further show that a hybrid variant of MARL achieves the best overall performance.

On the observability of bow shocks of Galactic runaway OB stars

Authors:D. M. -A. Meyer, A. -J. van Marle, R. Kuiper, W. Kley
Date:2016-03-15 16:42:30

Massive stars that have been ejected from their parent cluster and supersonically sailing away through the interstellar medium (ISM) are classified as exiled. They generate circumstellar bow shock nebulae that can be observed. We present two-dimensional, axisymmetric hydrodynamical simulations of a representative sample of stellar wind bow shocks from Galactic OB stars in an ambient medium of densities ranging from n_ISM=0.01 up to 10.0/cm3. Independently of their location in the Galaxy, we confirm that the infrared is the most appropriated waveband to search for bow shocks from massive stars. Their spectral energy distribution is the convenient tool to analyze them since their emission does not depend on the temporary effects which could affect unstable, thin-shelled bow shocks. Our numerical models of Galactic bow shocks generated by high-mass (~40 Mo) runaway stars yield H$\alpha$ fluxes which could be observed by facilities such as the SuperCOSMOS H-Alpha Survey. The brightest bow shock nebulae are produced in the denser regions of the ISM. We predict that bow shocks in the field observed at Ha by means of Rayleigh-sensitive facilities are formed around stars of initial mass larger than about 20 Mo. Our models of bow shocks from OB stars have the emission maximum in the wavelength range 3 <= lambda <= 50 micrometer which can be up to several orders of magnitude brighter than the runaway stars themselves, particularly for stars of initial mass larger than 20 Mo.

A Hydrodynamic Approach to the Study of Anisotropic Instabilities in Dissipative Relativistic Plasmas

Authors:Esteban Calzetta, Alejandra Kandus
Date:2016-02-04 10:32:50

We develop a purely hydrodynamic formalism to describe collisional, anisotropic instabilities in a relativistic plasma, that are usually described with kinetic theory tools. Our main motivation is the fact that coarse-grained models of high particle number systems give more clear and comprehensive physical descriptions of those systems than purely kinetic approaches, and can be more easily tested experimentally as well as numerically. In particular, we aim at developing a theory that describes both a background non-equilibrium fluid configurations and its perturbations, to be able to account for the backreaction of the latter on the former. Our system of equations includes the usual conservation laws for the energy-momentum tensor and for the electric current, and the equations for two new tensors that encode the information about dissipation. To make contact with kinetic theory, we write the different tensors as the moments of a non-equilibrium one-particle distribution function (1pdf) which, for illustrative purposes, we take in the form of a Grad-like ansatz. Although this choice limits the applicability of the formalism to states not far from equilibrium, it retains the main features of the underlying kinetic theory. We assume the validity of the Vlasov-Boltzmann equation, with a collision integral given by the Anderson-Witting prescription, which is more suitable for highly relativistic systems than Marle's (or Bhatnagar, Gross and Krook) form, and derive the conservation laws by taking its corresponding moments. We apply our developments to study the emergence of instabilities in an anisotropic, but axially symmetric background. For small departures of isotropy we find the dispersion relation for normal modes, which admit unstable solutions for a wide range of values of the parameter space.

Shape and evolution of wind-blown bubbles of massive stars: on the effect of the interstellar magnetic field

Authors:Allard Jan van Marle, Zakaria Meliani, Alexandre Marcowith
Date:2015-09-01 09:34:07

The winds of massive stars create large (>10 pc) bubbles around their progenitors. As these bubbles expand they encounter the interstellar coherent magnetic field which, depending on its strength, can influence the shape of the bubble. We wish to investigate if, and how much, the interstellar magnetic field can contribute to the shape of an expanding circumstellar bubble around a massive star. We use the MPI-AMRVAC code to make magneto-hydrodynamical simulations of bubbles, using a single star model, combined with several different field strengths: B = 5, 10, and 20 muG for the interstellar magnetic field. This covers the typical field strengths of the interstellar magnetic fields found in the galactic disk and bulge. Furthermore, we present two simulations that include both a 5 muG interstellar magnetic field and a 10,000 K interstellar medium and two different ISM densities to demonstrate how the magnetic field can combine with other external factors to influence the morphology of the circumstellar bubbles. Our results show that low magnetic fields, as found in the galactic disk, inhibit the growth of the circumstellar bubbles in the direction perpendicular to the field. As a result, the bubbles become ovoid, rather than spherical. Strong interstellar fields, such as observed for the galactic bulge, can completely stop the expansion of the bubble in the direction perpendicular to the field, leading to the formation of a tube-like bubble. When combined with a warm, high-density ISM the bubble is greatly reduced in size, causing a dramatic change in the evolution of temporary features inside the bubble. The magnetic field of the interstellar medium can affect the shape of circumstellar bubbles. This effect may have consequences for the shape and evolution of circumstellar nebulae and supernova remnants, which are formed within the main wind-blown bubble.

Multi-agent Reinforcement Learning with Sparse Interactions by Negotiation and Knowledge Transfer

Authors:Luowei Zhou, Pei Yang, Chunlin Chen, Yang Gao
Date:2015-08-21 16:30:25

Reinforcement learning has significant applications for multi-agent systems, especially in unknown dynamic environments. However, most multi-agent reinforcement learning (MARL) algorithms suffer from such problems as exponential computation complexity in the joint state-action space, which makes it difficult to scale up to realistic multi-agent problems. In this paper, a novel algorithm named negotiation-based MARL with sparse interactions (NegoSI) is presented. In contrast to traditional sparse-interaction based MARL algorithms, NegoSI adopts the equilibrium concept and makes it possible for agents to select the non-strict Equilibrium Dominating Strategy Profile (non-strict EDSP) or Meta equilibrium for their joint actions. The presented NegoSI algorithm consists of four parts: the equilibrium-based framework for sparse interactions, the negotiation for the equilibrium set, the minimum variance method for selecting one joint action and the knowledge transfer of local Q-values. In this integrated algorithm, three techniques, i.e., unshared value functions, equilibrium solutions and sparse interactions are adopted to achieve privacy protection, better coordination and lower computational complexity, respectively. To evaluate the performance of the presented NegoSI algorithm, two groups of experiments are carried out regarding three criteria: steps of each episode (SEE), rewards of each episode (REE) and average runtime (AR). The first group of experiments is conducted using six grid world games and shows fast convergence and high scalability of the presented algorithm. Then in the second group of experiments NegoSI is applied to an intelligent warehouse problem and simulated results demonstrate the effectiveness of the presented NegoSI algorithm compared with other state-of-the-art MARL algorithms.

Simplified models of stellar wind anatomy for interpreting high-resolution data: Analytical approach to embedded spiral geometries

Authors:Ward Homan, Leen Decin, Alex de Koter, Allard Jan van Marle, Robin Lombaert, Wouter Vlemmings
Date:2015-04-20 10:08:16

Recent high-resolution observations have shown stellar winds to harbour complexities which strongly deviate from spherical symmetry, generally assumed as standard wind model. One such morphology is the archimedean spiral, generally believed to be formed by binary interactions, which has been directly observed in multiple sources. We seek to investigate the manifestation in the observables of spiral structures embedded in the spherical outflows of cool stars. We aim to provide an intuitive bedrock with which upcoming ALMA data can be compared and interpreted. By means of an extended parameter study, we model rotational CO emission from the stellar outflow of asymptotic giant branch stars. To this end, we develop a simplified analytical parametrised description of a 3D spiral structure. This model is embedded into a spherical wind, and fed into the 3D radiative transfer code LIME, which produces 3D intensity maps throughout velocity space. Subsequently, we investigate the spectral signature of rotational transitions of CO of the models, as well as the spatial aspect of this emission by means of wide-slit PV diagrams. Additionally, the potential for misinterpretation of the 3D data in a 1D context is quantified. Finally, we simulate ALMA observations to explore the impact of interefrometric noise and artifacts on the emission signatures. The spectral signatures of the CO rotational transition v=0 J=3-2 are very efficient at concealing the dual nature of the outflow. Only a select few parameter combinations allow for the spectral lines to disclose the presence of the spiral structure. The inability to disentangle the spiral from the spherical signal can result in an incorrect interpretation in a 1D context. Consequently, erroneous mass loss rates would be calculated...

The Hamiltonian Tube Of A Cotangent-Lifted Action

Authors:Miguel Rodriguez-Olmos, Miguel Teixidó-Román
Date:2014-10-14 14:06:35

The Marle-Guillemin-Sternberg (MGS) form is local model for a neighborhood of an orbit of a Hamiltonian Lie group action on a symplectic manifold. One of the main features of the MGS form is that it puts simultaneously in normal form the existing symplectic structure and momentum map. The main drawback of the MGS form is that it does not have an explicit expression. We will obtain a MGS form for cotangent- lifted actions on cotangent bundles that, in addition to its defining features, respects the additional fibered structure present. This model generalizes previous results obtained by T. Schmah for orbits with fully-isotropic momentum. In addition, our construction is explicit up to the integration of a differential equation on $G$. This equation can be easily solved for the groups $SO(3)$ or $SL(2)$, thus giving explicit symplectic coordinates for arbitrary canonical actions of these groups on any cotangent bundle.

Decentralised Multi-Agent Reinforcement Learning for Dynamic and Uncertain Environments

Authors:Andrei Marinescu, Ivana Dusparic, Adam Taylor, Vinny Cahill, Siobhán Clarke
Date:2014-09-16 10:07:52

Multi-Agent Reinforcement Learning (MARL) is a widely used technique for optimization in decentralised control problems. However, most applications of MARL are in static environments, and are not suitable when agent behaviour and environment conditions are dynamic and uncertain. Addressing uncertainty in such environments remains a challenging problem for MARL-based systems. The dynamic nature of the environment causes previous knowledge of how agents interact to become outdated. Advanced knowledge of potential changes through prediction significantly supports agents converging to near-optimal control solutions. In this paper we propose P-MARL, a decentralised MARL algorithm enhanced by a prediction mechanism that provides accurate information regarding up-coming changes in the environment. This prediction is achieved by employing an Artificial Neural Network combined with a Self-Organising Map that detects and matches changes in the environment. The proposed algorithm is validated in a realistic smart-grid scenario, and provides a 92% Pareto efficient solution to an electric vehicle charging problem.

A direct proof of Malus' theorem using the symplectic structure of the set of oriented straight lines

Authors:Charles-Michel Marle
Date:2014-09-01 12:21:14

We present a direct proof of Malus' theorem in geometrical Optics founded on the symplectic structure of the set of all oriented straight lines in an Euclidean affine space. Nous pr\'esentens une preuve directe du th\'eor\`eme de Malus de l'optique g\'eom\'etrique bas\'ee sur la structure symplectique de l'ensemble des droites orient\'ees d'un espace affine euclidien.

Eyes in the sky: Interactions between AGB winds and the interstellar magnetic field

Authors:A. J. van Marle, N. L. J. Cox, L. Decin
Date:2014-08-07 08:22:20

We aim to examine the role of the interstellar magnetic field in shaping the extended morphologies of slow dusty winds of Asymptotic Giant-branch (AGB) stars in an effort to pin-point the origin of so-called eye shaped CSE of three carbon-rich AGB stars. In addition, we seek to understand if this pre-planetary nebula (PN) shaping can be responsible for asymmetries observed in PNe. Hydrodynamical simulations are used to study the effect of typical interstellar magnetic fields on the free-expanding spherical stellar winds as they sweep up the local interstellar medium (ISM). The simulations show that typical Galactic interstellar magnetic fields of 5 to 10 muG, are sufficient to alter the spherical expanding shells of AGB stars to appear as the characteristic eye shape revealed by far-infrared observations. The typical sizes of the simulated eyes are in accordance with the observed physical sizes. However, the eye shapes are of transient nature. Depending on the stellar and interstellar conditions they develop after 20,000 to 200,000yrs and last for about 50,000 to 500,000 yrs, assuming that the star is at rest relative to the local interstellar medium. Once formed the eye shape will develop lateral outflows parallel to the magnetic field. The "explosion" of a PN in the center of the eye-shaped dust shell gives rise to an asymmetrical nebula with prominent inward pointing Rayleigh-Taylor instabilities. Interstellar magnetic fields can clearly affect the shaping of wind-ISM interaction shells. The occurrence of the eyes is most strongly influenced by stellar space motion and ISM density. Observability of this transient phase is favoured for lines-of-sight perpendicular to the interstellar magnetic field direction. The simulations indicate that shaping of the pre-PN envelope can strongly affect the shape and size of PNe.

Using numerical models of bow shocks to investigate the circumstellar medium of massive stars

Authors:Allard Jan van Marle, Leen Decin, Nick Cox, Zakaria Meliani
Date:2014-07-07 08:31:26

Many massive stars travel through the interstellar medium at supersonic speeds. As a result they form bow shocks at the interface between the stellar wind. We use numerical hydrodynamics to reproduce such bow shocks numerically, creating models that can be compared to observations. In this paper we discuss the influence of two physical phenomena, interstellar magnetic fields and the presence of interstellar dust grains on the observable shape of the bow shocks of massive stars. We find that the interstellar magnetic field, though too weak to restrict the general shape of the bow shock, reduces the size of the instabilities that would otherwise be observed in the bow shock of a red supergiant. The interstellar dust grains, due to their inertia can penetrate deep into the bow shock structure of a main sequence O-supergiant, crossing over from the ISM into the stellar wind. Therefore, the dust distribution may not always reflect the morphology of the gas. This is an important consideration for infrared observations, which are dominated by dust emission. Our models clearly show, that the bow shocks of massive stars are useful diagnostic tools that can used to investigate the properties of both the stellar wind as well as the interstellar medium.

Relativity, the Special Theory, explained to Children (from 7 to 107 years old)

Authors:Charles-Michel Marle
Date:2014-02-01 20:05:14

The author thinks that the main ideas or Relativity Theory can be explained to children (around the age of 15 or 16) without complicated calculations, by using very simple arguments of affine geometry. The proposed approach is presented as a conversation between the author and one of his grand-children. Limited here to the Special Theory, it will be extended to the General Theory elsewhere, as sketched in conclusion.

Lie, symplectic and Poisson groupoids and their Lie algebroids

Authors:Charles-Michel Marle
Date:2014-02-01 07:37:12

Groupoids are mathematical structures able to describe symmetry properties more general than those described by groups. They were introduced (and named) by H. Brandt in 1926. Around 1950, Charles Ehresmann used groupoids with additional structures (topological and differentiable) as essential tools in topology and differential geometry. In recent years, Mickael Karasev, Alan Weinstein and Stanis{\l}aw Zakrzewski independently discovered that symplectic groupoids can be used for the construction of noncommutative deformations of the algebra of smooth functions on a manifold, with potential applications to quantization. Poisson groupoids were introduced by Alan Weinstein as generalizations of both Poisson Lie groups and symplectic groupoids. We present here the main definitions and first properties relative to groupoids, Lie groupoids, Lie algebroids, symplectic and Poisson groupoids and their Lie algebroids.

The works of Charles Ehresmann on connections: from Cartan connections to connections on fibre bundles

Authors:Charles-Michel Marle
Date:2014-01-31 19:53:04

Around 1923, Elie Cartan introduced affine connections on manifolds and definedthe main related concepts: torsion, curvature, holonomy groups. He discussed applications of these concepts in Classical and Relativistic Mechanics; in particular he explained how parallel transport with respect to a connection can be related to the principle of inertia in Galilean Mechanics and, more generally, can be used to model the motion of a particle in a gravitational field. In subsequent papers, Elie Cartan extended these concepts for other types of connections on a manifold: Euclidean, Galilean and Minkowskian connections which can be considered as special types of affine connections, the group of affine transformations of the affine tangent space being replaced by a suitable subgroup; and more generally, conformal and projective connections, associated to a group which is no more a subgroup of the affine group. Around 1950, Charles Ehresmann introduced connections on a fibre bundle and, when the bundle has a Lie group as structure group, connection forms on the associated principal bundle, with values in the Lie algebra of the structure group. He called Cartan connections the various types of connections on a manifold previously introduced by E. Cartan, and explained how they can be considered as special cases of connections on a fibre bundle with a Lie group G as structure group: the standard fibre of the bundle is then an homogeneous space G/G' ; its dimension is equal to that of the base manifold; a Cartan connection determines an isomorphism of the vector bundle tangent to the the base manifold onto the vector bundle of vertical vectors tangent to the fibres of the bundle along a global section. These works are reviewed and some applications of the theory of connections are sketched.

Symmetries of Hamiltonian systems on symplectic and Poisson manifolds

Authors:Charles-Michel Marle
Date:2014-01-31 13:06:40

This text presents some basic notions in symplectic geometry, Poisson geometry, Hamiltonian systems, Lie algebras and Lie groups actions on symplectic or Poisson manifolds, momentum maps and their use for the reduction of Hamiltonian systems. It should be accessible to readers with a general knowledge of basic notions in differential geometry. Full proofs of many results are provided. The Marsden-Weinstein reduction procedure and the Euler-Poincar\'e equation are discussed. Symplectic cocycles involved when a momentum map is equivariant with respect to an affine action whose linear part is the coadjoint action are presented. The Hamiltonian actions of a Lie group on its cotangent bundle obtained by lifting the actions of the group on itself by translations on the left and on the right are fully discussed. We prove that there exists a two-parameter family of deformations of these actions into a pair of mutually symplectically orthogonal Hamiltonian actions whose momentum maps are equivariant with respect to an affine action involving any given Lie group symplectic cocycle. In the last section three classical examples are considered: the spherical pendulum, the motion of a rigid body around a fixed point and the Kepler problem. For each example the Euler-Poincar\'e equation is derived, the first integrals linked to symmetries are given. In this Section, the classical concepts of vector calculus on an Euclidean three-dimensional vector space (scalar, vector and mixed products) are used and their interpretation in terms of concepts such as the adjoint or coadjoint action of the group of rotations are explained.

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics

Authors:Sherief Abdallah, Victor Lesser
Date:2014-01-15 05:13:47

Several multiagent reinforcement learning (MARL) algorithms have been proposed to optimize agents decisions. Due to the complexity of the problem, the majority of the previously developed MARL algorithms assumed agents either had some knowledge of the underlying game (such as Nash equilibria) and/or observed other agents actions and the rewards they received. We introduce a new MARL algorithm called the Weighted Policy Learner (WPL), which allows agents to reach a Nash Equilibrium (NE) in benchmark 2-player-2-action games with minimum knowledge. Using WPL, the only feedback an agent needs is its own local reward (the agent does not observe other agents actions or rewards). Furthermore, WPL does not assume that agents know the underlying game or the corresponding Nash Equilibrium a priori. We experimentally show that our algorithm converges in benchmark two-player-two-action games. We also show that our algorithm converges in the challenging Shapleys game where previous MARL algorithms failed to converge without knowing the underlying game or the NE. Furthermore, we show that WPL outperforms the state-of-the-art algorithms in a more realistic setting of 100 agents interacting and learning concurrently. An important aspect of understanding the behavior of a MARL algorithm is analyzing the dynamics of the algorithm: how the policies of multiple learning agents evolve over time as agents interact with one another. Such an analysis not only verifies whether agents using a given MARL algorithm will eventually converge, but also reveals the behavior of the MARL algorithm prior to convergence. We analyze our algorithm in two-player-two-action games and show that symbolically proving WPLs convergence is difficult, because of the non-linear nature of WPLs dynamics, unlike previous MARL algorithms that had either linear or piece-wise-linear dynamics. Instead, we numerically solve WPLs dynamics differential equations and compare the solution to the dynamics of previous MARL algorithms.

A 4D synchrotron X-ray tomography study of the formation of hydrocarbon migration pathways in heated organic-rich shale

Authors:Hamed Panahi, Maya Kobchenko, Francois Renard, Adriano Mazzini, Julien Scheibert, Dag Kristian Dysthe, Bjorn Jamtveit, Anders Malthe-Sørenssen, Paul Meakin
Date:2014-01-10 20:37:29

Recovery of oil from oil shales and the natural primary migration of hydrocarbons are closely related processes that have received renewed interests in recent years because of the ever tightening supply of conventional hydrocarbons and the growing production of hydrocarbons from low permeability tight rocks. Quantitative models for conversion of kerogen into oil and gas and the timing of hydrocarbon generation have been well documented. However, lack of consensus about the kinetics of hydrocarbon formation in source rocks, expulsion timing and how the resulting hydrocarbons escape from or are retained in the source rocks motivates further investigation. In particular, many mechanisms for the transport of hydrocarbons from the source rocks in which they are generated into adjacent rocks with higher permeabilities and smaller capillary entry pressures have been proposed, and a better understanding of this complex process (primary migration) is needed. To characterize these processes it is imperative to use the latest technological advances. In this study, it is shown how insights into hydrocarbon migration in source rocks can be obtained by using sequential high resolution synchrotron X-ray tomography. Three-dimensional (3D) images of several immature "shale" samples were constructed at resolutions close to 5 micrometers. This is sufficient to resolve the source rock structure down to the grain level, but very fine grained silt particles, clay particles and colloids cannot be resolved. Samples used in this investigation came from the R-8 unit in the upper part of the Green River Shale, which is organic rich, varved, lacustrine marl formed in Eocene Lake Uinta, United States of America. One Green River Shale sample was heated in-situ up to 400{\deg}C as X-ray tomography images were recorded. The other samples were scanned before and after heating at 400{\deg}C. During the heating phase, the organic matter was decomposed, and gas was released. Gas expulsion from the low permeability shales was coupled with formation of microcracks. The main technical difficulty was numerical extraction of microcracks that have apertures in the 5 to 30 micrometer range (with 5 micrometers being the resolution limit) from a large 3D volume of X-ray attenuation data. The main goal of the work presented here is to develop a methodology to process these 3D data and image the cracks. This methodology is based on several levels of spatial filtering and automatic recognition of connected domains. Supportive petrographic and thermogravimetric data were an important complement to this study. An investigation of the strain field using two-dimensional image correlation analyses was also performed. As one application of the four-dimensional (4D, space + time) microtomography and the developed workflow, we show that fluid generation was accompanied by crack formation. Under different conditions, in the subsurface, this might provide paths for primary migration.

Can the magnetic field in the Orion arm inhibit the growth of instabilities in the bow shock of Betelgeuse?

Authors:Allard Jan van Marle, Leen Decin, Zakaria Meliani
Date:2013-12-20 10:38:27

Many evolved stars travel through space at supersonic velocities, which leads to the formation of bow shocks ahead of the star where the stellar wind collides with the interstellar medium (ISM). Herschel observations of the bow shock of $\alpha$-Orionis show that the shock is almost free of instabilities, despite being, at least in theory, subject to both Kelvin-Helmholtz and Rayleigh-Taylor instabilities. A possible explanation for the lack of instabilities lies in the presence of an interstellar magnetic field. We wish to investigate whether the magnetic field of the interstellar medium (ISM) in the Orion arm can inhibit the growth of instabilities in the bow shock of $\alpha$-Orionis. We used the code MPI-AMRVAC to make magneto-hydrodynamic simulations of a circumstellar bow shock, using the wind parameters derived for $\alpha$-Orionis and interstellar magnetic field strengths of $B\,=\,1.4,\, 3.0$, and $5.0\, \mu$G, which fall within the boundaries of the observed magnetic field strength in the Orion arm of the Milky Way. Our results show that even a relatively weak magnetic field in the interstellar medium can suppress the growth of Rayleigh-Taylor and Kelvin-Helmholtz instabilities, which occur along the contact discontinuity between the shocked wind and the shocked ISM. The presence of even a weak magnetic field in the ISM effectively inhibits the growth of instabilities in the bow shock. This may explain the absence of such instabilities in the Herschel observations of $\alpha$-Orionis.

Pulsating red giant stars in eccentric binary systems discovered from Kepler space-based photometry

Authors:P. G. Beck, K. Hambleton, J. Vos, T. Kallinger, S. Bloemen, A. Tkachenko, R. A. García, R. H. Østensen, C. Aerts, D. W. Kurtz, J. De Ridder, S. Hekker, K. Pavlovski, S. Mathur, K. De Smedt, A. Derekas, E. Corsaro, B. Mosser, H. Van Winckel, D. Huber, P. Degroote, G. R. Davies, A. Prša, J. Debosscher, Y. Elsworth, P. Nemeth, L. Siess, V. S. Schmid, P. I. Pápics, B. L. de Vries, A. J. van Marle, P. Marcos-Arenal, A. Lobel
Date:2013-12-16 20:19:07

The unparalleled photometric data obtained by NASA's Kepler space telescope led to an improved understanding of red giant stars and binary stars. Seismology allows us to constrain the properties of red giants. In addition to eclipsing binaries, eccentric non-eclipsing binaries, exhibiting ellipsoidal modulations, have been detected with Kepler. We aim to study the properties of eccentric binary systems containing a red giant star and derive the parameters of the primary giant component. We apply asteroseismic techniques to determine masses and radii of the primary component of each system. For a selected target, light and radial velocity curve modelling techniques are applied to extract the parameters of the system. The effects of stellar on the binary system are studied. The paper presents the asteroseismic analysis of 18 pulsating red giants in eccentric binary systems, for which masses and radii were constrained. The orbital periods of these systems range from 20 to 440days. From radial velocity measurements we find eccentricities between e=0.2 to 0.76. As a case study we present a detailed analysis of KIC5006817. From seismology we constrain the rotational period of the envelope to be at least 165 d, roughly twice the orbital period. The stellar core rotates 13 times faster than the surface. From the spectrum and radial velocities we expect that the Doppler beaming signal should have a maximum amplitude of 300ppm in the light curve. Through binary modelling, we determine the mass of the secondary component to be 0.29$\pm$0.03\,$M_\odot$. For KIC5006817 we exclude pseudo-synchronous rotation of the red giant with the orbit. The comparison of the results from seismology and modelling of the light curve shows a possible alignment of the rotational and orbital axis at the 2$\sigma$ level. Red giant eccentric systems could be progenitors of cataclysmic variables and hot subdwarf B stars.

On Henri Poincaré's note "Sur une forme nouvelle des équations de la Mécanique"

Authors:Charles-Michel Marle
Date:2013-02-17 18:10:10

We present in modern language the contents of the famous note published by Henri Poincar\'e in 1901 "Sur une forme nouvelle des \'equations de la M\'ecanique", in which he proves that, when a Lie algebra acts locally transitively on the configuration space of a Lagrangian mechanical system, the well known Euler-Lagrange equations are equivalent to a new system of differential equations defined on the product of the configuration space with the Lie algebra. We write these equations, called the \emph{Euler-Poincar\'e equations}, under an intrinsic form, without any reference to a particular system of local coordinates, and prove that they can be conveniently expressed in terms of the Legendre and momentum maps. We discuss the use of the Euler-Poincar\'e equation for reduction (a procedure sometimes called Lagrangian reduction by modern authors), and compare this procedure with the well known Hamiltonian reduction procedure (formulated in modern terms in 1974 by J.E. Marsden and A. Weinstein). We explain how a break of symmetry in the phase space produces the appearance of a semi-direct product of groups.

Ultrarelativistic Transport Coefficients in Two Dimensions

Authors:M. Mendoza, I. Karlin, S. Succi, H. J. Herrmann
Date:2013-01-15 17:13:34

We compute the shear and bulk viscosities, as well as the thermal conductivity of an ultrarelativistic fluid obeying the relativistic Boltzmann equation in 2+1 space-time dimensions. The relativistic Boltzmann equation is taken in the single relaxation time approximation, based on two approaches, the first, due to Marle and using the Eckart decomposition, and the second, proposed by Anderson and Witting and using the Landau-Lifshitz decomposition. In both cases, the local equilibrium is given by a Maxwell-Juettner distribution. It is shown that, apart from slightly different numerical prefactors, the two models lead to a different dependence of the transport coefficients on the fluid temperature, quadratic and linear, for the case of Marle and Anderson-Witting, respectively. However, by modifying the Marle model according to the prescriptions given in Ref.[1], it is found that the temperature dependence becomes the same as for the Anderson-Witting model.

Relativistic gas in a Schwarzschild metric

Authors:Gilberto M. Kremer
Date:2012-12-21 19:57:12

A relativistic gas in a Schwarzschild metric is studied within the framework of a relativistic Boltzmann equation in the presence of gravitational fields, where Marle's model for the collision operator of the Boltzmann equation is employed. The transport coefficients of bulk and shear viscosities and thermal conductivity are determined from the Chapman-Enskog method. It is shown that the transport coefficients depend on the gravitational potential. Expressions for the transport coefficients in the presence of weak gravitational fields in the non-relativistic (low temperatures) and ultra-relativistic (high temperatures) limiting cases are given. Apart from the temperature gradient the heat flux has two relativistic terms. The first one, proposed by Eckart, is due to the inertia of energy and represents an isothermal heat flux when matter is accelerated. The other, suggested by Tolman, is proportional to the gravitational potential gradient and indicates that -- in the absence of an acceleration field -- a state of equilibrium of a relativistic gas in a gravitational field can be attained only if the temperature gradient is counterbalanced by a gravitational potential gradient.

The enigmatic nature of the circumstellar envelope and bow shock surrounding Betelgeuse as revealed by Herschel. I. Evidence of clumps, multiple arcs, and a linear bar-like structure

Authors:L. Decin, N. L. J. Cox, P. Royer, A. J. Van Marle, B. Vandenbussche, D. Ladjal, F. Kerschbaum, R. Ottensamer, M. J. Barlow, J. A. D. L. Blommaert, H. L. Gomez, M. A. T. Groenewegen, T. Lim, B. M. Swinyard, C. Waelkens, A. G. G. M. Tielens
Date:2012-12-19 22:14:38

Context. The interaction between stellar winds and the interstellar medium (ISM) can create complex bow shocks. The photometers on board the Herschel Space Observatory are ideally suited to studying the morphologies of these bow shocks. Aims. We aim to study the circumstellar environment and wind-ISM interaction of the nearest red supergiant, Betelgeuse. Methods. Herschel PACS images at 70, 100, and 160 micron and SPIRE images at 250, 350, and 500 micron were obtained by scanning the region around Betelgeuse. These data were complemented with ultraviolet GALEX data, near-infrared WISE data, and radio 21 cm GALFA-HI data. The observational properties of the bow shock structure were deduced from the data and compared with hydrodynamical simulations. Results. The infrared Herschel images of the environment around Betelgeuse are spectacular, showing the occurrence of multiple arcs at 6-7 arcmin from the central target and the presence of a linear bar at 9 arcmin. Remarkably, no large-scale instabilities are seen in the outer arcs and linear bar. The dust temperature in the outer arcs varies between 40 and 140 K, with the linear bar having the same colour temperature as the arcs. The inner envelope shows clear evidence of a non-homogeneous clumpy structure (beyond 15 arcsec), probably related to the giant convection cells of the outer atmosphere. The non-homogeneous distribution of the material even persists until the collision with the ISM. A strong variation in brightness of the inner clumps at a radius of 2 arcmin suggests a drastic change in mean gas and dust density some 32 000 yr ago. Using hydrodynamical simulations, we try to explain the observed morphology of the bow shock around Betelgeuse. Conclusions: [abbreviated]

Multi-dimensional models of circumstellar shells around evolved massive stars

Authors:Allard Jan van Marle, Rony Keppens
Date:2012-09-20 11:21:02

Massive stars shape their surrounding medium through the force of their stellar winds, which collide with the circumstellar medium. Because the characteristics of these stellar winds vary over the course of the evolution of the star, the circumstellar matter becomes a reflection of the stellar evolution and can be used to determine the characteristics of the progenitor star. In particular, whenever a fast wind phase follows a slow wind phase, the fast wind sweeps up its predecessor in a shell, which is observed as a circumstellar nebula. We make 2-D and 3-D numerical simulations of fast stellar winds sweeping up their slow predecessors to investigate whether numerical models of these shells have to be 3-D, or whether 2-D models are sufficient to reproduce the shells correctly. We focus on those situations where a fast Wolf-Rayet (WR) star wind sweeps up the slower wind emitted by its predecessor, being either a red supergiant or a luminous blue variable. As the fast WR wind expands, it creates a dense shell of swept up material that expands outward, driven by the high pressure of the shocked WR wind. These shells are subject to a fair variety of hydrodynamic-radiative instabilities. If the WR wind is expanding into the wind of a luminous blue variable phase, the instabilities will tend to form a fairly small-scale, regular filamentary lattice with thin filaments connecting knotty features. If the WR wind is sweeping up a red supergiant wind, the instabilities will form larger interconnected structures with less regularity. Our results show that 3-D models, when translated to observed morphologies, give realistic results that can be compared directly to observations. The 3-D structure of the nebula will help to distinguish different progenitor scenarios.

A hydrodynamical model of the circumstellar bubble created by two massive stars

Authors:Allard Jan van Marle, Zakaria Meliani, Alexandre Marcowith
Date:2012-04-10 08:29:22

Numerical models of the wind-blown bubble of massive stars usually only account for the wind of a single star. However, since massive stars are usually formed in clusters, it would be more realistic to follow the evolution of a bubble created by several stars. We develope a two-dimensional (2D) model of the circumstellar bubble created by two massive stars, a 40 solar mass star and a 25 solar mass star, and follow its evolution. The stars are separated by approximately 16 pc and surrounded by a cold medium with a density of 20 particles per cubic cm. We use the MPI-AMRVAC hydrodynamics code to solve the conservation equations of hydrodynamics on a 2D cylindrical grid using time-dependent models for the wind parameters of the two stars. At the end of the stellar evolution (4.5 and 7.0 million years for the 40 and 25 solar mass stars, respectively), we simulate the supernova explosion of each star. Each star initially creates its own bubble. However, as the bubbles expand they merge, creating a combined, aspherical bubble. The combined bubble evolves over time, influenced by the stellar winds and supernova explosions. The evolution of a wind-blown bubble created by two stars deviates from that of the bubbles around single stars. In particular, once one of the stars has exploded, the bubble is too large for the wind of the remaining star to maintain and the outer shell starts to disintegrate. The lack of thermal pressure inside the bubble also changes the behavior of circumstellar features close to the remaining star. The supernovae are contained inside the bubble, which reflects part of the energy back into the circumstellar medium.

A far-infrared survey of bow shocks and detached shells around AGB stars and red supergiants

Authors:N. L. J. Cox, F. Kerschbaum, A. -J. van Marle, L. Decin, D. Ladjal, A. Mayer, M. A. T. Groenewegen, S. van Eck, P. Royer, R. Ottensamer, T. Ueta, A. Jorissen, M. Mecina, Z. Meliani, A. Luntzer, J. A. D. L. Blommaert, Th. Posch, B. Vandenbussche, C. Waelkens
Date:2011-10-25 12:52:05

Far-infrared Herschel/PACS images at 70 and 160 micron of a sample of 78 Galactic evolved stars are used to study the (dust) emission structures, originating from stellar wind-ISM interaction. In addition, two-fluid hydrodynamical simulations of the coupled gas and dust in wind-ISM interactions are used to compare with the observations. Four distinct classes of wind-ISM interaction (i.e. "fermata", "eyes", "irregular", and "rings") are identified and basic parameters affecting the morphology are discussed. We detect bow shocks for ~40% of the sample and detached rings for ~20%. De-projected stand-off distances (R_0) -- defined as the distance between the central star and the nearest point of the interaction region -- of the detected bow shocks ("fermata" and "eyes") are derived from the PACS images and compared to previous results, model predictions and the simulations. All observed bow shocks have stand-off distances smaller than 1 pc. Observed and theoretical stand-off distances are used together to independently derive the local ISM density. Both theoretical (analytical) models and hydrodynamical simulations give stand-off distances for adopted stellar properties that are in good agreement with the measured de-projected stand-off distance of wind-ISM bow shocks. The possible detection of a bow shock -- for the distance limited sample -- appears to be governed by its physical size as set roughly by the stand-off distance. In particular the star's peculiar space velocity and the density of the ISM appear decisive in detecting emission from bow shocks or detached rings. Tentatively, the "eyes" class objects are associated to (visual) binaries, while the "rings" generally appear not to occur for M-type stars, only for C or S-type objects that have experienced a thermal pulse.

Dust distribution in circumstellar shells

Authors:Allard Jan van Marle, Zakaria Meliani, Rony Keppens, Leen Decin
Date:2011-10-14 08:29:11

We present numerical simulations of the hydrodynamical interactions that produce circumstellar shells. These simulations include several scenarios, such as wind-wind interaction and wind-ISM collisions. In our calculations we have taken into account the presence of dust in the stellar wind. Our results show that, while small dust grains tend to be strongly coupled to the gas, large dust grains are only weakly coupled. As a result, the distribution of the large dust grains is not representative of the gas distribution. Combining these results with observations may give us a new way of validating hydrodynamical models of the circumstellar medium.

On the circumstellar medium of massive stars and how it may appear in GRB observations

Authors:Allard Jan van Marle, Rony Keppens, Sung-Chul Yoon, Norbert Langer
Date:2011-10-14 08:22:59

Massive stars lose a large fraction of their original mass over the course of their evolution. These stellar winds shape the surrounding medium according to parameters that are the result of the characteristics of the stars, varying over time as the stars evolve, leading to both permanent and temporary features that can be used to constrain the evolution of the progenitor star. Because long Gamma-Ray Bursts (GRBs) are thought to originate from massive stars, the characteristics of the circumstellar medium (CSM) should be observable in the signal of GRBs. This can occur directly, as the characteristics of the GRB-jet are changed by the medium it collides with, and indirectly because the GRB can only be observed through the extended circumstellar bubble that surrounds each massive star. We use computer simulations to describe the circumstellar features that can be found in the vicinity of massive stars and discuss if, and how, they may appear in GRB observations. Specifically, we make hydrodynamical models of the circumstellar environment of a rapidly rotating, chemically near-homogeneous star, which is represents a GRB progenitor candidate model. The simulations show that the star creates a large scale bubble of shocked wind material, which sweeps up the interstellar medium in an expanding shell. Within this bubble, temporary circumstellar shells, clumps and voids are created as a result of changes in the stellar wind. Most of these temporary features have disappeared by the time the star reaches the end of its life, leaving a highly turbulent circumstellar bubble behind. Placing the same star in a high density environment simplifies the evolution of the CSM as the more confined bubble prohibits the formation of some of the temporary structures.

Spectroscopy of $^{18}$Na: Bridging the two-proton radioactivity of $^{19}$Mg

Authors:M. Assié, F. de Oliveira Santos, T. Davinson, F. de Grancey, L. Achouri, J. Alcántara-Núñez, T. Al Kalanee, J. -C. Angélique, C. Borcea, R. Borcea, L. Caceres, I. Celikovic, V. Chudoba, D. Y. Pang, C. Ducoin, M. Fallot, O. Kamalou, J. Kiener, Y. Lam, A. Lefebvre-Schuhl, G. Lotay, J. Mrazek, L. Perrot, A. M. Sánchez-Benítez, F. Rotaru, M-G. Saint-Laurent, Yu. Sobolev, N. Smirnova, M. Stanoiu, I. Stefan, K. Subotic, P. Ujic, R. Wolski, P. J. Woods
Date:2011-10-06 12:09:05

The unbound nucleus $^{18}$Na, the intermediate nucleus in the two-proton radioactivity of $^{19}$Mg, was studied by the measurement of the resonant elastic scattering reaction $^{17}$Ne(p,$^{17}$Ne)p performed at 4 A.MeV. Spectroscopic properties of the low-lying states were obtained in a R-matrix analysis of the excitation function. Using these new results, we show that the lifetime of the $^{19}$Mg radioactivity can be understood assuming a sequential emission of two protons via low energy tails of $^{18}$Na resonances.

Formation of Solar Filaments by Steady and Nonsteady Chromospheric Heating

Authors:C. Xia, P. F. Chen, R. Keppens, A. J. van Marle
Date:2011-06-01 05:32:50

It has been established that cold plasma condensations can form in a magnetic loop subject to localized heating of the footpoints. In this paper, we use grid-adaptive numerical simulations of the radiative hydrodynamic equations to parametrically investigate the filament formation process in a pre-shaped loop with both steady and finite-time chromospheric heating. Compared to previous works, we consider low-lying loops with shallow dips, and use a more realistic description for the radiative losses. We demonstrate for the first time that the onset of thermal instability satisfies the linear instability criterion. The onset time of the condensation is roughly \sim 2 hr or more after the localized heating at the footpoint is effective, and the growth rate of the thread length varies from 800 km hr-1 to 4000 km hr-1, depending on the amplitude and the decay length scale characterizing this localized chromospheric heating. We show how single or multiple condensation segments may form in the coronal portion. In the asymmetric heating case, when two segments form, they approach and coalesce, and the coalesced condensation later drains down into the chromosphere. With a steady heating, this process repeats with a periodicity of several hours. While our parametric survey confirms and augments earlier findings, we also point out that steady heating is not necessary to sustain the condensation. Once the condensation is formed, it can keep growing also when the localized heating ceases. Finally, we show that the condensation can survive continuous buffeting by perturbations resulting from the photospheric p-mode waves.

Computing the dust distribution in the bowshock of a fast moving, evolved star

Authors:Allard Jan van Marle, Zakaria Meliani, Rony Keppens, Leen Decin
Date:2011-05-12 07:21:08

We study the hydrodynamical behavior occurring in the turbulent interaction zone of a fast moving red supergiant star, where the circumstellar and interstellar material collide. In this wind-interstellar medium collision, the familiar bow shock, contact discontinuity, and wind termination shock morphology forms, with localized instability development. Our model includes a detailed treatment of dust grains in the stellar wind, and takes into account the drag forces between dust and gas. The dust is treated as pressureless gas components binned per grainsize, for which we use ten representative grainsize bins. Our simulations allow to deduce how dust grains of varying sizes become distributed throughout the circumstellar medium. We show that smaller dust grains (radius <0.045 micro-meters) tend to be strongly bound to the gas and therefore follow the gas density distribution closely, with intricate finestructure due to essentially hydrodynamical instabilities at the wind-related contact discontinuity. Larger grains which are more resistant to drag forces are shown to have their own unique dust distribution, with progressive deviations from the gas morphology. Specifically, small dust grains stay entirely within the zone bound by shocked wind material. The large grains are capable of leaving the shocked wind layer, and can penetrate into the shocked or even unshocked interstellar medium. Depending on how the number of dust grains varies with grainsize, this should leave a clear imprint in infrared observations of bowshocks of red supergiants and other evolved stars.

3-D simulations of shells around massive stars

Authors:Allard Jan van Marle, Rony Keppens, Zakaria Meliani
Date:2011-02-01 09:16:21

As massive stars evolve, their winds change. This causes a series of hydrodynamical interactions in the surrounding medium. Whenever a fast wind follows a slow wind phase, the fast wind sweeps up the slow wind in a shell, which can be observed as a circumstellar nebula. One of the most striking examples of such an interaction is when a massive star changes from a red supergiant into a Wolf-Rayet star. Nebulae resulting from such a transition have been observed around many Wolf-Rayet stars and show detailed, complicated structures owing to local instabilities in the swept-up shells. Shells also form in the case of massive binary stars, where the winds of two stars collide with one another. Along the collision front gas piles up, forming a shell that rotates along with the orbital motion of the binary stars. In this case the shell follows the surface along which the ram pressure of the two colliding winds is in balance. Using the MPI-AMRVAC hydrodynamics code we have made multi-dimensional simulations of these interactions in order to model the formation and evolution of these circumstellar nebulae and explore whether full 3D simulations are necessary to obtain accurate models of such nebulae.

A property of conformally Hamiltonian vector fields; application to the Kepler problem

Authors:Charles-Michel Marle
Date:2010-11-26 08:56:14

Let $X$ be a Hamiltonian vector field defined on a symplectic manifold $(M,\omega)$, $g$ a nowhere vanishing smooth function defined on an open dense subset $M^0$ of $M$. We will say that the vector field $Y = gX$ is conformally Hamiltonian. We prove that when $X$ is complete, when $Y$ is Hamiltonian with respect to another symplectic form $\omega_2$ defined on $M^0$, and when another technical condition is satisfied, there exists a symplectic diffeomorphism from $(M^0,\omega_2)$ onto an open subset of $(M,\omega)$, equivariant with respect to the flows of the vector fields $Y$ on $M^0$ and $X$ on $M$. This result explains why the diffeomorphism of the phase space of the Kepler problem restricted to the negative (resp. positive) values of the energy function, onto an open subset of the cotangent bundle to a three-dimensional sphere (resp. two-sheeted hyperboloid), discovered by Gy\"orgyi (1968) [9], re-discovered by Ligon and Schaaf (1976) [15], whose properties were discussed by Cushman and Duistermaat (1997) [5], is a symplectic diffeomorphism. Infinitesimal symmetries of the Ke- pler problem are discussed, and it is shown that their space is a Lie algebroid with zero anchor map rather than a Lie algebra.

Radiative cooling in numerical astrophysics: the need for adaptive mesh refinement

Authors:Allard Jan van Marle, Rony Keppens
Date:2010-11-11 10:50:30

Energy loss through optically thin radiative cooling plays an important part in the evolution of astrophysical gas dynamics and should therefore be considered a necessary element in any numerical simulation. Although the addition of this physical process to the equations of hydrodynamics is straightforward, it does create numerical challenges that have to be overcome in order to ensure the physical correctness of the simulation. First, the cooling has to be treated (semi-)implicitly, owing to the discrepancies between the cooling timescale and the typical timesteps of the simulation. Secondly, because of its dependence on a tabulated cooling curve, the introduction of radiative cooling creates the necessity for an interpolation scheme. In particular, we will argue that the addition of radiative cooling to a numerical simulation creates the need for extremely high resolution, which can only be fully met through the use of adaptive mesh refinement.

Thin shell morphology in the circumstellar medium of massive binaries

Authors:Allard Jan van Marle, Rony Keppens, Zakaria Meliani
Date:2010-11-08 08:58:20

We investigate the morphology of the collision front between the stellar winds of binary components in two long-period binary systems, one consisting of a hydrogen rich Wolf-Rayet star (WNL) and an O-star and the other of a Luminous Blue Variable (LBV) and an O-star. Specifically, we follow the development and evolution of instabilities that form in such a shell, if it is sufficiently compressed, due to both the wind interaction and the orbital motion. We use MPI-AMRVAC to time-integrate the equations of hydrodynamics, combined with optically thin radiative cooling, on an adaptive mesh 3D grid. Using parameters for generic binary systems, we simulate the interaction between the winds of the two stars. The WNL+O star binary shows a typical example of an adiabatic wind collision. The resulting shell is thick and smooth, showing no instabilities. On the other hand, the shell created by the collision of the O star wind with the LBV wind, combined with the orbital motion of the binary components, is susceptible to thin shell instabilities, which create a highly structured morphology. We identify the nature of the instabilities as both linear and non-linear thin-shell instabilities, with distinct differences between the leading and the trailing parts of the collision front. We also find that for binaries containing a star with a (relatively) slow wind, the global shape of the shell is determined more by the slow wind velocity and the orbital motion of the binary, than the ram pressure balance between the two winds. The interaction between massive binary winds needs further parametric exploration, to identify the role and dynamical importance of multiple instabilities at the collision front, as shown here for an LBV+O star system.

Relativistic heat conduction: the kinetic theory approach and comparison with Marle's model

Authors:A. R. Mendez, A. L. Garcia-Perciante
Date:2010-10-07 16:33:36

In order to close the set of relativistic hydrodynamic equations, constitutive relations for the dissipative fluxes are required. In this work we outline the calculation of the corresponding closure for the heat flux in terms of the gradients of the independent scalar state variables: number density and temperature. The results are compared with the ones obtained using Marle's approximation in which a relativistic correction factor is included for the relaxation time parameter. It is shown how including such correction this BGK-like model yields good results in the relativistic case and not only in the non-relativistic limit as was previously found by other authors.

The relativistic kinetic dispersion relation: Comparison of the relativistic Bhatnagar-Gross-Krook model and Grad's 14-moment expansion

Authors:Makoto Takamoto, Shu-ichiro Inutsuka
Date:2010-06-14 10:45:04

In this paper, we study the Cauchy problem of the linearized kinetic equations for the models of Marle and Anderson-Witting, and compare these dispersion relations with the 14-moment theory. First, we propose a modification of the Marle model to improve the resultant transport coefficients in accord with those obtained by the full Boltzmann equation. Using the modified Marle model and Anderson-Witting model, we calculate dispersion relations that are kinetically correct within the validity of the BGK approximation. The 14-moment theory that includes the time derivative of dissipation currents has causal structure, in contrast to the acausal first-order Chapman-Enskog approximation. However, the dispersion relation of the 14-moment theory does not accurately describe the result of the kinetic equation. Thus, our calculation indicates that keeping these second-order terms does not simply correspond to improving the physical description of the relativistic hydrodynamics.

Numerical models of collisions between core-collapse supernovae and circumstellar shells

Authors:Allard Jan van Marle, Nathan Smith, Stanley P. Owocki, Bob van Veelen
Date:2010-04-16 09:07:23

Recent observations of luminous Type IIn supernovae (SNe) provide compelling evidence that massive circumstellar shells surround their progenitors. In this paper we investigate how the properties of such shells influence the SN lightcurve by conducting numerical simulations of the interaction between an expanding SN and a circumstellar shell ejected a few years prior to core collapse. Our parameter study explores how the emergent luminosity depends on a range of circumstellar shell masses, velocities, geometries, and wind mass-loss rates, as well as variations in the SN mass and energy. We find that the shell mass is the most important parameter, in the sense that higher shell masses (or higher ratios of M_shell/M_SN) lead to higher peak luminosities and higher efficiencies in converting shock energy into visual light. Lower mass shells can also cause high peak luminosities if the shell is slow or if the SN ejecta are very fast, but only for a short time. Sustaining a high luminosity for durations of more than 100 days requires massive circumstellar shells of order 10 M_sun or more. This reaffirms previous comparisons between pre-SN shells and shells produced by giant eruptions of luminous blue variables (LBVs), although the physical mechanism responsible for these outbursts remains uncertain. The lightcurve shape and observed shell velocity can help diagnose the approximate size and density of the circumstellar shell, and it may be possible to distinguish between spherical and bipolar shells with multi-wavelength lightcurves. These models are merely illustrative. One can, of course, achieve even higher luminosities and longer duration light curves from interaction by increasing the explosion energy and shell mass beyond values adopted here.

The rest-frame ultraviolet spectra of GRBs from massive rapidly-rotating stellar progenitors

Authors:Peter B. Robinson, Rosalba Perna, Davide Lazzati, Allard J. van Marle
Date:2010-03-19 16:55:15

The properties of a massive star prior to its final explosion are imprinted in the circumstellar medium (CSM) created by its wind and termination shock. We perform a detailed, comprehensive calculation of the time-variable and angle-dependent transmission spectra of an average-luminosity Gamma-Ray Burst (GRB) which explodes in the CSM structure produced by the collapse of a 20 Msun, rapidly rotating, Z=0.001 progenitor star. We study both the case in which metals are initially in the gaseous phase, as well as the situation in which they are heavily depleted into dust. We find that high-velocity lines from low-ionization states of silicon, carbon, and iron are initially present in the spectrum only if the metals are heavily depleted into dust prior to the GRB explosion. However, such lines disappear on timescales of a fraction of a second for a burst observed on-axis, and of a few seconds for a burst seen at high-latitude, making their observation virtually impossible. Rest-frame lines produced in the termination shock are instead clearly visible in all conditions. We conclude that time-resolved, early-time spectroscopy is not a promising way in which the properties of the GRB progenitor wind can be routinely studied. Previous detections of high velocity features in GRB UV spectra must have been due either due to a superposition of a physically unrelated absorber or to a progenitor star with very unusual properties.

The hydrodynamics of the supernova remnant Cas A: The influence of the progenitor evolution on the velocity structure and clumping

Authors:B. van Veelen, N. Langer, J. Vink, G. García-Segura, A. J. van Marle
Date:2009-07-07 12:13:43

There are large differences in the proposed progenitor models for the Cas A SNR. One of these differences is the presence or absence of a Wolf-Rayet (WR) phase of the progenitor star. The mass loss history of the progenitor star strongly affects the shape of the Supernova remnant (SNR). In this paper we investigate whether the progenitor star of Cas A had a WR phase or not and how long it may have lasted. We performed two-dimensional multi-species hydrodynamical simulations of the CSM around the progenitor star for several WR life times, each followed by the interaction of the supernova ejecta with the CSM. We then looked at the influence of the length of the WR phase and compared the results of the simulations with the observations of Cas A. The difference in the structure of the CSM, for models with different WR life times, has a strong impact on the resulting SNR. With an increasing WR life time the reverse shock velocity of the SNR decreases and the range of observed velocities in the shocked material increases. Furthermore, if a WR phase occurs, the remainders of the WR shell will be visible in the resulting SNR. Comparing our results with the observations suggests that the progenitor star of Cas A did not have a WR phase. We also find that the quasi-stationary flocculi (QSF) in Cas A are not consistent with the clumps from a WR shell that have been shocked and accelerated by the interaction with the SN ejecta. We can also conclude that for a SN explosion taking place in a CSM that is shaped by the wind during a short < 15000 yr WR phase, the clumps from the WR shell will be visible inside the SNR.

Analysis of the railway heave induced by soil swelling at a site in southern France

Authors:Anh-Minh Tang, Yu-Jun Cui, Viet-Nam Trinh, Yahel Szerman, Gilles Marchadier
Date:2009-05-06 12:19:59

In order to better understand the heave observed on the railway roadbed of the French high-speed train (TGV) at Chabrillan in southern France, the swelling behaviour of the involved expansive clayey marl taken from the site by coring was investigated. The aim the study is to analyse the part of heave induced by the soil swelling. First, the swell potential was determined by flooding the soil specimen in an oedometer under its in-situ overburden stress. On the other hand, in order to assess the swell induced by the excavation undertaken during the construction of the railway, a second method was applied. The soil was first loaded to its in situ overburden stress existing before the excavation. It was then flooded and unloaded to its current overburden stress (after the excavation). The swell induced by this unloading was considered. Finally, the experimental results obtained were analyzed, together with the results from other laboratory tests performed previously and the data collected from the field monitoring. This study allowed estimating the heave induced by soil swelling. Subsequently, the part of heave due to landslide could be estimated which corresponds to the difference between the monitored heave and the swelling heave.

Why Global Performance is a Poor Metric for Verifying Convergence of Multi-agent Learning

Authors:Sherief Abdallah
Date:2009-04-15 13:49:42

Experimental verification has been the method of choice for verifying the stability of a multi-agent reinforcement learning (MARL) algorithm as the number of agents grows and theoretical analysis becomes prohibitively complex. For cooperative agents, where the ultimate goal is to optimize some global metric, the stability is usually verified by observing the evolution of the global performance metric over time. If the global metric improves and eventually stabilizes, it is considered a reasonable verification of the system's stability. The main contribution of this note is establishing the need for better experimental frameworks and measures to assess the stability of large-scale adaptive cooperative systems. We show an experimental case study where the stability of the global performance metric can be rather deceiving, hiding an underlying instability in the system that later leads to a significant drop in performance. We then propose an alternative metric that relies on agents' local policies and show, experimentally, that our proposed metric is more effective (than the traditional global performance metric) in exposing the instability of MARL algorithms.

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case

Authors:Husheng Li
Date:2009-03-30 18:07:18

Resource allocation is an important issue in cognitive radio systems. It can be done by carrying out negotiation among secondary users. However, significant overhead may be incurred by the negotiation since the negotiation needs to be done frequently due to the rapid change of primary users' activity. In this paper, a channel selection scheme without negotiation is considered for multi-user and multi-channel cognitive radio systems. To avoid collision incurred by non-coordination, each user secondary learns how to select channels according to its experience. Multi-agent reinforcement leaning (MARL) is applied in the framework of Q-learning by considering the opponent secondary users as a part of the environment. The dynamics of the Q-learning are illustrated using Metrick-Polak plot. A rigorous proof of the convergence of Q-learning is provided via the similarity between the Q-learning and Robinson-Monro algorithm, as well as the analysis of convergence of the corresponding ordinary differential equation (via Lyapunov function). Examples are illustrated and the performance of learning is evaluated by numerical simulations.

The inception of Symplectic Geometry: the works of Lagrange and Poisson during the years 1808-1810

Authors:Charles-Michel Marle
Date:2009-02-04 10:21:08

The concept of a symplectic structure first appeared in the works of Lagrange on the so-called "method of variation of the constants". These works are presented, together with those of Poisson, who first defined the composition law called today the "Poisson bracket". The method of variation of the constants is presented using today's mathematical concepts and notations.

On the behaviour of stellar winds that exceed the photon-tiring limit

Authors:Allard Jan van Marle, Stanley P. Owocki, Nir J. Shaviv
Date:2008-12-01 11:16:35

Stars can produce steady-state winds through radiative driving as long as the mechanical luminosity of the wind does not exceed the radiative luminosity at its base. This upper bound on the mass loss rate is known as the photon-tiring limit. Once above this limit, the radiation field is unable to lift all the material out of the gravitational potential of the star, such that only part of it can escape and reach infinity. The rest stalls and falls back toward the stellar surface, making a steady-state wind impossible. Photon-tiring is not an issue for line-driven winds since they cannot achieve sufficiently high mass loss rates. It can however become important if the star exceeds the Eddington limit and continuum interaction becomes the dominant driving mechanism. This paper investigates the time-dependent behaviour of stellar winds that exceed the photon-tiring limit, using 1-D numerical simulations of a porosity moderated, continuum-driven stellar wind. We find that the regions close to the star show a hierarchical pattern of high density shells moving back and forth, unable to escape the gravitational potential of the star. At larger distances, the flow eventually becomes uniformly outward, though still quite variable. Typically, these winds have a very high density but a terminal flow speed well below the escape speed at the stellar surface. Since most of the radiative luminosity of the star is used to drive the stellar wind, such stars would appear much dimmer than expected from the super-Eddington energy generation at their core. The visible luminosity typically constitutes less then half of the total energy flow and can become as low as ten percent or less for those stars that exceed the photon-tiring limit by a large margin.

Simbol-X: A Formation Flight Mission with an Unprecedented Imaging Capability in the 0.5-80 Kev Energy Band

Authors:Gianpiero Tagliaferri, Philippe Ferrando, Jean-Michel Le Duigou, Giovanni Pareschi, Philippe Laurent, Giuseppe Malaguti, Rodolphe Cledassou, Mauro Piermaria, Olivier La Marle, Fabrizio Fiore, Paolo Giommi
Date:2008-10-20 21:13:36

The discovery of X-ray emission from cosmic sources in the 1960s has opened a new powerful observing window on the Universe. In fact, the exploration of the X-ray sky during the 70s-90s has established X-ray astronomy as a fundamental field of astrophysics. Today, the emission from astrophysical sources is by large best known at energies below 10 keV. The main reason for this situation is purely technical since grazing incidence reflection has so far been limited to the soft X-ray band. Above 10 keV all the observations have been obtained with collimated detectors or coded mask instruments. To make a leap step forward in X-ray astronomy above 10 keV it is necessary to extend the principle of focusing X ray optics to higher energies, up to 80 keV and beyond. To this end, ASI and CNES are presently studying the implementation of a X-ray mission called Simbol-X. Taking advantage of emerging technology in mirror manufacturing and spacecraft formation flying, Simbol-X will push grazing incidence imaging up to 80 keV and beyond, providing a strong improvement both in sensitivity and angular resolution compared to all instruments that have operated so far above 10 keV. This technological breakthrough will open a new high-energy window in astrophysics and cosmology. Here we will address the problematic of the development for such a distributed and deformable instrument. We will focus on the main performances of the telescope, like angular resolution, sensitivity and source localization. We will also describe the specificity of the calibration aspects of the payload distributed over two satellites and therefore in a not "frozen" configuration.

Numerical Analysis of Relativistic Boltzmann-kinetic Equations to Solve Relativistic Shock Layer Problems

Authors:Ryosuke Yano, Kojiro Suzuki, Hisayasu Kuroda
Date:2008-10-03 13:05:15

The relativistic shock layer problem was numerically analyzed by using two relativistic Boltzmann-kinetic equations. One is Marle model, and the other is Anderson-Witting model. As with Marle model, the temperature of the gain term was determined from its relation with the dynamic pressure in the framework of 14-moments theory. From numerical results of the relativistic shock layer problem, behaviors of projected moments in the nonequilibrium region were clarified. Profiles of the heat flux given by Marle model and Anderson-Witting model were quite adverse to the profile of the heat flux approximated by Navier-Stokes-Fourier law. On the other hand, profiles of the heat flux given by Marle model and Anderson-Witting model were similar to the profile approximated by Navier-Stokes-Fourier law. Additionally we discuss the differences between Anderson-Witting model and Marle model by focusing on the fact that the relaxational rate of the distribution function depends on both flow velocity and molecular velocity for Anderson-Witting model, while it depends only on the molecular velocity for Marle model.

Multiple ring nebulae around blue supergiants

Authors:S. M. Chita, N. Langer, A. J. van Marle, G. Garcia-Segura, A. Heger
Date:2008-07-19 07:07:55

In the course of the life of a massive star, wind-wind interaction can give rise to the formation of circumstellar nebulae which are both predicted and observed in the nature. We present generic model calculations to predict the properties of such nebulae for blue supergiants. From stellar evolution calculations including rotation, we obtain the time dependence of the stellar wind properties and of the stellar radiation field. These are used as input for hydro-calculations of the circumstellar medium throughout the star's life. Here, we present the results for a rapidly rotating 12 solar masses single star. This star undergoes a blue loop during its post main sequence evolution, at the onset of which its contraction spins it up close to critical rotation. Due to the consequent anisotropic mass loss, the blue supergiant wind sweeps up the preceding slow wind into an hour glass structure. Its collision with the previously formed spherical red supergiant wind shell forms a short-lived luminous nebula consisting of two polar caps and a central inner ring. With time, the polar caps evolve into mid-latitude rings which gradually move toward the equatorial plane while the central ring is fading. These structures are reminiscent to the observed nebulae around the blue supergiant Sher 25 and the progenitor of SN 1987A. The simple model of an hour glass colliding with a spherical shell retrieves most of the intriguing nebula geometries discovered around blue supergiants, and suggests them to form an evolutionary sequence. Our results indicate that binarity is not required to obtain them.

Numerical simulations of continuum-driven winds of super-Eddington stars

Authors:A. J. van Marle, S. P. Owocki, N. J. Shaviv
Date:2008-06-27 14:49:20

We present the results of numerical simulations of continuum-driven winds of stars that exceed the Eddington limit and compare these against predictions from earlier analytical solutions. Our models are based on the assumption that the stellar atmosphere consists of clumped matter, where the individual clumps have a much larger optical thickness than the matter between the clumps. This `porosity' of the stellar atmosphere reduces the coupling between radiation and matter, since photons tend to escape through the more tenuous gas between the clumps. This allows a star that formally exceeds the Eddington limit to remain stable, yet produce a steady outflow from the region where the clumps become optically thin. We have made a parameter study of wind models for a variety of input conditions in order to explore the properties of continuum-driven winds. The results show that the numerical simulations reproduce quite closely the analytical scalings. The mass loss rates produced in our models are much larger than can be achieved by line driving. This makes continuum driving a good mechanism to explain the large mass loss and flow speeds of giant outbursts, as observed in eta Carinae and other luminous blue variable (LBV) stars. Continuum driving may also be important in population III stars, since line driving becomes ineffective at low metalicities. We also explore the effect of photon tiring and the limits it places on the wind parameters.

Calculus on Lie algebroids, Lie groupoids and Poisson manifolds

Authors:Charles-Michel Marle
Date:2008-06-05 07:57:36

We begin with a short presentation of the basic concepts related to Lie groupoids and Lie algebroids, but the main part of this paper deals with Lie algebroids. A Lie algebroid over a manifold is a vector bundle over that manifold whose properties are very similar to those of a tangent bundle. Its dual bundle has properties very similar to those of a cotangent bundle: in the graded algebra of sections of its external powers, one can define an operator similar to the exterior derivative. We present the theory of Lie derivatives, Schouten-Nijenhuis brackets and exterior derivatives in the general setting of a Lie algebroid, its dual bundle and their exterior powers. All the results (which, for their most part, are already known) are given with detailed proofs. In the final sections, the results are applied to Poisson manifolds, whose links with Lie algebroids are very close.

Differential calculus on a Lie algebroid and Poisson manifolds

Authors:Charles-Michel Marle
Date:2008-04-15 18:26:15

A Lie algebroid over a manifold is a vector bundle over that manifold whose properties are very similar to those of a tangent bundle. Its dual bundle has properties very similar to those of a cotangent bundle: in the graded algebra of sections of its external powers, one can define an operator similar to the exterior derivative. We present in this paper the theory of Lie derivatives, Schouten-Nijenhuis brackets and exterior derivatives in the general setting of a Lie algebroid, its dual bundle and their exterior powers. All the results (which, for their most part, are already known) are given with detailed proofs. In the final sections, the results are applied to Poisson manifolds.

Luminous Blue Variables & Mass Loss near the Eddington Limit

Authors:Stanley P. Owocki, Allard Jan van Marle
Date:2008-01-16 15:52:37

During the course of their evolution, massive stars lose a substantial fraction of their initial mass, both through steady winds and through relatively brief eruptions during their Luminous Blue Variable (LBV) phase. This talk reviews the dynamical driving of this mass loss, contrasting the line-driving of steady winds to the potential role of continuum driving for eruptions during LBV episodes when the star exceeds the Eddington limit. A key theme is to emphasize the inherent limits that self-shadowing places on line-driven mass loss rates, whereas continuum driving can in principle drive mass up to the "photon-tiring" limit, for which the energy to lift the wind becomes equal to the stellar luminosity. We review how the "porosity" of a highly clumped atmosphere can regulate continuum-driven mass loss, but also discuss recent time-dependent simulations of how base mass flux that exceeds the tiring limit can lead to flow stagnation and a complex, time-dependent combination of inflow and outflow regions. A general result is thus that porosity-mediated continuum driving in super-Eddington phases can explain the large, near tiring-limit mass loss inferred for LBV giant eruptions.

The circumstellar medium around a rapidly rotating, chemically homogeneously evolving, possible gamma-ray burst progenitor

Authors:Allard Jan van Marle, Norbert Langer, Sung-Chul Yoon, Guillermo Garcia-Segura
Date:2007-11-29 19:44:11

Rapidly rotating, chemically homogeneously evolving massive stars are considered to be progenitors of long gamma-ray bursts. We present numerical simulations of the evolution of the circumstellar medium around a rapidly rotating 20 Msol star at a metallicity of Z=0.001. Its rotation is fast enough to produce quasi-chemically homogeneous evolution. While conventionally, a star of 20 Msol would not evolve into a Wolf-Rayet stage, the considered model evolves from the main sequence directly to the helium main sequence. We use the time-dependent wind parameters, such as mass loss rate, wind velocity and rotation-induced wind anisotropy from the evolution model as input for a 2D hydrodynamical simulation. While the outer edge of the pressure-driven circumstellar bubble is spherical, the circumstellar medium close to the star shows strong non-spherical features during and after the periods of near-critical rotation. We conclude that the circumstellar medium around rapidly rotating massive stars differs considerably from the surrounding material of non-rotating stars of similar mass. Multiple blue-shifted high velocity absorption components in gamma-ray burst afterglow spectra are predicted. As a consequence of near critical rotation and short stellar evolution time scales during the last few thousand years of the star's life, we find a strong deviation of the circumstellar density profile in the polar direction from the 1/R^2 density profile normally associated with stellar winds close to the star

Continuum driven winds from super-Eddington stars. A tale of two limits

Authors:A. J. van Marle, S. P. Owocki, N. J. Shaviv
Date:2007-08-30 17:58:38

Continuum driving is an effective method to drive a strong stellar wind. It is governed by two limits: the Eddington limit and the photon-tiring limit. A star must exceed the effective Eddington limit for continuum driving to overcome the stellar gravity. The photon-tiring limit places an upper limit on the mass loss rate that can be driven to infinity, given the energy available in the radiation field of the star. Since continuum driving does not require the presence of metals in the stellar atmosphere it is particularly suited to removing mass from low- and zero-metallicity stars and can play a crucial part in their evolution. Using a porosity length formalism we compute numerical simulations of super-Eddington, continuum driven winds to explore their behaviour for stars both below and above the photon-tiring limit. We find that below the photon tiring limit, continuum driving can produce a large, steady mass loss rate at velocities on the order of the escape velocity. If the star exceeds the photon-tiring limit, a steady solution is no longer possible. While the effective mass loss rate is still very large, the wind velocity is much smaller

Constraints on gamma-ray burst and supernova progenitors through circumstellar absorption lines. (II): Post-LBV Wolf-Rayet stars

Authors:A. J. van Marle, N. Langer, G. garcia-Segura
Date:2007-04-16 21:23:17

Van Marle et al. (2005) showed that circumstellar absorption lines in early Type Ib/c supernova and gamma-ray burst afterglow spectra may reveal the progenitor evolution of the exploding Wolf-Rayet star. While the quoted paper deals with Wolf-Rayet stars which evolved through a red supergiant stage, we investigate here the initially more massive Wolf-Rayet stars which are thought to evolve through a Luminous Blue Variable (LBV) stage. We perform hydrodynamic simulations of the evolution of the circumstellar medium around a 60 Msol star, from the main sequence through the LBV and Wolf-Rayet stages, up to core collapse. We then compute the column density of the circumstellar matter as a function of radial velocity, time and angle. This allows a comparison with the number and blue-shifts, of absorption components in the spectra of LBVs, Wolf-Rayet stars, Type Ib/c supernovae and gamma-ray burst afterglows. Our simulation for the post-LBV stage shows the formation of various absorption components, which are, however, rather short lived; they dissipate on time scales shorter than 50,000yr. As the LBV stage is thought to occur at the beginning of core helium burning, the remaining Wolf-Rayet life time is expected to be one order of magnitude larger. When interpreting the absorption components in the afterglow spectrum of GRB-021004 as circumstellar, it can be concluded that the progenitor of this source did most likely not evolve through an LBV stage. However, a close binary with late common-envelope phase (Case C) may produce a circumstellar medium that closely resembles the LBV to Wolf-Rayet evolution, but with a much shorter Wolf-Rayet period.

Models for the circumstellar medium of long gamma-ray burst progenitor candidates

Authors:A. J. van Marle, N. Langer, A. Achterberg, G. Garcia-Segura
Date:2006-11-27 17:02:50

We present hydrodynamical models of circumstellar medium (CSM) of long gamma-ray burst (GRB) progenitor candidates. These are massive stars that have lost a large amount of mass in the form of stellar wind during their evolution. There are two possible ways to probe the CSM of long GRB progenitors. Firstly, the GRB afterglow consists of synchrotron radiation, emitted when the GRB jet sweeps up the surrounding medium. Therefore, the lightcurve is directly related to the density profile of the CSM. The density can either decrease with the radius squared (as is the case for a freely expanding stellar wind) or be constant (as we would expect for shocked wind or the interstellar medium). Secondly, material between the GRB and the observer will absorb part of the afterglow radiation, causing absorption lines in the afterglow spectrum. In some cases, such absorption lines are blue-shifted relative to the source indicating that the material is moving away from the progenitor star. This can be explained in terms of wind interactions in the CSM. We can use the CSM of these stars to investigate their prior evolutionary stage.

Forming a constant density medium close to long gamma-ray bursts

Authors:A. J. van Marle, N. Langer, A. Achterberg, G. Garcia-Segura
Date:2006-05-29 12:56:10

The progenitor stars of long Gamma-Ray Bursts (GRBs) are thought to be Wolf-Rayet stars, which generate a massive and energetic wind. Nevertheless, about 25 percent of all GRB afterglows light curves indicate a constant density medium close to the exploding star. We explore various ways to produce this, by creating situations where the wind termination shock arrives very close to the star, as the shocked wind material has a nearly constant density. Typically, the distance between a Wolf-Rayet star and the wind termination shock is too large to allow afterglow formation in the shocked wind material. Here, we investigate possible causes allowing for a smaller distance: A high density or a high pressure in the surrounding interstellar medium (ISM), a weak Wolf-Rayet star wind, the presence of a binary companion, and fast motion of the Wolf-Rayet star relative to the ISM. We find that all four scenarios are possible in a limited parameter space, but that none of them is by itself likely to explain the large fraction of constant density afterglows. A low GRB progenitor metallicity, and a high GRB energy make the occurrence of a GRB afterglow in a constant density medium more likely. This may be consistent with constant densities beingpreferentially found for energetic, high redshift GRBs.

Constraints on gamma-ray burst and supernova progenitors through circumstellar absorption lines

Authors:Allard-Jan van Marle, Norbert Langer, Guillermo Garcia-Segura
Date:2005-07-28 09:29:28

Long gamma-ray bursts are thought to be caused by a subset of exploding Wolf-Rayet stars. We argue that the circumstellar absorption lines in early supernova and in gamma-ray burst afterglow spectra may allow us to determine the main properties of the Wolf-Rayet star progenitors which can produce those two events. To demonstrate this, we first simulate the hydrodynamic evolution of the circumstellar medium around a 40 Msun star up to the time of the supernova explosion. Knowledge of density, temperature and radial velocity of the circumstellar matter as function of space and time allows us to compute the column density in the line of sight to the centre of the nebula, as a function of radial velocity, angle, and time. Our column density profiles indicate the possible number, strengths, widths and velocities of absorption line components in supernova and gamma-ray burst afterglow spectra. Our example calculation shows four distinct line features during the Wolf-Rayet stage, at about 0, 50, 150-700 and 2200 km/s, with only those of the lowest and highest velocity present at all times. The 150-700 km/s feature decays rapidly as function of time after the onset of the Wolf-Rayet stage. It consists of a variable number of components, and, especially in its evolved stage, is depending strongly on the particular line of sight. A comparison with absorption lines detected in the afterglow of GRB 021004 suggests that the high velocity absorption component in GRB 021004 may be attributed to the free streaming Wolf-Rayet wind, which is consistent with the steep density drop indicated by the afterglow light curve. The presence of the intermediate velocity components implies that the duration of the Wolf-Rayet phase of the progenitor of GRB 021004 was much smaller than the average Wolf-Rayet life time.

A cotangent bundle slice theorem

Authors:Tanya Schmah
Date:2004-09-09 07:52:49

This article concerns cotangent-lifted Lie group actions; our goal is to find local and ``semi-global'' normal forms for these and associated structures. Our main result is a constructive cotangent bundle slice theorem that extends the Hamiltonian slice theorem of Marle, Guillemin and Sternberg. The result applies to all proper cotangent-lifted actions, around points with fully-isotropic momentum values. We also present a ``tangent-level'' commuting reduction result and use it to characterise the symplectic normal space of any cotangent-lifted action. In two special cases, we arrive at splittings of the symplectic normal space, which lead to refinements of the reconstruction equations (bundle equations) for a Hamiltonian vector field. We also note local normal forms for symplectic reduced spaces of cotangent bundles.

Optimal reduction

Authors:Juan-Pablo Ortega
Date:2002-06-28 14:04:35

We generalize various symplectic reduction techniques to the context of the optimal momentum map. Our approach allows the construction of symplectic point and orbit reduced spaces purely within the Poisson category under hypotheses that do not necessarily imply the existence of a momentum map. We construct an orbit reduction procedure for canonical actions on a Poisson manifold that exhibits an interesting interplay with the von Neumann condition previously introduced by the author in his study of singular dual pairs. This condition ensures that the orbits in the momentum space of the optimal momentum map (we call them polar reduced spaces) admit a presymplectic structure that generalizes the Kostant--Kirillov--Souriau symplectic structure of the coadjoint orbits in the dual of a Lie algebra. Using this presymplectic structure, the optimal orbit reduced spaces are symplectic with a form that satisfies a relation identical to the classical one obtained by Marle, Kazhdan, Kostant, and Sternberg for free Hamiltonian actions on a symplectic manifold. In the symplectic case we provide a necessary and sufficient condition for the polar reduced spaces to be symplectic. In general, the presymplectic polar reduced spaces are foliated by symplectic submanifolds that are obtained through a generalization to the optimal context of the so called Sjamaar Principle, already existing in the theory of Hamiltonian singular reduction. We use these ideas in the construction of a family of presymplectic homogeneous manifolds and of its symplectic foliation and we show that these reduction techniques can be implemented in stages in total analogy with the case of free globally Hamiltonian proper actions.

Nonholonomic systems with symmetry allowing a conformally symplectic reduction

Authors:Pedro de M. Rios, Jair Koiller
Date:2002-03-11 20:28:39

Non-holonomic mechanical systems can be described by a degenerate almost-Poisson structure (dropping the Jacobi identity) in the constrained space. If enough symmetries transversal to the constraints are present, the system reduces to a nondegenerate almost-Poisson structure on a ``compressed'' space. Here we show, in the simplest non-holonomic systems, that in favorable circumnstances the compressed system is conformally symplectic, although the ``non-compressed'' constrained system never admits a Jacobi structure (in the sense of Marle et al.).

A symplectic slice theorem

Authors:Juan-Pablo Ortega, Tudor S. Ratiu
Date:2001-10-08 14:24:12

We provide a model for an open invariant neighborhood of any orbit in a symplectic manifold endowed with a canonical proper symmetry. Our results generalize the constructions of Marle and Guillemin and Sternberg for canonical symmetries that have an associated momentum map. In these papers the momentum map played a crucial role in the construction of the tubular model. The present work shows that in the construction of the tubular model it can be used the so called Chu map instead, which exists for any canonical action, unlike the momentum map. Hamilton's equations for any invariant Hamiltonian function take on a particularly simple form in these tubular variables. As an application we will find situations, that we will call tubewise Hamiltonian, in which the existence of a standard momentum map in invariant neighborhoods is guaranteed.

Study by MOA of extra-solar planets in gravitational microlensing events of high magnification

Authors:I. A. Bond, N. J. Rattenbury, J. Skuljan, F. Abe, R. J. Dodd, J. B. Hearnshaw, M. Honda, J. Jugaku, P. M. Kilmartin, A. Marles, K. Masuda, Y. Matsubara, Y. Muraki, T. Nakamura, G. Nankivell, S. Noda, C. Noguchi, K. Ohnishi, M. Reid, To. Saito, H. Sato, M. Sekiguchi, D. J. Sullivan, T. Sumi, M. Takeuti, Y. Watase, S. Wilkinson, R. Yamada, T. Yanagisawa, P. C. M. Yock
Date:2001-02-11 22:02:16

A search for extra-solar planets was carried out in three gravitational microlensing events of high magnification, MACHO 98-BLG-35, MACHO 99-LMC-2, and OGLE 00-BUL-12. Photometry was derived from observational images by the MOA and OGLE groups using an image subtraction technique. For MACHO 98-BLG-35, additional photometry derived from the MPS and PLANET groups was included. Planetary modeling of the three events was carried out in a super-cluster computing environment. The estimated probability for explaining the data on MACHO 98-BLG-35 without a planet is <1%. The best planetary model has a planet of mass ~(0.4-1.5) X 10^-5 M_Earth at a projected radius of either ~1.5 or ~2.3 AU. We show how multi-planet models can be applied to the data. We calculated exclusion regions for the three events and found that Jupiter-mass planets can be excluded with projected radii from as wide as about 30 AU to as close as around 0.5 AU for MACHO 98-BLG-35 and OGLE 00-BUL-12. For MACHO 99-LMC-2, the exclusion region extends out to around 10 AU and constitutes the first limit placed on a planetary companion to an extragalactic star. We derive a particularly high peak magnification of ~160 for OGLE 00-BUL-12. We discuss the detectability of planets with masses as low as Mercury in this and similar events.

Real-Time Difference Imaging Analysis of MOA Galactic Bulge Observations During 2000

Authors:I. A. Bond, F. Abe, R. J. Dodd, J. B. Hearnshaw, M. Honda, J. Jugaku, P. M. Kilmartin, A. Marles, K. Masuda, Y. Matsubara, Y. Muraki, T. Nakamura, G. Nankivell, S. Noda, C. Noguchi, K. Ohnishi, N. J. Rattenbury, M. Reid, To. Saito, H. Sato, M. Sekiguchi, J. Skuljan, D. J. Sullivan, T. Sumi, M. Takeuti, Y. Watase, S. Wilkinson, R. Yamada, T. Yanagisawa, P. C. M. Yock
Date:2001-02-11 01:36:29

We describe observations carried out by the MOA group of the Galactic Bulge during 2000 that were designed to detect efficiently gravitational microlensing of faint stars in which the magnification is high and/or of short duration. These events are particularly useful for studies of extra-solar planets and faint stars. Approximately 17 degrees square were monitored at a sampling rate of up to 6 times per night. The images were analysed in real-time using a difference imaging technique. Twenty microlensing candidates were detected, of which 8 were alerted to the microlensing community whilst in progress. Approximately half of the candidates had high magnifications (>~10), at least one had very high magnification (>~50), and one exhibited a clear parallax effect. The details of these events are reported here, together with details of the on-line difference imaging technique. Some nova-like events were also observed and these are described, together with one asteroid.

Recent results by the MOA group on gravitational microlensing

Authors:P. Yock, I. Bond, N. Rattenbury, J. Skuljan, T. Sumi, FR. Abe, R. Dodd, J. Hearnshaw, M. Honda, J. Jugaku, P. Kilmartin, A. Marles, K. Masuda, Y. Matsubara, Y. Muraki, T. Nakamura, G. Nankivell, S. Noda, C. Noguchi, K. Ohnishi, M. Reid, To. Saito, H. Sato, M. Sekiguchi, D. Sullivan, M. Takeuti, Y. Watase, T. Yanagisawa
Date:2000-07-21 03:02:36

Recent work by the MOA gravitational microlensing group is briefly described, including (i) the current observing strategy, (ii) use of a high-speed parallel computer for analysis of results by inverse ray shooting, (iii) analysis of the light curve of event OGLE-2000-BUL12 in terms of extra-solar planets, and (iv) the MOA alert system using difference imaging.