[Paper Study] Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

field : RL
understanding : ๐Ÿ˜ƒ๐Ÿ˜ƒ

Paper study
Author

hoyeon

Published

May 1, 2023

Abstract

  • ์ตœ๊ทผ ๊ฐ•ํ™”ํ•™์Šต์€ ์ˆœ์ฐจ์ ์ธ ์˜์‚ฌ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”๋ฐ์— ์ƒ๋‹นํ•œ ์„ฑ๊ณต์„ ๊ฑฐ๋‘์—ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ์„ฑ๊ณต๋“ค์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  multi agent reinforcement learning(MARL)์— ๋Œ€ํ•œ ์ด๋ก ์ ์ธ ๊ธฐ์ดˆ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋ถ€์กฑํ•˜๋‹ค.
  • ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ช‡ ๊ฐ€์ง€ MARL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ „์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•˜๊ณ  ์ด๋ก ์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค.
  • ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ์ค‘์š”ํ•˜์ง€๋งŒ ๋‹ค์†Œ ๋„์ „์ ์ธ ๋‚œ๊ด€๋“ค๋„ ์ œ์‹œํ•œ๋‹ค.
  • ๋…ผ๋ฌธ์˜ ๊ถ๊ทน์ ์ธ ๋ชฉํ‘œ๋Š” ํ˜„์žฌ์˜ MARL ๋ถ„์•ผ์— ๋Œ€ํ•œ ์˜๊ฒฌ,ํ‰๊ฐ€์„ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด์„œ ์œ ์ตํ•œ ์—ฐ๊ตฌ๋ฐฉํ–ฅ์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๊ฒƒ์ด๋‹ค.

Introduction

  • ์ตœ๊ทผ ๊ฐ•ํ™”ํ•™์Šต์€ ๋ณต์žกํ•œ ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋Š” ๋”ฅ๋Ÿฌ๋‹์˜ ๋ฐœ์ „๊ณผ ๋”๋ถˆ์–ด RL์€ ๋†€๋ž๊ฒŒ ์ง„๋ณดํ–ˆ๋‹ค.

    • ์˜ˆ๋ฅผ ๋“ค๋ฉด playing real-time strategy games, playing car games, etc.
  • ๋Œ€๋ถ€๋ถ„์˜ ์„ฑ๊ณต์—๋Š” ํ•˜๋‚˜์˜ agent๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ•˜๋‚˜ ์ด์ƒ์˜ ๋‹ค์ˆ˜์˜ agent๊ฐ€ ์ฐธ์—ฌํ•˜๋ฉฐ ์ด๋Š” MARL๋กœ ๋ชจ๋ธ๋ง ๋œ๋‹ค.

  • MARL์€ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ญ˜๊นŒ?

    • ๋‹ค์ˆ˜์˜ ์ž๋ฆฝํ•˜์—ฌ ์›€์ง์ด๋Š” agent๋“ค์ด ์กด์žฌํ•˜๊ณ 
    • ์ด๋Ÿฌํ•œ agent๊ฐ€ ๊ณตํ†ต์ ์ธ ํ™˜๊ฒฝ(environment)์— ๋†“์—ฌ์žˆ์„ ๋•Œ์˜
    • ์ˆœ์ฐจ์ ์ธ ์˜์‚ฌ๊ฒฐ์ •๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.
    • ๊ฐ๊ฐ์˜ agent๋Š” ํ™˜๊ฒฝ,๋‹ค๋ฅธ agent๋“ค๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด์„œ ๊ทธ๋“ค์ด ์–ป๋Š” ๊ฐ๊ฐ์˜ return์„ ์ตœ๋Œ€ํ™” ํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•œ๋‹ค.
  • ํฌ๊ฒŒ MARL์€ ๋‹ค์Œ์˜ ์„ธ ๊ฐ€์ง€๋กœ ๊ตฌ๋ถ„๋œ๋‹ค.

    • fully cooperative \(\to\) agent๋“ค์€ ๊ณต๋™์˜ return์„ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ˜‘๋ ฅํ•œ๋‹ค.
      • ex) ๋กœ๋ด‡์ด ๋ฌผ๊ฑด์„ ์ง€์ •๋œ ์žฅ์†Œ์— ์šด๋ฐ˜ํ•˜๋Š” ๊ฒฝ์šฐ, ์—ฌ๋Ÿฌ๋Œ€์˜ ์ž์œจ์ฃผํ–‰์ฐจ๊ฐ€ ๋ชฉ์ ์ง€์— ๋„์ฐฉ.
    • fully competitive \(\to\) agent๋“ค์€ ํ•ฉ์ด 0์ธ return์„ ์„œ๋กœ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฒฝ์Ÿํ•œ๋‹ค.
      • ex) ๋ฐ”๋‘‘,์ฒด์Šค
    • a mix of two \(\to\) agent๊ฐ€ ๋ณดํŽธ์ ์ธ return์„ ์ตœ์ ํ™” ํ•˜๊ธฐ์œ„ํ•ด ํ˜‘๋ ฅ or ๊ฒฝ์Ÿํ•  ์ˆ˜ ์žˆ๋‹ค.(ํ˜‘๋ ฅ๋„ ๊ฒฝ์Ÿ๋„ ๋ชจ๋‘ ๊ฐ€๋Šฅํ•˜๋‹ค.)
      • ex) ์ถ•๊ตฌ,๋†๊ตฌ
  • MARL์ด ๋ฌด์กฐ๊ฑด ์ข‹์•„๋ณด์ด๋Š”๋ฐ? \(\to\) ์–ด๋ ค์šด ์ ์ด ๋งŽ๋‹ค.(๋ณต์Šต)

    • ๊ฐ ์—์ด์ „ํŠธ๋“ค์€ ๊ฐ๊ฐ์˜ return์„ ์ตœ๋Œ€ํ™” ํ•˜๋ คํ•จ.
      • ๊ท ํ˜•์ ์—์„œ ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ(๋‚ด์‹œ๊ท ํ˜•์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ)
      • ํ†ต์‹ ,ํ˜‘๋ ฅ์ด ํšจ์œจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ธฐ์ค€์ด ํ•„์š”ํ•จ.
      • ์ ๋Œ€์  agent์— ๋Œ€ํ•œ robustness๋Š” ์ถฉ๋ถ„ํ•œ๊ฐ€?
    • ๋ชจ๋“  agent๋Š” ์ €๋งˆ๋‹ค์˜ policy๋ฅผ ๊ณ„์†ํ•ด์„œ ํ–ฅ์ƒ(์ˆ˜์ •).
      • B๋ผ๋Š” agent๊ฐ€ ์ด์ „๊ณผ ๋‹ค๋ฅธ action์„ ํ•˜๊ฒŒ ๋˜๋ฉด environment๊ฐ€ ๋ณ€ํ•˜๊ฒŒ ๋˜๊ณ 
      • A๋ผ๋Š” agent๊ฐ€ ์ง๋ฉดํ•˜๋Š” environment๋Š” non-stationaryํ•ด์ง€๋Š” ๊ฒƒ์„ ์˜๋ฏธ.
    • ๋ชจ๋“  agent์— ๋Œ€ํ•œ action๋“ค์˜ ์กฐํ•ฉ,๊ฒฐํ•ฉ์€ agentํ•œ๋ช…์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€.
    • ๊ฐ agent๋Š” ๋‹ค๋ฅธ agent๊ฐ€ ๋ฌด์—‡์„ ๊ด€์ธกํ–ˆ๋Š”์ง€๋Š” ์ •ํ™•ํžˆ,๋ชจ๋“ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์—†์Œ.
      • ๋‹ค๋ฅธ agent์˜ ๊ด€์ธก์— ๋Œ€ํ•œ ์ œํ•œ๋œ ์ •๋ณด๋งŒ์„ ๊ฐ€์ง€๊ณ  ๊ฐ agent๋Š” ๊ฒฐ์ •์„ํ•จ.
      • ์ตœ์ ์˜ ๊ฒฐ์ •์ด ์•„๋‹Œ suboptimalํ•œ ๊ฒฐ์ •์„ ๊ฐ€์ ธ์˜ด.