[Paper Study] Neural Machine Translation by jointly learning to align and translate

field : NLP
understanding : ๐Ÿ˜ƒ๐Ÿ˜ƒ๐Ÿ˜ƒ

Paper study
Author

hoyeon

Published

March 28, 2023

Introduction

  • ๋‹น์‹œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•œ ๊ธฐ๊ณ„๋ฒˆ์—ญ์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ encoder์™€ decoder๋ฅผ ํฌํ•จํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๊ณ ์ •๋œ ๊ธธ์ด์˜ context vector๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธธ์ด๊ฐ€ ๊ธด ๋ฌธ์žฅ์—์„œ ์„ฑ๋Šฅ์ €ํ•˜๋ฅผ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ๊ฐ์˜ target world์— ๋Œ€ํ•ด ์„œ๋กœ๋‹ค๋ฅธ context vector๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ ๊ธธ์ด๊ฐ€ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ €ํ•˜๋ฅผ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ทธ ๋‹น์‹œ์˜ state of the art์ธ phrase-based system๊ณผ ๋น„์Šทํ•œ ๋ฒˆ์—ญ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

Problem Setting

  • ๊ธฐ์กด์˜ RNN๊ธฐ๋ฐ˜์˜ seq2seq๋ชจ๋ธ์€ source sentence๊ฐ€ encoder๋ฅผ ํ†ต๊ณผํ•˜์—ฌ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ context vector๊ฐ€ ๋˜๊ณ  ์ด๋ฅผ decoder์˜ ์ดˆ๊ธฐ hidden state์‚ฌ์šฉ๋˜์–ด output sequence๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์กŒ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋Š” ์–ด๋–ค ๋ฌธ์žฅ์ด๋˜์ง€ ๊ณ ์ •๋œ ๊ธธ์ด์˜ context vecotr๋กœ ๋ณ€ํ™˜๋˜์•ผ ํ•˜๋Š” ๊ตฌ์กฐ์  ํ•œ๊ณ„๋•Œ๋ฌธ์— ๊ธธ์ด๊ฐ€ ๊ธด ๋ฌธ์žฅ์—์„œ ์ •๋ณด๋ฅผ ๊ณผ๋„ํ•˜๊ฒŒ ์†Œ์‹ค,์ถ•์†Œ,๋ˆ„๋ฝ์‹œ์ผฐ์œผ๋ฉฐ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธด ๋ฌธ์žฅ์—์„œ์˜ ์„ฑ๋Šฅ์ €ํ•˜๋ฅผ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

Background

  • ํ™•๋ฅ ๋ก ์˜ ๊ด€์ ์—์„œ ๋ฒˆ์—ญ์€ source sentence \(\bf{x}\)๊ฐ€ ์ฃผ์–ด์กŒ์„๋•Œ target sentence \(\bf{y}\)์— ๋Œ€ํ•œ conditional probability๋ฅผ maximizeํ•˜๋Š” \(\bf{y}\)๋ฅผ ์ฐพ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\[\hat{\bf{y}} = \underset{\bf{y}}{\text{argmax}}\,p({\bf{y|x}})\]

  • chain rule์— ์˜ํ•ด \(p(\bf{y|x})\)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
\[\begin{aligned} p({\bf{y|x}}) &= \prod_{t=1}^Tp(y_t|y_1,y_2,\dots,y_{t-1}|{\bf{x}}) \\ &= p(y_1|{\bf{x}})p(y_2|y_1,{\bf{x}})p(y_3|y_2,y_1,{\bf{x}})\dots p(y_T|y_1,\dots,y_{t-1}|{\bf{x}}) \end{aligned}\]
  • ๊ธฐ์กด์˜ seq2seq ๋ชจ๋ธ์—์„œ ์ž˜ ํ•™์Šต๋œ decoder์—์„œ๋Š” ๊ฐ time step \(t\)๋งˆ๋‹ค chainrule์—์„œ ์—ฐ์‡„์ ์œผ๋กœ ๊ณฑํ•˜๋Š” conditional probability๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
\[\begin{aligned} &p(y_t|y_1,y_2,\dots,y_{t-1}) = g(y_{t-1},s_{t},c)\\ &\text{where } \\ &s_t = f(y_{t-1},s_{t-1},c)\\ &c = q(\{h_1,h_2,\dots,h_{T_x}\})\\ &s_t\text{ : hidden state of decoder at time t}\\ &c\text{ : context vector,encoder output,initial hidden state of decoder}\\ &h_t\text{ : hidden state of encoder at time t} \\ &q\text{ : arbitary function} \end{aligned}\]
  • ์ง€๋‚œ๋ฒˆ ๋ฆฌ๋ทฐํ–ˆ๋˜ seq2seq๋…ผ๋ฌธ์˜ ๊ฒฝ์šฐ \(q(\{h_1,\dots,h_T\}) = h_T\)์˜€์Šต๋‹ˆ๋‹ค.

Method

Intuition


์ถœ์ฒ˜ : paper-Figure1

  • ์œ„์˜ ๊ทธ๋ฆผ์€ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ชจ๋ธ๋กœ ์œ„๋Š” decoder ์•„๋ž˜๋Š” encoder์ž…๋‹ˆ๋‹ค.
  • decoder๋ฅผ ๋ณด๋ฉด ๊ธฐ์กด์˜ seq2seq ์•„ํ‚คํ…์ณ์™€ ๋‹ค๋ฅผ๊ฒƒ์ด ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค๋งŒ ์ƒˆ๋กœ์šด ์ •๋ณด(ํœ˜์–ด์ ธ ๋“ค์–ด๊ฐ€๋Š” ํ™”์‚ดํ‘œ) ๋“ค์–ด๊ฐ€๋ฉฐ ์ด๋Š” ๊ทธ๋ฆผ์˜ ์•„๋ž˜์ชฝ์— ์žˆ๋Š” encoder์—์„œ ๋งŒ๋“ค์–ด์ง์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • encoder๋Š” bidirectional RNN์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ •๋ฐฉํ–ฅ๊ณผ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ์ฝ์–ด๋“ค์ด๋ฉด์„œ input sequence์— ๋Œ€ํ•ด ์ „์ฒด์ ์ด๋ฉด์„œ๋„ ํŠนํžˆ i-th ๋‹จ์–ด(ํ† ํฐ)๊ณผ ์—ฐ๊ด€๋œ ์ •๋ณด \(h_t\)๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ input sequence์˜ ๊ฐ ์‹œ์ ์—์„œ์˜ ์ •๋ณด๋Š” ๊ฐ๊ฐ์˜ target์„ ๋งŒ๋“œ๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ์ •๋ณด์ธ์ง€๋ฅผ ์˜๋ฏธํ•˜๋Š” ๊ฐ’์ธ \(\alpha\)์™€ ๊ณฑํ•˜์—ฌ ๋ชจ๋‘ ๋”ํ•ด์ง‘๋‹ˆ๋‹ค.
  • ์ฆ‰, ๋”ํ•ด์ง„ ๊ฐ’์€ input sequence์—์„œ ๋ชจ๋“  ์ •๋ณด๋ฅผ target์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€,๊ด€๋ จ์žˆ๋Š”์ง€๋ฅผ ๊ณ ๋ คํ•ด์„œ ์žฌ์กฐํ•ฉํ•œ ์ƒˆ๋กœ์šด ์ •๋ณด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋”ํ•ด์ง„ ๊ฐ’์€ ์ƒˆ๋กœ์šด ์ถœ๋ ฅ๊ฐ’ \(y_t\)๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ์ •๋ณด์ธ decoder์˜ hidden state \(s_t\)๋ฅผ ๊ตฌํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์–‘๋ฐฉํ–ฅ vs ๋‹จ๋ฐฉํ–ฅ RNN


๋‹จ๋ฐฉํ–ฅ RNN์—์„œ hidden state \(h_t\)

  • ์–ด๋–ป๊ฒŒ ์ƒ์„ฑ? \(\rightarrow\) input word \(x_t\)์™€ ์ด์ „์— ์ฝ์–ด๋“ค์ธ sequence์— ๋Œ€ํ•œ ์ •๋ณด \(h_{t-1}\)๋ฅผ ํ•ฉ์ณ์„œ ์ƒˆ๋กœ์šด ์ •๋ณด
  • ์ง€๊ธˆ ๋‹จ์–ด \(x_t\)๊นŒ์ง€ ์ž…๋ ฅ๋œ sequence๊นŒ์ง€์˜ ๋Œ€ํ•œ ์ •๋ณด์ด์ž ํŠนํžˆ ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž…๋ ฅ๋œ ๋‹จ์–ด๋ฅผ ๋งŽ์ด ๊ณ ๋ คํ•œ ์ •๋ณด
  • ์ง€๊ธˆ์˜ input์ธ \(x_t\)๋‹ค์Œ์— ์˜ค๋Š” sequence๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ

์–‘๋ฐฉํ–ฅ RNN์—์„œ์˜ hidden state \(h_t\)

  • ์–ด๋–ป๊ฒŒ ์ƒ์„ฑ? \(\rightarrow\) input sequence๋ฅผ ์ •๋ฐฉํ–ฅ,์—ญ๋ฐฉํ–ฅ์œผ๋กœ ์„œ๋กœ๋‹ค๋ฅธ RNN์„ ํ†ต๊ณผํ•˜์—ฌ ์ƒ์„ฑ
  • ์ •๋ฐฉํ–ฅ์—์„œ๋Š” ์ง€๊ธˆ์˜ ๋‹จ์–ด \(x_t\)๊นŒ์ง€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋œ sequence์— ๋Œ€ํ•œ ์ •๋ณด์ด์ž ํŠนํžˆ ๋งˆ์ง€๋ง‰ \(x_t\)๋ฅผ ๋งŽ์ด ๊ณ ๋ คํ•œ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋ƒ„.
  • ๋ฐ˜๋Œ€๋กœ ์—ญ๋ฐฉํ–ฅ์—์„œ๋Š” ๋์—์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๋ฐ˜๋Œ€๋กœ \(x_t\)๊นŒ์ง€ ์ž…๋ ฅ๋œ sequence ๋Œ€ํ•œ ์ •๋ณด์ด์ž ํŠนํžˆ ๋งˆ์ง€๋ง‰ \(x_t\)๋ฅผ ๋งŽ์ด ๊ณ ๋ คํ•œ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋ƒ„.
  • ์ •๋ฐฉํ–ฅ์ด๋˜ ์—ญ๋ฐฉํ–ฅ์ด๋˜ ๋‹ค์Œ์— ์˜ค๋Š” sequence์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋‚˜ ์ด ๋‘˜์„ ํ•ฉ์ณ์„œ ์ „์ฒด์ ์ธ ์ •๋ณด + ํŠน์ •์‹œ์  \(x_t\)์— ๊ณ ๋ คํ•œ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ.

์ง๊ด€์ ์ธ ์ •๋ฆฌ


  • ๊ธฐ์กด์˜ seq2seq ๋ชจ๋ธ์€ encoder์—์„œ ๊ณ ์ •๋œ ๊ธธ์ด์˜ context vector๋กœ ๋ฐ”๊พผ ์ •๋ณด๋งŒ์„ decoder์—์„œ ์‚ฌ์šฉํ•˜์—ฌ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ๋Š” ์ •๋ณด์˜ ์†์‹ค์ด ์ผ์–ด๋‚ฌ์œผ๋ฉฐ ๋˜ํ•œ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” decoder์—์„œ target(word)์„ ๋งŒ๋“ค ๋•Œ input sequnce์˜ ๊ฐ๊ฐ์˜ ์œ„์น˜์—์„œ ๋‚˜์˜ค๋Š” ๋ชจ๋“  ์ •๋ณด๋ฅผ ์ค‘์š”๋„๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์žฌ์กฐํ•ฉํ•œ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ๋งŒ๋“  ์ƒˆ๋กœ์šด ์ •๋ณด๋Š” decoder๊ฐ€ output sequence์˜ ๊ฐ๊ฐ์˜ target์„ ์˜ˆ์ธกํ• ๋•Œ ๊ฐ€์ค‘์น˜ \(\alpha\)๋ฅผ ํ†ตํ•ด ํŠน์ •์œ„์น˜ ๊ทผ์ฒ˜์˜ ๋ฌธ๋งฅ์— ์ฃผ๋ชฉ,์ง‘์ค‘ํ•œ ๊ฐ’ ์ด๊ธฐ๋•Œ๋ฌธ์— ์ง‘์ค‘,์ฃผ์˜๋ฅผ ์˜๋ฏธํ•˜๋Š” attention์ด๋ผ๋Š” ์šฉ์–ด๋ฅผ ๋”ฐ์™€์„œ attention mechanism์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ attention mechanism์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ encoder๊ฐ€ source sentence์˜ ๋ชจ๋“ ์ •๋ณด๋ฅผ ํ•˜๋‚˜์˜ ๊ณ ์ •๋œ ๊ธธ์ด์˜ context vector๋กœ ์ธ์ฝ”๋”ฉํ•ด์•ผ ํ•˜๋Š” ๋ถ€๋‹ด์„ ์ค„์—ฌ์ค๋‹ˆ๋‹ค.

Modeling

Decoder

  • ์œ„์™€ ๊ฐ™์€ ์ƒˆ๋กœ์šด๋ชจ๋ธ์—์„œ chainrule์—์„œ ๊ฐ๊ฐ์˜ conditional probability๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
\[\begin{aligned} p(y_i|y_1,\dots,y_{i-1},{\bf{x}}) = g(y_{i-1},s_{i-1},c_i) \end{aligned}\]
  • ๊ธฐ์กด์˜ seq2seq๋ชจ๋ธ์—์„œ๋Š” context vector \(c\)๋Š” target \(y_i\)๊ฐ€ ๋ฐ”๋€Œ์–ด๋„ ๊ณ ์ •๋œ ๋ฐ”๋€Œ์ง€ ์•Š๋Š” ๊ฐ’์ด์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ๋ชจ๋ธ์€ ์ด์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ๊ฐ๊ฐ์˜ target \(y_i\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์„œ๋กœ๋‹ค๋ฅธ context vector \(c_i\)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ \(c_i\)๋Š” decoder์˜ hidden state์ธ \(s_{i}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰,๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\[s_i = f(s_{i-1},y_{i-1},c_i)\]

  • \(c_i\)๋Š” target \(y_i\)๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ input sequence์—์„œ ๋‚˜์˜จ ์ •๋ณด \(h\)๋ฅผ ์ค‘์š”๋„ \(a\)์— ๋”ฐ๋ผ ์žฌ์กฐํ•ฉํ•œ ์ •๋ณด์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
\[\begin{aligned} c_i = \sum_{j=1}^{T_x}\alpha_{i,j}h_{j} \end{aligned}\]
  • ๊ฐ๊ฐ์˜ annotation ์ฆ‰ hidden state \(h_j\)๋Š” ์ „์ฒด๋ฌธ์žฅ์˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์œผ๋‚˜ ํŠนํžˆ \(j\)๋ฒˆ์งธ poistion๊ทผ์ฒ˜์˜ ๋ฌธ๋งฅ์ •๋ณด๋ฅผ ๋งŽ์ด ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.(bidirectional RNN์— ์˜ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.)

  • ์ฐธ๊ณ  - \(i\)๊ฐ’์ด ๋ฐ”๋€Œ๋”๋ผ๋„ ์ฆ‰, ๋˜ ๋‹ค๋ฅธ target์„ ์˜ˆ์ธกํ•˜๋”๋ผ๋„ ์ฐธ๊ณ ํ•˜๋Š” input sequence์˜ ์ •๋ณด์ธ annotations๋Š” ๋ฐ”๋€Œ์ง€ ์•Š์Œ(์ €์žฅํ•ด๋†จ๋‹ค๊ฐ€ ๊ฐ๊ฐ์˜ \(y_i\)๊ฐ’์„ ๊ตฌํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ,๊ตฌํ˜„์‹œ ์œ ์˜)

  • ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ \(\alpha\)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

\[\begin{aligned} \alpha_{ij} = \frac{\text{exp}(e_{ij})}{\sum_{k=1}^{T_x}\text{exp}(e_{ik})}\\ \text{where},\,e_{ij} = a(s_{i-1},h_j) \end{aligned}\]
  • \(\alpha\)๋Š” softmax์˜ ํ•จ์ˆซ๊ฐ’์ด๋ฉฐ \(e\)๋ฅผ 0๊ณผ1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ scailing ํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ \(a\)๋Š” alignment model๋กœ input sequence์˜ j๋ฒˆ์งธ position๊ณผ output sequence์˜ i๋ฒˆ์งธ position์ด ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”์ง€,์—ฐ๊ด€๋˜์–ด์žˆ๋Š”์ง€,๊ด€๋ จ์žˆ๋Š”์ง€ ์•Œ๋ ค์ฃผ๋Š” ๊ฐ’์ด๋ฉฐ ์ด๋Š” feedforward nueral network๋กœ๋ถ€ํ„ฐ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.
  • ์ด์™€๊ฐ™์ด (soft)alignment(์ผ์น˜,์ •๋ ฌ)๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๊ณ„์‚ฐํ•จ์œผ๋กœ์„œ alignment๊ฐ€ ์ž ์žฌ์ (๋ณด์ด์ง€์•Š๋˜,์ˆจ๊ฒจ์ ธ์žˆ์—ˆ๋˜)์ด์—ˆ๋˜ ๊ธฐ์กด์˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ ๋ชจ๋ธ๋“ค๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ alignment๋ฅผ ๋” ์ž˜ ํ•™์Šตํ•˜๋„๋ก gradient๊ฐ€ backpropagation ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋”ฐ๋ผ์„œ input๊ณผ output sequnce์—์„œ์˜ alignment๋ฅผ ๊ธฐ์กด๋ชจ๋ธ๋ณด๋‹ค ๋” ์ž˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Encoder

  • ์ธ์ฝ”๋”์—์„œ๋Š” ์ •๋ฐฉํ–ฅ,์—ญ๋ฐฉํ–ฅ์œผ๋กœ input sequence๋ฅผ ๋ชจ๋‘ ์ฝ์–ด๋“ค์ด๋Š” bidirectional RNN์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • forward RNN \(\overset{\rightarrow}{f}\)์€ \(x_1\)์—์„œ \(x_{T_x}\)๊นŒ์ง€ forward hidden states์ธ \((\overset{\rightarrow}{h_1},\dots,\overset{\rightarrow}{h_{T_x}})\)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

  • backward RNN \(\overset{\leftarrow}{f}\)์€ \(x_{T_x}\)์—์„œ \(x_1\)๊นŒ์ง€ bacward hidden states์ธ \((\overset{\leftarrow}{h_{T_x}},\dots,\overset{\leftarrow}{h_{T_1}})\)์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

  • ์–‘๋ฐฉํ–ฅ,์ •๋ฐฉํ–ฅ์˜ hidden state๋ฅผ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐํ•ฉ๋œ hidden state๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. \[h_j = \left[\overset{\rightarrow}{h}_j^{\,\,T};\overset{\leftarrow}{h}_j^{\,\,T}\right]^T\]

  • \(h_j\)๋Š” input sequence ์ „์ฒด์˜ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๊ฐ–์ง€๋งŒ ํŠนํžˆ position j๊ทผ์ฒ˜์— ์ง‘์ค‘๋œ ์ •๋ณด๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Experiments


์ถœ์ฒ˜ : paper-figure2

  • RNNenc๋Š” ๊ธฐ์กด๋ชจ๋ธ RNNsearch๋Š” ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • \(-50\)์€ ์ตœ๋Œ€ ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ \(50\)์ธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • RNNsearch๊ฐ€ RNNenc์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(์‹ฌ์ง€์–ด RNNenc-50๋ณด๋‹ค RNNsearch-30์ด ๋” ๋†’์•„์š”.)
  • RNNsearch๊ฐ€ ๊ธธ์ด๊ฐ€ ๋” ๊ธด ๋ฌธ์žฅ์— robustํ•œ ๋ชจ๋ธ์ด๋ฉฐ ํŠนํžˆ RNNsearch-50์€ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ์„ฑ๋Šฅ์ด ์›”๋“ฑํžˆ ์ข‹์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์œ„์˜ ๊ทธ๋ฆผ์€ ๊ฐ€์ค‘์น˜ \(\alpha\)๊ฐ’์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ฃผ๋Œ€๊ฐ์„ ์˜ ๊ฐ’์ด ๋Œ€๋ถ€๋ถ„ ํฐ ๊ฒƒ์œผ๋กœ ๋ณด์•„ ์˜์–ด์™€ ํ”„๋ž‘์Šค์–ด ๋‹จ์–ด์‚ฌ์ด์˜ alingment(์ผ์น˜)๋Š” ๋Œ€๋ถ€๋ถ„ ๋‹จ์กฐ๋กœ์›€(monotonic)์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ์‹ค์ œ๋กœ ์˜์–ด์™€ ํ”„๋ž‘์Šค์–ด ์‚ฌ์ด์˜ ์–ด์ˆœ์ด ๋Œ€๋ถ€๋ถ„ ์ผ์น˜ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค๋กœ ๋ณด์•„ ์ง๊ด€์ ์ธ ์‚ฌ์‹ค์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜•์šฉ์‚ฌ,๋ช…์‚ฌ์˜ ๊ฒฝ์šฐ์—๋Š” ๋‘ ์–ธ์–ด ์‚ฌ์ด์— ์–ด์ˆœ์˜ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ์–ด์ˆœ์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ œ์‹œ๋œ ๋ชจ๋ธ์—์„œ๋Š” ์˜ฌ๋ฐ”๋ฅด๊ฒŒ alignment์‹œํ‚ด์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด figure3์—์„œ [European Economic Area]๋Š” [zone economique europยดeen]๋กœ ๋ฒˆ์—ญ์ด ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” zone์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์–ด์ˆœ์˜ ์ฐจ์ด๊ฐ€ ์žˆ๊ธฐ์— ๋‘ ๋‹จ์–ด๋ฅผ ๋›ฐ์–ด๋„˜์–ด์„œ Area๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ alignmentํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(๋˜ํ•œ ๋‚˜๋จธ์ง€ ๋‹จ์–ด๋“ค๋„ ๋˜๋Œ์•„๊ฐ€์„œ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ alignํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์–ด์š”.)

experiments detail

  • RNNenc์˜ ๊ฒฝ์šฐ encoder,decoder์— 1000 hidden units์ธ RNN์‚ฌ์šฉ
  • RNNsearch์˜ ๊ฒฝ์šฐ encoder,decoder์— ๊ฐ๊ฐ์˜ 1000๊ฐœ์˜ hidden units์„ ๊ฐ€์ง„ forward,backward(bidirectional) RNN์„ ์‚ฌ์šฉํ–ˆ์Œ
  • RNNsearch,RNNenc ๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ target word์— ๋Œ€ํ•œ conditional probability ๊ณ„์‚ฐ์„ ์œ„ํ•ด single maxout์„ ํฌํ•จํ•œ multilayer neural network๋ฅผ ์‚ฌ์šฉํ•จ.
  • SGD with Adadelta
  • minibatch of 80 sentences
  • beam search

Conclusion

  • ๋‹น์‹œ์˜ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ ๋ชจ๋ธ์€ encoder-decoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ ์—ฌ๊ธฐ์—๋Š” ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ context vector์—์„œ ์ •๋ณด์˜ ์†์‹ค์ด ์ผ์–ด๋‚œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ๊ฐ๊ฐ์˜ target word๋ฅผ ์ƒ์„ฑํ• ๋•Œ input sequence์—์„œ ๋‚˜์˜ค๋Š” ๊ฐ๊ฐ์˜ ๋ชจ๋“  ์ •๋ณด(hidden state)๋ฅผ ์ค‘์š”๋„์— ๋”ฐ๋ผ์„œ ์žฌ์กฐํ•ฉํ•œ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” input sequence์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ context vector๋กœ ํ•จ์ถ•ํ•ด์•ผํ•˜๋Š” ๋ถ€๋‹ด์„ ์ค„์—ฌ์ฃผ๋ฉฐ ๋™์‹œ์— target๊ฐ’์„ ์ƒ์„ฑํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ input sequence์˜ ํŠน์ •ํ•œ ์ •๋ณด์— ์ง‘์ค‘(attention)ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์œผ๋ฉฐ ๋˜ํ•œ ๊ธฐ์กด์˜ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅด๊ฒŒ alignment๋ชจ๋ธ๊ณผ ๋ฒˆ์—ญ๋ชจ๋ธ์„ ๋™์‹œ์— ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์—์„œ ๊ธ์ •์ ์ž…๋‹ˆ๋‹ค.