[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

๐ŸŠ ๋…ผ๋ฌธ ๋งํฌ: https://dl.acm.org/doi/pdf/10.1145/3357384.3357895

Wu, S., Sun, F., Zhang, W., Xie, X., & Cui, B. (2020). Graph neural networks in recommender systems: a survey.

ACM Computing Surveys (CSUR).


  2020๋…„ ์•Œ๋ฆฌ๋ฐ”๋ฐ”์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ด๋‹ค. BERT4Rec ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ๋„ˆ๋ฌด๋‚˜ ์œ ๋ช…ํ•œ Google์˜ BERT๋ฅผ ์ถ”์ฒœ์‹œ์Šคํ…œ์— ์ ์šฉํ•œ ๋ชจ๋ธ์ด๋‹ค. BERT ์› ๋…ผ๋ฌธ์„ ๋ณด๋ฉด 'Sentence'์™€ 'Sequential'์„ ์ •์˜ํ•œ ๋ถ€๋ถ„์ด ๋‚˜์˜จ๋‹ค. ์ด ๋•Œ 'Sentence'๋Š” ๋ฐ˜๋“œ์‹œ linguisticํ•˜์ง€ ์•Š์•„๋„ ๋˜๊ณ , an arbitrary span of contigeous text์—ฌ๋„ ๋œ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ์ด ๋ถ€๋ถ„์—์„œ sequentialํ•œ user-item interaction์ด BERT์˜ input์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ๋– ์˜ฌ๋ž์„ ๊ฒƒ ๊ฐ™๋‹ค. 

 

BERT ์› ๋…ผ๋ฌธ์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ํ…Œ์Šคํฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. MLM (Masked Language Model)๊ณผ NSP (Next Sentence Prediction)์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์ธ BERT4Rec์€ MLM ๋ถ€๋ถ„์„ Cloze task๋ผ๊ณ  ๋งํ•˜๊ณ , predicting the random masked items in the sequence์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค. ๋˜ํ•œ ๋…ผ๋ฌธ์˜ 'by jointly conditioning on their left and right context'๋ผ๋Š” ํ‘œํ˜„๋„ BERT bidirectionalํ•œ ํŠน์ง•์„ ์ž˜ ํ‘œํ˜„ํ•œ ๊ฒƒ ๊ฐ™๋‹ค. BERT4Rec์ด ์ถ”์ฒœ์‹œ์Šคํ…œ์˜ next prediction์„ ์–ด๋–ป๊ฒŒ ์ •์˜ํ–ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉด์„œ ๋ฆฌ๋ทฐํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.

 

 

1. INTRODUCTION

  User์˜ ๋ณ€ํ™”ํ•˜๋Š” ์„ ํ˜ธ๋„ (dynamic preference)์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด sequential recommendations ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ์ด๋Š” user์˜ historical interaction์— time ๊ฐœ๋… (์‹œํ€€์Šค)๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋‹ค์Œ ์•„์ดํ…œ์„ ์ถ”์ฒœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ(unidirectional model)๋“ค์€ optimalํ•œ representation์„ ํ‘œํ˜„ํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋‹จ๋ฐฉํ–ฅ ์ถ”์ฒœ์‹œ์Šคํ…œ์€ ์˜ค๋กœ์ง€ ๊ณผ๊ฑฐ ๊ธฐ๋ก๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”์ฒœ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘œํ˜„๋ ฅ์ด ์ œํ•œ๋œ๋‹ค. Real-world์—์„œ ์œ ์ €์˜ ํ–‰๋™์€ ๋ฐ˜๋“œ์‹œ ์ˆœ์„œ์— ๋”ฐ๋ผ rigidํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋…ผ๋ฌธ์€ ์„ค๋ช…ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ (bidirectinal)์„ ๊ณ ๋ คํ•˜๋Š” ์ถ”์ฒœ์‹œ์Šคํ…œ ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์€ random mask๋ฅผ ์ ์šฉํ•˜๋Š” Cloze task๋ฅผ ํ†ตํ•ด ์•„์ดํ…œ๋“ค์˜ surrounding context๋ฅผ ์ž˜ ํ‘œํ˜„ํ•œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ Cloze objective๋Š” multiple epochs์ด ๋  ์ˆ˜๋ก ๋” powerfulํ•œ ๋ชจ๋ธ์ด ๋œ๋‹ค. ๋˜ํ•œ BERT4Rec์€ ๋งˆ์ง€๋ง‰ input sequence๋ฅผ [mask]๋กœ ๋‘์–ด์„œ ๋‹ค์Œ ์•„์ดํ…œ ์˜ˆ์ธกํ•˜๋Š” ํ…Œ์Šคํฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

2. RELATED WORK

2.1 General Recommendation

- ์ „ํ†ต์ ์ธ ์ถ”์ฒœ์‹œ์Šคํ…œ์ธ CF, MF

- ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ์‹œ์Šคํ…œ RBM, NCF ๋“ฑ์„ ์–ธ๊ธ‰

2.2 Sequential Recommendation

- ์ด์ „ unidirectional model๋“ค์„ ์–ธ๊ธ‰

- GRU4Rec, SASRec์€ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ๋„ ์‚ฌ์šฉํ•œ๋‹ค.

2.3 Attention Mechanism

- ๋ณธ ๋…ผ๋ฌธ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Transfomer์™€ BERT์— ๋Œ€ํ•œ ์„ ํ–‰ ์ง€์‹์ด ํ•„์š”ํ•˜๋‹ค. Transfomer ๊ด€๋ จ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŒ…๊ณผ BERT ๋…ผ๋ฌธ ๋งํฌ์ด๋‹ค.

 

3. BERT4REC

3.1 Problem Statement

- ์•„๋ž˜๋Š” ๋…ธํ…Œ์ด์…˜๊ณผ ๋ชจ๋ธ๋ง ์‹์ด๋‹ค

์œ ์ € -> $\textbf{u} = \{u_1, u_2,...,u_{|u|}\}$

์•„์ดํ…œ -> $\textbf{v} = \{v_1, v_2,...,v_{|v|}\}$

 

$u \in \textbf{u}$์ธ ์œ ์ €์— ๋Œ€ํ•ด์„œ,  $t$์‹œ์ ์— $u$๊ฐ€ ์ƒํ˜ธ์ž‘์šฉํ•œ ์•„์ดํ…œ ๋ชฉ๋ก์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$$ v^{(u)}_t \in \textbf{V} $$

 

์œ ์ €-์•„์ดํ…œ ์ƒํ˜ธ์ž‘์šฉ ์‹œํ€€์Šค -> $\textbf{S}_u = [{v^{(u)}_1,..., v^{(u)}_t,...,v^{(u)}_{|n_u|}}]$

 

u์˜ interaction sequence์˜ ๊ธธ์ด๊ฐ€ $n_u$์ผ๋•Œ, ๋ชจ๋ธ์€ $n_{u+1}$์‹œ์ ์˜ ์•„์ดํ…œ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋”ฐ๋ผ์„œ  ์•„๋ž˜ ์‹์ด ํ™•๋ฅ  ๋ชจ๋ธ๋ง ์‹์ด ๋œ๋‹ค.

 

$p(v^{(u)}_{n_{u+1}} = v | \textbf{S}_u)$

 

3.2 Model Architecture

BERT4REC์˜ Transformer Layer์™€ model architecture

# Embedding Layer

- Embedding Layer์—์„œ๋Š” input sequence์— ๋Œ€ํ•ด embedding๊ณผ positional encoding์„ ๋”ํ•ด์ค€๋‹ค.

bottom embedding -> $h^0_i  =  v_i + p_i$

 

# Transformer Layer

- Transformer Layer๋Š” ํฌ๊ฒŒ Multi-Head Attention๊ณผ Position-wise Feed-Forward ๋‘ ๊ฐ€์ง€์˜ sub-layer๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋‘ sub-layer๋Š” BERT์—์„œ์˜ transformer layer ๊ตฌ์กฐ์™€ ๋™์ผํ•˜๋‹ค.

 

$h^l_i$ : ๊ฐ ๋ ˆ์ด์–ด $l$์—์„œ ๊ฐ๊ฐ์˜ position $i$์˜ hidden representation

$h^l_i \in \mathbb{R}$ ์ผ๋•Œ $h^l_i$๋ฅผ ๋งคํŠธ๋ฆญ์Šค๋กœ ํ‘œํ˜„ํ•˜๋ฉด $H^l \in \mathbb{R}$

 

< Multi-Head Attention >

MH($H^l$) = [ head_1 ; head_2 ; ... ; head_h ]$W^O$ 

 

 < Position-wise Feed-Forward >

# Stacking Transformer Layer

- BERT4Rec์€ L๊ฐœ์˜ Transformer Layer๋“ค์„ ์Œ“์•„ ์˜ฌ๋ฆฐ ๊ตฌ์กฐ์ด๋‹ค.

- ์ผ๋ฐ˜์ ์ธ transformer layer ๊ตฌ์กฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด residual connection๊ณผ LN(Layer Nomalization)์„ ์ง„ํ–‰ํ•œ๋‹ค.

 

# Output Layer

- ์ด ๋ถ€๋ถ„ ํ‘œํ˜„์ด ์ข‹์•„์„œ ์›๋ฌธ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™”๋‹ค.

- After $L$ layers that hierarchically exchange information across all positions in the previous layer, we get the final output $H^L$ for all items of the input sequence.

์˜ˆ์ธกํ•  [mask] v์˜ ํ™•๋ฅ ์€ softmax์— ์˜ํ•ด ๊ตฌํ•ด์ง„๋‹ค.

 

$W^p$๋Š” ํ•™์Šต๋˜๋Š” projection matrix์ด๊ณ  $b^P, b^O$๋Š” ํŽธํ–ฅ์ด๋‹ค

$E$๋Š” item set $\textbf{v}$์˜ ์ž„๋ฒ ๋”ฉ ๋งคํŠธ๋ฆญ์Šค๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

3.3 Model Learning

- ๊ธฐ์กด ๋‹จ๋ฐฉํ–ฅ ๋„คํŠธ์›Œํฌ์—์„œ๋Š” ๋‹ค์Œ ์ƒํ’ˆ์„ ์˜ˆ์ธกํ•˜๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๋‹ค

- [ input : v1, . . . ,vt  // output : vt+1 ] 

 

- BERT4Rec์€ Cloze Task ๋ฐฉ์‹์œผ๋กœ Masked Language modeling ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

 

- ์ตœ์ข…์ ์œผ๋กœ ๊ฐ masked input์˜ loss,  $S'_u$ ๋ฅผ negative log-likelihood๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

- Text ๋‹จ๊ณ„์—์„œ๋Š” ๋งˆ์ง€๋ง‰ user behavior์— ๋งˆ์Šคํ‚น์„ ํ•ด์„œ next prediction์„ ์˜ˆ์ธกํ•œ๋‹ค.

 

4. EXPERIMENTS

  • Datasets: ์•„๋ž˜ 4๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    Amazon Beauty / Steam / MovieLens 1m/20m
  • Evaluation Metrics
    Hit ratio / NDCG / MRR
  • Baselines & Implementation
    ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ์„ ๋น„๋กฏํ•œ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ๋“ค๊ณผ ๋น„๊ตํ•œ๋‹ค.
  • ์ดํ›„ ์‹คํ—˜ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ ์ฐธ๊ณ 

 

๋ฐ˜์‘ํ˜•