[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Wide and deep learning for recommender system

๐ŸŠ ๋…ผ๋ฌธ ๋งํฌ: https://arxiv.org/pdf/1606.07792.pdf

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah


  2016๋…„ ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” Wide & Deep learning์€ linear ๋ชจ๋ธ๊ณผ deep neural networks๋ฅผ jointlyํ•˜๊ฒŒ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•˜๋“ฏ jointly trained ๋œ๋‹ค๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์„ ๊ฐ๊ฐ ํ•™์Šตํ•˜๋Š” ์•™์ƒ๋ธ”๊ณผ๋Š” ๋‹ค๋ฅธ ๊ฐœ๋…์ด๋‹ค. Regression๊ณผ ๊ฐ™์€ ์„ ํ˜• ๋ชจ๋ธ์€ wide ํŒŒํŠธ์ด๊ณ , ๊ณผ๊ฑฐ ๊ธฐ๋ก์„ memorization ํ•˜๊ธฐ ์ข‹๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

 

Deep neural networks๋Š” memorization์— ๋น„ํ•ด featrue engineering์ด ๋œ ํ•„์š”ํ•˜๊ณ  unseen feature combinations์„ generalization ํ•˜๊ธฐ ์ข‹๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. Wide ์„ ํ˜•ํ•™์Šต๊ณผ DNN ๋™์‹œ ํ•™์Šต์œผ๋กœ Memorization + Generatlization ์žฅ์ ์„ ๊ฒฐํ•ฉํ•ด ์„ฑ๋Šฅ ๊ทน๋Œ€ํ™”ํ•œ ๋ชจ๋ธ์„ ์ œ์‹œํ–ˆ์œผ๋ฉฐ, Google์˜ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋กœ ์‹คํ—˜์„ ํ–ˆ๋‹ค๋Š” ๊ฒƒ์— contribution์ด ์žˆ๋‹ค.


1. INTRODUCTION

  ์ถ”์ฒœ์‹œ์Šคํ…œ์€ user์™€ contextual information์ด๋ผ๋Š” query๊ฐ€ ๋“ค์–ด์™”์„๋•Œ ์•„์ดํ…œ ์ถ”์ฒœ ๋ชฉ๋ก๋ฅผ ์ถœ๋ ฅํ•˜๋Š” search ranking system์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ „์ œ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ดํ›„ ์ „์ฒด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๊ตฌํ˜„๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” generatlization๊ณผ memorization ๋‘ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค.

  • Memorization: historical data๋กœ ๋ถ€ํ„ฐ ํ”ผ์ฒ˜๋“ค ๊ฐ„์˜ correlation์„ ํ•™์Šตํ•˜์—ฌ ์ง์ ‘์ ์œผ๋กœ ๊ด€๋ จ๋œ ์•„์ดํ…œ์„ ์ถ”์ฒœํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณผ๊ฑฐ ๊ธฐ๋ก๋งŒ์„ linear ํ•™์Šต์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๊ณ , ์ƒˆ๋กœ์šด ์ถ”์ฒœ์„ ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
  • Generalization: DNN๋ชจ๋ธ์€ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด feature๋“ค์„ ๋™์ผํ•œ ์ฐจ์›์˜ latent vector๋กœ ๋งคํ•‘ํ•˜์—ฌ denseํ•œ ํ–‰๋ ฌ์˜ ๋‚ด์ ์œผ๋กœ ํ•™์Šต์„ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ feature engineering ๋ถ€๋‹ด์ด ์—†๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ๋˜ํ•œ unseen ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๊ณ  diversity๋ฅผ ํ–ฅ์ƒ ์‹œํ‚จ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณผํ•˜๊ฒŒ generalํ•œ ์ถ”์ฒœ์„ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.

  

Memorization์€ ๋‘ ํ”ผ์ฒ˜ pair์˜ co-occurence๋ฅผ binaryํ•˜๊ฒŒ ํ”ผ์ฒ˜์—”์ง€๋‹ˆ์–ด๋ง์„ ํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. ๋…ผ๋ฌธ์— ๋‚˜์˜จ ์˜ˆ์‹œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

AND(user_installed_app='netflix', impression_app='pandora')

 

์œ ์ €์˜ ๊ณผ๊ฑฐ ํ–‰๋™์ด  ํ‰๊ฐ€(Rating)์— ๋ผ์นœ ์˜ํ–ฅ์— ๋Œ€ํ•ด ์„ค๋ช…๋ ฅ ์žˆ์œผ๋ฉฐ, ๋งค์šฐ Topicalํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์ƒํ’ˆ์— ์ง์ ‘์ ์œผ๋กœ ๊ด€๋ จ๋œ ์ •๋ณด๋“ค์„ ์ถ”์ฒœํ•˜๋„๋ก ๋„์™€์ค€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ diversity๊ฐ€ ๋–จ์–ด์ง€๊ณ  feature engineering์„ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.

2. RECOMMENDER SYSTEM OVERVIEW

  ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ ์‹ค์ œ๋กœ ์„œ๋น„์Šคํ•˜๋Š” ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ„๋‹จํ•˜๊ณ  ์‹ค์šฉ์ ์ธ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ € user์™€ contextual features๋ผ๋Š” query๊ฐ€ ๋“ค์–ด์˜ค๋ฉด Retrieval system์€ ์งง์€ ์•„์ดํ…œ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—๋Š” ์ˆ˜๋ฐฑ๋งŒ๊ฐœ์˜ ์•ฑ์ด ์žˆ์œผ๋ฏ€๋กœ, ์ด๋ ‡๊ฒŒ machine-learned ๋ชจ๋ธ๊ณผ human-defined rules์„ ํ†ตํ•ด ๋น ๋ฅธ ๋™์ž‘์ด ํ•„์š”ํ•˜๋‹ค. ์ดํ›„ Ranking ์‹œ์Šคํ…œ์€ ๋ชจ๋“  ํ•ญ๋ชฉ์˜ score๋กœ ์ˆœ์œ„๋ฅผ ๋งค๊ธด๋‹ค. ๋™์‹œ์— User actions์€ Logs ๋ฐ์ดํ„ฐ๋กœ ๋“ค์–ด๊ฐ€์„œ ๋ชจ๋ธ ํ•™์Šต์— ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ Ranking ๋ถ€๋ถ„์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค.

3. WIDE & DEEP LEARNING

3.1 The Wide Component

  Wide ๋ชจ๋ธ์€ Memorization์— ํŠนํ™”๋œ ์ผ๋ฐ˜์ ์ธ linear model์ด๋‹ค.

$$ y = w^Tx + b $$

$x = [x_1, x_2, ..., x_d]$ ๋Š” ์ธํ’‹ feature, $w = [w_1, w_2, ..., w_d]$๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ, $b$๋Š” bias์ด๋‹ค.

๊ฐ€์žฅ ์ค‘์š”ํ•œ feature transformations์ค‘ ํ•˜๋‚˜๋Š” cross-product transformation์ด๋‹ค. ์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$$ \phi_k(x)= \prod_{i=1}^d x_i^{c_{ki}},,,,,c_{ki} \in {0,1} $$

i๋ฒˆ์งธ feature์˜ k๋ฒˆ์งธ transformation $\phi_k$๋Š” ํ”ผ์ฒ˜ ๊ฐ„์˜ co-occurence๋ฅผ binary๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

 

3.2 The Deep Component

  Deep model์€ generalization์˜ ์žฅ์ ์„ ์‚ด๋ฆฐ Deep neural network์ด๋‹ค. Sparseํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ํ”ผ์ฒ˜๋ฅผ Denseํ•œ low-dimensional ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋งคํ•‘ํ•œ๋‹ค. ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ์ด๋•Œ ์ฐจ์›์€ ์ผ๋ฐ˜์ ์œผ๋กœ 10์—์„œ 100์œผ๋กœ ํ•œ๋‹ค. Hidden Layer์—์„œ๋Š” ์•„๋ž˜์˜ ๊ธฐ๋ณธ์ ์ธ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ํ•œ๋‹ค.

$$a^{(l+1)} = f(W^{(l)}a^{(l)} + b^{(l)} )$$

$l$์€ ๋ ˆ์ด์–ด ๋„˜๋ฒ„, $f$๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ด๊ณ  ReLU์ด๋‹ค. $W^l, a^l, b^l$ ๋Š” ๊ฐ๊ฐ l๋ฒˆ์งธ ๊ฐ€์ค‘์น˜, activation, ํŽธํ–ฅ์ด๋‹ค.

 

 

 

3.3 Joint Training of  Wide & Deep Model

  Wide & Deep Model์€ Wide ํŒŒํŠธ์™€ Deep ํŒŒํŠธ output์˜ ๊ฐ€์ค‘ํ•ฉ์ด๋‹ค. ๋ชจ๋ธ์€ ๋™์‹œ์— ํ•™์Šต๋˜๋ฉฐ backpropagation ์—ญ์‹œ ๋™์‹œ์— ์ง„ํ–‰๋œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ๋“ค์„ ๊ฐ๊ฐ ํ•™์Šตํ•˜๋Š” ์•™์ƒ๋ธ” ๊ฐœ๋…๊ณผ๋Š” ๋‹ค๋ฅด๋‹ค. ๋ชจ๋ธ์˜ ์ตœ์ข…์ ์ธ ์˜ˆ์ธก์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$$ p(Y=1|x) = \sigma(w^{T}_{wide}[x,\phi(x)]+w^{T}_{deep}a^{(l_{f})}+b) $$

$p(Y=1|x)$์€ ์•ฑ์„ ๋‹ค์šด๋กœ๋“œํ•  ํ™•๋ฅ ์ด๋‹ค. Cross product feature $\phi(x)$์™€ wide์˜ ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•ด์ง„๋‹ค. ์ด๋ฅผ deepํŒŒํŠธ์™€ ๋”ํ•œ๋‹ค. $a^{l_f}$๋Š” deep ํŒŒํŠธ ๋„คํŠธ์›Œํฌ์˜ ์ตœ์ข… ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ง€๋‚œ ๊ฐ’์ด๊ณ  ์ด๋ฅผ ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•œ๋‹ค. $\sigma$๋Š” sigmoid์ด๋‹ค.

 

4. SYSTEM IMPLEMENTATIONS

  ๋…ผ๋ฌธ ์ฐธ๊ณ 

 

5. EXPERIMENT RESULTS

  ๋ณธ ๋…ผ๋ฌธ์€ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ์œ„ํ•ด ์‹ค์ œ ๊ตฌ๊ธ€ ํ”Œ๋ ˆ์ด ์Šคํ† ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์‹ค์ œ ์˜จ๋ผ์ธ ๊ฑฐ๋Œ€ ๋ฐ์ดํ„ฐ์— ๋ชจ๋ธ์„ ์ ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ํ–ˆ๋‹ค๋Š” ์ ์„ ์ €์ž๋“ค์€ contribution์œผ๋กœ ์–ธ๊ธ‰ํ•œ๋‹ค. 3์ฃผ๊ฐ„ ์˜จ๋ผ์ธ A/B ํ…Œ์ŠคํŠธ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ Wide & deep ๋ชจ๋ธ์ด ๋‹จ์ผ ๋ชจ๋ธ์— ๋น„ํ•ด ๋†’์€ ๋‹ค์šด๋กœ๋“œ ์ฆ๊ฐ€์œจ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์˜คํ”„๋ผ์ธ ํ…Œ์ŠคํŠธ์—์„œ๋„ ๋†’์€ AUC ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

 

๋ฐ˜์‘ํ˜•