[코드 리뷰] PyTorch로 NCF 모델 구현하기 (without pre-training)

이전에 포스팅 했던 NCF 논문의 모델을 PyTorch로 구현한 코드 리뷰이다.

구글 colab 환경에서 진행했으며 스탠다드 GPU를 사용하여 에포크 당 학습 시간은 2분이다.

코드는 guoyang9의 깃허브를 참고했다. 아래는 논문 리뷰 글과 코드 링크이다.

원 논문 코드 링크: https://github.com/hexiangnan/neural_collaborative_filtering

The requirements are as follows:
* python==3.6
* pandas==0.24.2
* numpy==1.16.2
* pytorch==1.0.1
* gensim==3.7.1
* tensorboardX==1.6

원 논문에서 NCF 모델은 user와 item을 latent vector로 임베딩한 한다. 이때 GMF와 MLP의 임베딩을 각각 pretraining 한다. 이 후 두 모델의 출력 벡터를 concat해서 최종 NeuMF Layer를 통과 시킨다. 원 논문 공개 코드는 python 2.xx 버전과 keras 초기 버전으로 작성되었다. 이를 보완하여 본 포스팅에서는 pretraining 하지 않고, 각각 임베딩에서부터 학습하는 모델 코드를 리뷰하도록 하겠다.

Colab 환경에서 PyTorch로 구현했다.

config.py

# dataset name 
dataset = 'ml-1m'

# model name
model = 'NeuMF-end'

# paths
main_path = './'

train_rating = main_path + '{}.train.rating.txt'.format(dataset)
test_rating = main_path + '{}.test.rating.txt'.format(dataset)
test_negative = main_path + '{}.test.negative.txt'.format(dataset)

데이터셋과 모델, 파일 저장 경로를 지정한다.

데이터셋은 원 논문 깃허브에 저장되어 있다. ml-1m 데이터 셋을 사용하고, text data를 위한 negative 파일도 제공한다.

train.rating.txt: userID\t itemID\t rating\t timestamp
test.negative.txt: (userID,itemID)\t negativeItemID1\t negativeItemID2 ...negativeItemID99
각 라인마다 positive sample 1개와 negative samples 99개 목록으로 구성

data_utils.py

import numpy as np 
import pandas as pd 
import scipy.sparse as sp

import torch.utils.data as data

import config


def load_all(test_num=100):
	""" We load all the three file here to save time in each epoch. """
	train_data = pd.read_csv(
		config.train_rating, 
		sep='\t', header=None, names=['user', 'item'], 
		usecols=[0, 1], dtype={0: np.int32, 1: np.int32})

	user_num = train_data['user'].max() + 1
	item_num = train_data['item'].max() + 1

	train_data = train_data.values.tolist()

	# load ratings as a dok matrix
	train_mat = sp.dok_matrix((user_num, item_num), dtype=np.float32)
	for x in train_data:
		train_mat[x[0], x[1]] = 1.0

	test_data = []
	with open(config.test_negative, 'r') as fd:
		line = fd.readline()
		while line != None and line != '':
			arr = line.split('\t')
			u = eval(arr[0])[0]
			test_data.append([u, eval(arr[0])[1]])
			for i in arr[1:]:
				test_data.append([u, int(i)])
			line = fd.readline()
	return train_data, test_data, user_num, item_num, train_mat

load_all() 함수는 5개 값을 반환한다.

train_data: NCF는 user와 item 간의 iteraction만 데이터만 사용하기 때문에 train_rating.txt 파일에서 user, item 열만 불러온다.
test_data: training 시 negative sampling에 필요한 데이터이다. (user_id, item_id) 형식의 튜플로 저장된다.
user_num: user 인덱스 값의 최대 값 + 1
item_num: user 인덱스 값의 최대 값 + 1
train_mat: user-item interaction 희소행렬을 key-value dictionaty 형태로 정의한다.

class NCFData(data.Dataset):
	def __init__(self, features, 
				num_item, train_mat=None, num_ng=0, is_training=None):
		super(NCFData, self).__init__()
		""" Note that the labels are only useful when training, we thus 
			add them in the ng_sample() function.
		"""
		self.features_ps = features
		self.num_item = num_item
		self.train_mat = train_mat
		self.num_ng = num_ng
		self.is_training = is_training
		self.labels = [0 for _ in range(len(features))]

	def ng_sample(self):
		assert self.is_training, 'no need to sampling when testing'

		self.features_ng = []
		for x in self.features_ps:
			u = x[0]
			for t in range(self.num_ng):
				j = np.random.randint(self.num_item)
				while (u, j) in self.train_mat:
					j = np.random.randint(self.num_item)
				self.features_ng.append([u, j])

		labels_ps = [1 for _ in range(len(self.features_ps))]
		labels_ng = [0 for _ in range(len(self.features_ng))]

		self.features_fill = self.features_ps + self.features_ng
		self.labels_fill = labels_ps + labels_ng

	def __len__(self):
		return (self.num_ng + 1) * len(self.labels)

	def __getitem__(self, idx):
		features = self.features_fill if self.is_training \
					else self.features_ps
		labels = self.labels_fill if self.is_training \
					else self.labels

		user = features[idx][0]
		item = features[idx][1]
		label = labels[idx]
		return user, item ,label

이후 Dataloader를 정의하기 위해 dataset을 정의한다.

NCFdata는 torch.utils.data의 Dataset을 상속받는다.

0 or 1 label은 training 시에만 필요하다.
ng_sample() 함수는 파라미터로 받은 num_ng 수만큼 neagative sample을 추가한다. (본 학습에서는 4로 지정)
dok_matrix 형태의 train_mat에서 (user, item) 페어를 불러온다.
__getitem__은 user index, item index, label (0 or 1) 3가지 값을 반환한다.

model.py

import torch
import torch.nn as nn
import torch.nn.functional as F 


class NCF(nn.Module):
	def __init__(self, user_num, item_num, factor_num, num_layers, dropout, model, GMF_model=None, MLP_model=None):
		super(NCF, self).__init__()
		self.dropout = dropout
		self.model = model
		self.GMF_model = GMF_model
		self.MLP_model = MLP_model

		self.embed_user_GMF = nn.Embedding(user_num, factor_num)
		self.embed_item_GMF = nn.Embedding(item_num, factor_num)
		self.embed_user_MLP = nn.Embedding(
				user_num, factor_num * (2 ** (num_layers - 1)))
		self.embed_item_MLP = nn.Embedding(
				item_num, factor_num * (2 ** (num_layers - 1)))

		MLP_modules = []
		for i in range(num_layers):
			input_size = factor_num * (2 ** (num_layers - i))
			MLP_modules.append(nn.Dropout(p=self.dropout))
			MLP_modules.append(nn.Linear(input_size, input_size//2))
			MLP_modules.append(nn.ReLU())
		self.MLP_layers = nn.Sequential(*MLP_modules)

		predict_size = factor_num * 2

		self.predict_layer = nn.Linear(predict_size, 1)
  
		self._init_weight_()

	def _init_weight_(self):
        nn.init.normal_(self.embed_user_GMF.weight, std=0.01)
        nn.init.normal_(self.embed_user_MLP.weight, std=0.01)
        nn.init.normal_(self.embed_item_GMF.weight, std=0.01)
        nn.init.normal_(self.embed_item_MLP.weight, std=0.01)

        for m in self.MLP_layers:
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
        nn.init.kaiming_uniform_(self.predict_layer.weight, 
                                a=1, nonlinearity='sigmoid')

        for m in self.modules():
            if isinstance(m, nn.Linear) and m.bias is not None:
                m.bias.data.zero_()

	def forward(self, user, item):
        embed_user_GMF = self.embed_user_GMF(user)
        embed_item_GMF = self.embed_item_GMF(item)
        output_GMF = embed_user_GMF * embed_item_GMF

        embed_user_MLP = self.embed_user_MLP(user)
        embed_item_MLP = self.embed_item_MLP(item)
        interaction = torch.cat((embed_user_MLP, embed_item_MLP), -1)
        output_MLP = self.MLP_layers(interaction)

        concat = torch.cat((output_GMF, output_MLP), -1)

		prediction = self.predict_layer(concat)
		return prediction.view(-1)

NCF 모델은 nn.Module을 상속받는다. pretrained 모델이 아니기 때문에 GMF와 MLP 모델은 None으로 받는다.

GMF, MLP 각각 user, item 임베딩을 학습한다.
num_layers 수만큼 MLP 모듈에 레이어를 추가한다.
최적의 학습을 위해 _init_weight() 함수로 각 가중치를 초기화한다.

forward() 함수에서는 NCF 모델 학습을 차례대로 진행한다.

GMF, MLP 각각 임베딩 한 후에 두 출력을 concat 한다.
concat = torch.cat((output_GMF, output_MLP) -1)
concat한 벡터를 최종 출력 layer를 통과 시켜 prediction을 예측한다.
prediction = self.predict_layer(concat)

evaluate.py

import numpy as np
import torch


def hit(gt_item, pred_items):
	if gt_item in pred_items:
		return 1
	return 0


def ndcg(gt_item, pred_items):
	if gt_item in pred_items:
		index = pred_items.index(gt_item)
		return np.reciprocal(np.log2(index+2))
	return 0


def metrics(model, test_loader, top_k):
	HR, NDCG = [], []

	for user, item, label in test_loader:
		user = user.cuda()
		item = item.cuda()

		predictions = model(user, item)
		_, indices = torch.topk(predictions, top_k)
		recommends = torch.take(
				item, indices).cpu().numpy().tolist()

		gt_item = item[0].item()
		HR.append(hit(gt_item, recommends))
		NDCG.append(ndcg(gt_item, recommends))

	return np.mean(HR), np.mean(NDCG)

평가 지표로 hit ratio와 NDCG를 적용한다.

metrics() 함수는 test_loader의 데이터에서 평가를 실시한다.

파라미터로 받은 top_k만큼 torch.topk를 통해 아이템 인덱스를 저장한다. 이후 torch.take에 tolist()를 적용하여 recommends list를 만든다.

main.py

import os
import time
import argparse
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.backends.cudnn as cudnn
from tensorboardX import SummaryWriter

import model
import config
import evaluate
import data_utils


parser = argparse.ArgumentParser()
parser.add_argument("--lr", 
	type=float, 
	default=0.001, 
	help="learning rate")
parser.add_argument("--dropout", 
	type=float,
	default=0.0,  
	help="dropout rate")
parser.add_argument("--batch_size", 
	type=int, 
	default=256, 
	help="batch size for training")
parser.add_argument("--epochs", 
	type=int,
	default=20,  
	help="training epoches")
parser.add_argument("--top_k", 
	type=int, 
	default=10, 
	help="compute metrics@top_k")
parser.add_argument("--factor_num", 
	type=int,
	default=32, 
	help="predictive factors numbers in the model")
parser.add_argument("--num_layers", 
	type=int,
	default=3, 
	help="number of layers in MLP model")
parser.add_argument("--num_ng", 
	type=int,
	default=4, 
	help="sample negative items for training")
parser.add_argument("--test_num_ng", 
	type=int,
	default=99, 
	help="sample part of negative items for testing")
parser.add_argument("--out", 
	default=True,
	help="save model or not")
parser.add_argument("--gpu", 
	type=str,
	default="0",  
	help="gpu card ID")
args = parser.parse_args()

os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu
cudnn.benchmark = True

py 파일로 수행되기 때문에 인자를 지정해준다.

모델 수행 예시이다. python main.py --batch_size=256 --lr=0.001 --factor_num=16

############################## PREPARE DATASET ##########################
train_data, test_data, user_num ,item_num, train_mat = data_utils.load_all()

# construct the train and test datasets
train_dataset = data_utils.NCFData(
		train_data, item_num, train_mat, args.num_ng, True)
test_dataset = data_utils.NCFData(
		test_data, item_num, train_mat, 0, False)
train_loader = data.DataLoader(train_dataset,
		batch_size=args.batch_size, shuffle=True, num_workers=4)
test_loader = data.DataLoader(test_dataset,
		batch_size=args.test_num_ng+1, shuffle=False, num_workers=0)

########################### CREATE MODEL #################################
GMF_model = None
MLP_model = None

model = model.NCF(user_num, item_num, args.factor_num, args.num_layers, 
						args.dropout, config.model, GMF_model, MLP_model)
model.cuda()
loss_function = nn.BCEWithLogitsLoss()

optimizer = optim.Adam(model.parameters(), lr=args.lr)

# writer = SummaryWriter() # for visualization

원 논문과 같이 pretrain 모델 사용시 GMF_model. MLP_model을 불러온다.

본 코드는 end to end 학습을 구현하기 때문에 None으로 설정한다.

0, 1 binary 레이블이기 때문에 바이너리 loss 함수를 사용한다. optimizer는 adam을 적용한다.

########################### TRAINING #####################################
count, best_hr = 0, 0
for epoch in range(args.epochs):
	model.train() # Enable dropout (if have).
	start_time = time.time()
	train_loader.dataset.ng_sample()

	for user, item, label in train_loader:
		user = user.cuda()
		item = item.cuda()
		label = label.float().cuda()

		model.zero_grad()
		prediction = model(user, item)
		loss = loss_function(prediction, label)
		loss.backward()
		optimizer.step()
		# writer.add_scalar('data/loss', loss.item(), count)
		count += 1

	model.eval()
	HR, NDCG = evaluate.metrics(model, test_loader, args.top_k)

	elapsed_time = time.time() - start_time
	print("The time elapse of epoch {:03d}".format(epoch) + " is: " + 
			time.strftime("%H: %M: %S", time.gmtime(elapsed_time)))
	print("HR: {:.3f}\tNDCG: {:.3f}".format(np.mean(HR), np.mean(NDCG)))

	if HR > best_hr:
		best_hr, best_ndcg, best_epoch = HR, NDCG, epoch
		if args.out:
			if not os.path.exists(config.model_path):
				os.mkdir(config.model_path)
			torch.save(model, 
				'{}{}.pth'.format(config.model_path, config.model))

print("End. Best epoch {:03d}: HR = {:.3f}, NDCG = {:.3f}".format(
									best_epoch, best_hr, best_ndcg))

마지막으로 에포크 횟수만큼 학습을 수행하는 부분이다.

colab 환경에서 py파일을 수행하였다.

%run main.py --batch_size=256 --lr=0.001 --factor_num=16

The time elapse of epoch 000 is: 00: 02: 13
HR: 0.616	NDCG: 0.351
The time elapse of epoch 001 is: 00: 02: 10
HR: 0.656	NDCG: 0.384
The time elapse of epoch 002 is: 00: 02: 12
HR: 0.671	NDCG: 0.396
The time elapse of epoch 003 is: 00: 02: 09
HR: 0.679	NDCG: 0.400
The time elapse of epoch 004 is: 00: 02: 09
HR: 0.682	NDCG: 0.406
The time elapse of epoch 005 is: 00: 02: 09
HR: 0.687	NDCG: 0.408
The time elapse of epoch 006 is: 00: 02: 07
HR: 0.687	NDCG: 0.412
The time elapse of epoch 007 is: 00: 02: 10
HR: 0.686	NDCG: 0.412
The time elapse of epoch 008 is: 00: 02: 08
HR: 0.693	NDCG: 0.417
The time elapse of epoch 009 is: 00: 02: 09
HR: 0.695	NDCG: 0.416
The time elapse of epoch 010 is: 00: 02: 07
HR: 0.689	NDCG: 0.416
The time elapse of epoch 011 is: 00: 02: 08
HR: 0.686	NDCG: 0.412
The time elapse of epoch 012 is: 00: 02: 07
HR: 0.696	NDCG: 0.420
The time elapse of epoch 013 is: 00: 02: 08
HR: 0.692	NDCG: 0.417
The time elapse of epoch 014 is: 00: 02: 08
HR: 0.691	NDCG: 0.416
The time elapse of epoch 015 is: 00: 02: 07
HR: 0.686	NDCG: 0.413
The time elapse of epoch 016 is: 00: 02: 05
HR: 0.684	NDCG: 0.411
The time elapse of epoch 017 is: 00: 02: 08
HR: 0.688	NDCG: 0.415
The time elapse of epoch 018 is: 00: 02: 06
HR: 0.689	NDCG: 0.419
The time elapse of epoch 019 is: 00: 02: 06
HR: 0.688	NDCG: 0.416
End. Best epoch 012: HR = 0.696, NDCG = 0.420

'Recommender System' 카테고리의 다른 글

NDCF, MAP - 실제 추천 모델을 통해 평가 지표 이해하기(코드 구현) (0)	2023.07.10
Yelp 데이터로 NCF 코드 밑바닥부터 구현하기 with PyTorch (0)	2023.06.28
그래프 기반 추천시스템의 이해 (1)	2023.06.21

config.py

data_utils.py

model.py

'Recommender System' 카테고리의 다른 글

티스토리툴바