To SVD or not to SVD [a primer on fair evaluation of recommender systems]


In this tutorial we will go through a full cycle of model tuning and evaluation to perform a fair comparison of recommendation algorithms with Polara.

This will include 2 phases: grid-search for finding (almost) optimal values of hyper-parameters and verification of results via 5-fold cross-validation. We will compare a popular ALS-based matrix factorization (MF) model called Weighted Regularized Matrix Factorization (WRMF a.k.a. iALS) [Hu, 2008] with much simpler SVD-based models.

We will use a standard sparse implementation of SVD from Scipy for the latter and a great library called implicit for iALS. Both are wrapped by Polara and can be accessed via the corresponding recommender model classes. Due to its practicality the implicit library is often recommended to beginners and sometimes even serves as a default tool in production. On the other hand, there are some important yet often overlooked features, which make SVD-based models stand out. Ignoring them in my opinion leads to certain misconceptions and myths, not to say that it also overcomplicates things quite a bit.

Note that by saying SVD I really mean Singular Value Decomposition, not just an arbitrary matrix factorization. In that sense, methods like FunkSVD, SVD++, SVDFeature, etc., are not SVD-based at all, even though historically they use SVD acronym in their names and are often referenced as if they are real substitutes for SVD. These methods utilize another optimization algorithm, typically based on stochastic gradient descent, and do not preserve the algebraic properties of SVD. This is really an important distinction, especially in the view of the following remarks:

  1. SVD-based approach has a number of unique and beneficial properties. To name a few, it produces stable and determenistic output with global guarantees (can be critical for non-regression testing). It admits the same prediction formula for both known and previously unseen users as long as at least one user rating is known (this is especially handy for online regime). It requires minimal tuning and allows to compute and store a single latent feature matrix - either for users or for items - instead of computing and storing both of them. This luxury is not available in the majority of other matrix factorization approaches, definitely not in popular ones.
  2. Thanks to the Lanczos procedure, computational complexity of truncated SVD scales linearly with the number of known observations and with the number of users/items. It scales quadratically only with the rank of decomposition. There are open source implementations, allowing to handle nearly billion-scale problems with one of its efficient randomized versions.
  3. At least in some cases the simplest possible model called PureSVD outperforms other more sophisticated matrix factorization methods [Cremonesi, 2010].
  4. Moreover, PureSVD can be quite easily tuned to perform much better [Nikolakopoulos, 2019].
  5. Finally, it can take a hybrid form to include side information via the generalized formulation (see Chapter 6 of my thesis). Even without hybridization it can be quite successfully applied in the cold start regime [Frolov, 2019].

Despite that impresisve list, SVD-based models rarely get into the list of baselines to compare or to start with.

Hence, this tutorial also aims at performing an assessment of the default choice of many practitioners to see whether it really stays advantageous over the simpler SVD-based approach after a thorough tuning of both models.


  • [Hu, 2008] Hu Y., Koren, Y. and Volinsky, C., 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM (Vol. 8, pp. 263-272). Link.
  • [Cremonesi, 2010] Cremonesi, P., Koren, Y. and Turrin, R., 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (pp. 39-46). ACM. Link.
  • [Nikolakopoulos, 2019] Nikolakopoulos, A.N., Kalantzis, V., Gallopoulos, E. and Garofalakis, J.D., 2019. EigenRec: generalizing PureSVD for effective and efficient top-N recommendations. Knowledge and Information Systems, 58(1), pp.59-81. Link.
  • [Frolov, 2019] Frolov, E. and Oseledets, I., 2019. HybridSVD: When Collaborative Information is Not Enough. To appear in Proceedings of the Thirteenth ACM Conference on Recommender Systems. ACM. Link

Downloading data

We will use the Movielens-10M dataset for our experiments. It's large enough to perform reliable evaluation; however, not that large that you would spend too many hours waiting for results. If you don't plan to play with the code, you may want to run all cells with a single command and leave the notebook running in the background, while reading the text. It takes around 1.5h to complete this notebook on a modern laptop.

Note that you'll need an internet connection in order to run the cell below.

It will automatically download data, store it in a temporary location, and convert into a pandas dataframe. Alternatively, if you have already downloaded the dataset, you can use its local path as an input for the get_movielens_data function instead of tmp_file.

In [1]:
import urllib
from polara import (get_movielens_data, # returns data in the pandas dataframe format
                    RecommenderData) # provides common interface to access data 

url = ''
tmp_file, _ = urllib.request.urlretrieve(url) # this may take some time depending on your internet connection

data = get_movielens_data(tmp_file, include_time=True)

The resulting dataframe has a bit more than 10M ratings, as expected.

In [3]:
(10000054, 4)

Basic data stats:

In [4]:
userid         69878
movieid        10677
rating            10
timestamp    7096905
dtype: int64

Preparing data

As always, you need to firstly define a data model that will provide a common interface for all recommendation algorithms used in experiments:

In [5]:
data_model = RecommenderData(data, 'userid', 'movieid', 'rating', custom_order='timestamp', seed=0)
Fields(userid='userid', itemid='movieid', feedback='rating')

Setting seed=0 ensures controllable randomization when sampling test data, which enhances reproducibility; custom_order allows to select observations for evaluation based on their timestamp, rather than on rating value (more on that later).

Let's look at the default configuration of data model:

In [6]:
{'test_ratio': 0.2,
 'negative_prediction': False,
 'warm_start': True,
 'permute_tops': False,
 'test_sample': None,
 'random_holdout': False,
 'test_fold': 5,
 'shuffle_data': False,
 'holdout_size': 3}

By default, Polara samples 20% of users and marks them for test (test_ratio attribute). These users would be excluded from the training dataset, if warm_start attribute remained set to True (strong generalization test). However, in the iALS case such setting would require running additional half-step optimization (folding-in) for each test user in order to obtain their latent representation. To avoid that we turn the "warm start" setting off and perform standard evaluation (weak generalization test). In that case test users are part of the training (except for the ratings that were held out for evaluation) and one can directly invoke scalar products of latent factors. Note that SVD recommendations do not depend on this setting due to uniform projection formula applicable to both known and "warm" users: $r = VV^\top p$, where $r$ is a vector of predicted relevance scores, $p$ is a vector of any known preferences and $V$ is an item latent features matrix.

In [7]:
data_model.holdout_size = 1 # hold out 1 item from every test user
data_model.random_holdout = False # take items with the latest timstamp
data_model.warm_start = False # standard case
Preparing data...
There are 9986078 events in the training and 13976 events in the holdout.

The holdout_size attribute controls how many user preferences will be used for evaluation. Current configuration instructs data model to holdout one item from every test user. The random_holdout=False setting along with custom_order input argumment of data model make sure that only the latest rated item is taken for evaluation, allowing to avoid "recommendations from future". All these items are available via data_model.test.holdout.

General configuration

technical settings

Sometimes due to the size of dataset evaluation make take considerably longer than actual training time. If that's the case, you may want to give Polara more resources to perform evaluation, which is mostly controlled by changing memory_hard_limit and max_test_workers settings from their defaults. The former defines how much memory is allowed to use when generating predictions for test users. Essentially, Polara avoids running an inefficient by-user-loop for that task and makes calculation in bulk, which allos to invoke linear algebra kernels and speed up calculations. This, however, generates dense $M \times N$ matrix, where $M$ is the number of test users and $N$ is the number of all items seen during training. If this matrix doesn't fit into the memory constraint defined by memory_hard_limit (1Gb by default), calculations will be perfomed on a sequence of groups of $m<M$ users so that each group respects the constraint. If you have enough resources it can be a good idea to increase this memory limit. However, depending on hardware specs, manipulating huge amount of test data can be also slow. In that case it can be useful to still split test users into smaller group, however run calculations on each group in parallel. This can be achieved by setting max_test_workers integer to some number above 0, which will spawn the corresponding number of parallel threads. For example, instead of generating 60Gb matrix in a single run, one can define memory_hard_limit=15 and max_test_workers=4, which may help complete calculations faster.

In our case simply increasing the memory limit to 2Gb is sufficient for optimal performance.

In [8]:
from polara.recommender import defaults
In [9]:
defaults.memory_hard_limit = 2
defaults.max_test_workers = None # None is the default value

common config


In order to avoid undesired effects, related to the positivity bias, models will be trained only on interactions with ratings not lower than 4. This can be achieved by setting feedback_threshold attribute of models to 4. Due to the same reason, during the evaluation only items rated with ratings $\geq 4$ will be counted as true positive and used to calculate evaluation scores. This is controlled by the switch_positive setting.

In [10]:
defaults.switch_positive = 4
init_config = {'feedback_threshold': 4} # alternatively could set defaults.feedback_threshold

The default metric for tuning hyper-parameters and selecting the best model will be Mean Reciprocal Rank (MRR).

In [11]:
target_metric = 'mrr'
model tuning

All MF models will be tested on the following grid of rank values (number of latent features):

In [12]:
max_rank = 150
rank_grid = range(10, max_rank+1, 10)

Creating and tuning models

We will start from a simple PureSVD model and use it as a baseline in comparison with its own scaling modification and iALS algorithm.


On one hand, tuning SVD is limited due to its strict least squares formulation, which doesn't leave too much freedom comparing to more general matrix factorization approaches. On the other hand, this means less parameters to tune. Moreover, for whatever configuration of hyper-parameters, once you compute some SVD-based model of rank $r$ you can immediately obtain a model of rank $r' < r$ by a simple truncation procedure, which doesn't require recomputing the model. This makes grid-search with SVD very efficient and allows to explore a broader hyper-parameter space with less time.

In [13]:
try: # import package to track grid search progress
    from ipypb import track # lightweight progressbar, doesn't depend on widgets
except ImportError:
    from tqdm import tqdm_notebook as track

from polara import SVDModel
from polara.evaluation.pipelines import find_optimal_svd_rank

%matplotlib inline
In [14]:
psvd = SVDModel(data_model)
In [15]:
# the model will be computed only once for the max value of rank
psvd_best_rank, psvd_rank_scores = find_optimal_svd_rank(psvd,
100% 15/15 [01:03<00:04, 4.21s/it]

Note that in this case the most of the time is spent on evaluation, rather than on model computation. The model was computed only once. You can verify it by calling psvd.training_time list attribute and seeing that it contains only one entry:

In [16]:

Let's see how quality of recommendations changes with rank (number of latent features).

In [17]:
ax = psvd_rank_scores.plot(ylim=(0, None))
ax.set_xlabel('# of latent factors')

Scaled PureSVD model

We will employ a simple scaling trick over the rating matrix $R$ that was proposed by the authors of the EIGENREC model [Nikolakopoulos2019]: $R \rightarrow RD^{f-1},$ where $D$ is a diagonal scaling matrix with elements corresponding to the norm of the matrix columns (or square root of the number of nonzero elements in each column for the binary case). Parameter $f$ controls the effect of scaling and typically lies in the range [0, 1]. Finding the optimal value is an experimental task and will be performed via grid-search. We will use built-in support for such model in Polara. If you're interested in technical aspects of this implementation, see Reproducing EIGENREC results tutorial.

In [18]:
from polara.recommender.models import ScaledSVD
from polara.evaluation.pipelines import find_optimal_config # generic routine for grid-search

Now we have to compute SVD model for every value of $f$. However, we can still avoid computing the model for each rank value by the virtue of rank truncation.

In [19]:
def fine_tune_scaledsvd(model, ranks, scale_params, target_metric, config=None):
    rev_ranks = sorted(ranks, key=lambda x: -x) # descending order helps avoiding model recomputation
    param_grid = [(s1, r) for s1 in scale_params for r in rev_ranks]
    param_names = ('col_scaling', 'rank')
    return find_optimal_config(model,
                               force_build=False, # avoid recomputing the model

We already know an approximate range of values for the scaling factor. You may also want to play with other values, especially when working with a different dataset.

In [20]:
ssvd = ScaledSVD(data_model) # create model
scaling = [0.2, 0.4, 0.6, 0.8]

ssvd_best_config, ssvd_scores = fine_tune_scaledsvd(ssvd,
100% 60/60 [05:03<00:04, 5.05s/it]

Note that during this grid search the model was computed only len(scaling)=4 number of times, other points were found via rank truncation. Let's see how quality changes with different values of scaling parameter $f$.

In [21]:
for cs in scaling:
    cs_scores = ssvd_scores.xs(cs, level='col_scaling')
    ax = cs_scores.plot(label=f'col_scaling: {cs}')
ax.set_title(f'Recommendations quality for {ssvd.method} model')
ax.set_ylim(0, None)
In [22]:
ssvd_rank_scores = ssvd_scores.xs(ssvd_best_config['col_scaling'], level='col_scaling')

The optimal set of hyper-parameters:

In [23]:
{'col_scaling': 0.6, 'rank': 130}


Using implicit library in Polara is almost as simple as using SVD-based models. Make sure you have it installed in your python environment (follow instructions at ).

In [24]:
import os; os.environ["MKL_NUM_THREADS"] = "1" # as required by implicit
import numpy as np

from polara.recommender.external.implicit.ialswrapper import ImplicitALS
from polara.evaluation.pipelines import random_grid, set_config

defining hyper-parameter grid

Hyper-parameter space in that case is much broader. We will start by adjusting all hyper-parameters expect the rank value and then, once an optimal config is found, we will perform full grid-search over the range of rank values defined by rank_grid.

In [25]:
als_params = dict(alpha = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
                  epsilon = [0.01, 0.03, 0.1, 0.3, 1],
                  weight_func = [None, np.sign, np.sqrt, np.log2, np.log10],
                  regularization = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
                  rank = [40] # enforce rank value for quick exploration of other parameters

In order to avoid too long computation time, grid-search is performed over 60 random points, which is enough to get within 5% of the optimum with 95% confidence. The grid is generated with the built-in random_grid function.

In [26]:
als_param_grid, als_param_names = random_grid(als_params, n=60)
In [27]:
ials = ImplicitALS(data_model) # create model
In [28]:
ials_best_config, ials_grid_scores = find_optimal_config(ials,
                                                         als_param_grid, # hyper-parameters grid
                                                         als_param_names, # hyper-parameters' names
100% 60/60 [24:18<00:26, 24.30s/it]

rank tuning

In contrast to the case of SVD-based algorithms, iALS requires recomputing the model for every new rank value, therefore in addition to the previous 60 times, the model will be computed len(rank_grid) more times for all rank values.

In [29]:
ials_best_rank, ials_rank_scores = find_optimal_config(ials,
                                                       # configs are applied in the order they're provided
100% 15/15 [07:56<00:45, 31.71s/it]

Let's combine the best rank value with other optimal parameters:

In [30]:
{'alpha': 0.3,
 'epsilon': 0.3,
 'weight_func': <ufunc 'sqrt'>,
 'regularization': 0.03,
 'rank': 60}

visualizing rank tuning results

We can now see how all three algorithms compare to each other.

In [31]:
def plot_rank_scores(scores):
    ax = None
    for sc in scores:
        ax = sc.sort_index().plot(, ax=ax)
    ax.set_ylim(0, None)
    ax.set_title('Recommendations quality')
    ax.set_xlabel('# of latent factors')
    return ax
In [32]:

It can be seen that scaling (PureSVD-s line) has a significant impact on the quality of recommendations. This is, however, a preliminary result, which is yet to be verified via cross-validation.

Models comparison

The results above were computed only with a single split into train-test corresponding to a single fold. In order to verify the obtained results, perform a full CV with optimal parameters fixed. It can be achieved with the built-in run_cv_experiment function from Polara's evaluation engine as shown below.

cross-validation experiment

In [33]:
from polara.evaluation import evaluation_engine as ee

Fixing optimal configurations:

In [34]:
set_config(psvd, {'rank': psvd_best_rank})
set_config(ssvd, ssvd_best_config)
set_config(ials, ials_best_config)

Performing 5-fold CV:

In [35]:
models = [psvd, ssvd, ials]
metrics = ['ranking', 'relevance', 'experience']

# run experiments silently
data_model.verbose = False
for model in models:
    model.verbose = False

# perform cross-validation on models, report scores according to metrics
cv_results = ee.run_cv_experiment(models,
100% 5/5 [05:23<01:07, 64.52s/it]

The output contains results for all folds:

In [36]:
type relevance ranking experience
metric hr mrr coverage
fold model
1 PureSVD 0.076857 0.029101 0.085902
PureSVD-s 0.084729 0.032221 0.148946
iALS 0.076428 0.028240 0.093489
2 PureSVD 0.079868 0.029385 0.089836
PureSVD-s 0.086953 0.034059 0.150539

plotting results

We will plot average scores and confidence intervals for them. The following function will do this based on raw input from CV:

In [37]:
def plot_cv_results(scores, subplot_size=(6, 3.5)):
    scores_mean = scores.mean(level='model')
    scores_errs = ee.sample_ci(scores, level='model')
    # remove top-level columns with classes of metrics (for convenience)
    scores_mean.columns = scores_mean.columns.get_level_values(1)
    scores_errs.columns = scores_errs.columns.get_level_values(1)
    # plot results
    n = len(scores_mean.columns)
    return, rot=0,
                                subplots=True, layout=(1, n),
                                figsize=(subplot_size[0]*n, subplot_size[1]),
In [38]:

The difference between PureSVD and iALS is not significant.

In contrast, the advantage of the scaled version of PureSVD denoted as `PureSVD-s` over the other models is much more pronounced making it a clear favorite.
Interestingly, the difference is especially pronounced in terms of the coverage metric, which is defined as the ratio of unique recommendations generated for all test users to the total number of items in the training data. This indicates that generated recommendations are not only more relevant but also are significantly more diverse.

comparing training time

Another important practical aspect is how long does it take to compute a model? Sometimes the best model in terms of quality of recommendations can be the slowest to compute. You can check each model's training time by accessing the training_time list attribute. It holds the history of trainings, hence, if you have just performed 5-fold CV experiment, the last 5 entries in this list will correspond to the training time on each fold. This information can be used to get average time with some error bounds as shown below.

In [39]:
import pandas as pd
In [40]:
timings = {}
for model in models:
    timings[f'{model.method} rank {model.rank}'] = model.training_time[-5:]
In [41]:
time_df = pd.DataFrame(timings)
time_df.mean(), rot=0, title='Computation time for optimal config, s');

PureSVD-s compares favoribly to the iALS, even though it requires higher rank value, which results in a longer training time comparing to PureSVD. Another interesting measure is what time does it take to achieve approximately the same quality by all models. Note that all models give approximately the same quality at the optimal rank of iALS. Let's compare training time for this value of rank.

In [42]:
fixed_rank_timings = {}
for model in models:
    model.rank = ials_best_config['rank']
    fixed_rank_timings[model.method] = model.training_time[-1]

pd.Series(fixed_rank_timings), title=f'Rank {ials.rank} computation time, s')
<matplotlib.axes._subplots.AxesSubplot at 0x211016ee4a8>

By all means computing SVD on this dataset is much faster than ALS. This may, however, vary on other datasets due to a different sparsity structure. Nevertheless, you can still expect, that SVD-based models will be perfroming well due to the usage of highly optimized BLAS and LAPACK routines.

Bonus: scaling for iALS

You may reasonably question whether that scaling trick also works for non SVD-based models. Let's verify its applicability for iALS. We will reuse Polara's built-in scaling functions in order to create a new class of the scaled iALS-based model.

In [43]:
from polara.recommender.models import ScaledMatrixMixin
In [44]:
class ScaledIALS(ScaledMatrixMixin, ImplicitALS): pass # similarly to how PureSVD is extended to its scaled version
In [45]:
sals = ScaledIALS(data_model)

In order to save time, we will utilize the optimal configuration for scaling, found by tuning scaled version of PureSVD. Alternatively, you could include scaling parameters into the grid search step by extending als_param_grid and als_param_names variables. However, taking configuration of PureSVD-s should be a good enough approximation at least for verifying the effect of scaling. The tuning itself has to be repeated from the beginning.

hyper-parameter tuning

In [46]:
sals_best_config, sals_param_scores = find_optimal_config(sals,
                                                                       ssvd_best_config], # the rank value will be overriden
100% 60/60 [26:27<00:30, 26.45s/it]
In [47]:
sals_best_rank, sals_rank_scores = find_optimal_config(sals,
100% 15/15 [09:09<00:46, 36.58s/it]

visualizing rank tuning results

In [48]:

There seem to be no difference between the original and scaled versions of iALS. Let's verify this with CV experiment.


You only need to perform CV computations for the new model. Configuration of data will be the same as previously, as the data_model instance ensures reproducible data state.

In [49]:
{'alpha': 0.1,
 'epsilon': 0.01,
 'weight_func': <ufunc 'sqrt'>,
 'regularization': 3.0,
 'rank': 60}
In [50]:
set_config(sals, sals_best_config)
In [51]:
sals.verbose = False
sals_cv_results = ee.run_cv_experiment([sals],
100% 5/5 [03:36<00:44, 43.26s/it]
In [52]:
Surprisingly, the iALS model remains largely insensitive to the scaling trick. At least in the current settings and for the current dataset.

Remark: You may want to repeat all experiments in a different setting with random_holdout set to True. My own results indicate that in this case iALS performs slightly better than PureSVD and also becomes more responsive to scaling, giving the same result as the scaled version of SVD. However, the scaled version of SVD is still easier to compute and tune.


With a proper tuning the quality of recommendations of one of the simplest SVD-based models can be substantially improved. Despite common beliefs, it turns out that PureSVD with simple scaling trick compares favorably to a much more popular iALS algorithm. The former not only generates more relevant recommendations, but also makes them more diverse and potentially more interesting. In addition to that, it has a number of unique advantages and merely requires to do from scipy.sparse.linalg import svds to get started. Of course, the obtained results may not necessarily hold on all other datasets and require further verification.

However, in the view of all its features and advantages, the scaled SVD-based model certainly deserves to be included into the list of the default baselines.

The result obtained here resembles situation in the Natural Language Processing field, where simple SVD-based model with proper tuning turns out to be competitive on a variety of downstream tasks even in comparison with Neural Network-based models. In the end it's all about adequate tuning and fair comparison.

As a final remark, this tutorial is a part of a series of tutorials demonstrating usage scenarios for the Polara framework. Polara is designed to support openness and reproducibility of research. It provides controlled environment and rich functionality with high level abstractions, which allows conducting thorough experiments with minimal efforts and can be used to either quickly test known ideas or implement something new.


Comments powered by Disqus