Outperforming LightFM with HybridSVD in cold start

Evgeny Frolov

2019-09-15

Introduction¶

LightFM is a very popular tool, which finds its applications both in small projects and in production systems. On a high level, the general idea of representing users and items in terms of combinations of their features, used in the LightFM model, is not new. It was implemented in a number of preceeding works, and I'm not even saying about the obvious link to Factorization Machines. Another notable member of this family is SVDFeature , which came around in 2011 and in some sense was even closer to LightFM than FM (the naming, though, is misleading, as it is not an SVD-based model). In fact, the feature combination approach can be traced back to the Microsoft's MatchBox system from 2009. The authors used the term trait vector for a combined representation and employed elaborate probabilistic framework on top of it. Still, even though the idea itself was not new, the LightFM model was made convenient and extremely easy to use, which, probably, has made this framework popular among practitioners. Moreover, it provided several ranking optimization schemes and had a fairly good computational performance.

On the other hand, I have also seen some complains from data scientists about actual prediction quality of LightFM. I could not find a more or less rigorous evaluation of LightFM, where it would be compared against strong baselines and, more importantly, all models would undergo appropriate tuning. The examples section of the LightFM's documentation merely provides quick-start demos, not real performance tests. The paper from the RecSys workshop also does not provide too much evidence on the model's performance. Numerous online tutorials only repeat basic configuration steps, leaving the whole tuning aspect aside. Considering how often practitioners are advised to start with LightFM (in my experience), I have decided to put the LightFM capabilities to the real test.

In this tutorial we will conduct a thorough experiment with proper tuning and testing of LightFM against some strong baselines in the cold strat setting.

The baseline models in this tutorial will be based on my favorite SVD (the real one, not just some matrix factorization). I have summarized the reasons why I promote SVD-based recommendation algorithms as the default choice in my previous post. However, I have not yet described how to use SVD in the cold start scenario. The theoretical foundation as well as more experiments can be found in our recent paper, where we demonstrate how to adopt both PureSVD and our HybridSVD model for the cold start scenario. Here, I will provide only the minimum necessary material for understanding what is going on, focusing more on technical aspects instead.

For all experiments we will be using Polara, as it already provides a high-level API for experimenting with LightFM and SVD-based models, including a set of convenient tools for building the entire tuning and evalation pipeline. We will also employ an advanced optimization procedure for LightFM to ensure that the obtained model is close to its optimal configuration. Conversely, as we will see, tuning of the SVD-based models is much more straightforward and less cumbersome. They depend on fewer hyper-parameters and also have deterministic output. The question remains, whether it is possible to outperform LightFM with these simpler models. Let us find it out. Below are a few shortcut links to make navigation through this long post more convenient.

Teleport me to:¶

Getting data¶

We will use the StackExchange dataset provided in one of the examples from the LightFM documentation. It consist of users who have answered certain questions. In addition to the user-answer interaction matrix, the dataset also provides additional information in the form of tags assigned to each answered question. The data-reading code below is simply copied from the documentation.

Downloading and preprocessing data¶

In [1]:

import numpy as np
import pandas as pd
from lightfm.datasets import fetch_stackexchange
from polara.recommender.coldstart.data import ItemColdStartData
from polara.tools.display import print_frames # to print df's side-by-side

In [2]:

data = fetch_stackexchange('crossvalidated',
                           test_set_fraction=0.1,
                           indicator_features=False,
                           tag_features=True)

The data variable contains both training and test datasets, as well as tag assignments and their labels:

In [3]:

print(data.keys())

dict_keys(['train', 'test', 'item_features', 'item_feature_labels'])

Now, we need to convert it into the Polara-compatible format of pandas DataFrames. Mind the field names for users and items in the result:

In [4]:

entities = ['users', 'items']
training_data = pd.DataFrame(dict(zip(entities, data['train'].nonzero())))
test_data = pd.DataFrame(dict(zip(entities, data['test'].nonzero())))

In [5]:

seed = 321 # to be used in data model, LightFM, and optimization routines
training_data = training_data.sample(frac=1, random_state=seed) # shuffle data

We also need to convert item features (tags) into a dataframe with a ceartain structure. This will enable many Polara's built-in functions for data manipulation.

In [6]:

# convert sparse matrix into `item - tag list` pd.Series object
# make use of sparse CSR format for efficiency
item_tags = (
    pd.Series(
        np.array_split(
            # convert back fron indices to tag labels
            np.take(data['item_feature_labels'],
                    data['item_features'].indices),
            # split tags into groups by items
            data['item_features'].indptr[1:-1]))
    .rename_axis('items')
    .to_frame('tags') # Polara expects dataframe
)

In [7]:

print_frames([training_data.head(), # data for training and validation
              test_data.head(), # data for testing
              item_tags.head()]) # item features data

Out[7]:

	users	items
35417	1088	41856
12389	205	38838
48719	2081	20186
25104	635	49982
46833	1878	30914

	users	items
0	16	70628
1	18	66569
2	18	66609
3	18	68964
4	18	69780

	tags
items
0	[bayesian, prior, elicitation]
1	[distributions, normality]
2	[software, open-source]
3	[distributions, statistical-significance]
4	[machine-learning]

Preparing data model for training and validation¶

We need to wrap our data into an instance of Polara's RecommenderData subclass, designed for automating cold start experiments. This also allows to share data among many recommender models and propagate consistent state across all of them. In order to fine-tune models we will additionally split training data into the traininig and validation sets using the Polara's built-in functionality.

Once optimal-hyperparameters are found, the model will be retrained on the full training data and verified against the initially provided test set.

In [8]:

data_model = ItemColdStartData(
    training_data,
    *training_data.columns, # userid, itemid
    item_features=item_tags,
    seed=seed)
print(data_model)

ItemColdStartData with Fields(userid='users', itemid='items', feedback=None)

In [9]:

data_model.test_ratio = 0.05 # take 5% of items as cold (at random)
data_model.prepare()

Preparing data...
1 unique users entities within 2 holdout interactions were filtered. Reason: not in the training data.
Done.
There are 54975 events in the training and 2853 events in the holdout.

Let us look at the holdout data, which will serve as a validation set:

In [10]:

print_frames([data_model.test.holdout.head(), data_model.test.holdout.tail()])

Out[10]:

	users	items_cold
12389	39	0
48717	1	1
25266	75	2
27901	1242	2
30278	32	3

	users	items_cold
8223	29	1920
36348	677	1921
13834	53	1922
38558	343	1923
41979	225	1924

The procedure for both validation and final evaluation will be the following:

For every unique cold item, we generate a list of candidate users who, according to our models, are most likely to be intersted in this item.
We then verify the list of predicted users against the actual users from the holdout.

We will use standard evaluation metrics to assess the quality of predictions.

Recommender models¶

LightFM¶

As stated in the introduction, LightFM is an efficient and convenient instrument for building hybrid recommender systems. There are, however, several downsides in its implementation (at least at the moment of writing this post).

First of all, there is no off-the-shelf support for the warm start scenario (e.g., for new users with several known interactions), which would allow estimating strong generalization of the algorithm. Basically, there's no folding-in implementation, even though it seems to be in a high demand (see github issues here, here or here) and it is not that difficult to implement. Likewise, there is no session-based recommendations support as well. The sometimes recommended fit_partial method is not really an answer to these problems. It is going to update both user and item embeddings and sometimes it is not what you may want. For example, fitting newly introduced users would affect latent representation of some know items as well, potentially changing recommendations for already present users without any of their actions. Such an unintended change in recommendations can be confusing for users and lead to a lower satisfaction with a recommendation service.

Another aspect, which can be critical in certain situations is the lack of determinism. LightFM's output is deterministic only in strict conditions: single-threaded execution with a fixed input data order. Multithredading would cause racing conditions that are resolved at a low system level beyond user control. This is not unique to LightFM, of course, and is shared across all matrix or tensor factorization algorithms based on naive SGD. Input data reordering and, probably, some other factors may also lead to noticably different results. Even in this notebook, if you set the number of LightFM threads to 1 (see num_threads variable below), once you restart the notebook kernel, the results may change. If your business has specific requirements on, e.g., non-regression testing, this model may not be suitable for you. Conversely, SVD-based models are free of such issues. This is also among the reasons why matrix factorization models like SVDFeature or FunkSVD should not be mixed with real SVD-based models.

The model¶

In [11]:

from polara.recommender.coldstart.models import LightFMItemColdStart

In [12]:

num_threads = 4 # number of parallel threads used by LightFM
max_rank = 200 # the max value or latent features used in tuning

Setting the number of threads greater than 1 is not critical in our setup, as the result will fluctuate by a relatively small margin, which should not drammatically change the whole picture of the algorithm's performance. Nevertheless, you can try running in a fully deterministic setting and verify that. However, you'll have to wait a bit longer during the LightFM tuning phase due to a sequential execution.

In [13]:

def create_lightfm_model(data_model, item_features, num_threads, seed=None):
    'Prepare LightFM model and fix initial configuration.'
    model = LightFMItemColdStart(data_model, item_features=item_features)
    model.loss = 'warp'
    model.learning_schedule = 'adagrad'
    model.seed = seed
    model.fit_params['num_threads'] = num_threads
    return model

In [14]:

lfm = create_lightfm_model(data_model, item_tags, num_threads, seed)

Tuning¶

As the LightFM algorithm depends on many hyper-parameters and has a stochastic nature, it presents a hard problem for optimization and requires a fair bit of tuning. There are many great tools that make life a bit easier in this regard, e.g., HyperOpt or Optuna. I personally prefer the latter as it typically allows writing more concise code. We will be using it in this tutorial as well. If you don't have optuna installed you can simply run

pip install optuna

in your python environment.

Instead of performing random search for hyper-parameter optimization we will employ a more advanced and flexible techniqe based on Tree-structured Parzen Estimator (TPE). It will help to iteratively narrow-down the hyper-parameter search subspace and preemptively disregard unmpromising search regions. A bit more details along with resources for further reading can be found in the optuna documentation. There is also a nice comparison of different techniques in this blog post.

In [15]:

import optuna
try: # import lightweight progressbar
    from ipypb import track
except ImportError: # fallback to default
    from tqdm.auto import tqdm as track

Note that tuning with high number of hyper-parameters exponentially increases the complexity of the task. The authors of optuna even recommend to refrain from tuning unimportant variables. We will therefore concentrate on 3 main hyper-parameters:

dimensionality of the latent space (rank of the decomposition),
importance of item features, controlled by the item_alpha regularization parameter,
number of epochs.

We will leave other hyper-parameters with their default values. My preliminary experiments showed that there was no big difference in the final result after changing them, as long as their values remained within a reasonable range. As always, you are free to verify that on your own by adding more hyper-parameters into the search space. You will need to modify the objective function defined below.

To aid the learning process, we will sample item_alpha from a log-uniform distribution and rank values from a range of positive integer numbers up to a certain threshold. Note that we also use set_user_attr routine to store additional information related to each tuning trial.

In [16]:

def evaluate_lightfm(model):
    '''Convenience function for evaluating LightFM.
    It disables user bias terms to improve quality in cold start.'''
    model._model.user_biases *= 0.0
    return model.evaluate()

def find_target_metric(metrics, target_metric):
    'Convenience function to quickly extract the required metric.'
    for metric in metrics:
        if hasattr(metric, target_metric):
            return getattr(metric, target_metric)

def lightfm_objective(model, target_metric):
    'Objective function factory for optuna trials.'
    def objective(trial):
        # sample hyper-parameter values
        model.rank = trial.suggest_int('rank', 1, max_rank)
        model.item_alpha = trial.suggest_loguniform('item_alpha', 1e-10, 1e-0)
        # train model silently and evaluate
        model.verbose = False
        model.build()
        metrics = evaluate_lightfm(model)
        target = find_target_metric(metrics, target_metric)
        # store trial-specific information for later use
        trial.set_user_attr('epochs', model.fit_params['epochs'])
        trial.set_user_attr('metrics', metrics)
        return target
    return objective

One of the limitations of TPE is that it does not take into account possible dependencies between hyper-parameters. In addition to that, parameters like rank or epochs significantly influence the training time. In order to avoid spending too much time on getting a reasonably good model, we will use a specific tuning procedure.

We take the number of epochs out of the main loop. Based on that, we want to early exclude bad configurations that lead to high enough scores solely due to high number of epochs. Hence, we start from a smaller number of epochs and increase it gradually, letting TPE discard unpromising search directions along the way. Hopefully, it won't discard good search regions that can potentially produce good results with a higher number of epochs. Likewise, we also gradually decrease the trials budget for each subsequent number of epochs, assuming that narrower search space requires less exploration for achieving reasonable quality.

In [17]:

n_trials = {
# epochs: # trials
    15: 30,
    25: 25,
    50: 20,
    75: 15,
    100: 10,
    150: 5
}

We will target specifically the Precision@10 metric and will anylize other metrics during the final evaluation.

In [18]:

target_metric = 'precision'
objective = lightfm_objective(lfm, target_metric)

study = optuna.create_study(
    direction = 'maximize',
    sampler = optuna.samplers.TPESampler(seed=seed)
)

optuna.logging.disable_default_handler() # do not report progress
for num_epochs, num_trials in track(n_trials.items()):
    lfm.fit_params['epochs'] = num_epochs
    study.optimize(objective, n_trials=num_trials, n_jobs=1, catch=None)

100% 6/6 [06:10<00:36, 61.66s/it]

Our optimal configuration can be retrieved as follows (note that it can be slightly different on your machine):

In [19]:

print(f'The best value of {target_metric}={study.best_value:0.4f} was achieved with '
      f'rank={study.best_params["rank"]} and item_alpha={study.best_params["item_alpha"]:.02e} '
      f'within {study.best_trial.user_attrs["epochs"]} epochs.')

The best value of precision=0.0344 was achieved with rank=9 and item_alpha=7.26e-07 within 100 epochs.

Visualizing tuning results¶

Let's plot the result to see whether there are any trends and whether we need to continue our parameter search in some new region of the hyper-parameter subspace.

In [20]:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use(['seaborn-notebook']) # see plt.style.available

In [21]:

trials_df = (study.trials_dataframe() # all trials history
             .loc[:, ['value', 'params', 'user_attrs']]
             .rename(columns={'': target_metric}))
trials_df.columns = trials_df.columns.get_level_values(1)
trials_df.sort_values('precision').tail()

Out[21]:

	precision	item_alpha	rank	epochs	metrics
92	0.033974	2.986615e-07	6	100	[(0.03397402597402597, 0.25204923647780786, No...
99	0.034078	1.901221e-06	12	100	[(0.034077922077922075, 0.25326630512344794, N...
53	0.034078	5.043306e-06	25	25	[(0.034077922077922075, 0.2526167166167166, No...
60	0.034234	3.091518e-09	9	50	[(0.034233766233766234, 0.2541265401265401, No...
91	0.034390	7.262623e-07	9	100	[(0.034389610389610394, 0.2558374958374958, No...

We will treat low-score points (likely obtained at the initial tuning stage) as outliers and only take into account the top-20% results.

In [22]:

# take only top-20% of the points for analysis
(trials_df.query(f'{target_metric}>{0.8*study.best_value}')
          .boxplot(column=target_metric, by='epochs', grid=False))
plt.suptitle(''); # hide auto-generated box-plot title

Even though this plot is not suitable for making conclusions about the actual perfromance of the LightFM model, it roughly indicates that, with the current selection of optimization mechanism, increasing the number of epochs further is unlikely to significantly improve the quality of predictions. In other words, our tuning procedure reached a nearly stable state and, most likely, it is somewhere close to a local optimum. We will, therefore, proceed with the best found configuration without any further adjustments. Below is a characterizing plot of all trials.

In [23]:

plt.figure(figsize=(10, 5))
# limit plot display region
plt.xlim(0, 50)
plt.ylim(0.8*study.best_value, study.best_value + 0.002)
# plot points
marker_scale = 5
sc = plt.scatter(
    trials_df['rank'], trials_df[target_metric],
    # circle size indicates the number of epochs 
    s=trials_df['epochs']*marker_scale,
    # color encodes the value of item_alpha (logscale)
    c=np.log10(trials_df['item_alpha']),
    alpha=0.5, cmap="viridis", marker='o'
)
# prepare legend handles for each number of epochs
legend_labels = n_trials.keys()
legend_handles = [
    plt.scatter([], [], s=n_epochs*marker_scale, marker='o',
                color='lightgrey', edgecolors='darkgrey')
    for n_epochs in legend_labels
]
# add legend, ensuring 1:1 scale with the main plot
plt.legend(
    legend_handles, legend_labels,
    scatterpoints=1, ncol=len(legend_labels),
    title='# epochs', borderpad=1.1,
    labelspacing=1.5, markerscale=1,
    loc='lower right'
)    
# add colorbar for item_alpha values
clb = plt.colorbar(sc)
clb.set_label('item_alpha, [log10]', rotation=-90, labelpad=20)
# annotate plot
plt.title('LightFM hyper-parameters tuning')
plt.xlabel('rank')
plt.ylabel(target_metric);

We see that even lower values of the number of epochs produce a relatively good score, whcih also supports the assumption that we have probably found nearly optimal values. How good is this result? To answer that question we need to introduce the baselines.

PureSVD¶

This will be our first baseline. We will simply reuse the scaled version of PureSVD which has shown to perform very competitively with other models. For more details, see the "Reproducing EIGENREC results" tutorial. The only difference is that the model is additionally adopted for the cold start evaluation scenario. In this modification the latent representation $v$ of a cold item can be obtained by solving the following linear system: $$ W^\top v = f, $$ where $f$ is a one-hot vector of real item features (tags in our case), and $W=V^\top F$ is a precomputed linear mapping between the learned latent space represented by the right singular vectors $V$ and the feature matrix $F$ (the one-hot encoding of tags for known items). Everything here can be calculated efficiently. For more details I invite you to check our paper. After you find the latent representation of a cold item, you can utilize the standard notion of a scalar product to find relevant users.

The model¶

In the cold start scenario the model depends on side features (needed to construct the matrix $F$). This informaion can be provided via an input argument during instantiation of the model itself. An alternative and more robust way used here is to provide features in the data model constructor. Hence, the required item_tags variable will be taken from the data_model object.

In [24]:

from polara.recommender.coldstart.models import ScaledSVDItemColdStart

In [25]:

svd = ScaledSVDItemColdStart(data_model)

Tuning¶

Computing and evaluating SVD is going to be blazingly fast as the dataset is small and the model is computed only once for every scaling value due to a simple rank truncation procedure. Hence, we can create a very dense parameter grid, which can still be greedily explored with the grid-search in a reasonable amount of time.

In [26]:

from polara.evaluation.pipelines import find_optimal_config # generic routine for grid-search

In [27]:

def fine_tune_scaledsvd(model, ranks, scale_params, target_metric):
    'Efficiently tuning SVD rank for different scaling parameter values.'
    # descending order helps avoiding model recomputation
    rev_ranks = sorted(ranks, key=lambda x: -x)
    param_grid = [(s, r) for s in scale_params for r in rev_ranks]
    param_names = ('col_scaling', 'rank')
    config, scores = find_optimal_config(
        model, param_grid, param_names, target_metric,
        return_scores=True, force_build=False, iterator=track
    )
    return config, scores

In [28]:

# define the hyper-parameters grid
rank_grid = [1,] + list(range(5, max_rank+1, 5)) # 1, 5, 10, ..., max_rank
scaling_grid = [-0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8] # 1.0 is for PureSVD

In [29]:

# perform tuning
svd_best_config, svd_scores = fine_tune_scaledsvd(
    svd, rank_grid, scaling_grid, target_metric
)

100% 369/369 [01:06<00:00, 0.18s/it]

In [30]:

print(f'The best value of {target_metric}={svd_scores.max():.4f} was achieved with '
      f'rank={svd_best_config["rank"]} and scaling parameter={svd_best_config["col_scaling"]}.')

The best value of precision=0.0298 was achieved with rank=45 and scaling parameter=0.6.

Recall, the result is fully deterministic and reproducible in this case.

Visualizing tuning results¶

The resulting hyper-parameter search grid can be conveniently represented as a heatmap:

In [31]:

(svd_scores.sort_index()
           .unstack(level='col_scaling')
           .iloc[::3] # don't display all rank values
           .style
           .format("{:.4f}")
           .background_gradient(cmap='viridis', high=0.2, axis=None))

Out[31]:

col_scaling	-0.8	-0.6	-0.4	-0.2	0.0	0.2	0.4	0.6	0.8
rank
1	0.0224	0.0231	0.0231	0.0231	0.0231	0.0231	0.0231	0.0231	0.0231
15	0.0255	0.0253	0.0257	0.0261	0.0259	0.0259	0.0260	0.0263	0.0269
30	0.0266	0.0267	0.0269	0.0268	0.0270	0.0268	0.0282	0.0282	0.0275
45	0.0294	0.0289	0.0290	0.0290	0.0291	0.0290	0.0290	0.0298	0.0288
60	0.0286	0.0286	0.0287	0.0289	0.0289	0.0292	0.0286	0.0285	0.0287
75	0.0279	0.0279	0.0281	0.0283	0.0285	0.0287	0.0286	0.0286	0.0291
90	0.0279	0.0276	0.0276	0.0277	0.0278	0.0282	0.0279	0.0283	0.0285
105	0.0279	0.0279	0.0284	0.0282	0.0284	0.0283	0.0285	0.0287	0.0286
120	0.0279	0.0279	0.0276	0.0276	0.0281	0.0282	0.0282	0.0281	0.0282
135	0.0270	0.0271	0.0272	0.0274	0.0273	0.0273	0.0276	0.0276	0.0281
150	0.0266	0.0268	0.0272	0.0273	0.0273	0.0276	0.0276	0.0276	0.0278
165	0.0264	0.0269	0.0266	0.0267	0.0268	0.0275	0.0273	0.0274	0.0270
180	0.0256	0.0257	0.0261	0.0263	0.0261	0.0262	0.0266	0.0268	0.0274
195	0.0252	0.0251	0.0253	0.0256	0.0254	0.0258	0.0263	0.0260	0.0261

As we can see, the optimal rank values sit somewhere around the rank 50. The prediction quality is not impressive, though. Surprisingly, unlike many other examples, this time the scaling does not help too much and the model performs poorly. It may indicate that properly handling side information plays more critical role than data debiasing. Seemingly, SVD is unable to reliably connect tag-based description with the latent representation. We can try to fix this problem with the help of the hybrid version of SVD. This is going to be the second and the main baseline.

HybridSVD¶

HybridSVD allows to impose the desired structure on the latent feature space using side information.

Hopefully, this structure will be more appropriate for discovering relationships between the real and the latent features, and will improve predictions quality in the cold start. The hybrid functionality of SVD is enabled by the use of additional information about similarities between users and/or items. These similarities should be computed beforehand and utilize side features (item tags in our case). The model still uses SVD for computations, even though internally it is applied to a special auxiliary matrix instead of the original one. It, therefore, remains the "real" SVD-based model with all its advantages. Once again, you can look up the details in our paper. The process of computing the model and generating recommendations remains largely the same, except for the additional handling of similarity information. Polara already provides support for this. We only need to invoke the necessary methods and classes. See the related data pre-processing procedure below.

Preparing similarity data¶

We start by computing the tag-based similarity between items using the Polara's built-in functions. You can use your own favorite methods to compute similarities, there is a lot of freedom here.

In [32]:

from polara.recommender.coldstart.data import ItemColdStartSimilarityData
from polara.lib.similarity import combine_similarity_data

In [33]:

training_tags = item_tags.reindex(training_data['items'].unique(), fill_value=[])
tag_similarity = combine_similarity_data(training_tags, similarity_type='cosine')
print('Similarity matrix density is '
      f'{tag_similarity.nnz / np.prod(tag_similarity.shape):.1%}.')

Similarity matrix density is 8.4%.

This similairty matrix will be fed into the special data model instance, which will take care of providing a consistent on-demand access to similarity information for all dependent recommender models. The input should be in the format of a dictionary, specifiyng types of entities, similarities between them and the corresponding indexing information to decode rows (or columns) of the similarity matrices:

In [34]:

similarities = {'users': None, # we have no user features
                'items': tag_similarity}
sim_indices = {'users': None,
               'items': training_tags.index}

All other functionality remains the same as in the standard data model:

In [35]:

data_model_sim = ItemColdStartSimilarityData(
    training_data,
    *training_data.columns,
    relations_matrices=similarities, # new input args
    relations_indices=sim_indices, # new input args
    item_features=item_tags,
    seed=seed
)
print(data_model_sim)

ItemColdStartSimilarityData with Fields(userid='users', itemid='items', feedback=None)

We also use the the same configuration of the new data model to ensure the same data pre-processing and the same overall setup.

In [36]:

data_model_sim.test_ratio = data_model.test_ratio
data_model_sim.prepare()

Preparing data...
1 unique users entities within 2 holdout interactions were filtered. Reason: not in the training data.
Done.
There are 54975 events in the training and 2853 events in the holdout.

The model¶

In order to generate the latent representation of a cold item the model uses the same linear system as in the standard PureSVD case. The main difference is how the linear transformatin matrix $W$ is computed: $$ W=V^\top SF. $$ $S$ is an item similarity matrix and $V$ is now $S$-orthogonal, e.g., $V^\top S V = I$. The model also introduces an additional weighting hyper-parameter $\alpha \in [0, 1]$: $$ S=(1-\alpha)I + \alpha Z, $$ where $Z$ is the actual similarity matrix computed above (see similarities variable). Higher values of $\alpha$ will make the model more sensitive to side information, while setting $\alpha=0$ will turn the model back into PureSVD. To simplify the process, we will not tune its value. We will just make it relatively high to emphasize the importance of tag information for the model. This is going to help us verify an assumption that tag information is critical for building an adequate latent representation with SVD.

In [37]:

from polara.recommender.coldstart.models import ScaledHybridSVDItemColdStart

In [38]:

hsvd = ScaledHybridSVDItemColdStart(data_model_sim)
hsvd.features_weight = 0.9 # the value of alpha

On the technical side, the model takes as an input the product of several sparse matrices, which forms an auxiliary matrix. As these computations are based on matrix-vector products (due to the Lanczos procedure used in the truncated SVD), we are presented with the choice:

we can either avoid explicitly forming the auxiliary matrix and just consequently compute matrix-vector products with all involved components, or
we can precompute and store the auxiliary matrix first and then perform a single matrix-vector multiplication.

Computational efficiency of the aforementioned approaches strongly depends on the sparsity structure of the input matrices. Therefore, the optimal choice should be decided case by case. There is a special attribute to control this behavior in the HybridSVD model:

In [39]:

hsvd.precompute_auxiliary_matrix = True # faster in this case

Tuning¶

We have already spent some time tuning PureSVD. Assuming that the main contribution of HybridSVD is related to tag handling, we can simply reuse the the scaling values obtained earlier. The only parameter left for tuning is the rank of decomposition. Conveniently, it requires computing the model only once! We can use the Polara's built-in find_optimal_svd_rank routine for that.

Note that some calculations with the tag similarity matrix may eat up to 10Gb of RAM due to its relatively high density.

In [40]:

from polara.evaluation.pipelines import find_optimal_svd_rank

In [41]:

hsvd.col_scaling = svd_best_config['col_scaling'] # reuse PureSVD config
# perform rank tuning (will compute the model only once for the max_rank value)
hsvd_best_rank, hsvd_rank_scores = find_optimal_svd_rank(
    hsvd, rank_grid, target_metric, return_scores=True, iterator=track
)
# restore item features embeddings with the maximal rank value
hsvd.update_item_features_transform()

Updating items relations matrix

100% 41/41 [00:08<00:00, 0.20s/it]

In [42]:

print(f'The best {target_metric}={hsvd_rank_scores.loc[hsvd_best_rank]:.4f} '
      f'is achieved at rank={hsvd_best_rank}')

The best precision=0.0458 is achieved at rank=135

The result is much better than for PureSVD! Let us compare all three models together.

Comparing tuning results for all models¶

Note that this is still a preliminary result, awaiting for the confirmation on the test data, provided later in this tutorial.

In [43]:

ax = hsvd_rank_scores.sort_index().plot(label='HybridSVD', marker='.')
svd_scores.groupby('rank').max().plot(ax=ax , label='SVD', ls=':')
ax.axhline(study.best_value, label='LightFM (best)', ls='--')
ax.legend()
ax.set_ylabel('precision')
ax.set_title('Item Cold Start');

We see that HybridSVD has a very good capacity comparing to other models in terms of the prediction quality. It seems that HybridSVD can be further improved with higher rank values. There is also some room for improvement in a more careful tuning of the scaling parameter and the weighting coefficient $\alpha$.

Interestingly, you can achieve the LightFM's quality with HybridSVD of rank 10, while the standard SVD-based model is unable to get even close to LightFM at any rank value!

Here is a quick check:

In [44]:

hsvd.rank = 10 # Polara automatically truncates latent factors in SVD-based models
print(f'HybridSVD\'s {target_metric} at rank {hsvd.rank} is '
      f'{find_target_metric(hsvd.evaluate(), target_metric):.4f}\n'
      f'LightFM\'s best {target_metric} score is {study.best_value:.4f}')

HybridSVD's precision at rank 10 is 0.0366
LightFM's best precision score is 0.0344

It is time to finally verify the obtained result on the test data.

Evaluation of models¶

Preparing data¶

In order to disable the Polara's built-in data splitting mechanism, we will employ two handy methods prepare_training_only and set_test_data. The former will instruct the data_model to utilize the whole training data for training without splitting it, and the latter will inject the test data, ensuring its overall consistency.

In [45]:

data_model_sim.prepare_training_only()
data_model_sim.set_test_data(holdout=test_data)

Preparing data...
Done.
There are 57830 events in the training and 0 events in the holdout.
Done. There are 4307 events in the holdout.

Note that in the data splitting provided by the fetch_stackexchange function some of the items belong to both the training and the test set. We will ignore it and treat all items in the testset as cold items. Recall that in our evaluation setup we generate a list of candidate users for every cold item and than verify the list against the actual interactions present in the holdout.

Preparing the models to compare¶

We need to ensure that all models use the same data to avoid any accidental discrepancies in the testing procedure. After that we train the models once again with the values of hyper-parameters, found during the tuning phase, and report the final results.

LightFM¶

In [46]:

lfm = create_lightfm_model(data_model_sim, item_tags, num_threads, seed)
lfm.rank = study.best_params['rank']
lfm.item_alpha = study.best_params['item_alpha']
lfm.fit_params['epochs'] = study.best_trial.user_attrs['epochs']
lfm.build()

LightFM training time: 4.090s

SVD¶

In [47]:

svd = ScaledSVDItemColdStart(data_model_sim)
svd.col_scaling = svd_best_config['col_scaling']
svd.rank = svd_best_config['rank']
svd.build()

PureSVD(cs)-s training time: 0.112s

HybridSVD¶

In [48]:

hsvd.rank = hsvd_best_rank
hsvd.build()

Updating items relations matrix
Performing sparse Cholesky decomposition for items similarity
Cholesky decomposition computation time: 01m:47s
HybridSVD(cs)-s training time: 52.765s
Building items projector for HybridSVD(cs)-s
    Solving triangular system: 1.852s
    Applying Cholesky factor: 48.022s

Analyzing the results¶

Let us first gather all the scores into a single dataframe, which is more convenient.

In [49]:

from polara.evaluation.evaluation_engine import consolidate_metrics

In [50]:

all_scores = {
    'SVD (best)': svd.evaluate()
    'LightFM (best)': evaluate_lightfm(lfm)
    f'HybridSVD (rank {hsvd.rank})': hsvd.evaluate()
}

In [51]:

all_scores_df = pd.concat([consolidate_metrics(scores, model, False)
                           for model, scores in all_scores.items()])

During the tuning phase we focused on the precision metric. It is now time to also see other metrics as well, and particularly the coverage score. It is calculated as the ratio of all unique recommendations, generated by an algorithms, to all unique entities of the same type present in the training data. Obviously, the maximum value is 1 (100% coverage) and it can be typically only achieved with randomly generated recommendations. The coverage metric characterizes the tendency of an algorithm to generate the same recommendations over and over again and is, therefore, linked to the diversity of recommendations. Higher diversity allows mitigating the famous "Harry Potter" problem, sometimes also called the "bananas problem". I personally like the term "tyranny of the majority". More diverse recommendations are likely to improve an overall user experience as long as the relevance of recommendations remains high enough. It may also improve overall product sales). However, in practice, there is an inverse relationship between diversity and accuracy of recommendations: increasing one of them may decrease the other. The underlying phenomena is the succeptibility of many recommendation algorithms to popularity biases in the long tail-distributed data. We have already tried to partially address that problem by introducing the scaling parameter for SVD. Note that SGD-based algotihms like LightFM also have some control over it via different sampling schemes and customized optimization objectives.

In [52]:

(all_scores_df
 .dropna(axis=1) # skip irrelevant metrics
 .loc[:, :'coverage']
 .style.bar(subset=[target_metric, 'coverage'], color='#5fba7d', align='mid')
)

Out[52]:

	precision	recall	miss_rate	nDCG	coverage
SVD (best)	0.0228118	0.19749	0.80251	0.130738	0.016765
LightFM (best)	0.0322757	0.285704	0.714296	0.20512	0.190934
HybridSVD (rank 135)	0.0374726	0.328895	0.671105	0.237991	0.104315

As indicated by green horizontal bars in the table above, the standard SVD-based model performs poorly both in terms of recommendations accuracy and in terms of diversity. Likewise, LightFM presents a trade-off between these two metrics. However, we have seen that HybridSVD has not yet reached it's best performance. Note that lower rank values make SVD-based models insensitive to frequent variations in the observed user behavior, hence leading to a lower diversity of recommendations in the end. Let us try to increase the rank of the decompostition to verify that.

In [53]:

hsvd.rank = 400
hsvd.build()

HybridSVD(cs)-s training time: 02m:42s
Building items projector for HybridSVD(cs)-s
    Solving triangular system: 7.070s
    Applying Cholesky factor: 02m:28s

In [54]:

# add new HybridSVD scores to the evaluation results
all_scores_df.loc[f'HybridSVD (rank {hsvd.rank})', :] = [
    score for metrics in hsvd.evaluate() for score in metrics
]

In [55]:

all_scores_df[['precision', 'recall', 'nDCG', 'coverage']].T.plot.bar(rot=0);
plt.title('Combined evaluation results');

Remarkably, with HybridSVD we were able to substantially increase the diversity of recommendations without spoiling them.

In fact, we have even slightly increased the accuracy both in terms of relevance of recommendations (indicated by precision and recall) and in terms of ranking (indicated by nDCG metric)!

Conclusion¶

This tutorial demonstrates capabilities of the HybridSVD model. The dataset we used here has a strong dependence on side information, and HybridSVD apparently takes the most out of it, significantly outperforming its competitors. If you find this tutorial helpful and will use it on another data (or with other models), let me know about your results! Do you observe the same behavior?

Of course, this tutorial does not suggest that HybridSVD will always be better than any other hybrid model, including LightFM. However, HybridSVD at least presents an easy-to-tune yet very competitive baseline. It inherits the main advantages of the standard PureSVD approach and extends its functionality. Moreover, its hyper-parameters have a straightforward and intuitive effect on the quality of recommendations.

There are certain technical challenges, related to implementation of the algorithm, which I'm not going to discuss here (the post is already too long). Some of them will be addressed in forthcoming papers or in blog posts. However, if you have any specific question in mind, let me know in the comments section below or post an issue in the Polara's github repository.

To SVD or not to SVD [a primer on fair evaluation of recommender systems]

Evgeny Frolov

2019-08-18

Comments

Introduction¶

In this tutorial we will go through a full cycle of model tuning and evaluation to perform a fair comparison of recommendation algorithms with Polara.

This will include 2 phases: grid-search for finding (almost) optimal values of hyper-parameters and verification of results via 5-fold cross-validation. We will compare a popular ALS-based matrix factorization (MF) model called Weighted Regularized Matrix Factorization (WRMF a.k.a. iALS) [Hu, 2008] with much simpler SVD-based models.

We will use a standard sparse implementation of SVD from Scipy for the latter and a great library called implicit for iALS. Both are wrapped by Polara and can be accessed via the corresponding recommender model classes. Due to its practicality the implicit library is often recommended to beginners and sometimes even serves as a default tool in production. On the other hand, there are some important yet often overlooked features, which make SVD-based models stand out. Ignoring them in my opinion leads to certain misconceptions and myths, not to say that it also overcomplicates things quite a bit.

Note that by saying SVD I really mean Singular Value Decomposition, not just an arbitrary matrix factorization. In that sense, methods like FunkSVD, SVD++, SVDFeature, etc., are not SVD-based at all, even though historically they use SVD acronym in their names and are often referenced as if they are real substitutes for SVD. These methods utilize another optimization algorithm, typically based on stochastic gradient descent, and do not preserve the algebraic properties of SVD. This is really an important distinction, especially in the view of the following remarks:

SVD-based approach has a number of unique and beneficial properties. To name a few, it produces stable and determenistic output with global guarantees (can be critical for non-regression testing). It admits the same prediction formula for both known and previously unseen users as long as at least one user rating is known (this is especially handy for online regime). It requires minimal tuning and allows to compute and store a single latent feature matrix - either for users or for items - instead of computing and storing both of them. This luxury is not available in the majority of other matrix factorization approaches, definitely not in popular ones.
Thanks to the Lanczos procedure, computational complexity of truncated SVD scales linearly with the number of known observations and with the number of users/items. It scales quadratically only with the rank of decomposition. There are open source implementations, allowing to handle nearly billion-scale problems with one of its efficient randomized versions.
At least in some cases the simplest possible model called PureSVD outperforms other more sophisticated matrix factorization methods [Cremonesi, 2010].
Moreover, PureSVD can be quite easily tuned to perform much better [Nikolakopoulos, 2019].
Finally, it can take a hybrid form to include side information via the generalized formulation (see Chapter 6 of my thesis). Even without hybridization it can be quite successfully applied in the cold start regime [Frolov, 2019].

Despite that impresisve list, SVD-based models rarely get into the list of baselines to compare or to start with.

Hence, this tutorial also aims at performing an assessment of the default choice of many practitioners to see whether it really stays advantageous over the simpler SVD-based approach after a thorough tuning of both models.

References¶

[Hu, 2008] Hu Y., Koren, Y. and Volinsky, C., 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM (Vol. 8, pp. 263-272). Link.
[Cremonesi, 2010] Cremonesi, P., Koren, Y. and Turrin, R., 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (pp. 39-46). ACM. Link.
[Nikolakopoulos, 2019] Nikolakopoulos, A.N., Kalantzis, V., Gallopoulos, E. and Garofalakis, J.D., 2019. EigenRec: generalizing PureSVD for effective and efficient top-N recommendations. Knowledge and Information Systems, 58(1), pp.59-81. Link.
[Frolov, 2019] Frolov, E. and Oseledets, I., 2019. HybridSVD: When Collaborative Information is Not Enough. To appear in Proceedings of the Thirteenth ACM Conference on Recommender Systems. ACM. Link

Downloading data¶

We will use the Movielens-10M dataset for our experiments. It's large enough to perform reliable evaluation; however, not that large that you would spend too many hours waiting for results. If you don't plan to play with the code, you may want to run all cells with a single command and leave the notebook running in the background, while reading the text. It takes around 1.5h to complete this notebook on a modern laptop.

Note that you'll need an internet connection in order to run the cell below.

It will automatically download data, store it in a temporary location, and convert into a pandas dataframe. Alternatively, if you have already downloaded the dataset, you can use its local path as an input for the get_movielens_data function instead of tmp_file.

In [1]:

import urllib
from polara import (get_movielens_data, # returns data in the pandas dataframe format
                    RecommenderData) # provides common interface to access data 

url = 'http://files.grouplens.org/datasets/movielens/ml-10m.zip'
tmp_file, _ = urllib.request.urlretrieve(url) # this may take some time depending on your internet connection

data = get_movielens_data(tmp_file, include_time=True)
data.head()

The resulting dataframe has a bit more than 10M ratings, as expected.

In [3]:

data.shape

Out[3]:

(10000054, 4)

Basic data stats:

In [4]:

data.apply('nunique')

Out[4]:

userid         69878
movieid        10677
rating            10
timestamp    7096905
dtype: int64

Preparing data¶

As always, you need to firstly define a data model that will provide a common interface for all recommendation algorithms used in experiments:

In [5]:

data_model = RecommenderData(data, 'userid', 'movieid', 'rating', custom_order='timestamp', seed=0)
data_model.fields

Out[5]:

Fields(userid='userid', itemid='movieid', feedback='rating')

Setting seed=0 ensures controllable randomization when sampling test data, which enhances reproducibility; custom_order allows to select observations for evaluation based on their timestamp, rather than on rating value (more on that later).

Let's look at the default configuration of data model:

In [6]:

data_model.get_configuration()

Out[6]:

{'test_ratio': 0.2,
 'negative_prediction': False,
 'warm_start': True,
 'permute_tops': False,
 'test_sample': None,
 'random_holdout': False,
 'test_fold': 5,
 'shuffle_data': False,
 'holdout_size': 3}

By default, Polara samples 20% of users and marks them for test (test_ratio attribute). These users would be excluded from the training dataset, if warm_start attribute remained set to True (strong generalization test). However, in the iALS case such setting would require running additional half-step optimization (folding-in) for each test user in order to obtain their latent representation. To avoid that we turn the "warm start" setting off and perform standard evaluation (weak generalization test). In that case test users are part of the training (except for the ratings that were held out for evaluation) and one can directly invoke scalar products of latent factors. Note that SVD recommendations do not depend on this setting due to uniform projection formula applicable to both known and "warm" users: $r = VV^\top p$, where $r$ is a vector of predicted relevance scores, $p$ is a vector of any known preferences and $V$ is an item latent features matrix.

In [7]:

data_model.holdout_size = 1 # hold out 1 item from every test user
data_model.random_holdout = False # take items with the latest timstamp
data_model.warm_start = False # standard case
data_model.prepare()

Preparing data...
Done.
There are 9986078 events in the training and 13976 events in the holdout.

The holdout_size attribute controls how many user preferences will be used for evaluation. Current configuration instructs data model to holdout one item from every test user. The random_holdout=False setting along with custom_order input argumment of data model make sure that only the latest rated item is taken for evaluation, allowing to avoid "recommendations from future". All these items are available via data_model.test.holdout.

General configuration¶

technical settings¶

Sometimes due to the size of dataset evaluation make take considerably longer than actual training time. If that's the case, you may want to give Polara more resources to perform evaluation, which is mostly controlled by changing memory_hard_limit and max_test_workers settings from their defaults. The former defines how much memory is allowed to use when generating predictions for test users. Essentially, Polara avoids running an inefficient by-user-loop for that task and makes calculation in bulk, which allos to invoke linear algebra kernels and speed up calculations. This, however, generates dense $M \times N$ matrix, where $M$ is the number of test users and $N$ is the number of all items seen during training. If this matrix doesn't fit into the memory constraint defined by memory_hard_limit (1Gb by default), calculations will be perfomed on a sequence of groups of $m<M$ users so that each group respects the constraint. If you have enough resources it can be a good idea to increase this memory limit. However, depending on hardware specs, manipulating huge amount of test data can be also slow. In that case it can be useful to still split test users into smaller group, however run calculations on each group in parallel. This can be achieved by setting max_test_workers integer to some number above 0, which will spawn the corresponding number of parallel threads. For example, instead of generating 60Gb matrix in a single run, one can define memory_hard_limit=15 and max_test_workers=4, which may help complete calculations faster.

In our case simply increasing the memory limit to 2Gb is sufficient for optimal performance.

In [8]:

from polara.recommender import defaults

In [9]:

defaults.memory_hard_limit = 2
defaults.max_test_workers = None # None is the default value

common config¶

evaluation¶

In order to avoid undesired effects, related to the positivity bias, models will be trained only on interactions with ratings not lower than 4. This can be achieved by setting feedback_threshold attribute of models to 4. Due to the same reason, during the evaluation only items rated with ratings $\geq 4$ will be counted as true positive and used to calculate evaluation scores. This is controlled by the switch_positive setting.

In [10]:

defaults.switch_positive = 4
init_config = {'feedback_threshold': 4} # alternatively could set defaults.feedback_threshold

The default metric for tuning hyper-parameters and selecting the best model will be Mean Reciprocal Rank (MRR).

In [11]:

target_metric = 'mrr'

model tuning¶

All MF models will be tested on the following grid of rank values (number of latent features):

In [12]:

max_rank = 150
rank_grid = range(10, max_rank+1, 10)

Creating and tuning models¶

We will start from a simple PureSVD model and use it as a baseline in comparison with its own scaling modification and iALS algorithm.

PureSVD¶

On one hand, tuning SVD is limited due to its strict least squares formulation, which doesn't leave too much freedom comparing to more general matrix factorization approaches. On the other hand, this means less parameters to tune. Moreover, for whatever configuration of hyper-parameters, once you compute some SVD-based model of rank $r$ you can immediately obtain a model of rank $r' < r$ by a simple truncation procedure, which doesn't require recomputing the model. This makes grid-search with SVD very efficient and allows to explore a broader hyper-parameter space with less time.

In [13]:

try: # import package to track grid search progress
    from ipypb import track # lightweight progressbar, doesn't depend on widgets
except ImportError:
    from tqdm import tqdm_notebook as track

from polara import SVDModel
from polara.evaluation.pipelines import find_optimal_svd_rank

%matplotlib inline

In [14]:

psvd = SVDModel(data_model)

In [15]:

# the model will be computed only once for the max value of rank
psvd_best_rank, psvd_rank_scores = find_optimal_svd_rank(psvd,
                                                         rank_grid,
                                                         target_metric,
                                                         config=init_config,
                                                         return_scores=True,
                                                         iterator=track)

100% 15/15 [01:03<00:04, 4.21s/it]

Note that in this case the most of the time is spent on evaluation, rather than on model computation. The model was computed only once. You can verify it by calling psvd.training_time list attribute and seeing that it contains only one entry:

In [16]:

psvd.training_time

Out[16]:

[9.6594064]

Let's see how quality of recommendations changes with rank (number of latent features).

In [17]:

ax = psvd_rank_scores.plot(ylim=(0, None))
ax.set_xlabel('# of latent factors')
ax.set_ylabel(target_metric.upper());

Scaled PureSVD model¶

We will employ a simple scaling trick over the rating matrix $R$ that was proposed by the authors of the EIGENREC model [Nikolakopoulos2019]: $R \rightarrow RD^{f-1},$ where $D$ is a diagonal scaling matrix with elements corresponding to the norm of the matrix columns (or square root of the number of nonzero elements in each column for the binary case). Parameter $f$ controls the effect of scaling and typically lies in the range [0, 1]. Finding the optimal value is an experimental task and will be performed via grid-search. We will use built-in support for such model in Polara. If you're interested in technical aspects of this implementation, see Reproducing EIGENREC results tutorial.

Grid search¶

In [18]:

from polara.recommender.models import ScaledSVD
from polara.evaluation.pipelines import find_optimal_config # generic routine for grid-search

Now we have to compute SVD model for every value of $f$. However, we can still avoid computing the model for each rank value by the virtue of rank truncation.

In [19]:

def fine_tune_scaledsvd(model, ranks, scale_params, target_metric, config=None):
    rev_ranks = sorted(ranks, key=lambda x: -x) # descending order helps avoiding model recomputation
    param_grid = [(s1, r) for s1 in scale_params for r in rev_ranks]
    param_names = ('col_scaling', 'rank')
    return find_optimal_config(model,
                               param_grid,
                               param_names,
                               target_metric,
                               init_config=config,
                               return_scores=True,
                               force_build=False, # avoid recomputing the model
                               iterator=track)

We already know an approximate range of values for the scaling factor. You may also want to play with other values, especially when working with a different dataset.

In [20]:

ssvd = ScaledSVD(data_model) # create model
scaling = [0.2, 0.4, 0.6, 0.8]

ssvd_best_config, ssvd_scores = fine_tune_scaledsvd(ssvd,
                                                    rank_grid,
                                                    scaling,
                                                    target_metric,
                                                    config=init_config)

100% 60/60 [05:03<00:04, 5.05s/it]

Note that during this grid search the model was computed only len(scaling)=4 number of times, other points were found via rank truncation. Let's see how quality changes with different values of scaling parameter $f$.

In [21]:

for cs in scaling:
    cs_scores = ssvd_scores.xs(cs, level='col_scaling')
    ax = cs_scores.plot(label=f'col_scaling: {cs}')
ax.set_title(f'Recommendations quality for {ssvd.method} model')
ax.set_ylim(0, None)
ax.set_ylabel(target_metric.upper())
ax.legend();

In [22]:

ssvd_rank_scores = ssvd_scores.xs(ssvd_best_config['col_scaling'], level='col_scaling')

The optimal set of hyper-parameters:

In [23]:

ssvd_best_config

Out[23]:

{'col_scaling': 0.6, 'rank': 130}

iALS¶

Using implicit library in Polara is almost as simple as using SVD-based models. Make sure you have it installed in your python environment (follow instructions at https://github.com/benfred/implicit ).

In [24]:

import os; os.environ["MKL_NUM_THREADS"] = "1" # as required by implicit
import numpy as np

from polara.recommender.external.implicit.ialswrapper import ImplicitALS
from polara.evaluation.pipelines import random_grid, set_config

defining hyper-parameter grid¶

Hyper-parameter space in that case is much broader. We will start by adjusting all hyper-parameters expect the rank value and then, once an optimal config is found, we will perform full grid-search over the range of rank values defined by rank_grid.

In [25]:

als_params = dict(alpha = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
                  epsilon = [0.01, 0.03, 0.1, 0.3, 1],
                  weight_func = [None, np.sign, np.sqrt, np.log2, np.log10],
                  regularization = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
                  rank = [40] # enforce rank value for quick exploration of other parameters
                 )

In order to avoid too long computation time, grid-search is performed over 60 random points, which is enough to get within 5% of the optimum with 95% confidence. The grid is generated with the built-in random_grid function.

In [26]:

als_param_grid, als_param_names = random_grid(als_params, n=60)

random grid search¶

In [27]:

ials = ImplicitALS(data_model) # create model

In [28]:

ials_best_config, ials_grid_scores = find_optimal_config(ials,
                                                         als_param_grid, # hyper-parameters grid
                                                         als_param_names, # hyper-parameters' names
                                                         target_metric,
                                                         init_config=init_config,
                                                         return_scores=True,
                                                         iterator=track)

100% 60/60 [24:18<00:26, 24.30s/it]

rank tuning¶

In contrast to the case of SVD-based algorithms, iALS requires recomputing the model for every new rank value, therefore in addition to the previous 60 times, the model will be computed len(rank_grid) more times for all rank values.

In [29]:

ials_best_rank, ials_rank_scores = find_optimal_config(ials,
                                                       rank_grid,
                                                       'rank',
                                                       target_metric,
                                                       # configs are applied in the order they're provided
                                                       init_config=[init_config,
                                                                    ials_best_config],
                                                       return_scores=True,
                                                       iterator=track)

100% 15/15 [07:56<00:45, 31.71s/it]

Let's combine the best rank value with other optimal parameters:

In [30]:

ials_best_config.update(ials_best_rank)
ials_best_config

Out[30]:

{'alpha': 0.3,
 'epsilon': 0.3,
 'weight_func': <ufunc 'sqrt'>,
 'regularization': 0.03,
 'rank': 60}

visualizing rank tuning results¶

We can now see how all three algorithms compare to each other.

In [31]:

def plot_rank_scores(scores):
    ax = None
    for sc in scores:
        ax = sc.sort_index().plot(label=sc.name, ax=ax)
    ax.set_ylim(0, None)
    ax.set_title('Recommendations quality')
    ax.set_xlabel('# of latent factors')
    ax.set_ylabel(target_metric.upper());
    ax.legend()
    return ax

In [32]:

plot_rank_scores([ssvd_rank_scores,
                  ials_rank_scores,
                  psvd_rank_scores]);

It can be seen that scaling (PureSVD-s line) has a significant impact on the quality of recommendations. This is, however, a preliminary result, which is yet to be verified via cross-validation.

Models comparison¶

The results above were computed only with a single split into train-test corresponding to a single fold. In order to verify the obtained results, perform a full CV with optimal parameters fixed. It can be achieved with the built-in run_cv_experiment function from Polara's evaluation engine as shown below.

cross-validation experiment¶

In [33]:

from polara.evaluation import evaluation_engine as ee

Fixing optimal configurations:

In [34]:

set_config(psvd, {'rank': psvd_best_rank})
set_config(ssvd, ssvd_best_config)
set_config(ials, ials_best_config)

Performing 5-fold CV:

In [35]:

models = [psvd, ssvd, ials]
metrics = ['ranking', 'relevance', 'experience']

# run experiments silently
data_model.verbose = False
for model in models:
    model.verbose = False

# perform cross-validation on models, report scores according to metrics
cv_results = ee.run_cv_experiment(models,
                                  metrics=metrics,
                                  iterator=track)

100% 5/5 [05:23<01:07, 64.52s/it]

The output contains results for all folds:

In [36]:

cv_results.head()

Out[36]:

	type	relevance	ranking	experience
	metric	hr	mrr	coverage
fold	model
1	PureSVD	0.076857	0.029101	0.085902
	PureSVD-s	0.084729	0.032221	0.148946
	iALS	0.076428	0.028240	0.093489
2	PureSVD	0.079868	0.029385	0.089836
2	PureSVD-s	0.086953	0.034059	0.150539

plotting results¶

We will plot average scores and confidence intervals for them. The following function will do this based on raw input from CV:

In [37]:

def plot_cv_results(scores, subplot_size=(6, 3.5)):
    scores_mean = scores.mean(level='model')
    scores_errs = ee.sample_ci(scores, level='model')
    # remove top-level columns with classes of metrics (for convenience)
    scores_mean.columns = scores_mean.columns.get_level_values(1)
    scores_errs.columns = scores_errs.columns.get_level_values(1)
    # plot results
    n = len(scores_mean.columns)
    return scores_mean.plot.bar(yerr=scores_errs, rot=0,
                                subplots=True, layout=(1, n),
                                figsize=(subplot_size[0]*n, subplot_size[1]),
                                legend=False);

In [38]:

plot_cv_results(cv_results);

The difference between PureSVD and iALS is not significant.

In contrast, the advantage of the scaled version of PureSVD denoted as `PureSVD-s` over the other models is much more pronounced making it a clear favorite.

Interestingly, the difference is especially pronounced in terms of the coverage metric, which is defined as the ratio of unique recommendations generated for all test users to the total number of items in the training data. This indicates that generated recommendations are not only more relevant but also are significantly more diverse.

comparing training time¶

Another important practical aspect is how long does it take to compute a model? Sometimes the best model in terms of quality of recommendations can be the slowest to compute. You can check each model's training time by accessing the training_time list attribute. It holds the history of trainings, hence, if you have just performed 5-fold CV experiment, the last 5 entries in this list will correspond to the training time on each fold. This information can be used to get average time with some error bounds as shown below.

In [39]:

import pandas as pd

In [40]:

timings = {}
for model in models:
    timings[f'{model.method} rank {model.rank}'] = model.training_time[-5:]

In [41]:

time_df = pd.DataFrame(timings)
time_df.mean().plot.bar(yerr=time_df.std(), rot=0, title='Computation time for optimal config, s');

PureSVD-s compares favoribly to the iALS, even though it requires higher rank value, which results in a longer training time comparing to PureSVD. Another interesting measure is what time does it take to achieve approximately the same quality by all models. Note that all models give approximately the same quality at the optimal rank of iALS. Let's compare training time for this value of rank.

In [42]:

fixed_rank_timings = {}
for model in models:
    model.rank = ials_best_config['rank']
    model.build()
    fixed_rank_timings[model.method] = model.training_time[-1]

pd.Series(fixed_rank_timings).plot.bar(rot=0, title=f'Rank {ials.rank} computation time, s')

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x211016ee4a8>

By all means computing SVD on this dataset is much faster than ALS. This may, however, vary on other datasets due to a different sparsity structure. Nevertheless, you can still expect, that SVD-based models will be perfroming well due to the usage of highly optimized BLAS and LAPACK routines.

Bonus: scaling for iALS¶

You may reasonably question whether that scaling trick also works for non SVD-based models. Let's verify its applicability for iALS. We will reuse Polara's built-in scaling functions in order to create a new class of the scaled iALS-based model.

In [43]:

from polara.recommender.models import ScaledMatrixMixin

In [44]:

class ScaledIALS(ScaledMatrixMixin, ImplicitALS): pass # similarly to how PureSVD is extended to its scaled version

In [45]:

sals = ScaledIALS(data_model)

In order to save time, we will utilize the optimal configuration for scaling, found by tuning scaled version of PureSVD. Alternatively, you could include scaling parameters into the grid search step by extending als_param_grid and als_param_names variables. However, taking configuration of PureSVD-s should be a good enough approximation at least for verifying the effect of scaling. The tuning itself has to be repeated from the beginning.

hyper-parameter tuning¶

In [46]:

sals_best_config, sals_param_scores = find_optimal_config(sals,
                                                          als_param_grid,
                                                          als_param_names,
                                                          target_metric,
                                                          init_config=[init_config,
                                                                       ssvd_best_config], # the rank value will be overriden
                                                          return_scores=True,
                                                          iterator=track)

100% 60/60 [26:27<00:30, 26.45s/it]

In [47]:

sals_best_rank, sals_rank_scores = find_optimal_config(sals,
                                                       rank_grid,
                                                       'rank',
                                                       target_metric,
                                                       init_config=[init_config,
                                                                    ssvd_best_config,
                                                                    sals_best_config],
                                                       return_scores=True,
                                                       iterator=track)

100% 15/15 [09:09<00:46, 36.58s/it]

visualizing rank tuning results¶

In [48]:

plot_rank_scores([ssvd_rank_scores,
                  sals_rank_scores,
                  ials_rank_scores,
                  psvd_rank_scores]);

There seem to be no difference between the original and scaled versions of iALS. Let's verify this with CV experiment.

cross-validation¶

You only need to perform CV computations for the new model. Configuration of data will be the same as previously, as the data_model instance ensures reproducible data state.

In [49]:

sals_best_config.update(sals_best_rank)
sals_best_config

Out[49]:

{'alpha': 0.1,
 'epsilon': 0.01,
 'weight_func': <ufunc 'sqrt'>,
 'regularization': 3.0,
 'rank': 60}

In [50]:

set_config(sals, sals_best_config)

In [51]:

sals.verbose = False
sals_cv_results = ee.run_cv_experiment([sals],
                                       metrics=metrics,
                                       iterator=track)

100% 5/5 [03:36<00:44, 43.26s/it]

In [52]:

plot_cv_results(cv_results.append(sals_cv_results));

Surprisingly, the iALS model remains largely insensitive to the scaling trick. At least in the current settings and for the current dataset.

Remark: You may want to repeat all experiments in a different setting with random_holdout set to True. My own results indicate that in this case iALS performs slightly better than PureSVD and also becomes more responsive to scaling, giving the same result as the scaled version of SVD. However, the scaled version of SVD is still easier to compute and tune.

Conclusion¶

With a proper tuning the quality of recommendations of one of the simplest SVD-based models can be substantially improved. Despite common beliefs, it turns out that PureSVD with simple scaling trick compares favorably to a much more popular iALS algorithm. The former not only generates more relevant recommendations, but also makes them more diverse and potentially more interesting. In addition to that, it has a number of unique advantages and merely requires to do from scipy.sparse.linalg import svds to get started. Of course, the obtained results may not necessarily hold on all other datasets and require further verification.

However, in the view of all its features and advantages, the scaled SVD-based model certainly deserves to be included into the list of the default baselines.

The result obtained here resembles situation in the Natural Language Processing field, where simple SVD-based model with proper tuning turns out to be competitive on a variety of downstream tasks even in comparison with Neural Network-based models. In the end it's all about adequate tuning and fair comparison.

As a final remark, this tutorial is a part of a series of tutorials demonstrating usage scenarios for the Polara framework. Polara is designed to support openness and reproducibility of research. It provides controlled environment and rich functionality with high level abstractions, which allows conducting thorough experiments with minimal efforts and can be used to either quickly test known ideas or implement something new.

About this blog

Evgeny Frolov

2019-08-17

Comments

"Oh, not again..."

"Is this yet another blog on machine learning containing ramblings on trivial topics and reincarnation of some dusty guides?" If this was one of your first thoughts, while being redirected to this website, then you are probably in the right place.

General vision

One of the main motivations for me to start this blog was the desire to share and elaborate on underexplored or overlooked ideas with potentially high practical impact. I will be focusing primarily on the topic of recommender systems and will present ideas through the lens of my experience of working on both research and industrial projects. To some extent, it will have an anti-hype or anti-mainstream flavor. This means that sometimes I will revisit some good old concepts and attempt to make the most out of them. Check out my tutorial on using SVD to get a glimpse of how it may look like.

As any personal blog, it is going be opinionated to a certain degree. In a sense, it will reflect my "eigen" values and views, hence the name "EigenTheories" (although, some people insist that I just misspelled the word "Eugene", which is a western variant of my name... who knows). It may also sometimes happen that I unknowingly present some common knowledge as a revelation. Shame on me in that case, and I will appreciate if you take time to tell me about that in the comments section! On the other hand, in a fully authoritarian way, I also preserve the right to occasionally blog about some other not so fancy stuff (like how to configure this blog or how to make pandas work more efficiently), which is useful for my work and research. No complaints are accepted here.

Values

From the scientific perspective, recommender systems is a vibrant interdisciplinary field with a whirling mix of math, engineering, behavioral sciences, philosophy, ethics, and even quantum computing. Ok, the last one is more about grabbing your attention. There is an opinion that it does not add too much to what is already known from the cross approximation theory and the maximum volume approach. Nevertheless, the field of recommender systems provides many opportunities for attacking practical problems with math. Its beauty has never stopped fascinating me, even though I was never trained as a pure mathematician. I am always amazed by the elegance of solutions, achieved by an appropriate choice of mathematical abstractions and with the knowledge of some fundamental results. I hope to make this blog some sort of a tribute to that from a practical viewpoint.

Commitments

Two aspects that I will try to commit to in this blog are rigorous experimentation and reproducibility. Such commitment means that from time to time there will be somewhat longish posts with a certain level of technical details seasoned with the source code. I will also write about reproducing others' results. I will be using Polara – the recommender systems framework that helps to make reproducible research easier (disclaimer: I am the author). As an example, I invite you to verify at least some of the results from our recent paper right in your browser! No setup required. No excuses not to try.

I find reproducibility and openness of research especially important in spite of recent appraisal of many state-of-the-art methods based on both neural networks and classical algorithms. As funny as it may sound, many modern SOTA algorithms have finally achieved the level of properly tuned decade-old baselines. Do not get me wrong, I am not trying to defy any of the recent achievements. In fact, I would be more than happy to have these achievements advance us to some new technological edges. It just requires a bit more of a careful work, in my opinion. The least I can do to help this situation is to point out some weak points that could be improved, make my own research transparent, and enable others to place my work under scrutiny. Hopefully, blogging will help me with that.

Embracing uncertainty

Developing good recommender systems is difficult due to many uncertainties and even controversies in the decision making process modelling. There is even no agreement on how to perform quality evaluation and how to reliably compare algorithms with each other. Moreover, practitioners frequently observe a discrepancy between online A/B tests conducted with real users and offline verification on historical data, which is commonly used at initial phases of research. Of course, it does not mean that the recommender systems field is broken and doomed. It just shows that there is still a lot to be learned. This is a long journey, which I will be reflecting on down the road.

Probably, I have said more than enough already for introducing the blog. As a final remark, it won't be a regular blog with a weekly posts schedule. Crafting something interesting takes time and requires a tremendous amount of efforts. We will see how it goes. May the algorithms guide you.