Introduction

This report is being created for the final and only project of the Advanced Topics in Data Science (ATDS) course. This project entails the usage of a Farfetch dataset, containing anonymized information about luxury fashion brand purchases, for the creation of a brand recommender system.

The objectives of this project encompass 3 different tasks.
Task 1 - Data preprocessing and visualization.
Task 2 - Recommender systems using popularity, association rules or collaborative filtering; using both ROC and precision-recall curves with top N of 1,2 and 6 for comparisons.
Task 3 - Open ended task involving the improvement of results by employing contextual information other than user id and brand.

For task 1 almost no preprocessing was required, and as such, an extensive visualization process was undertaken. The table and plot outputs were, due to size, placed at the end of the report as an annex. In task 2 the set of recommender systems is created utilizing different data subsets of different dimensions and the results compared using cross-validation. An example for each model type for a single user is also presented. For task 3, an LSTM model was selected to employ the contextual information and attempt to improve results.

Import Libraries

To make it easier to run and thus replicate attained results all required libraries are imported here and should be properly installed by the user before running. It is important to stress that the order in which the libraries are imported is relevant as there are some cases of function masking, which can cause problems when libraries are loading in a different order.

library(keras)
library(tensorflow)
library(recommenderlab)
library(dplyr)
library(tidyr)
library(readr)
library(plotly)
library(e1071)
library(ggpubr)
library(ggplot2)
library(reshape2)
library(psych)
library(proxy)
library(groupdata2)
library(abind)
library(caret)

Read Data

The first task that must be undertaken is the reading of the datasets from files into memory to allow further manipulation. Here relative paths are used, and it is assumed that the files are in the same directory as the markdown file. Of course, these paths can be changed by the user to load the dataset from wherever it would be most convenient.

Both files that were provided for this project are read, with the first one being user_brands, containing most of the information and the second one containing only the Farfetch created 120-dimensional brand embeddings. The contents of the second file will only be used in task 3.

userBrandsTible = read_csv("users_brands.csv")
userBrandsDF = as.data.frame(userBrandsTible)
brandFeatures = read_csv("brand_features.csv")

Task 1

Task 1 entails both the data visualization and preprocessing steps. An initial analysis of the contents led to the conclusion that no preprocessing steps are required for data visualization. And preprocessing required is problem dependent and is left for when it is required.

Content Inspection

The first visualization task that was undertaken is the inspection of some data entries, not only to check the particular values and value formats but also the respective types and any possible anomalies in the loading process.

head(userBrandsDF)

##                                                            user_id brand_id
## 1 65b4c425c57d4d5cd6a09b6721f76dfe76fca72000a58ad1b252ef4ba10dfb0d 92850413
## 2 e45ae0e683895ab429483c3aaeba836c9e6904692daebad297c5e370d735d6df 59833398
## 3 ab112830f538338e386af96be442ea1d9e81d76fabc17b16fffa8f98be478241 72430441
## 4 577321c8f3b5bff9e5470184606207ed04edc14a82f706a92e9fe62c998d2aba 29631653
## 5 75624f1d112034145f181fb895479093f94ea2a7de0e48d582609e35dcd89193 50327475
## 6 36e368697be4d8d37b94338581c0b963f99db8c2c0b508bf80e28ede94c18930 48210076
##   platform     country user_segment purchase_date perc_sale
## 1 35279833  country_76    segment_2           154       0.0
## 2 35279833 country_144    segment_1           114       0.0
## 3 35279833  country_35    segment_4           204  (25, 50]
## 4 82921252 country_177    segment_3           139  (25, 50]
## 5 35279833 country_125    segment_0           170       0.0
## 6 59713904  country_74    segment_2           174       0.0

str(userBrandsDF)

## 'data.frame':    846490 obs. of  7 variables:
##  $ user_id      : chr  "65b4c425c57d4d5cd6a09b6721f76dfe76fca72000a58ad1b252ef4ba10dfb0d" "e45ae0e683895ab429483c3aaeba836c9e6904692daebad297c5e370d735d6df" "ab112830f538338e386af96be442ea1d9e81d76fabc17b16fffa8f98be478241" "577321c8f3b5bff9e5470184606207ed04edc14a82f706a92e9fe62c998d2aba" ...
##  $ brand_id     : num  92850413 59833398 72430441 29631653 50327475 ...
##  $ platform     : num  35279833 35279833 35279833 82921252 35279833 ...
##  $ country      : chr  "country_76" "country_144" "country_35" "country_177" ...
##  $ user_segment : chr  "segment_2" "segment_1" "segment_4" "segment_3" ...
##  $ purchase_date: num  154 114 204 139 170 174 172 142 161 139 ...
##  $ perc_sale    : chr  "0.0" "0.0" "(25, 50]" "(25, 50]" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   user_id = col_character(),
##   ..   brand_id = col_double(),
##   ..   platform = col_double(),
##   ..   country = col_character(),
##   ..   user_segment = col_character(),
##   ..   purchase_date = col_double(),
##   ..   perc_sale = col_character()
##   .. )

From this the first thing to note is the overall dataset shape, containing 846490 observations from 7 different variables.

By performing this analysis some important data facets were identified. The first is that, although some variables have numerical type, all but purchace_date are categorical in meaning, enforcing the need from special care on their investigation. Another important factor regards the values for user_id, which were encoded in a large string which can make results difficult to analyze individually. Finally, the last feature that is important to note is that perc_sale presents its values in the form of discount intervals.

Statistical Moments

To better understand the overall data distributions an extensive statistical moment table was created, containing, among others, variable types, missing value counts, extremes, rages, quantiles, mean, mode, variance and number of unique values. Although, as most variables are categorical in nature, and as such these statistical moments have little meaning, they still provide important insight to the data.

Mode <- function(x) {
  uniques <- unique(x)
  uniques[which.max(tabulate(match(x, uniques)))]
}

ModeCount <- function(x) {
  uniques <- unique(x)
  max(tabulate(match(x, uniques)))
}

userBrandsStatistics = data.frame(matrix(ncol = ncol(userBrandsDF), nrow = 0))
colnames(userBrandsStatistics) = colnames(userBrandsDF)
for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  userBrandsStatistics["type",column] = mode(columnData)
  userBrandsStatistics["na count",column] = sum(is.na(columnData))
  userBrandsStatistics["min",column] = min(columnData)
  userBrandsStatistics["25 quantile",column] = tryCatch(
    {quantile(columnData, 0.25)},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["mean",column] = mean(columnData)
  userBrandsStatistics["median",column] = median(columnData)
  userBrandsStatistics["75 quantile",column] = tryCatch(
    {quantile(columnData, 0.75)},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["max",column] = max(columnData)
  userBrandsStatistics["range",column] = tryCatch(
    {as.numeric(userBrandsStatistics["max",column])- as.numeric(userBrandsStatistics["min",column])},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["IQR",column] = tryCatch(
    {as.numeric(userBrandsStatistics["75quantile",column])- as.numeric(userBrandsStatistics["25quantile",column])},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["sd",column] = sd(columnData)
  userBrandsStatistics["var",column] = var(columnData)
  userBrandsStatistics["coef var",column] = as.numeric(userBrandsStatistics["sd",column]) / as.numeric(userBrandsStatistics["mean",column])
  userBrandsStatistics["skewness",column] = tryCatch(
    {skewness(columnData)},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["kurtosis",column] = tryCatch(
    {kurtosis(columnData)},
    error = function(e) { return(NA) }
  )
  userBrandsStatistics["mode",column] = Mode(columnData)
  userBrandsStatistics["mode count",column] = ModeCount(columnData)
  userBrandsStatistics["unique count",column] = length(unique(columnData))
  userBrandsStatistics["sum",column] = tryCatch(
    {sum(columnData)},
    error = function(e) { return(NA) }
  )
}
userBrandsStatistics

##                                                                       user_id
## type                                                                character
## na count                                                                    0
## min          00001fb7f8a0bab989afb3b2e3f87091f19c624f4c918f5e1fbb3c2ee12e9db6
## 25 quantile                                                              <NA>
## mean                                                                     <NA>
## median                                                                   <NA>
## 75 quantile                                                              <NA>
## max          ffffede449ad3486ffee78949e0d34db24c36de30f71e1a20182df8d658281a2
## range                                                                    <NA>
## IQR                                                                      <NA>
## sd                                                                       <NA>
## var                                                                      <NA>
## coef var                                                                 <NA>
## skewness                                                                 <NA>
## kurtosis                                                                 <NA>
## mode         e646f8406d203c00922016d163433c7665bdd4d3b6260e72626a2900c7910727
## mode count                                                                 24
## unique count                                                           329213
## sum                                                                      <NA>
##                         brand_id          platform     country user_segment
## type                     numeric           numeric   character    character
## na count                       0                 0           0            0
## min                       102532          15766995   country_0    segment_0
## 25 quantile             27968276          35279833        <NA>         <NA>
## mean            52775913.8391416  47381156.6653605        <NA>         <NA>
## median                  51215738          39857033        <NA>         <NA>
## 75 quantile             80735391          59713904        <NA>         <NA>
## max                     99855154          96462699  country_99    segment_5
## range                   99752622          80695704        <NA>         <NA>
## IQR                         <NA>              <NA>        <NA>         <NA>
## sd              29031126.4920308  13546980.7862736        <NA>         <NA>
## var              842806305396293   183520688423666        <NA>         <NA>
## coef var       0.550082876452244 0.285914944667815        <NA>         <NA>
## skewness     0.00453599969320028 0.559633782166231        <NA>         <NA>
## kurtosis       -1.25236279783164 -0.64813779627103        <NA>         <NA>
## mode                    89083147          35279833 country_118    segment_1
## mode count                 23022            379237      189617       355539
## unique count                1771                12         175            6
## sum               44674283305695    40107675305661        <NA>         <NA>
##                   purchase_date perc_sale
## type                    numeric character
## na count                      0         0
## min                           1   (0, 25]
## 25 quantile                  67      <NA>
## mean           119.200774964855      <NA>
## median                      129      <NA>
## 75 quantile                 170      <NA>
## max                         231       0.0
## range                       230      <NA>
## IQR                        <NA>      <NA>
## sd             64.4743898286592      <NA>
## var            4156.94694377791      <NA>
## coef var      0.540889015592967      <NA>
## skewness     -0.169298217417195      <NA>
## kurtosis      -1.10177341210125      <NA>
## mode                        161       0.0
## mode count                14798    383519
## unique count                231         5
## sum                   100902264      <NA>

The most important takeaways from this large table are as follows:

For user_id the most common value only appears 24 times, while there are 329213 unique values for less than 3 times as many observations. This leads to the conclusion that the dataset is extremely sparce and special care will have to be had in its analysis.

For brand_id the mode if far more common, with 23022 appearances, being as such very well represented. With 1771 unique values there is much less sparsity than for user_id, although is still large enough for an individual analysis to be unreasonable.

The last variable that will further detailed is purchace_date, for which it was verified that a total of 230 unique values are present, following a mostly symmetric distribution. Due to the kurtosis values, it is also concluded that the distribution is closer to uniform than normal.

From this point although an individual analysis could be done for each other variables, the conclusions are almost identical to those for brand_id. The only important remark to make is that the variables platform, user_segment and perc_sale have a low enough number of unique values to warrant an individual value analysis.

Unique Value Analysis

With this in mind, the unique values for these 3 variables were obtained and studied. From this it was possible to see that the values for platform and user_segment are not meaningful, being only randomly assigned ids. Perc_sale, one the other hand, shows 5 values, 4 of which are 25% sale intervals (25% to 50%, 75% to 100%, etc.) and the last, with value 0, indicates no discount. This suggests that perc_sale is ordinal, and as such could be label encoded with little loss of meaning.

print(unique(userBrandsDF$platform))

##  [1] 35279833 82921252 59713904 39857033 50298854 34959346 24380949 54454273
##  [9] 15766995 22734142 47514527 96462699

print(unique(userBrandsDF$user_segment))

## [1] "segment_2" "segment_1" "segment_4" "segment_3" "segment_0" "segment_5"

print(unique(userBrandsDF$perc_sale))

## [1] "0.0"       "(25, 50]"  "(0, 25]"   "(50, 75]"  "(75, 100]"

To finalize the set of data statistical moments, a second table containing moments that should be more meaningful for categorical variables was created. For this table, instead of utilizing the values themselves, the number of occurrences per unique value is employed.

uniqueStatistics = data.frame(matrix(ncol = ncol(userBrandsDF), nrow = 0))
colnames(uniqueStatistics) = colnames(userBrandsDF)
for (column in colnames(userBrandsDF)){
  columnData = (userBrandsDF %>% group_by_at(column) %>% tally())$n
  uniqueStatistics["type",column] = mode(columnData)
  uniqueStatistics["count",column] = length(columnData)
  uniqueStatistics["na count",column] = sum(is.na(columnData))
  uniqueStatistics["min",column] = min(columnData)
  uniqueStatistics["25 quantile",column] = tryCatch(
    {quantile(columnData, 0.25)},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["mean",column] = mean(columnData)
  uniqueStatistics["median",column] = median(columnData)
  uniqueStatistics["75 quantile",column] = tryCatch(
    {quantile(columnData, 0.75)},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["max",column] = max(columnData)
  uniqueStatistics["range",column] = tryCatch(
    {as.numeric(uniqueStatistics["max",column])- as.numeric(uniqueStatistics["min",column])},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["IQR",column] = tryCatch(
    {as.numeric(uniqueStatistics["75 quantile",column]) - as.numeric(uniqueStatistics["25 quantile",column])},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["sd",column] = sd(columnData)
  uniqueStatistics["var",column] = var(columnData)
  uniqueStatistics["coef var",column] = as.numeric(uniqueStatistics["sd",column]) / as.numeric(uniqueStatistics["mean",column])
  uniqueStatistics["skewness",column] = tryCatch(
    {skewness(columnData)},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["kurtosis",column] = tryCatch(
    {kurtosis(columnData)},
    error = function(e) { return(NA) }
  )
  uniqueStatistics["mode",column] = Mode(columnData)
  uniqueStatistics["mode count",column] = ModeCount(columnData)
  uniqueStatistics["unique count",column] = length(unique(columnData))
}
print(uniqueStatistics)

##                       user_id         brand_id          platform
## type                  numeric          numeric           numeric
## count                  329213             1771                12
## na count                    0                0                 0
## min                         1                1                 3
## 25 quantile                 1               14            376.25
## mean         2.57125326156622 477.972896668549  70540.8333333333
## median                      2               57             12824
## 75 quantile                 3              229           34408.5
## max                        24            23022            379237
## range                      23            23021            379234
## IQR                         2              215          34032.25
## sd           2.33704794337009  1726.6482940855  136023.169010974
## var          5.46179308961037 2981314.33146838  18502302507.7879
## coef var     0.90891394414665 3.61243975572708  1.92828979448273
## skewness     2.69042099494101 7.72615321730558    1.544551106458
## kurtosis     9.35063859226698 72.5998431649631 0.527163457775885
## mode                        1                1                 3
## mode count             132091               75                 2
## unique count               24              585                11
##                       country      user_segment     purchase_date
## type                  numeric           numeric           numeric
## count                     175                 6               231
## na count                    0                 0                 0
## min                         1             42100              1982
## 25 quantile                 9           55811.5            2860.5
## mean         4837.08571428571  141081.666666667  3664.45887445887
## median                    122          118175.5              3379
## 75 quantile              2147         162651.25              4157
## max                    189617            355539             14798
## range                  189616            313439             12816
## IQR                      2138         106839.75            1296.5
## sd           17755.2890066876  117286.073484735  1320.66591463699
## var          315250287.711002  13756023033.4667  1744158.45808395
## coef var     3.67065833757083 0.831334618138913 0.360398618153959
## skewness     7.47568085149853 0.813596576523874  3.65298340534392
## kurtosis     68.0561319345571 -0.97206332327874  23.3827130959389
## mode                        1             46523              4029
## mode count                  9                 1                 2
## unique count              127                 6               227
##                      perc_sale
## type                   numeric
## count                        5
## na count                     0
## min                        908
## 25 quantile              48699
## mean                    169298
## median                   91704
## 75 quantile             321660
## max                     383519
## range                   382611
## IQR                     272961
## sd            171773.801569098
## var              29506238905.5
## coef var      1.01462392685736
## skewness     0.251431661408326
## kurtosis     -2.15079936814689
## mode                     91704
## mode count                   1
## unique count                 5

At a first glance it is immediately obvious that the results which were obtained are far more helpful. From this table, the obtained conclusions were as follows:

For user_id it is verified that there are 132091 users that only appear once, although over half of them appear at least twice. Of course, the vast majority of users show less than 3 times. This distribution is also far from normal, being heavily left skewed.

Brand_id shows an even more left skewed distribution. With 75 brands having only one entry, and thus being very poorly represented, with an accurate description of them not being feasible. Even if the bottom 50% brands were discarded, with a median value of 57, it would still be very difficult to properly understand the distribution of the remaining. Of course, the top 50% brands present a much larger number of entries and, as such, most of the dataset would still be kept.

Both platform and country show similar results to the ones presented before, thus making separate analysis for each of these variable’s values redundant and undesirable.

User_segment, on the other hand, has a distribution relatively close to normal, and with the most infrequent value appearing 42100 times, an analysis per segment would be reasonable.

With purchace_date being the only numerical variable, this analysis does not add much. It is furthermore not reasonable to make a model per date, regardless of the number of occurrences, since we would expect a recommender system to work on different, later dates.

Finally, perc_sale shows a surprising result given the very low number of unique values. This is the fact that the least common value only appears 908 times. Nevertheless, disregarding this, the variable presents a mostly centered distribution.

Distributions

The different statistical summary tables previously analyzed provided a great deal of insight into the dataset. Despite this, due to the aggregation of information into a set of single values a lot of detail is lost. Taking this into consideration the following step involves the further study of variable distributions aided by a large set of graphical plots. This set of plots is comprised of bar plots, histograms, boxplots, normal Q-Q plots, pie plots and a line chart.

Bar Plots

Given the mostly categorical nature of the data, one of the most important plot types is the bar plot, for which all available variables are applicable, even if not necessarily very compatible. With the intent of making these understandable, a maximum of 50 different classes/values is displayed at a time, which has as a downside not giving a full overview of the data. The most important takeaways from these plots are:

for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  t = table(columnData)
  barplot(
    sort(t, decreasing=TRUE)[1:min(length(unique(columnData)),50)],
    xaxt = "n", main=paste("Barplot of",column)
  )
}

From the bar plot of user_id, despite only a minimal number of values being shown it can be clearly be seen that the higher the value the less frequent it is. It is easy to imagine this plot going through all values, with a longer and longer tail.

For brand_id a different story is told, with only a few very common brands that overshadow the vast majority. This of course, means that only a minority of brands is properly represented, with all others lacking in terms of information.

This is even more noticeable for the platform variable, with 2 platforms having each more prevalence than all others combined.

The same is repeated for the variable country, with one being by far the most prevalent. Upon a quick search it was concluded that this is likely the USA, with the second highest being the UK.

The user_segment plot is, in contrast, very well balanced, with the most common class not being much more common than the others. similarly, purchace_date is also quite balanced, apart from the outliers, possibly around holidays

Finally, perc_sale has two values that are by far the most common and one which barely appears. A further analysis of this was performed, which will be detailed further along the visualization process.

Histograms

After bar plots the creation of histograms for all applicable variables proceeded. Given the existence of only one numerical variable, and it being related to time, which usually does not give much insight on its own, these plots were expected not be very useful, which was confirmed. Thus, not a lot of insight is gained from this alone. Nevertheless, they were kept for completeness of the process.

for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  tryCatch(
    {
      hist(columnData, main=paste("Histogram of",column))
    },
    error = function(e) {}
  )
}

Box Plots

Box plots followed the histograms, but once again being mostly a tool for analysis of numerical variables they presented little in the way of new information. Once again, they were, nevertheless, kept.

for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  tryCatch(
    {
      boxplot(columnData, main=paste("Boxplot of",column))
    },
    error = function(e) {}
  )
}

Q-Q Plots

To verify the conclusions regarding the normality of the different variables a set of Q-Q plots was created. Despite many different attempts, due to the length of time it takes for the plots to be generated, Rmarkdown does not plot them. As such, to observe them they must be done in the command line. Therefore, in this case the images are loaded from a file.

for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  t = table(columnData)
  tryCatch(
    {
      ggqqplot(userBrandsDF,x=column, main=paste("Normal Q-Q Plot of",column))
    },
    error = function(e) {print(column)}
  )
}

brand_id QQplot platform QQplot purchace_date QQplot

From the analysis of these plots, it was verified that most of points fall outside the confidence bands, and as such it becomes obvious that the data is, in fact, not normally distributed.

Pie Charts

The pie chart is a tool that is often ignored or even discouraged in a data visualization pipeline, with more importance given to the bar plot. However, is this case, due to the overly large number of unique categorical values and the heavy left skewness of variable frequencies, the pie chart can be very useful for a brief and informative plot.

for (column in colnames(userBrandsDF)){
  columnData = pull(userBrandsDF,column)
  t = table(columnData)
  theshold = 2.1
  relFreq = as.integer(t)/sum(t)*100
  if (length(which(relFreq<=theshold))>0){
    slices = c(
      relFreq[which(relFreq>theshold)],
      sum(relFreq[which(relFreq<=theshold)])
    )
    labels = c(
      names(t)[which(relFreq>theshold)],
      "Others"
    )
  }
  else{
    slices = relFreq
    labels = names(t)
  }
  if (length(slices)>=2){
    labels <- paste(labels, signif(slices,2), sep=" - ") # add percents to labels
    labels <- paste(labels,"%",sep="") # ad % to labels
    pie(
      slices,labels = labels, 
      col=rainbow(length(labels)),
      main=paste("Pie Chart of",column)
    )
  }
}

For the creation of the pie charts, any variable whose occurrence is less than 2.1% is combined into the “other” slice, while if there are less than 2 slices after the process, the plot is skipped. The important information obtained from this set of plots is as follows:

The chart of brand_id shows that there are 4 brands that dominate. These, nevertheless, only account for 10% of the total entries.

For platform there are 2 very common values, with others having roughly the same amount of participation.

The third pie chart of country shows a more diverse distribution, with 11 total countries having more than 2% participation. This however means that the remaining 164 account for only 29% of all purchases.

It is clear from the pie chart of user_segment that segment 1 is the most common, with 2 and 4 closely behind.

Finally, the last chart shows that a lack of any discount is the most common, while a discount between 75 and 100 is the least.

Line plot

The final plot type that will be presented is this section is the line plot. This was only done for the purchace_date variable, with the data aggregated and counted per individual date.

plot(
  (userBrandsDF %>% group_by(purchase_date) %>% tally())$n,
  type="l",xlab="purchase_date", ylab="Purchace Count"
)

In this data there is a clear important and central point, the outlier with a very high value. In the initial analysis of the statistical moments, it was concluded that this was due to holidays. Combining this with the knowledge that this data likely stems largely from the USA (a heavily Christian country), It was concluded that this spike comes immediately after Christmas. A larger than normal amount is observed before this, which declines (due to the purchase Christmas gifts), followed by a spike and rapid decline (caused by after Christmas gift swapping and heavily discounted prices).

Relations

With the combined knowledge obtained from the statistical tables and the set of univariate plots, the understanding of each individual variable is now very solid. However, there is still a quite large gap in the understanding of the dataset, regarding the relations between variable distributions. Similarly, to the analysis of single variable distributions, an initial analysis of statistical moments was performed, which was followed by a set of bivariate graphical plots.

Distribution by group

As mentioned, the first dual variable analysis entails the obtention of a statistical moment table, this time with the distributions being obtained split by the values of other variables. Due to the number of unique values, only 2 variables were split by their values, user_segment and perc_sale.

A second set of tables that is also obtained is the set contingency tables for platform, user_segment and perc_sale. Platform was here included, unlike for the statistics table, as the contingency table occupies less space, thus permitting the utilization of a variable with slightly more unique values.

describeBy(userBrandsDF, userBrandsDF$perc_sale)

## 
##  Descriptive statistics by group 
## group: (0, 25]
##               vars     n        mean          sd   median     trimmed
## user_id*         1 91704         NaN          NA       NA         NaN
## brand_id         2 91704 51775886.64 29956092.56 50039784 52116569.85
## platform         3 91704 48430440.77 13669659.47 50298854 47596890.34
## country*         4 91704         NaN          NA       NA         NaN
## user_segment*    5 91704         NaN          NA       NA         NaN
## purchase_date    6 91704      136.87       55.69      151      142.93
## perc_sale*       7 91704         NaN          NA       NA         NaN
##                      mad      min      max    range  skew kurtosis       se
## user_id*              NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      38608176.1   287368 99855154 99567786  0.01    -1.26 98921.58
## platform      15481043.8 15766995 82921252 67154257  0.45    -0.73 45140.21
## country*              NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*         NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       34.1        1      231      230 -0.97     0.16     0.18
## perc_sale*            NA      Inf     -Inf     -Inf    NA       NA       NA
## ------------------------------------------------------------ 
## group: (25, 50]
##               vars      n        mean          sd   median     trimmed
## user_id*         1 321660         NaN          NA       NA         NaN
## brand_id         2 321660 51393299.65 28839280.88 49160230 51384848.70
## platform         3 321660 47585810.05 13656657.47 39857033 46573225.17
## country*         4 321660         NaN          NA       NA         NaN
## user_segment*    5 321660         NaN          NA       NA         NaN
## purchase_date    6 321660      124.33       66.93      143      127.29
## perc_sale*       7 321660         NaN          NA       NA         NaN
##                       mad      min      max    range  skew kurtosis       se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      33605662.76   102532 99855154 99752622  0.08    -1.23 50849.41
## platform       6786156.72 15766995 82921252 67154257  0.57    -0.61 24079.41
## country*               NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       66.72        1      231      230 -0.47    -1.05     0.12
## perc_sale*             NA      Inf     -Inf     -Inf    NA       NA       NA
## ------------------------------------------------------------ 
## group: (50, 75]
##               vars     n        mean          sd   median    trimmed
## user_id*         1 48699         NaN          NA       NA        NaN
## brand_id         2 48699 54486519.88 29055104.84 53637929 55031252.0
## platform         3 48699 46455785.69 13485960.30 35279833 45224198.0
## country*         4 48699         NaN          NA       NA        NaN
## user_segment*    5 48699         NaN          NA       NA        NaN
## purchase_date    6 48699      112.16       71.83      117      111.7
## perc_sale*       7 48699         NaN          NA       NA        NaN
##                       mad      min      max    range  skew kurtosis        se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA        NA
## brand_id      39915704.73   102532 99855154 99752622 -0.05    -1.30 131662.60
## platform       6786156.72 15766995 82921252 67154257  0.75    -0.38  61111.35
## country*               NA      Inf     -Inf     -Inf    NA       NA        NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA        NA
## purchase_date       99.33        1      231      230  0.00    -1.42      0.33
## perc_sale*             NA      Inf     -Inf     -Inf    NA       NA        NA
## ------------------------------------------------------------ 
## group: (75, 100]
##               vars   n        mean          sd   median    trimmed         mad
## user_id*         1 908         NaN          NA       NA        NaN          NA
## brand_id         2 908 59962973.04 26208729.16 67279918 61786532.7 28372576.99
## platform         3 908 45458870.80 13099593.18 35279833 44023605.4   475154.03
## country*         4 908         NaN          NA       NA        NaN          NA
## user_segment*    5 908         NaN          NA       NA        NaN          NA
## purchase_date    6 908      100.49       60.28       91       97.9       63.75
## perc_sale*       7 908         NaN          NA       NA        NaN          NA
##                    min      max    range  skew kurtosis       se
## user_id*           Inf     -Inf     -Inf    NA       NA       NA
## brand_id        422351 98398869 97976518 -0.59    -0.97 869767.2
## platform      22734142 82921252 60187110  0.94    -0.06 434725.3
## country*           Inf     -Inf     -Inf    NA       NA       NA
## user_segment*      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date        1      231      230  0.38    -0.80      2.0
## perc_sale*         Inf     -Inf     -Inf    NA       NA       NA
## ------------------------------------------------------------ 
## group: 0.0
##               vars      n        mean          sd   median     trimmed
## user_id*         1 383519         NaN          NA       NA         NaN
## brand_id         2 383519 53940412.92 28906400.07 51522426 54673053.97
## platform         3 383519 47080670.17 13414669.37 39857033 46212375.95
## country*         4 383519         NaN          NA       NA         NaN
## user_segment*    5 383519         NaN          NA       NA         NaN
## purchase_date    6 383519      111.61       62.06      105      109.84
## perc_sale*       7 383519        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis       se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      36410011.04   102532 99855154 99752622 -0.05    -1.26 46676.75
## platform       6786156.72 15766995 96462699 80695704  0.55    -0.69 21661.40
## country*               NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       69.68        1      231      230  0.26    -0.94     0.10
## perc_sale*           0.00        0        0        0   NaN      NaN     0.00

describeBy(userBrandsDF, userBrandsDF$user_segment)

## 
##  Descriptive statistics by group 
## group: segment_0
##               vars     n        mean          sd   median     trimmed
## user_id*         1 46523         NaN          NA       NA         NaN
## brand_id         2 46523 54863739.58 28752727.30 52029885 55716830.28
## platform         3 46523 44850103.80 13508129.43 35279833 43456710.19
## country*         4 46523         NaN          NA       NA         NaN
## user_segment*    5 46523         NaN          NA       NA         NaN
## purchase_date    6 46523       96.95       65.99       88       93.65
## perc_sale*       7 46523        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis        se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA        NA
## brand_id      37531618.70   102532 99855154 99752622 -0.09    -1.24 133304.62
## platform             0.00 15766995 82921252 67154257  0.93    -0.01  62626.96
## country*               NA      Inf     -Inf     -Inf    NA       NA        NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA        NA
## purchase_date       84.51        1      231      230  0.31    -1.10      0.31
## perc_sale*           0.00        0        0        0   NaN      NaN      0.00
## ------------------------------------------------------------ 
## group: segment_1
##               vars      n        mean          sd   median    trimmed
## user_id*         1 355539         NaN          NA       NA        NaN
## brand_id         2 355539 52702464.62 29244060.11 51215738 53081995.2
## platform         3 355539 46729967.90 13489723.67 35279833 45625890.3
## country*         4 355539         NaN          NA       NA        NaN
## user_segment*    5 355539         NaN          NA       NA        NaN
## purchase_date    6 355539      118.29       65.22      128      119.2
## perc_sale*       7 355539        0.00        0.00        0        0.0
##                      mad      min      max    range  skew kurtosis       se
## user_id*              NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      36507260.7   102532 99855154 99752622  0.01    -1.26 49044.92
## platform       6786156.7 15766995 82921252 67154257  0.67    -0.50 22623.48
## country*              NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*         NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       77.1        1      231      230 -0.15    -1.15     0.11
## perc_sale*           0.0        0        0        0   NaN      NaN     0.00
## ------------------------------------------------------------ 
## group: segment_2
##               vars      n        mean          sd   median     trimmed
## user_id*         1 165977         NaN          NA       NA         NaN
## brand_id         2 165977 52628970.53 28962098.94 50688913 52995154.25
## platform         3 165977 47942258.92 13621260.31 39857033 47050281.96
## country*         4 165977         NaN          NA       NA         NaN
## user_segment*    5 165977         NaN          NA       NA         NaN
## purchase_date    6 165977      120.85       63.48      132      122.24
## perc_sale*       7 165977        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis       se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      35543493.61   102532 99855154 99752622  0.02    -1.25 71089.60
## platform       7261310.75 15766995 82921252 67154257  0.50    -0.71 33434.38
## country*               NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       72.65        1      231      230 -0.21    -1.05     0.16
## perc_sale*           0.00        0        0        0   NaN      NaN     0.00
## ------------------------------------------------------------ 
## group: segment_3
##               vars     n        mean          sd   median     trimmed
## user_id*         1 83677         NaN          NA       NA         NaN
## brand_id         2 83677 52752868.80 28824015.54 51215738 53183983.96
## platform         3 83677 48162380.68 13465602.85 39857033 47488958.89
## country*         4 83677         NaN          NA       NA         NaN
## user_segment*    5 83677         NaN          NA       NA         NaN
## purchase_date    6 83677      122.89       62.67      131      124.56
## perc_sale*       7 83677        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis       se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      35628848.38   102532 99855154 99752622 -0.01    -1.25 99644.06
## platform       7261310.75 15766995 96462699 80695704  0.41    -0.82 46550.33
## country*               NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       71.16        1      231      230 -0.23    -0.98     0.22
## perc_sale*           0.00        0        0        0   NaN      NaN     0.00
## ------------------------------------------------------------ 
## group: segment_4
##               vars      n        mean          sd   median     trimmed
## user_id*         1 152674         NaN          NA       NA         NaN
## brand_id         2 152674 52455483.79 28859666.54 50640759 52797427.29
## platform         3 152674 48430734.75 13550298.59 50298854 47678877.83
## country*         4 152674         NaN          NA       NA         NaN
## user_segment*    5 152674         NaN          NA       NA         NaN
## purchase_date    6 152674      122.46       63.33      133      124.18
## perc_sale*       7 152674        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis       se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA       NA
## brand_id      35472100.49   102532 99855154 99752622  0.01    -1.24 73859.91
## platform      15481043.81 15766995 82921252 67154257  0.42    -0.78 34678.98
## country*               NA      Inf     -Inf     -Inf    NA       NA       NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA       NA
## purchase_date       71.16        1      231      230 -0.24    -1.01     0.16
## perc_sale*           0.00        0        0        0   NaN      NaN     0.00
## ------------------------------------------------------------ 
## group: segment_5
##               vars     n        mean          sd   median     trimmed
## user_id*         1 42100         NaN          NA       NA         NaN
## brand_id         2 42100 52876175.86 28742864.84 51215738 53341463.56
## platform         3 42100 48106366.65 13275117.65 39857033 47633955.08
## country*         4 42100         NaN          NA       NA         NaN
## user_segment*    5 42100         NaN          NA       NA         NaN
## purchase_date    6 42100      125.85       62.79      133      127.93
## perc_sale*       7 42100        0.00        0.00        0        0.00
##                       mad      min      max    range  skew kurtosis        se
## user_id*               NA      Inf     -Inf     -Inf    NA       NA        NA
## brand_id      35628848.38   102532 99855154 99752622 -0.02    -1.26 140084.15
## platform      15481043.81 15766995 82921252 67154257  0.33    -0.93  64698.96
## country*               NA      Inf     -Inf     -Inf    NA       NA        NA
## user_segment*          NA      Inf     -Inf     -Inf    NA       NA        NA
## purchase_date       71.16        1      231      230 -0.26    -0.95      0.31
## perc_sale*           0.00        0        0        0   NaN      NaN      0.00

table(userBrandsDF$platform, userBrandsDF$user_segment, userBrandsDF$perc_sale)

## , ,  = (0, 25]
## 
##           
##            segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
##   15766995         2         8         1         3         2         0
##   22734142         2        26         4         2         3         0
##   24380949        55       200        72        47        60        33
##   34959346       107      1238       414       155       295        65
##   35279833      2005     17431      7139      3352      6522      1617
##   39857033        89      1955       937       536       991       263
##   47514527         0         0         0         0         0         0
##   50298854       150      1115       278        56       195        28
##   54454273        29       320        84        17        59        13
##   59713904      1194     16143      8401      4160      8071      2103
##   82921252       165      1656       736       314       657       129
##   96462699         0         0         0         0         0         0
## 
## , ,  = (25, 50]
## 
##           
##            segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
##   15766995         4        25        22         5         8         4
##   22734142         4       145        24         7        16         5
##   24380949       247       826       290       160       241       117
##   34959346       360      3996      1433       514      1034       229
##   35279833      7842     65595     27692     12237     23499      5221
##   39857033       445      7536      4168      2202      3873      1080
##   47514527         0         1         0         0         0         0
##   50298854       665      4366      1046       233       697        84
##   54454273        79       851       309        89       201        26
##   59713904      4554     52331     27069     13280     25753      6292
##   82921252       485      5418      2744      1111      2420       450
##   96462699         0         0         0         0         0         0
## 
## , ,  = (50, 75]
## 
##           
##            segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
##   15766995         1        11         2         0         2         4
##   22734142         1        24         6         1         2         1
##   24380949        22       120        45        24        56        16
##   34959346        45       507       185        63       143        41
##   35279833      1225     10680      4630      1973      3862       812
##   39857033        71      1349       810       459       802       194
##   47514527         0         1         0         0         0         0
##   50298854       127       679       144        43       107        11
##   54454273         9       128        30        16        29         2
##   59713904       597      6702      3692      1894      3655       831
##   82921252        86       746       384       181       351        65
##   96462699         0         0         0         0         0         0
## 
## , ,  = (75, 100]
## 
##           
##            segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
##   15766995         0         0         0         0         0         0
##   22734142         0         1         0         0         0         0
##   24380949         0         1         1         0         0         0
##   34959346         0        11         3         2         5         1
##   35279833        27       236        91        32        57        10
##   39857033         7        38        13         6        18         2
##   47514527         0         0         0         0         0         0
##   50298854         4        20         1         0         2         1
##   54454273         1         1         0         0         0         0
##   59713904        12       135        57        23        48        10
##   82921252         3        17         5         1         5         0
##   96462699         0         0         0         0         0         0
## 
## , ,  = 0.0
## 
##           
##            segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
##   15766995         6        40        24         6        29         3
##   22734142        16        72        31         5        24         9
##   24380949       686      1387       480       456       368       296
##   34959346       556      4345      1734       702      1348       384
##   35279833     14407     73158     32090     17119     29242      9434
##   39857033       604      7015      3624      2502      3834      1449
##   47514527         0         0         1         0         0         0
##   50298854      1007      5345      1336       472       966       164
##   54454273       137       945       313       106       251        51
##   59713904      7436     55753     30965     17935     30680     10043
##   82921252       947      4890      2417      1173      2191       507
##   96462699         0         0         0         3         0         0

Upon analyzing the statistics split by perc_sale it was concluded that any variations are simply a matter of randomness and not caused by any underlying distribution difference. The main important thing to point out is that the value that was verified to be extremely rare is that of “(75, 100]”, with only 908 occurrences.

Similar conclusions are derived from the statistics table by user segment. Once again concluding that any distribution variations are being caused by random chance.

The 3 variable contingency table, however, presents far more meaningful results. the first one to point out is that platform 96462699 only appears 3 times and always for perc_sale of 0 and for segment_3. Platform 47514527 is also quite rare, appearing only in a handful of instances for segments 1 and 2, which are the most populated. Regardless, there does not appear to be any strong correlation amongst these 3 variables, with occurrences being mostly independent and only affected by the relative frequency of each value.

These results imply that the task might be very difficult and bad results would not be unexpected.

Correlation Heatmap

With the intent of further investigating variable correlation, a correlation heatmap was created. Upon its investigation is obvious that no variable has high correlation with another, and as such, to improve the chances at obtaining good results, heavily nonlinear approaches are likely to be required.

userBrandsDFNum = userBrandsDF[, sapply(userBrandsDF, is.numeric)]
corMatrix = round(cor(userBrandsDFNum),2)
corMap = melt(corMatrix)
ggplot(data = corMap, aes(x=Var2, y=Var1, fill=value)) + geom_tile()

Grouped Box Plots

In a final effort to identify strong variable relations box plots split by class were created. To make these plots understandable the values were only divided by the 3 variables with the least number of unique variables: platform, user_segment and perc_sale.

For the two plots regarding brand_id and purchace_date divided by platform, the 2 most interesting cases, platforms 47514527 and 96462699 can be ruled out, as these were the ones identified as almost nonexistent. The case for 22734142 is, however, more relevant as this class has enough occurrences for this to be considered a real relation with purchace_date. Other cases, however, do not present enough variation to be considered relevant.

Similarly, all others, but the last plot, containing purchace_date divided by perc_sale, do not present any significant differences or relations. Regarding this last plot, a relation between the perc_sale value of “(0, 25]” and purchase_date seems to exist, with more purchases with slight discount being made at later dates.

for (column1 in c("platform","user_segment","perc_sale")){
  for (column2 in names(userBrandsDF)){
    if (column1==column2){
      next()
    }
    tryCatch(
      {
        boxplot(
          userBrandsDF[,column2] ~ userBrandsDF[,column1],
          main=paste("Boxplots of",column1,"vs",column2),
          las=2, ylab="", xlab=""
        )

      },
      error = function(e) {}
    )
  }
}

Stacked Area Plots

The last plot type in this visualization approach is the stacked area plot, where the count per class for each date is obtained. Two types were created, the normal, with just the counts per day; and a cumulative sum variant, to see the overall evolution is sales.

stackedPlot = function(data, timeColumn, targetColumn, legend=FALSE, cumulative=FALSE){
  time <- rep(sort(unique(data[,timeColumn])),each=length(unique(data[,targetColumn])))  # x Axis
  
  v1 = (data %>% group_by_at(vars(timeColumn,targetColumn)) %>% tally())
  v2 = (data %>% expand(data[timeColumn],data[targetColumn]) %>% group_by_at(vars(timeColumn,targetColumn)))
  value = bind_rows(v1, v2) %>% 
    distinct_at(vars(timeColumn, targetColumn), .keep_all = TRUE)  %>% 
    arrange_at(vars(timeColumn, targetColumn)) # y Axis
  if (cumulative){
    value = (value %>% group_by_at(targetColumn) %>% arrange_at(timeColumn) %>% mutate(cs = cumsum(n)))$cs %>% replace_na(0)
  } else {
    value = value$n  %>% replace_na(0)
  }
  group = rep(sort(unique(data[,targetColumn])),times=length(unique(data[,timeColumn]))) # group, one shape per group
  plotData = data.frame(time, value, group)
  ggplot(plotData, aes(x=time, y=value, fill=as.character(group))) + 
      geom_area(show.legend=legend)
}
stackedPlot(userBrandsDF, "purchase_date", "brand_id", cumulative=FALSE, legend=FALSE)

stackedPlot(userBrandsDF, "purchase_date", "brand_id", cumulative=TRUE, legend=FALSE)

stackedPlot(userBrandsDF, "purchase_date", "platform", cumulative=FALSE, legend=TRUE)

stackedPlot(userBrandsDF, "purchase_date", "platform", cumulative=TRUE, legend=TRUE)

stackedPlot(userBrandsDF, "purchase_date", "country", cumulative=FALSE, legend=FALSE)

stackedPlot(userBrandsDF, "purchase_date", "country", cumulative=TRUE, legend=FALSE)

stackedPlot(userBrandsDF, "purchase_date", "user_segment", cumulative=FALSE, legend=TRUE)

stackedPlot(userBrandsDF, "purchase_date", "user_segment", cumulative=TRUE, legend=TRUE)

stackedPlot(userBrandsDF, "purchase_date", "perc_sale", cumulative=FALSE, legend=TRUE)

stackedPlot(userBrandsDF, "purchase_date", "perc_sale", cumulative=TRUE, legend=TRUE)

The first two plots are those of brand_id. The first thing to note is that the structure of the non-cumulative plot is identical to that of the line plot, with only the colored areas giving further information. However, due to the large number of different classes, it becomes impossible to analyse individual class distributions. Still, these two plots were kept as they were found to be quite pretty.

The second pair of plots, pertaining to the occurrences of platform values, present no clear relation, with values of each class simply increasing proportionally to the overall sales.

The same is repeated for the next two sets of plots, for country and user_segment, again with no clear relation shown.

The last pair, on the other hand, shows a relation between the “(25, 50]” value of perc_sale and time, with a much higher proportion of sales belonging to this class in the latter half of the dates. Other values, however, do not present any strong relation with time.

Preprocessing

With a thorough data understanding gained by the extensive visualization methodology, some very basic preprocessing followed. This preprocessing simply entails transforming the data from a data frame to a binary ratings matrix and obtaining “recommentderlab” evaluation schemes.

Four schemes were obtained, each using 10-fold cross-validation, with different restrictions on the number of times each user must appear. The minimum numbers of occurrences per user are 1, 2, 5 and 10. Furthermore the goal of predicting the next brand the user will purchase was selected, thus with all but one entry per user being employed as known and the remaining value as unknown.

userBrandsBRM = as(userBrandsDF[c("user_id","brand_id")],"binaryRatingMatrix")

e1 = evaluationScheme(userBrandsBRM, method="cross-validation", k=10, given=-1)

userBrandsBRM2 <- userBrandsBRM[rowCounts(userBrandsBRM)>=2,]
e2 = evaluationScheme(userBrandsBRM2, method="cross-validation", k=10, given=-1)

userBrandsBRM5 <- userBrandsBRM[rowCounts(userBrandsBRM)>=5,]
e5 = evaluationScheme(userBrandsBRM5, method="cross-validation", k=10, given=-1)

userBrandsBRM10 <- userBrandsBRM[rowCounts(userBrandsBRM)>=10,]
e10 = evaluationScheme(userBrandsBRM10, method="cross-validation", k=10, given=-1)

Task 2

The second task of the project entails, as stated in the beginning of this report, the obtention of different recommender systems employing 3 methods, popularity, association rules and collaborative filtering; and the analysis of their respective results resorting to ROC and precision-recall curves.

Popularity

The recommender that was first developed is the simplest one, using popularity. This recommender works by simply finding which brands occur more often, and then predicting, while removing already seen brands, the N most common. For an example, let’s take the 6 first users as training data and the 7th as test. Bellow we can see the table containing the number brand of occurrences for the 6 users. It is possible to see that in this case all values are equally common. If we intend to predict 2 values, then for the 7th user 2 of these values would be obtained as predictions. Which in this case would be incorrect, as none of these are in this user’s data.

colCounts(userBrandsBRM[1:6,])[colCounts(userBrandsBRM[1:6,]) != 0]

## 26493521 26715155 33008187 36189858 45417945 51404770 55675036 63020180 
##        1        1        1        1        1        1        1        1 
## 70512514 78369697 86469062 
##        1        1        1

modelPop = Recommender(userBrandsBRM[1:6,],"POPULAR")
recsPop = predict(modelPop, userBrandsBRM[7,], n=2)

inspect(getRatingMatrix(userBrandsBRM[7,]))

##     items                       
## [1] {19686531,47093199,67196564}

as.integer(getList(recsPop)[[1]])

## [1] 70512514 26715155

Given this, the final results, for which all 4 of the evaluation schemes are used, were obtained. The precision-recall and ROC curves, for the top-N values in (1,2,6), can be seen below.

results = evaluate(e10, list())
methods <- list(
  "popular" = list(name="POPULAR", param = NULL)
)
results$popular1 = evaluate(e1, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular2 = evaluate(e2, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular5 = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular10 = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular

plot(results,annotate=TRUE)

plot(results, "prec/rec", annotate=TRUE)

A first glance at either of the images immediately shows a trend, whereas the data gets more restricted, the results worsen. In fact, for equal values of top-N, the larger datasets attain better results in all metrics. The comparison between different top-N is, however, more difficult; but, to make it both, as simple and consistent as possible, a way to combine both metrics into one was devised. For the ROC curves, the final performance was devised as the TPR divided by the FPR; while for the precision-recall curves, the value obtained by multiplying both was selected. It is clear that this is far from perfect, with for ROC lower values of N having a clear advantage, while for precision-recall, the opposite being true. Nevertheless, as the comparison of these curves is almost always done taking into consideration the context in which the recommender system will be employed, the selected aggregation method is about as good as can be for a metric that can be both globally used and that can be immediately interpreted from a plot without requiring an external calculation (as with for example F1).

By utilizing these rules for the popular1 curve, N of 1 is selected as the best for the ROC, while the value N of 6 is selected for the precision-recall. Despite popular1’s results being objectively the best, as some of the methods that will be later analyzed cannot use the full dataset, and in fact, a set smaller than that used for the popular10 is not feasible for a fair model comparison, only popular10 will be kept in the curve set.

Association Rules

The second recommender is that obtained from association rules. This method is far more complex than the previous one and requires some tunning of hyperparameters. This method works by finding the sets of brands that users tend to buy together and create rules indicating their relations. For example, users who buy brand 1 might commonly also buy brand 2, so a rule “brand_1 => brand_2” might be generated. The capability to find popular brands also exists, creating rules such as “ => brand_3”, where brand 3 is predicted regardless of the other brands a user has purchased.

Lets once again see an example using the same 7 users as before, with user 7 being used for testing. Below the set of transactions and rules obtained by the apriori algorithm is presented. Then, passing all but one of the items from user 7 the 2 most likely brands are predicted. The brands are selected in order by the confidence of applicable rules, with rules that are not applicable, obviously being ignored. Unfortunately, none of the rules generated from the 6 users is applicable to the 7th one, and as such, no brand is predicted.

inspect(getRatingMatrix(userBrandsBRM[1:6,]))

##     items                       
## [1] {70512514}                  
## [2] {63020180}                  
## [3] {26493521,26715155,86469062}
## [4] {33008187,45417945}         
## [5] {78369697}                  
## [6] {36189858,51404770,55675036}

modelAr = Recommender(userBrandsBRM[1:6,], "AR", parameter=list(support=0, confidence=0))
rules = getModel(modelAr)$rule_base
inspect(head(rules))

##     lhs           rhs        support   confidence coverage  lift count
## [1] {33008187} => {45417945} 0.1666667 1          0.1666667 6    1    
## [2] {45417945} => {33008187} 0.1666667 1          0.1666667 6    1    
## [3] {26493521} => {26715155} 0.1666667 1          0.1666667 6    1    
## [4] {26715155} => {26493521} 0.1666667 1          0.1666667 6    1    
## [5] {26493521} => {86469062} 0.1666667 1          0.1666667 6    1    
## [6] {86469062} => {26493521} 0.1666667 1          0.1666667 6    1

recsAR = predict(modelAr, userBrandsBRM[7,], n=2)

inspect(getRatingMatrix(userBrandsBRM[7,]))

##     items                       
## [1] {19686531,47093199,67196564}

as.integer(getList(recsAR)[[1]])

## integer(0)

The values of support and confidence are the most important and after some search the values of 0.002 for support and 0.2 for confidence were found to perform the best. Bellow the precision-recall and ROC curves can be seen for the same set of top N predictions.

results$popular1 = NULL
results$popular2 = NULL
results$popular5 = NULL

methods <- list(
  "ar" = list(name="AR", parameter=list(support=0.002, confidence=0.2))
)
results$ar1 = evaluate(e1, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar2 = evaluate(e2, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar5 = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar10 = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar

plot(results,annotate=TRUE)

plot(results, "prec/rec", annotate=TRUE)

Upon an initial look at the ROC curve, it becomes obvious that something is not as it would be expected, with the ar1 and ar2 curves collapsing into a single point and even the ar5 being shorter than what has previously been observed. An understanding of how FPR is calculated indicates that for a N only slightly bigger than 1, all curves, regardless of performance, should present very similar values. This is not observed, and by knowing how association rules work, this leads to the conclusion that in some instances, or even the majority for ar1 and ar2, not enough rules are generated to obtain the requested top-N predictions. This is, of course, caused by the selected support and confidence thresholds. Despite a lot of effort, it was not possible to find more lax values without making the computational time unreasonably long, and thus, these results are something that will have to be accepted.

By employing the aforementioned method to compare the overall results of different top-N values on the ar10 (which has the best results), for the ROC curve, N of 1 is the best, while for precision-recall, N of 6 is. Furthermore, comparing ar10 with popular10 shows for the former a higher TPR, precision and by a larger margin recall, thus achieving objectively better results.

For the same reasons as those presented for the popular models, only ar10 will be used henceforth in any of the comparison plots.

Collaborative Filtering

The last recommender system to be created in task 2 employs collaborative filtering. There are two important variants to this method, user-based and item-based, with both being tested. User-based collaborative filtering works by finding the similarity between users in terms of the items that were purchased, while item-based collaborative filtering selects new items by finding the similarity between different items.

Once again, an example with the first 7 users is presented. Then the models for each were created and the predictions for user 7 can be seen. Again, like for all other methods, none of the results are accurate, with both variants providing the same recommendations. This, however, would be expected, given the very high sparsity in the data.

modelUBCF = Recommender(userBrandsBRM[1:6,], "UBCF", parameter=list(method="cosine",nn=2)) 
modelIBCF = Recommender(userBrandsBRM[1:6,], "IBCF", parameter=list(method="cosine",k=2))

recsUBCF = predict(modelUBCF, userBrandsBRM[7,], n=2)
recsIBCF = predict(modelUBCF, userBrandsBRM[7,], n=2)

inspect(getRatingMatrix(userBrandsBRM[7,]))

##     items                       
## [1] {19686531,47093199,67196564}

as.integer(getList(recsUBCF)[[1]])

## [1] 63020180 70512514

as.integer(getList(recsIBCF)[[1]])

## [1] 63020180 70512514

Unlike for the previous methods, not all evaluation schemes were possible to use as with the two largest ones there were memory limitations. As such, only the schemes where users appear more than 5 and 10 times were used. For a fair comparison, the same top-N set was employed, and the obtained performance curves are presented below.

results$ar1 = NULL
results$ar2 = NULL
results$ar5 = NULL

methods <- list(
"UBCF" = list(name="UBCF", param = NULL),
"IBCF" = list(name="IBCF", param = NULL)
)
colabFilter = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)
results$UBCF5 = colabFilter$UBCF
results$IBCF5 = colabFilter$IBCF
colabFilter = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)
results$UBCF10 = colabFilter$UBCF
results$IBCF10 = colabFilter$IBCF

plot(results,annotate=TRUE)

plot(results, "prec/rec", annotate=TRUE)

This set of results presents similar findings to those for the set of popular models, with larger datasets leading to improved performance in all instances. Another fact that is clear is the superiority (in this case) of the item-based collaborative filtering over the user-based one, with the former achieving undeniably better results. One thing of note is that for the UBCF5 precision-recall curve, precision improves with N, which indicates that the later predictions are far more accurate.

Much like it was done for the previous two model types, employing the stated comparison methodology it is concluded that in the ROC curve, for both UBCF5 and IBCF5, N of 1 is the best, while for precision-recall, 6 is chosen as the optimal. Finally, comparing UBCF10 or IBCF10 with either ar10 or popular10 shows significantly worse results.

Task 3

The final task involves the utilization of the contextual information present in the dataset to improve results. With this goal, an ambitious method employing an LSTM was selected. Due to the complexity of this strategy many decisions had to be made and parameters to be tuned. As such, a more in-depth implementation explanation and detailing will be provided.

A LSTM expects, as its input, ordered sequences (typically by time), and as such, the first step was to order the full dataset by purchace_date.

As the sequences of user purchases are not related to amongst different users, other than similarities in the statistical distributions, feeding them to the LSTM in sequence is bound to create spurious relations based on their arbitrary order. With this in mind, the dataset was divided by user into independent sequences of different lengths.

With the aforementioned steps concluded the data is now in its base format, with a list of user sequences. Now, each variable needs to be treated individually to be passed to the model.

Firstly, at this point, user_id no longer holds meaningful information and was removed.

It is the goal to predict brand_id, as such, knowing it for the current step is not realistic, therefore all values were shifted by one. Thus, the model has access to the brand_id in the previous step, and in case there is no previous step, a value of 0 was selected to fill the empty space. Furthermore, the LSTM does not handle categorical variables, and thus, brand_id was encoded using the brand_features dataset, with the previously placed values of 0 being replaced by all 0s with the same length (120) as the encoding. This set of steps encompasses only the input of the brand_id. For the output, the variable was one hot encoded and no shifting occurred.

The variables platform and user_segment were one hot encoded, while purchace_date was left as is (numerical).

Perc_sale presents an interesting case, as much like for brand_id it does not make sense that it would be known before the purchase. Taking this into consideration, it was also shifted. Unlike all other variables for which the values have no relation, this one is obviously ordinal. Therefore, it was encoded between 1 and 5, with 0 being left to fill the lack of value due to shifting.

As country will be handled differently, by passing into an embeding layer, it was label encoded, and as such after this process we were left with 3 different sets of data. The X set with all but the country variable. The X country set, containing only the country. And the Y set, containing only the one hot encoded brand.

shift = function(array, n){
  c(rep(0, n), head(array, -n))
}
get_x = function(data, toShift=c(), toDrop=c()){
  X = list()
  for (entry in names(data)){
    entryData = array(,c(0,0,0))
    for (column in names(data[[entry]])){
      if (column %in% toDrop){
        next()
      }
      temp = data[[entry]][1:(nrow(data[[entry]])-1),column]
      if (column %in% toShift){
        temp = shift(temp,1)
      }
      entryData = tryCatch(
        abind(entryData,array(temp,c(length(temp),1)),along=3),
        error = function(e){
          array(cbind(entryData,temp),c(length(temp),1,1))
        }
      )
    }
    columns = names(data[[entry]])
    dimnames(entryData)[[3]]=as.list(c(columns[!columns %in% toDrop]))
    X = append(X, list(entryData))
    names(X)[length(X)]=entry
  }
  X
}

get_y = function(data, y){
  Y = list()
  entryNames = paste0(y,"_",levels(data[[1]][,y]))
  for (entry in names(data)){
    entryData = data.frame(data[[entry]][2:nrow(data[[entry]]),y])
    dummy = dummyVars(~ .,data = entryData)
    entryDummy = predict(dummy, entryData)
    entryArray = array(entryDummy, c(nrow(entryDummy),1,ncol(entryDummy)))
    dimnames(entryArray)[[3]]=entryNames
    Y[[entry]] = entryArray
  }
  Y
}

split_brand_features = function(brandFeatures){
  featureMap = list()
  for (row in 1:nrow(brandFeatures)){
    rowFeatures = brandFeatures[[row,"features"]]
    rowBrand = as.character(brandFeatures[[row,"brand_id"]])
    rowFeatures = gsub('\\[|\\]|\n', '', rowFeatures) %>% 
      strsplit(" ") %>% 
      unlist() %>% 
      as.numeric()
    rowFeatures = rowFeatures[!is.na(rowFeatures)]
    featureMap[[rowBrand]]=rowFeatures
  }
  featureMap
}

encode_brand_features = function(brands, brandMap){
  encodedBrands = list()
  counter=0
  for (brand in brands){
    counter = counter+1
    brand = as.numeric(brand)
    if (brand == 0){
      encodedBrand = array(0,c(1,1,length(brandMap[[1]])))
    }
    else{
      encodedBrand = array(brandMap[[brand]],c(1,1,length(brandMap[[brand]])))
    }
    
    encodedBrands[[counter]] = encodedBrand
  }
  encodedBrands = do.call(abind, list(encodedBrands, along=1))
  dimnames(encodedBrands)[[3]] = paste0("brand_",c(1:dim(encodedBrands)[3]))
  encodedBrands
}


set.seed(1)
toShift = c("brand_id","perc_sale")
toDrop = c("user_id","country")

userBrandsDF10 = as.data.frame(userBrandsDF %>% group_by(user_id) %>% filter(n()>=10))
userBrandsDF10$user_id = as.factor(userBrandsDF10$user_id)
userBrandsDF10$brand_id = as.factor(userBrandsDF10$brand_id)
userBrandsDF10$country = as.numeric(factor(userBrandsDF10$country))-1

userBrandsSplit = split(userBrandsDF10, userBrandsDF10$user_id)
userBrandsSplitX = get_x(userBrandsSplit, toShift, toDrop)

toDrop = names(userBrandsSplit[[1]])[names(userBrandsSplit[[1]]) != "country"]
countryX = get_x(userBrandsSplit, c(), toDrop)

Y = get_y(userBrandsSplit, "brand_id")


brandMap = split_brand_features(brandFeatures)
platformLevels = levels(as.factor(userBrandsDF10$platform))
userSegmentLevels = levels(as.factor(userBrandsDF10$user_segment))
X = list()
for (name in names(userBrandsSplitX)){
  user = userBrandsSplitX[[name]]
  encode_brand_features(user[,,"brand_id"], brandMap)
  brandEncodedUser = abind(user, encode_brand_features(user[,,"brand_id"], brandMap), along=3)
  columns = dimnames(brandEncodedUser)[[3]]
  columns = columns["brand_id" != columns]
  brandEncodedUser = array(brandEncodedUser[,,columns],c(dim(brandEncodedUser)[1:2],dim(brandEncodedUser)[3]-1))
  dimnames(brandEncodedUser)[[3]] = columns
  

  userSegmentValues = factor(brandEncodedUser[,,"user_segment"], levels=userSegmentLevels)
  userSegmentDF = data.frame(userSegmentValues)
  userSegmentHot = predict(dummyVars(~ userSegmentValues, data = userSegmentDF),userSegmentDF)
  userSegmentArray = array(userSegmentHot,c(nrow(userSegmentHot),1,ncol(userSegmentHot)))
  dimnames(userSegmentArray)[[3]]=userSegmentLevels
  oneHotSegment = abind(brandEncodedUser, userSegmentArray, along=3)
  
  columns = dimnames(oneHotSegment)[[3]]
  columns = columns["user_segment" != columns]
  oneHotSegment = array(oneHotSegment[,,columns],c(dim(oneHotSegment)[1:2],dim(oneHotSegment)[3]-1))
  dimnames(oneHotSegment)[[3]] = columns
  

  platformValues = factor(oneHotSegment[,,"platform"], levels=platformLevels)
  platformDF = data.frame(platformValues)
  platformHot = predict(dummyVars(~ platformValues, data = platformDF),platformDF)
  platformArray = array(platformHot,c(nrow(platformHot),1,ncol(platformHot)))
  dimnames(platformArray)[[3]]=paste0("platform_",platformLevels)
  oneHotPlatform = abind(oneHotSegment, platformArray, along=3)
  
  columns = dimnames(oneHotPlatform)[[3]]
  columns = columns["platform" != columns]
  oneHotPlatform = array(oneHotPlatform[,,columns],c(dim(oneHotPlatform)[1:2],dim(oneHotPlatform)[3]-1))
  dimnames(oneHotPlatform)[[3]] = columns
  
  
  oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "0.0"] = 1
  oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(0, 25]"] = 2
  oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(25, 50]"] = 3
  oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(50, 75]"] = 4
  oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(75, 100]"] = 5
  
  X[[name]] = oneHotPlatform
  class(X[[name]]) = "numeric"
}

for (name in names(countryX)){
  countryX[[name]] = array(countryX[[name]], c(dim(countryX[[name]])[1],1))
}

Due to time and computational limitations, it was not feasible to test a large number of model topologies, and as such, the model was left at the bare minimal for our intent. It consists of 2 input layers, the first of which takes the X set and directly passes it to the LSTM layer. The other takes the X country set, passes it first by an embedding layer and then to the LSTM. There is a single LSTM layer, directly followed by a dense output layer with a softmax activation.

One important note for the LSTM layer is that the parameter stateful is set to FALSE, thus resetting its internal state (memory not weights) at each fit or prediction operation, thus ensuring that each sequence is treated independently.

build_model = function(){
  mainInput = layer_input(shape = c(1,dim(X[[1]])[3]), name = 'mainInput')

  countryInput = layer_input(shape = c(1), name = 'countryInput') 
    
  countryEmbeding = countryInput %>%
    layer_embedding(input_dim = length(levels(as.factor(userBrandsDF10$country))), output_dim = 1, name ='countryEmbeding')
  
  lstmLayer = layer_concatenate(c(mainInput, countryEmbeding), name = "fullInput") %>%
    layer_lstm(units = 1, # size of the layer
         return_sequences = TRUE,
         stateful = FALSE, name = "LSTM") %>%
    time_distributed(keras::layer_dense(units = dim(Y[[1]])[3], activation = 'softmax'), name = "output")
  
  lstmModel = keras_model(
    inputs = c(mainInput, countryInput),
    outputs = c(lstmLayer)
  )
    
  lstmModel %>%
      compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy'))
  lstmModel
}

It is not possible to use the created evaluation schemes for training, but a serious attempt to emulate them as closely as possible was done. First, the data is split into 10 folds. For each fold the model is totally reset (including weights) and the respective data is obtained. Then, for each user, the model is trained on its sequence, resetting memory each time. With the training complete, the predictions are obtained in a similar manner, resetting the memory each time. For consistency with the previous approaches of task 2, although all outputs are used for training, only the last is saved for performance comparisons. In the end, the output of this process is a list containing 10 lists; one for each fold. Each of these lists in turn contains another list for each user. Finally, each of these contains the predicted class indexes in the order of highest predicted value (predicted likelihood).

create_k_folds = function(data, k){
  foldSize = length(data) %/% k
  foldExtra = length(data) %% k
  sampleIndex = 1:length(data)
  folds = list()
  for (fold in 1:k){
    if (foldExtra>=fold){
      folds[[fold]] = sort(sample(sampleIndex, foldSize+1))
    }
    else{
      folds[[fold]] = sort(sample(sampleIndex, foldSize))
    }
    sampleIndex = sampleIndex[!sampleIndex %in% folds[[fold]]]
  }
  folds
}


preds = list()
folds = create_k_folds(X, k = 10)
for (fold in 1:10){
  lstmModel = build_model() # reset model for each fold
  
  trainX = X[unlist(folds[-fold])]
  trainCountry = countryX[unlist(folds[-fold])]
  trainY = Y[unlist(folds[-fold])]
  
  validationX = X[unlist(folds[fold])]
  validationCountry = countryX[unlist(folds[fold])]

  
  for (entry in 1:length(trainX)){
    x = list(trainX[[entry]], trainCountry[[entry]])
    y = trainY[[entry]]
    lstmModel %>% fit(
      x = x,
      y = y,
      batch_size = 1,
      epochs = 1,
      shuffle = FALSE,
      verbose = FALSE,
      view_metrics = FALSE
    )
  }
  
  predFold = list()
  for (entry in 1:length(validationX)){
    x = list(validationX[[entry]], validationCountry[[entry]])
    pred = predict(lstmModel, x , batch_size=1)
    pred = pred[nrow(pred),,]
    predFold[[entry]] = order(-pred)
  }
  preds[[fold]] = predFold
}

The final step in the process is the creation of the precision-recall and ROC curves for the predictions. First the metrics for each user are obtained, then they are averaged per fold, and finally, the results for all folds are averaged.

Before looking at the results it bears mentioning that the parameters for the number of epochs, LSTM layer size and embedding size were tuned, but due to the limited amount of time and lack of access to a TensorFlow compatible GPU (which increased train times over 100X), are far from optimal.

Thus, the best obtained results are presented below.

precision_at_n = function(pred, real, N){
  TP = 0
  for (row in 1:nrow(pred)){
    if (real[row] %in% pred[row,1:N]){
      TP = TP +1
    }
  }
  P = nrow(pred)*N
  precision = TP/P
  precision
}
recall_at_n = function(pred, real, N){
  TP = 0
  for (row in 1:nrow(pred)){
    if (real[row] %in% pred[row,1:N]){
      TP = TP +1
    }
  }
  recall = TP/(nrow(pred))
  recall
}
FPR_at_n = function(pred, real, N, factorCount){
  TP = 0
  for (row in 1:nrow(pred)){
    if (real[row] %in% pred[row,1:N]){
      TP = TP + 1
    }
  }
  FPR = (nrow(pred)*N - TP) / (nrow(pred) * factorCount)
  FPR
}


plot_prec_rec = function(results,xlim=c(0,1),ylim=c(0,1)){
  plot(1, type="n", xlab="recall", ylab="precision",xlim=xlim,ylim=ylim)
  legendText=c()
  for (resultIdx in 1:length(results)){
    result = results[[resultIdx]]
    x = result[,"recall"]
    y = result[,"precision"]
    lines(x,y,type="b",pch=resultIdx,col=resultIdx)
    legendText=c(legendText,names(results)[resultIdx])
  }
  legend(
    "bottomright", 
    legend=legendText,
    col=1:length(results),
    lty=1, 
    pch =1:length(results), 
    cex=1,
    bty = "n"
  )
}

plot_ROC = function(results,xlim=c(0,1),ylim=c(0,1)){
  plot(1, type="n", xlab="FPR", ylab="TPR",xlim=xlim,ylim=ylim)
  legendText=c()
  for (resultIdx in 1:length(results)){
    result = results[[resultIdx]]
    x = result[,"FPR"]
    y = result[,"TPR"]
    lines(x,y,type="b",pch=resultIdx,col=resultIdx)
    legendText=c(legendText,names(results)[resultIdx])
  }
  legend(
    "bottomright", 
    legend=legendText,
    col=1:length(results),
    lty=1, 
    pch =1:length(results), 
    cex=1,
    bty = "n"
  )
}

topN = c(1,2,6)
results$UBCF5 = NULL
results$IBCF5 = NULL
results = avg(results)
result = data.frame(row.names = topN)
for (N in topN){
  precision = c()
  recall = c()
  FPR = c()
  for (fold in 1:10){
    pred = do.call(rbind,preds[[fold]])
    testY = Y[unlist(folds[fold])]
    real = c()
    for (user in testY){
      real = c(real, which.max(user[nrow(user),,]))
    }
    precision = c(precision, precision_at_n(pred, real, N))
    recall = c(recall, recall_at_n(pred, real, N))
    FPR = c(FPR, FPR_at_n(pred, real, N, length(levels(as.factor(userBrandsDF$brand_id)))))
  }
  result[as.character(N),"precision"] = mean(precision)
  result[as.character(N),"recall"] = mean(recall)
  result[as.character(N),"TPR"] = mean(recall)
  result[as.character(N),"FPR"] = mean(FPR)
}
results$LSTM10 = result

maxPrecision = 0
maxRecall = 0
maxFPR = 0
for (result in results){
  maxPrecision = max(maxPrecision, max(result[,"precision"]))
  maxRecall = max(maxRecall, max(result[,"recall"]))
  maxFPR = max(maxFPR, max(result[,"FPR"]))
}
plot_prec_rec(results, xlim=c(0,maxRecall), ylim=c(0,maxPrecision))

plot_ROC(results, xlim=c(0,maxFPR), ylim=c(0,maxRecall))

On the previous ROC and precision-recall curves, it is possible to see that the LSTM algorithm achieves betters results than the second best ar10. Surprisingly, despite some variety of hyperparameters tested, the simplest employing an embedding and LSTM layer sizes of 1, with training being done for a single epoch attained the best results.

Conclusion

From the work performed there are a set of finding of great importance related to the relative performance of different methods and dataset sizes. Overall, restricting the dataset to users with more transactions reduced the performance of all methods in all top-N configurations. It was further found that from the original set of methods, association rules perform the best, with popular in second. Both instances of collaborative filtering achieved the worst results, with user-based being by far the worst. Given the low values for recall, a greater top-N of 6 might be preferable, to increase the chance that at least one recommendation is accurate. Finally, regarding the developed LSTM model, it was possible to verify that by using it the overall results can be improved, with the downside of a significantly increased computational cost.

The work performed has many limitations, almost all of which computational in nature. The collaborative filtering models ran into memory limitations with the largest dataset subset, while the LSTM models took too long to train for an extensive combination of model hyperparameters and topologies to be tested, which hindered the results.

The principal avenue for future work lies in the continuation of finetuning the various LSTM parameters. To speed up this process an inheritance from an unknown rich uncle would be welcome, allowing the obtention of a GPU that is compatible with TensorFlow and Keras.

ATDS Project

José Pinto - up201603713

4/06/2021