This report is being created for the final and only project of the Advanced Topics in Data Science (ATDS) course. This project entails the usage of a Farfetch dataset, containing anonymized information about luxury fashion brand purchases, for the creation of a brand recommender system.
The objectives of this project encompass 3 different tasks.
Task 1 - Data preprocessing and visualization.
Task 2 - Recommender systems using popularity, association rules or collaborative filtering; using both ROC and precision-recall curves with top N of 1,2 and 6 for comparisons.
Task 3 - Open ended task involving the improvement of results by employing contextual information other than user id and brand.
For task 1 almost no preprocessing was required, and as such, an extensive visualization process was undertaken. The table and plot outputs were, due to size, placed at the end of the report as an annex. In task 2 the set of recommender systems is created utilizing different data subsets of different dimensions and the results compared using cross-validation. An example for each model type for a single user is also presented. For task 3, an LSTM model was selected to employ the contextual information and attempt to improve results.
To make it easier to run and thus replicate attained results all required libraries are imported here and should be properly installed by the user before running. It is important to stress that the order in which the libraries are imported is relevant as there are some cases of function masking, which can cause problems when libraries are loading in a different order.
library(keras)
library(tensorflow)
library(recommenderlab)
library(dplyr)
library(tidyr)
library(readr)
library(plotly)
library(e1071)
library(ggpubr)
library(ggplot2)
library(reshape2)
library(psych)
library(proxy)
library(groupdata2)
library(abind)
library(caret)
The first task that must be undertaken is the reading of the datasets from files into memory to allow further manipulation. Here relative paths are used, and it is assumed that the files are in the same directory as the markdown file. Of course, these paths can be changed by the user to load the dataset from wherever it would be most convenient.
Both files that were provided for this project are read, with the first one being user_brands, containing most of the information and the second one containing only the Farfetch created 120-dimensional brand embeddings. The contents of the second file will only be used in task 3.
userBrandsTible = read_csv("users_brands.csv")
userBrandsDF = as.data.frame(userBrandsTible)
brandFeatures = read_csv("brand_features.csv")
Task 1 entails both the data visualization and preprocessing steps. An initial analysis of the contents led to the conclusion that no preprocessing steps are required for data visualization. And preprocessing required is problem dependent and is left for when it is required.
The first visualization task that was undertaken is the inspection of some data entries, not only to check the particular values and value formats but also the respective types and any possible anomalies in the loading process.
head(userBrandsDF)
## user_id brand_id
## 1 65b4c425c57d4d5cd6a09b6721f76dfe76fca72000a58ad1b252ef4ba10dfb0d 92850413
## 2 e45ae0e683895ab429483c3aaeba836c9e6904692daebad297c5e370d735d6df 59833398
## 3 ab112830f538338e386af96be442ea1d9e81d76fabc17b16fffa8f98be478241 72430441
## 4 577321c8f3b5bff9e5470184606207ed04edc14a82f706a92e9fe62c998d2aba 29631653
## 5 75624f1d112034145f181fb895479093f94ea2a7de0e48d582609e35dcd89193 50327475
## 6 36e368697be4d8d37b94338581c0b963f99db8c2c0b508bf80e28ede94c18930 48210076
## platform country user_segment purchase_date perc_sale
## 1 35279833 country_76 segment_2 154 0.0
## 2 35279833 country_144 segment_1 114 0.0
## 3 35279833 country_35 segment_4 204 (25, 50]
## 4 82921252 country_177 segment_3 139 (25, 50]
## 5 35279833 country_125 segment_0 170 0.0
## 6 59713904 country_74 segment_2 174 0.0
str(userBrandsDF)
## 'data.frame': 846490 obs. of 7 variables:
## $ user_id : chr "65b4c425c57d4d5cd6a09b6721f76dfe76fca72000a58ad1b252ef4ba10dfb0d" "e45ae0e683895ab429483c3aaeba836c9e6904692daebad297c5e370d735d6df" "ab112830f538338e386af96be442ea1d9e81d76fabc17b16fffa8f98be478241" "577321c8f3b5bff9e5470184606207ed04edc14a82f706a92e9fe62c998d2aba" ...
## $ brand_id : num 92850413 59833398 72430441 29631653 50327475 ...
## $ platform : num 35279833 35279833 35279833 82921252 35279833 ...
## $ country : chr "country_76" "country_144" "country_35" "country_177" ...
## $ user_segment : chr "segment_2" "segment_1" "segment_4" "segment_3" ...
## $ purchase_date: num 154 114 204 139 170 174 172 142 161 139 ...
## $ perc_sale : chr "0.0" "0.0" "(25, 50]" "(25, 50]" ...
## - attr(*, "spec")=
## .. cols(
## .. user_id = col_character(),
## .. brand_id = col_double(),
## .. platform = col_double(),
## .. country = col_character(),
## .. user_segment = col_character(),
## .. purchase_date = col_double(),
## .. perc_sale = col_character()
## .. )
From this the first thing to note is the overall dataset shape, containing 846490 observations from 7 different variables.
By performing this analysis some important data facets were identified. The first is that, although some variables have numerical type, all but purchace_date are categorical in meaning, enforcing the need from special care on their investigation. Another important factor regards the values for user_id, which were encoded in a large string which can make results difficult to analyze individually. Finally, the last feature that is important to note is that perc_sale presents its values in the form of discount intervals.
To better understand the overall data distributions an extensive statistical moment table was created, containing, among others, variable types, missing value counts, extremes, rages, quantiles, mean, mode, variance and number of unique values. Although, as most variables are categorical in nature, and as such these statistical moments have little meaning, they still provide important insight to the data.
Mode <- function(x) {
uniques <- unique(x)
uniques[which.max(tabulate(match(x, uniques)))]
}
ModeCount <- function(x) {
uniques <- unique(x)
max(tabulate(match(x, uniques)))
}
userBrandsStatistics = data.frame(matrix(ncol = ncol(userBrandsDF), nrow = 0))
colnames(userBrandsStatistics) = colnames(userBrandsDF)
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
userBrandsStatistics["type",column] = mode(columnData)
userBrandsStatistics["na count",column] = sum(is.na(columnData))
userBrandsStatistics["min",column] = min(columnData)
userBrandsStatistics["25 quantile",column] = tryCatch(
{quantile(columnData, 0.25)},
error = function(e) { return(NA) }
)
userBrandsStatistics["mean",column] = mean(columnData)
userBrandsStatistics["median",column] = median(columnData)
userBrandsStatistics["75 quantile",column] = tryCatch(
{quantile(columnData, 0.75)},
error = function(e) { return(NA) }
)
userBrandsStatistics["max",column] = max(columnData)
userBrandsStatistics["range",column] = tryCatch(
{as.numeric(userBrandsStatistics["max",column])- as.numeric(userBrandsStatistics["min",column])},
error = function(e) { return(NA) }
)
userBrandsStatistics["IQR",column] = tryCatch(
{as.numeric(userBrandsStatistics["75quantile",column])- as.numeric(userBrandsStatistics["25quantile",column])},
error = function(e) { return(NA) }
)
userBrandsStatistics["sd",column] = sd(columnData)
userBrandsStatistics["var",column] = var(columnData)
userBrandsStatistics["coef var",column] = as.numeric(userBrandsStatistics["sd",column]) / as.numeric(userBrandsStatistics["mean",column])
userBrandsStatistics["skewness",column] = tryCatch(
{skewness(columnData)},
error = function(e) { return(NA) }
)
userBrandsStatistics["kurtosis",column] = tryCatch(
{kurtosis(columnData)},
error = function(e) { return(NA) }
)
userBrandsStatistics["mode",column] = Mode(columnData)
userBrandsStatistics["mode count",column] = ModeCount(columnData)
userBrandsStatistics["unique count",column] = length(unique(columnData))
userBrandsStatistics["sum",column] = tryCatch(
{sum(columnData)},
error = function(e) { return(NA) }
)
}
userBrandsStatistics
## user_id
## type character
## na count 0
## min 00001fb7f8a0bab989afb3b2e3f87091f19c624f4c918f5e1fbb3c2ee12e9db6
## 25 quantile <NA>
## mean <NA>
## median <NA>
## 75 quantile <NA>
## max ffffede449ad3486ffee78949e0d34db24c36de30f71e1a20182df8d658281a2
## range <NA>
## IQR <NA>
## sd <NA>
## var <NA>
## coef var <NA>
## skewness <NA>
## kurtosis <NA>
## mode e646f8406d203c00922016d163433c7665bdd4d3b6260e72626a2900c7910727
## mode count 24
## unique count 329213
## sum <NA>
## brand_id platform country user_segment
## type numeric numeric character character
## na count 0 0 0 0
## min 102532 15766995 country_0 segment_0
## 25 quantile 27968276 35279833 <NA> <NA>
## mean 52775913.8391416 47381156.6653605 <NA> <NA>
## median 51215738 39857033 <NA> <NA>
## 75 quantile 80735391 59713904 <NA> <NA>
## max 99855154 96462699 country_99 segment_5
## range 99752622 80695704 <NA> <NA>
## IQR <NA> <NA> <NA> <NA>
## sd 29031126.4920308 13546980.7862736 <NA> <NA>
## var 842806305396293 183520688423666 <NA> <NA>
## coef var 0.550082876452244 0.285914944667815 <NA> <NA>
## skewness 0.00453599969320028 0.559633782166231 <NA> <NA>
## kurtosis -1.25236279783164 -0.64813779627103 <NA> <NA>
## mode 89083147 35279833 country_118 segment_1
## mode count 23022 379237 189617 355539
## unique count 1771 12 175 6
## sum 44674283305695 40107675305661 <NA> <NA>
## purchase_date perc_sale
## type numeric character
## na count 0 0
## min 1 (0, 25]
## 25 quantile 67 <NA>
## mean 119.200774964855 <NA>
## median 129 <NA>
## 75 quantile 170 <NA>
## max 231 0.0
## range 230 <NA>
## IQR <NA> <NA>
## sd 64.4743898286592 <NA>
## var 4156.94694377791 <NA>
## coef var 0.540889015592967 <NA>
## skewness -0.169298217417195 <NA>
## kurtosis -1.10177341210125 <NA>
## mode 161 0.0
## mode count 14798 383519
## unique count 231 5
## sum 100902264 <NA>
The most important takeaways from this large table are as follows:
For user_id the most common value only appears 24 times, while there are 329213 unique values for less than 3 times as many observations. This leads to the conclusion that the dataset is extremely sparce and special care will have to be had in its analysis.
For brand_id the mode if far more common, with 23022 appearances, being as such very well represented. With 1771 unique values there is much less sparsity than for user_id, although is still large enough for an individual analysis to be unreasonable.
The last variable that will further detailed is purchace_date, for which it was verified that a total of 230 unique values are present, following a mostly symmetric distribution. Due to the kurtosis values, it is also concluded that the distribution is closer to uniform than normal.
From this point although an individual analysis could be done for each other variables, the conclusions are almost identical to those for brand_id. The only important remark to make is that the variables platform, user_segment and perc_sale have a low enough number of unique values to warrant an individual value analysis.
With this in mind, the unique values for these 3 variables were obtained and studied. From this it was possible to see that the values for platform and user_segment are not meaningful, being only randomly assigned ids. Perc_sale, one the other hand, shows 5 values, 4 of which are 25% sale intervals (25% to 50%, 75% to 100%, etc.) and the last, with value 0, indicates no discount. This suggests that perc_sale is ordinal, and as such could be label encoded with little loss of meaning.
print(unique(userBrandsDF$platform))
## [1] 35279833 82921252 59713904 39857033 50298854 34959346 24380949 54454273
## [9] 15766995 22734142 47514527 96462699
print(unique(userBrandsDF$user_segment))
## [1] "segment_2" "segment_1" "segment_4" "segment_3" "segment_0" "segment_5"
print(unique(userBrandsDF$perc_sale))
## [1] "0.0" "(25, 50]" "(0, 25]" "(50, 75]" "(75, 100]"
To finalize the set of data statistical moments, a second table containing moments that should be more meaningful for categorical variables was created. For this table, instead of utilizing the values themselves, the number of occurrences per unique value is employed.
uniqueStatistics = data.frame(matrix(ncol = ncol(userBrandsDF), nrow = 0))
colnames(uniqueStatistics) = colnames(userBrandsDF)
for (column in colnames(userBrandsDF)){
columnData = (userBrandsDF %>% group_by_at(column) %>% tally())$n
uniqueStatistics["type",column] = mode(columnData)
uniqueStatistics["count",column] = length(columnData)
uniqueStatistics["na count",column] = sum(is.na(columnData))
uniqueStatistics["min",column] = min(columnData)
uniqueStatistics["25 quantile",column] = tryCatch(
{quantile(columnData, 0.25)},
error = function(e) { return(NA) }
)
uniqueStatistics["mean",column] = mean(columnData)
uniqueStatistics["median",column] = median(columnData)
uniqueStatistics["75 quantile",column] = tryCatch(
{quantile(columnData, 0.75)},
error = function(e) { return(NA) }
)
uniqueStatistics["max",column] = max(columnData)
uniqueStatistics["range",column] = tryCatch(
{as.numeric(uniqueStatistics["max",column])- as.numeric(uniqueStatistics["min",column])},
error = function(e) { return(NA) }
)
uniqueStatistics["IQR",column] = tryCatch(
{as.numeric(uniqueStatistics["75 quantile",column]) - as.numeric(uniqueStatistics["25 quantile",column])},
error = function(e) { return(NA) }
)
uniqueStatistics["sd",column] = sd(columnData)
uniqueStatistics["var",column] = var(columnData)
uniqueStatistics["coef var",column] = as.numeric(uniqueStatistics["sd",column]) / as.numeric(uniqueStatistics["mean",column])
uniqueStatistics["skewness",column] = tryCatch(
{skewness(columnData)},
error = function(e) { return(NA) }
)
uniqueStatistics["kurtosis",column] = tryCatch(
{kurtosis(columnData)},
error = function(e) { return(NA) }
)
uniqueStatistics["mode",column] = Mode(columnData)
uniqueStatistics["mode count",column] = ModeCount(columnData)
uniqueStatistics["unique count",column] = length(unique(columnData))
}
print(uniqueStatistics)
## user_id brand_id platform
## type numeric numeric numeric
## count 329213 1771 12
## na count 0 0 0
## min 1 1 3
## 25 quantile 1 14 376.25
## mean 2.57125326156622 477.972896668549 70540.8333333333
## median 2 57 12824
## 75 quantile 3 229 34408.5
## max 24 23022 379237
## range 23 23021 379234
## IQR 2 215 34032.25
## sd 2.33704794337009 1726.6482940855 136023.169010974
## var 5.46179308961037 2981314.33146838 18502302507.7879
## coef var 0.90891394414665 3.61243975572708 1.92828979448273
## skewness 2.69042099494101 7.72615321730558 1.544551106458
## kurtosis 9.35063859226698 72.5998431649631 0.527163457775885
## mode 1 1 3
## mode count 132091 75 2
## unique count 24 585 11
## country user_segment purchase_date
## type numeric numeric numeric
## count 175 6 231
## na count 0 0 0
## min 1 42100 1982
## 25 quantile 9 55811.5 2860.5
## mean 4837.08571428571 141081.666666667 3664.45887445887
## median 122 118175.5 3379
## 75 quantile 2147 162651.25 4157
## max 189617 355539 14798
## range 189616 313439 12816
## IQR 2138 106839.75 1296.5
## sd 17755.2890066876 117286.073484735 1320.66591463699
## var 315250287.711002 13756023033.4667 1744158.45808395
## coef var 3.67065833757083 0.831334618138913 0.360398618153959
## skewness 7.47568085149853 0.813596576523874 3.65298340534392
## kurtosis 68.0561319345571 -0.97206332327874 23.3827130959389
## mode 1 46523 4029
## mode count 9 1 2
## unique count 127 6 227
## perc_sale
## type numeric
## count 5
## na count 0
## min 908
## 25 quantile 48699
## mean 169298
## median 91704
## 75 quantile 321660
## max 383519
## range 382611
## IQR 272961
## sd 171773.801569098
## var 29506238905.5
## coef var 1.01462392685736
## skewness 0.251431661408326
## kurtosis -2.15079936814689
## mode 91704
## mode count 1
## unique count 5
At a first glance it is immediately obvious that the results which were obtained are far more helpful. From this table, the obtained conclusions were as follows:
For user_id it is verified that there are 132091 users that only appear once, although over half of them appear at least twice. Of course, the vast majority of users show less than 3 times. This distribution is also far from normal, being heavily left skewed.
Brand_id shows an even more left skewed distribution. With 75 brands having only one entry, and thus being very poorly represented, with an accurate description of them not being feasible. Even if the bottom 50% brands were discarded, with a median value of 57, it would still be very difficult to properly understand the distribution of the remaining. Of course, the top 50% brands present a much larger number of entries and, as such, most of the dataset would still be kept.
Both platform and country show similar results to the ones presented before, thus making separate analysis for each of these variable’s values redundant and undesirable.
User_segment, on the other hand, has a distribution relatively close to normal, and with the most infrequent value appearing 42100 times, an analysis per segment would be reasonable.
With purchace_date being the only numerical variable, this analysis does not add much. It is furthermore not reasonable to make a model per date, regardless of the number of occurrences, since we would expect a recommender system to work on different, later dates.
Finally, perc_sale shows a surprising result given the very low number of unique values. This is the fact that the least common value only appears 908 times. Nevertheless, disregarding this, the variable presents a mostly centered distribution.
The different statistical summary tables previously analyzed provided a great deal of insight into the dataset. Despite this, due to the aggregation of information into a set of single values a lot of detail is lost. Taking this into consideration the following step involves the further study of variable distributions aided by a large set of graphical plots. This set of plots is comprised of bar plots, histograms, boxplots, normal Q-Q plots, pie plots and a line chart.
Given the mostly categorical nature of the data, one of the most important plot types is the bar plot, for which all available variables are applicable, even if not necessarily very compatible. With the intent of making these understandable, a maximum of 50 different classes/values is displayed at a time, which has as a downside not giving a full overview of the data. The most important takeaways from these plots are:
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
t = table(columnData)
barplot(
sort(t, decreasing=TRUE)[1:min(length(unique(columnData)),50)],
xaxt = "n", main=paste("Barplot of",column)
)
}
From the bar plot of user_id, despite only a minimal number of values being shown it can be clearly be seen that the higher the value the less frequent it is. It is easy to imagine this plot going through all values, with a longer and longer tail.
For brand_id a different story is told, with only a few very common brands that overshadow the vast majority. This of course, means that only a minority of brands is properly represented, with all others lacking in terms of information.
This is even more noticeable for the platform variable, with 2 platforms having each more prevalence than all others combined.
The same is repeated for the variable country, with one being by far the most prevalent. Upon a quick search it was concluded that this is likely the USA, with the second highest being the UK.
The user_segment plot is, in contrast, very well balanced, with the most common class not being much more common than the others. similarly, purchace_date is also quite balanced, apart from the outliers, possibly around holidays
Finally, perc_sale has two values that are by far the most common and one which barely appears. A further analysis of this was performed, which will be detailed further along the visualization process.
After bar plots the creation of histograms for all applicable variables proceeded. Given the existence of only one numerical variable, and it being related to time, which usually does not give much insight on its own, these plots were expected not be very useful, which was confirmed. Thus, not a lot of insight is gained from this alone. Nevertheless, they were kept for completeness of the process.
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
tryCatch(
{
hist(columnData, main=paste("Histogram of",column))
},
error = function(e) {}
)
}
Box plots followed the histograms, but once again being mostly a tool for analysis of numerical variables they presented little in the way of new information. Once again, they were, nevertheless, kept.
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
tryCatch(
{
boxplot(columnData, main=paste("Boxplot of",column))
},
error = function(e) {}
)
}
To verify the conclusions regarding the normality of the different variables a set of Q-Q plots was created. Despite many different attempts, due to the length of time it takes for the plots to be generated, Rmarkdown does not plot them. As such, to observe them they must be done in the command line. Therefore, in this case the images are loaded from a file.
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
t = table(columnData)
tryCatch(
{
ggqqplot(userBrandsDF,x=column, main=paste("Normal Q-Q Plot of",column))
},
error = function(e) {print(column)}
)
}
From the analysis of these plots, it was verified that most of points fall outside the confidence bands, and as such it becomes obvious that the data is, in fact, not normally distributed.
The pie chart is a tool that is often ignored or even discouraged in a data visualization pipeline, with more importance given to the bar plot. However, is this case, due to the overly large number of unique categorical values and the heavy left skewness of variable frequencies, the pie chart can be very useful for a brief and informative plot.
for (column in colnames(userBrandsDF)){
columnData = pull(userBrandsDF,column)
t = table(columnData)
theshold = 2.1
relFreq = as.integer(t)/sum(t)*100
if (length(which(relFreq<=theshold))>0){
slices = c(
relFreq[which(relFreq>theshold)],
sum(relFreq[which(relFreq<=theshold)])
)
labels = c(
names(t)[which(relFreq>theshold)],
"Others"
)
}
else{
slices = relFreq
labels = names(t)
}
if (length(slices)>=2){
labels <- paste(labels, signif(slices,2), sep=" - ") # add percents to labels
labels <- paste(labels,"%",sep="") # ad % to labels
pie(
slices,labels = labels,
col=rainbow(length(labels)),
main=paste("Pie Chart of",column)
)
}
}
For the creation of the pie charts, any variable whose occurrence is less than 2.1% is combined into the “other” slice, while if there are less than 2 slices after the process, the plot is skipped. The important information obtained from this set of plots is as follows:
The chart of brand_id shows that there are 4 brands that dominate. These, nevertheless, only account for 10% of the total entries.
For platform there are 2 very common values, with others having roughly the same amount of participation.
The third pie chart of country shows a more diverse distribution, with 11 total countries having more than 2% participation. This however means that the remaining 164 account for only 29% of all purchases.
It is clear from the pie chart of user_segment that segment 1 is the most common, with 2 and 4 closely behind.
Finally, the last chart shows that a lack of any discount is the most common, while a discount between 75 and 100 is the least.
The final plot type that will be presented is this section is the line plot. This was only done for the purchace_date variable, with the data aggregated and counted per individual date.
plot(
(userBrandsDF %>% group_by(purchase_date) %>% tally())$n,
type="l",xlab="purchase_date", ylab="Purchace Count"
)
In this data there is a clear important and central point, the outlier with a very high value. In the initial analysis of the statistical moments, it was concluded that this was due to holidays. Combining this with the knowledge that this data likely stems largely from the USA (a heavily Christian country), It was concluded that this spike comes immediately after Christmas. A larger than normal amount is observed before this, which declines (due to the purchase Christmas gifts), followed by a spike and rapid decline (caused by after Christmas gift swapping and heavily discounted prices).
With the combined knowledge obtained from the statistical tables and the set of univariate plots, the understanding of each individual variable is now very solid. However, there is still a quite large gap in the understanding of the dataset, regarding the relations between variable distributions. Similarly, to the analysis of single variable distributions, an initial analysis of statistical moments was performed, which was followed by a set of bivariate graphical plots.
As mentioned, the first dual variable analysis entails the obtention of a statistical moment table, this time with the distributions being obtained split by the values of other variables. Due to the number of unique values, only 2 variables were split by their values, user_segment and perc_sale.
A second set of tables that is also obtained is the set contingency tables for platform, user_segment and perc_sale. Platform was here included, unlike for the statistics table, as the contingency table occupies less space, thus permitting the utilization of a variable with slightly more unique values.
describeBy(userBrandsDF, userBrandsDF$perc_sale)
##
## Descriptive statistics by group
## group: (0, 25]
## vars n mean sd median trimmed
## user_id* 1 91704 NaN NA NA NaN
## brand_id 2 91704 51775886.64 29956092.56 50039784 52116569.85
## platform 3 91704 48430440.77 13669659.47 50298854 47596890.34
## country* 4 91704 NaN NA NA NaN
## user_segment* 5 91704 NaN NA NA NaN
## purchase_date 6 91704 136.87 55.69 151 142.93
## perc_sale* 7 91704 NaN NA NA NaN
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 38608176.1 287368 99855154 99567786 0.01 -1.26 98921.58
## platform 15481043.8 15766995 82921252 67154257 0.45 -0.73 45140.21
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 34.1 1 231 230 -0.97 0.16 0.18
## perc_sale* NA Inf -Inf -Inf NA NA NA
## ------------------------------------------------------------
## group: (25, 50]
## vars n mean sd median trimmed
## user_id* 1 321660 NaN NA NA NaN
## brand_id 2 321660 51393299.65 28839280.88 49160230 51384848.70
## platform 3 321660 47585810.05 13656657.47 39857033 46573225.17
## country* 4 321660 NaN NA NA NaN
## user_segment* 5 321660 NaN NA NA NaN
## purchase_date 6 321660 124.33 66.93 143 127.29
## perc_sale* 7 321660 NaN NA NA NaN
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 33605662.76 102532 99855154 99752622 0.08 -1.23 50849.41
## platform 6786156.72 15766995 82921252 67154257 0.57 -0.61 24079.41
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 66.72 1 231 230 -0.47 -1.05 0.12
## perc_sale* NA Inf -Inf -Inf NA NA NA
## ------------------------------------------------------------
## group: (50, 75]
## vars n mean sd median trimmed
## user_id* 1 48699 NaN NA NA NaN
## brand_id 2 48699 54486519.88 29055104.84 53637929 55031252.0
## platform 3 48699 46455785.69 13485960.30 35279833 45224198.0
## country* 4 48699 NaN NA NA NaN
## user_segment* 5 48699 NaN NA NA NaN
## purchase_date 6 48699 112.16 71.83 117 111.7
## perc_sale* 7 48699 NaN NA NA NaN
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 39915704.73 102532 99855154 99752622 -0.05 -1.30 131662.60
## platform 6786156.72 15766995 82921252 67154257 0.75 -0.38 61111.35
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 99.33 1 231 230 0.00 -1.42 0.33
## perc_sale* NA Inf -Inf -Inf NA NA NA
## ------------------------------------------------------------
## group: (75, 100]
## vars n mean sd median trimmed mad
## user_id* 1 908 NaN NA NA NaN NA
## brand_id 2 908 59962973.04 26208729.16 67279918 61786532.7 28372576.99
## platform 3 908 45458870.80 13099593.18 35279833 44023605.4 475154.03
## country* 4 908 NaN NA NA NaN NA
## user_segment* 5 908 NaN NA NA NaN NA
## purchase_date 6 908 100.49 60.28 91 97.9 63.75
## perc_sale* 7 908 NaN NA NA NaN NA
## min max range skew kurtosis se
## user_id* Inf -Inf -Inf NA NA NA
## brand_id 422351 98398869 97976518 -0.59 -0.97 869767.2
## platform 22734142 82921252 60187110 0.94 -0.06 434725.3
## country* Inf -Inf -Inf NA NA NA
## user_segment* Inf -Inf -Inf NA NA NA
## purchase_date 1 231 230 0.38 -0.80 2.0
## perc_sale* Inf -Inf -Inf NA NA NA
## ------------------------------------------------------------
## group: 0.0
## vars n mean sd median trimmed
## user_id* 1 383519 NaN NA NA NaN
## brand_id 2 383519 53940412.92 28906400.07 51522426 54673053.97
## platform 3 383519 47080670.17 13414669.37 39857033 46212375.95
## country* 4 383519 NaN NA NA NaN
## user_segment* 5 383519 NaN NA NA NaN
## purchase_date 6 383519 111.61 62.06 105 109.84
## perc_sale* 7 383519 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 36410011.04 102532 99855154 99752622 -0.05 -1.26 46676.75
## platform 6786156.72 15766995 96462699 80695704 0.55 -0.69 21661.40
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 69.68 1 231 230 0.26 -0.94 0.10
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
describeBy(userBrandsDF, userBrandsDF$user_segment)
##
## Descriptive statistics by group
## group: segment_0
## vars n mean sd median trimmed
## user_id* 1 46523 NaN NA NA NaN
## brand_id 2 46523 54863739.58 28752727.30 52029885 55716830.28
## platform 3 46523 44850103.80 13508129.43 35279833 43456710.19
## country* 4 46523 NaN NA NA NaN
## user_segment* 5 46523 NaN NA NA NaN
## purchase_date 6 46523 96.95 65.99 88 93.65
## perc_sale* 7 46523 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 37531618.70 102532 99855154 99752622 -0.09 -1.24 133304.62
## platform 0.00 15766995 82921252 67154257 0.93 -0.01 62626.96
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 84.51 1 231 230 0.31 -1.10 0.31
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: segment_1
## vars n mean sd median trimmed
## user_id* 1 355539 NaN NA NA NaN
## brand_id 2 355539 52702464.62 29244060.11 51215738 53081995.2
## platform 3 355539 46729967.90 13489723.67 35279833 45625890.3
## country* 4 355539 NaN NA NA NaN
## user_segment* 5 355539 NaN NA NA NaN
## purchase_date 6 355539 118.29 65.22 128 119.2
## perc_sale* 7 355539 0.00 0.00 0 0.0
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 36507260.7 102532 99855154 99752622 0.01 -1.26 49044.92
## platform 6786156.7 15766995 82921252 67154257 0.67 -0.50 22623.48
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 77.1 1 231 230 -0.15 -1.15 0.11
## perc_sale* 0.0 0 0 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: segment_2
## vars n mean sd median trimmed
## user_id* 1 165977 NaN NA NA NaN
## brand_id 2 165977 52628970.53 28962098.94 50688913 52995154.25
## platform 3 165977 47942258.92 13621260.31 39857033 47050281.96
## country* 4 165977 NaN NA NA NaN
## user_segment* 5 165977 NaN NA NA NaN
## purchase_date 6 165977 120.85 63.48 132 122.24
## perc_sale* 7 165977 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 35543493.61 102532 99855154 99752622 0.02 -1.25 71089.60
## platform 7261310.75 15766995 82921252 67154257 0.50 -0.71 33434.38
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 72.65 1 231 230 -0.21 -1.05 0.16
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: segment_3
## vars n mean sd median trimmed
## user_id* 1 83677 NaN NA NA NaN
## brand_id 2 83677 52752868.80 28824015.54 51215738 53183983.96
## platform 3 83677 48162380.68 13465602.85 39857033 47488958.89
## country* 4 83677 NaN NA NA NaN
## user_segment* 5 83677 NaN NA NA NaN
## purchase_date 6 83677 122.89 62.67 131 124.56
## perc_sale* 7 83677 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 35628848.38 102532 99855154 99752622 -0.01 -1.25 99644.06
## platform 7261310.75 15766995 96462699 80695704 0.41 -0.82 46550.33
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 71.16 1 231 230 -0.23 -0.98 0.22
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: segment_4
## vars n mean sd median trimmed
## user_id* 1 152674 NaN NA NA NaN
## brand_id 2 152674 52455483.79 28859666.54 50640759 52797427.29
## platform 3 152674 48430734.75 13550298.59 50298854 47678877.83
## country* 4 152674 NaN NA NA NaN
## user_segment* 5 152674 NaN NA NA NaN
## purchase_date 6 152674 122.46 63.33 133 124.18
## perc_sale* 7 152674 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 35472100.49 102532 99855154 99752622 0.01 -1.24 73859.91
## platform 15481043.81 15766995 82921252 67154257 0.42 -0.78 34678.98
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 71.16 1 231 230 -0.24 -1.01 0.16
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
## ------------------------------------------------------------
## group: segment_5
## vars n mean sd median trimmed
## user_id* 1 42100 NaN NA NA NaN
## brand_id 2 42100 52876175.86 28742864.84 51215738 53341463.56
## platform 3 42100 48106366.65 13275117.65 39857033 47633955.08
## country* 4 42100 NaN NA NA NaN
## user_segment* 5 42100 NaN NA NA NaN
## purchase_date 6 42100 125.85 62.79 133 127.93
## perc_sale* 7 42100 0.00 0.00 0 0.00
## mad min max range skew kurtosis se
## user_id* NA Inf -Inf -Inf NA NA NA
## brand_id 35628848.38 102532 99855154 99752622 -0.02 -1.26 140084.15
## platform 15481043.81 15766995 82921252 67154257 0.33 -0.93 64698.96
## country* NA Inf -Inf -Inf NA NA NA
## user_segment* NA Inf -Inf -Inf NA NA NA
## purchase_date 71.16 1 231 230 -0.26 -0.95 0.31
## perc_sale* 0.00 0 0 0 NaN NaN 0.00
table(userBrandsDF$platform, userBrandsDF$user_segment, userBrandsDF$perc_sale)
## , , = (0, 25]
##
##
## segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
## 15766995 2 8 1 3 2 0
## 22734142 2 26 4 2 3 0
## 24380949 55 200 72 47 60 33
## 34959346 107 1238 414 155 295 65
## 35279833 2005 17431 7139 3352 6522 1617
## 39857033 89 1955 937 536 991 263
## 47514527 0 0 0 0 0 0
## 50298854 150 1115 278 56 195 28
## 54454273 29 320 84 17 59 13
## 59713904 1194 16143 8401 4160 8071 2103
## 82921252 165 1656 736 314 657 129
## 96462699 0 0 0 0 0 0
##
## , , = (25, 50]
##
##
## segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
## 15766995 4 25 22 5 8 4
## 22734142 4 145 24 7 16 5
## 24380949 247 826 290 160 241 117
## 34959346 360 3996 1433 514 1034 229
## 35279833 7842 65595 27692 12237 23499 5221
## 39857033 445 7536 4168 2202 3873 1080
## 47514527 0 1 0 0 0 0
## 50298854 665 4366 1046 233 697 84
## 54454273 79 851 309 89 201 26
## 59713904 4554 52331 27069 13280 25753 6292
## 82921252 485 5418 2744 1111 2420 450
## 96462699 0 0 0 0 0 0
##
## , , = (50, 75]
##
##
## segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
## 15766995 1 11 2 0 2 4
## 22734142 1 24 6 1 2 1
## 24380949 22 120 45 24 56 16
## 34959346 45 507 185 63 143 41
## 35279833 1225 10680 4630 1973 3862 812
## 39857033 71 1349 810 459 802 194
## 47514527 0 1 0 0 0 0
## 50298854 127 679 144 43 107 11
## 54454273 9 128 30 16 29 2
## 59713904 597 6702 3692 1894 3655 831
## 82921252 86 746 384 181 351 65
## 96462699 0 0 0 0 0 0
##
## , , = (75, 100]
##
##
## segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
## 15766995 0 0 0 0 0 0
## 22734142 0 1 0 0 0 0
## 24380949 0 1 1 0 0 0
## 34959346 0 11 3 2 5 1
## 35279833 27 236 91 32 57 10
## 39857033 7 38 13 6 18 2
## 47514527 0 0 0 0 0 0
## 50298854 4 20 1 0 2 1
## 54454273 1 1 0 0 0 0
## 59713904 12 135 57 23 48 10
## 82921252 3 17 5 1 5 0
## 96462699 0 0 0 0 0 0
##
## , , = 0.0
##
##
## segment_0 segment_1 segment_2 segment_3 segment_4 segment_5
## 15766995 6 40 24 6 29 3
## 22734142 16 72 31 5 24 9
## 24380949 686 1387 480 456 368 296
## 34959346 556 4345 1734 702 1348 384
## 35279833 14407 73158 32090 17119 29242 9434
## 39857033 604 7015 3624 2502 3834 1449
## 47514527 0 0 1 0 0 0
## 50298854 1007 5345 1336 472 966 164
## 54454273 137 945 313 106 251 51
## 59713904 7436 55753 30965 17935 30680 10043
## 82921252 947 4890 2417 1173 2191 507
## 96462699 0 0 0 3 0 0
Upon analyzing the statistics split by perc_sale it was concluded that any variations are simply a matter of randomness and not caused by any underlying distribution difference. The main important thing to point out is that the value that was verified to be extremely rare is that of “(75, 100]”, with only 908 occurrences.
Similar conclusions are derived from the statistics table by user segment. Once again concluding that any distribution variations are being caused by random chance.
The 3 variable contingency table, however, presents far more meaningful results. the first one to point out is that platform 96462699 only appears 3 times and always for perc_sale of 0 and for segment_3. Platform 47514527 is also quite rare, appearing only in a handful of instances for segments 1 and 2, which are the most populated. Regardless, there does not appear to be any strong correlation amongst these 3 variables, with occurrences being mostly independent and only affected by the relative frequency of each value.
These results imply that the task might be very difficult and bad results would not be unexpected.
With the intent of further investigating variable correlation, a correlation heatmap was created. Upon its investigation is obvious that no variable has high correlation with another, and as such, to improve the chances at obtaining good results, heavily nonlinear approaches are likely to be required.
userBrandsDFNum = userBrandsDF[, sapply(userBrandsDF, is.numeric)]
corMatrix = round(cor(userBrandsDFNum),2)
corMap = melt(corMatrix)
ggplot(data = corMap, aes(x=Var2, y=Var1, fill=value)) + geom_tile()
In a final effort to identify strong variable relations box plots split by class were created. To make these plots understandable the values were only divided by the 3 variables with the least number of unique variables: platform, user_segment and perc_sale.
For the two plots regarding brand_id and purchace_date divided by platform, the 2 most interesting cases, platforms 47514527 and 96462699 can be ruled out, as these were the ones identified as almost nonexistent. The case for 22734142 is, however, more relevant as this class has enough occurrences for this to be considered a real relation with purchace_date. Other cases, however, do not present enough variation to be considered relevant.
Similarly, all others, but the last plot, containing purchace_date divided by perc_sale, do not present any significant differences or relations. Regarding this last plot, a relation between the perc_sale value of “(0, 25]” and purchase_date seems to exist, with more purchases with slight discount being made at later dates.
for (column1 in c("platform","user_segment","perc_sale")){
for (column2 in names(userBrandsDF)){
if (column1==column2){
next()
}
tryCatch(
{
boxplot(
userBrandsDF[,column2] ~ userBrandsDF[,column1],
main=paste("Boxplots of",column1,"vs",column2),
las=2, ylab="", xlab=""
)
},
error = function(e) {}
)
}
}
The last plot type in this visualization approach is the stacked area plot, where the count per class for each date is obtained. Two types were created, the normal, with just the counts per day; and a cumulative sum variant, to see the overall evolution is sales.
stackedPlot = function(data, timeColumn, targetColumn, legend=FALSE, cumulative=FALSE){
time <- rep(sort(unique(data[,timeColumn])),each=length(unique(data[,targetColumn]))) # x Axis
v1 = (data %>% group_by_at(vars(timeColumn,targetColumn)) %>% tally())
v2 = (data %>% expand(data[timeColumn],data[targetColumn]) %>% group_by_at(vars(timeColumn,targetColumn)))
value = bind_rows(v1, v2) %>%
distinct_at(vars(timeColumn, targetColumn), .keep_all = TRUE) %>%
arrange_at(vars(timeColumn, targetColumn)) # y Axis
if (cumulative){
value = (value %>% group_by_at(targetColumn) %>% arrange_at(timeColumn) %>% mutate(cs = cumsum(n)))$cs %>% replace_na(0)
} else {
value = value$n %>% replace_na(0)
}
group = rep(sort(unique(data[,targetColumn])),times=length(unique(data[,timeColumn]))) # group, one shape per group
plotData = data.frame(time, value, group)
ggplot(plotData, aes(x=time, y=value, fill=as.character(group))) +
geom_area(show.legend=legend)
}
stackedPlot(userBrandsDF, "purchase_date", "brand_id", cumulative=FALSE, legend=FALSE)
stackedPlot(userBrandsDF, "purchase_date", "brand_id", cumulative=TRUE, legend=FALSE)
stackedPlot(userBrandsDF, "purchase_date", "platform", cumulative=FALSE, legend=TRUE)
stackedPlot(userBrandsDF, "purchase_date", "platform", cumulative=TRUE, legend=TRUE)
stackedPlot(userBrandsDF, "purchase_date", "country", cumulative=FALSE, legend=FALSE)
stackedPlot(userBrandsDF, "purchase_date", "country", cumulative=TRUE, legend=FALSE)
stackedPlot(userBrandsDF, "purchase_date", "user_segment", cumulative=FALSE, legend=TRUE)
stackedPlot(userBrandsDF, "purchase_date", "user_segment", cumulative=TRUE, legend=TRUE)
stackedPlot(userBrandsDF, "purchase_date", "perc_sale", cumulative=FALSE, legend=TRUE)
stackedPlot(userBrandsDF, "purchase_date", "perc_sale", cumulative=TRUE, legend=TRUE)
The first two plots are those of brand_id. The first thing to note is that the structure of the non-cumulative plot is identical to that of the line plot, with only the colored areas giving further information. However, due to the large number of different classes, it becomes impossible to analyse individual class distributions. Still, these two plots were kept as they were found to be quite pretty.
The second pair of plots, pertaining to the occurrences of platform values, present no clear relation, with values of each class simply increasing proportionally to the overall sales.
The same is repeated for the next two sets of plots, for country and user_segment, again with no clear relation shown.
The last pair, on the other hand, shows a relation between the “(25, 50]” value of perc_sale and time, with a much higher proportion of sales belonging to this class in the latter half of the dates. Other values, however, do not present any strong relation with time.
With a thorough data understanding gained by the extensive visualization methodology, some very basic preprocessing followed. This preprocessing simply entails transforming the data from a data frame to a binary ratings matrix and obtaining “recommentderlab” evaluation schemes.
Four schemes were obtained, each using 10-fold cross-validation, with different restrictions on the number of times each user must appear. The minimum numbers of occurrences per user are 1, 2, 5 and 10. Furthermore the goal of predicting the next brand the user will purchase was selected, thus with all but one entry per user being employed as known and the remaining value as unknown.
userBrandsBRM = as(userBrandsDF[c("user_id","brand_id")],"binaryRatingMatrix")
e1 = evaluationScheme(userBrandsBRM, method="cross-validation", k=10, given=-1)
userBrandsBRM2 <- userBrandsBRM[rowCounts(userBrandsBRM)>=2,]
e2 = evaluationScheme(userBrandsBRM2, method="cross-validation", k=10, given=-1)
userBrandsBRM5 <- userBrandsBRM[rowCounts(userBrandsBRM)>=5,]
e5 = evaluationScheme(userBrandsBRM5, method="cross-validation", k=10, given=-1)
userBrandsBRM10 <- userBrandsBRM[rowCounts(userBrandsBRM)>=10,]
e10 = evaluationScheme(userBrandsBRM10, method="cross-validation", k=10, given=-1)
The second task of the project entails, as stated in the beginning of this report, the obtention of different recommender systems employing 3 methods, popularity, association rules and collaborative filtering; and the analysis of their respective results resorting to ROC and precision-recall curves.
The recommender that was first developed is the simplest one, using popularity. This recommender works by simply finding which brands occur more often, and then predicting, while removing already seen brands, the N most common. For an example, let’s take the 6 first users as training data and the 7th as test. Bellow we can see the table containing the number brand of occurrences for the 6 users. It is possible to see that in this case all values are equally common. If we intend to predict 2 values, then for the 7th user 2 of these values would be obtained as predictions. Which in this case would be incorrect, as none of these are in this user’s data.
colCounts(userBrandsBRM[1:6,])[colCounts(userBrandsBRM[1:6,]) != 0]
## 26493521 26715155 33008187 36189858 45417945 51404770 55675036 63020180
## 1 1 1 1 1 1 1 1
## 70512514 78369697 86469062
## 1 1 1
modelPop = Recommender(userBrandsBRM[1:6,],"POPULAR")
recsPop = predict(modelPop, userBrandsBRM[7,], n=2)
inspect(getRatingMatrix(userBrandsBRM[7,]))
## items
## [1] {19686531,47093199,67196564}
as.integer(getList(recsPop)[[1]])
## [1] 70512514 26715155
Given this, the final results, for which all 4 of the evaluation schemes are used, were obtained. The precision-recall and ROC curves, for the top-N values in (1,2,6), can be seen below.
results = evaluate(e10, list())
methods <- list(
"popular" = list(name="POPULAR", param = NULL)
)
results$popular1 = evaluate(e1, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular2 = evaluate(e2, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular5 = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
results$popular10 = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)$popular
plot(results,annotate=TRUE)
plot(results, "prec/rec", annotate=TRUE)
A first glance at either of the images immediately shows a trend, whereas the data gets more restricted, the results worsen. In fact, for equal values of top-N, the larger datasets attain better results in all metrics. The comparison between different top-N is, however, more difficult; but, to make it both, as simple and consistent as possible, a way to combine both metrics into one was devised. For the ROC curves, the final performance was devised as the TPR divided by the FPR; while for the precision-recall curves, the value obtained by multiplying both was selected. It is clear that this is far from perfect, with for ROC lower values of N having a clear advantage, while for precision-recall, the opposite being true. Nevertheless, as the comparison of these curves is almost always done taking into consideration the context in which the recommender system will be employed, the selected aggregation method is about as good as can be for a metric that can be both globally used and that can be immediately interpreted from a plot without requiring an external calculation (as with for example F1).
By utilizing these rules for the popular1 curve, N of 1 is selected as the best for the ROC, while the value N of 6 is selected for the precision-recall. Despite popular1’s results being objectively the best, as some of the methods that will be later analyzed cannot use the full dataset, and in fact, a set smaller than that used for the popular10 is not feasible for a fair model comparison, only popular10 will be kept in the curve set.
The second recommender is that obtained from association rules. This method is far more complex than the previous one and requires some tunning of hyperparameters. This method works by finding the sets of brands that users tend to buy together and create rules indicating their relations. For example, users who buy brand 1 might commonly also buy brand 2, so a rule “brand_1 => brand_2” might be generated. The capability to find popular brands also exists, creating rules such as “ => brand_3”, where brand 3 is predicted regardless of the other brands a user has purchased.
Lets once again see an example using the same 7 users as before, with user 7 being used for testing. Below the set of transactions and rules obtained by the apriori algorithm is presented. Then, passing all but one of the items from user 7 the 2 most likely brands are predicted. The brands are selected in order by the confidence of applicable rules, with rules that are not applicable, obviously being ignored. Unfortunately, none of the rules generated from the 6 users is applicable to the 7th one, and as such, no brand is predicted.
inspect(getRatingMatrix(userBrandsBRM[1:6,]))
## items
## [1] {70512514}
## [2] {63020180}
## [3] {26493521,26715155,86469062}
## [4] {33008187,45417945}
## [5] {78369697}
## [6] {36189858,51404770,55675036}
modelAr = Recommender(userBrandsBRM[1:6,], "AR", parameter=list(support=0, confidence=0))
rules = getModel(modelAr)$rule_base
inspect(head(rules))
## lhs rhs support confidence coverage lift count
## [1] {33008187} => {45417945} 0.1666667 1 0.1666667 6 1
## [2] {45417945} => {33008187} 0.1666667 1 0.1666667 6 1
## [3] {26493521} => {26715155} 0.1666667 1 0.1666667 6 1
## [4] {26715155} => {26493521} 0.1666667 1 0.1666667 6 1
## [5] {26493521} => {86469062} 0.1666667 1 0.1666667 6 1
## [6] {86469062} => {26493521} 0.1666667 1 0.1666667 6 1
recsAR = predict(modelAr, userBrandsBRM[7,], n=2)
inspect(getRatingMatrix(userBrandsBRM[7,]))
## items
## [1] {19686531,47093199,67196564}
as.integer(getList(recsAR)[[1]])
## integer(0)
The values of support and confidence are the most important and after some search the values of 0.002 for support and 0.2 for confidence were found to perform the best. Bellow the precision-recall and ROC curves can be seen for the same set of top N predictions.
results$popular1 = NULL
results$popular2 = NULL
results$popular5 = NULL
methods <- list(
"ar" = list(name="AR", parameter=list(support=0.002, confidence=0.2))
)
results$ar1 = evaluate(e1, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar2 = evaluate(e2, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar5 = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
results$ar10 = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)$ar
plot(results,annotate=TRUE)
plot(results, "prec/rec", annotate=TRUE)
Upon an initial look at the ROC curve, it becomes obvious that something is not as it would be expected, with the ar1 and ar2 curves collapsing into a single point and even the ar5 being shorter than what has previously been observed. An understanding of how FPR is calculated indicates that for a N only slightly bigger than 1, all curves, regardless of performance, should present very similar values. This is not observed, and by knowing how association rules work, this leads to the conclusion that in some instances, or even the majority for ar1 and ar2, not enough rules are generated to obtain the requested top-N predictions. This is, of course, caused by the selected support and confidence thresholds. Despite a lot of effort, it was not possible to find more lax values without making the computational time unreasonably long, and thus, these results are something that will have to be accepted.
By employing the aforementioned method to compare the overall results of different top-N values on the ar10 (which has the best results), for the ROC curve, N of 1 is the best, while for precision-recall, N of 6 is. Furthermore, comparing ar10 with popular10 shows for the former a higher TPR, precision and by a larger margin recall, thus achieving objectively better results.
For the same reasons as those presented for the popular models, only ar10 will be used henceforth in any of the comparison plots.
The last recommender system to be created in task 2 employs collaborative filtering. There are two important variants to this method, user-based and item-based, with both being tested. User-based collaborative filtering works by finding the similarity between users in terms of the items that were purchased, while item-based collaborative filtering selects new items by finding the similarity between different items.
Once again, an example with the first 7 users is presented. Then the models for each were created and the predictions for user 7 can be seen. Again, like for all other methods, none of the results are accurate, with both variants providing the same recommendations. This, however, would be expected, given the very high sparsity in the data.
modelUBCF = Recommender(userBrandsBRM[1:6,], "UBCF", parameter=list(method="cosine",nn=2))
modelIBCF = Recommender(userBrandsBRM[1:6,], "IBCF", parameter=list(method="cosine",k=2))
recsUBCF = predict(modelUBCF, userBrandsBRM[7,], n=2)
recsIBCF = predict(modelUBCF, userBrandsBRM[7,], n=2)
inspect(getRatingMatrix(userBrandsBRM[7,]))
## items
## [1] {19686531,47093199,67196564}
as.integer(getList(recsUBCF)[[1]])
## [1] 63020180 70512514
as.integer(getList(recsIBCF)[[1]])
## [1] 63020180 70512514
Unlike for the previous methods, not all evaluation schemes were possible to use as with the two largest ones there were memory limitations. As such, only the schemes where users appear more than 5 and 10 times were used. For a fair comparison, the same top-N set was employed, and the obtained performance curves are presented below.
results$ar1 = NULL
results$ar2 = NULL
results$ar5 = NULL
methods <- list(
"UBCF" = list(name="UBCF", param = NULL),
"IBCF" = list(name="IBCF", param = NULL)
)
colabFilter = evaluate(e5, methods, type="topNList", n=c(1,2,6), progress=FALSE)
results$UBCF5 = colabFilter$UBCF
results$IBCF5 = colabFilter$IBCF
colabFilter = evaluate(e10, methods, type="topNList", n=c(1,2,6), progress=FALSE)
results$UBCF10 = colabFilter$UBCF
results$IBCF10 = colabFilter$IBCF
plot(results,annotate=TRUE)
plot(results, "prec/rec", annotate=TRUE)
This set of results presents similar findings to those for the set of popular models, with larger datasets leading to improved performance in all instances. Another fact that is clear is the superiority (in this case) of the item-based collaborative filtering over the user-based one, with the former achieving undeniably better results. One thing of note is that for the UBCF5 precision-recall curve, precision improves with N, which indicates that the later predictions are far more accurate.
Much like it was done for the previous two model types, employing the stated comparison methodology it is concluded that in the ROC curve, for both UBCF5 and IBCF5, N of 1 is the best, while for precision-recall, 6 is chosen as the optimal. Finally, comparing UBCF10 or IBCF10 with either ar10 or popular10 shows significantly worse results.
The final task involves the utilization of the contextual information present in the dataset to improve results. With this goal, an ambitious method employing an LSTM was selected. Due to the complexity of this strategy many decisions had to be made and parameters to be tuned. As such, a more in-depth implementation explanation and detailing will be provided.
A LSTM expects, as its input, ordered sequences (typically by time), and as such, the first step was to order the full dataset by purchace_date.
As the sequences of user purchases are not related to amongst different users, other than similarities in the statistical distributions, feeding them to the LSTM in sequence is bound to create spurious relations based on their arbitrary order. With this in mind, the dataset was divided by user into independent sequences of different lengths.
With the aforementioned steps concluded the data is now in its base format, with a list of user sequences. Now, each variable needs to be treated individually to be passed to the model.
Firstly, at this point, user_id no longer holds meaningful information and was removed.
It is the goal to predict brand_id, as such, knowing it for the current step is not realistic, therefore all values were shifted by one. Thus, the model has access to the brand_id in the previous step, and in case there is no previous step, a value of 0 was selected to fill the empty space. Furthermore, the LSTM does not handle categorical variables, and thus, brand_id was encoded using the brand_features dataset, with the previously placed values of 0 being replaced by all 0s with the same length (120) as the encoding. This set of steps encompasses only the input of the brand_id. For the output, the variable was one hot encoded and no shifting occurred.
The variables platform and user_segment were one hot encoded, while purchace_date was left as is (numerical).
Perc_sale presents an interesting case, as much like for brand_id it does not make sense that it would be known before the purchase. Taking this into consideration, it was also shifted. Unlike all other variables for which the values have no relation, this one is obviously ordinal. Therefore, it was encoded between 1 and 5, with 0 being left to fill the lack of value due to shifting.
As country will be handled differently, by passing into an embeding layer, it was label encoded, and as such after this process we were left with 3 different sets of data. The X set with all but the country variable. The X country set, containing only the country. And the Y set, containing only the one hot encoded brand.
shift = function(array, n){
c(rep(0, n), head(array, -n))
}
get_x = function(data, toShift=c(), toDrop=c()){
X = list()
for (entry in names(data)){
entryData = array(,c(0,0,0))
for (column in names(data[[entry]])){
if (column %in% toDrop){
next()
}
temp = data[[entry]][1:(nrow(data[[entry]])-1),column]
if (column %in% toShift){
temp = shift(temp,1)
}
entryData = tryCatch(
abind(entryData,array(temp,c(length(temp),1)),along=3),
error = function(e){
array(cbind(entryData,temp),c(length(temp),1,1))
}
)
}
columns = names(data[[entry]])
dimnames(entryData)[[3]]=as.list(c(columns[!columns %in% toDrop]))
X = append(X, list(entryData))
names(X)[length(X)]=entry
}
X
}
get_y = function(data, y){
Y = list()
entryNames = paste0(y,"_",levels(data[[1]][,y]))
for (entry in names(data)){
entryData = data.frame(data[[entry]][2:nrow(data[[entry]]),y])
dummy = dummyVars(~ .,data = entryData)
entryDummy = predict(dummy, entryData)
entryArray = array(entryDummy, c(nrow(entryDummy),1,ncol(entryDummy)))
dimnames(entryArray)[[3]]=entryNames
Y[[entry]] = entryArray
}
Y
}
split_brand_features = function(brandFeatures){
featureMap = list()
for (row in 1:nrow(brandFeatures)){
rowFeatures = brandFeatures[[row,"features"]]
rowBrand = as.character(brandFeatures[[row,"brand_id"]])
rowFeatures = gsub('\\[|\\]|\n', '', rowFeatures) %>%
strsplit(" ") %>%
unlist() %>%
as.numeric()
rowFeatures = rowFeatures[!is.na(rowFeatures)]
featureMap[[rowBrand]]=rowFeatures
}
featureMap
}
encode_brand_features = function(brands, brandMap){
encodedBrands = list()
counter=0
for (brand in brands){
counter = counter+1
brand = as.numeric(brand)
if (brand == 0){
encodedBrand = array(0,c(1,1,length(brandMap[[1]])))
}
else{
encodedBrand = array(brandMap[[brand]],c(1,1,length(brandMap[[brand]])))
}
encodedBrands[[counter]] = encodedBrand
}
encodedBrands = do.call(abind, list(encodedBrands, along=1))
dimnames(encodedBrands)[[3]] = paste0("brand_",c(1:dim(encodedBrands)[3]))
encodedBrands
}
set.seed(1)
toShift = c("brand_id","perc_sale")
toDrop = c("user_id","country")
userBrandsDF10 = as.data.frame(userBrandsDF %>% group_by(user_id) %>% filter(n()>=10))
userBrandsDF10$user_id = as.factor(userBrandsDF10$user_id)
userBrandsDF10$brand_id = as.factor(userBrandsDF10$brand_id)
userBrandsDF10$country = as.numeric(factor(userBrandsDF10$country))-1
userBrandsSplit = split(userBrandsDF10, userBrandsDF10$user_id)
userBrandsSplitX = get_x(userBrandsSplit, toShift, toDrop)
toDrop = names(userBrandsSplit[[1]])[names(userBrandsSplit[[1]]) != "country"]
countryX = get_x(userBrandsSplit, c(), toDrop)
Y = get_y(userBrandsSplit, "brand_id")
brandMap = split_brand_features(brandFeatures)
platformLevels = levels(as.factor(userBrandsDF10$platform))
userSegmentLevels = levels(as.factor(userBrandsDF10$user_segment))
X = list()
for (name in names(userBrandsSplitX)){
user = userBrandsSplitX[[name]]
encode_brand_features(user[,,"brand_id"], brandMap)
brandEncodedUser = abind(user, encode_brand_features(user[,,"brand_id"], brandMap), along=3)
columns = dimnames(brandEncodedUser)[[3]]
columns = columns["brand_id" != columns]
brandEncodedUser = array(brandEncodedUser[,,columns],c(dim(brandEncodedUser)[1:2],dim(brandEncodedUser)[3]-1))
dimnames(brandEncodedUser)[[3]] = columns
userSegmentValues = factor(brandEncodedUser[,,"user_segment"], levels=userSegmentLevels)
userSegmentDF = data.frame(userSegmentValues)
userSegmentHot = predict(dummyVars(~ userSegmentValues, data = userSegmentDF),userSegmentDF)
userSegmentArray = array(userSegmentHot,c(nrow(userSegmentHot),1,ncol(userSegmentHot)))
dimnames(userSegmentArray)[[3]]=userSegmentLevels
oneHotSegment = abind(brandEncodedUser, userSegmentArray, along=3)
columns = dimnames(oneHotSegment)[[3]]
columns = columns["user_segment" != columns]
oneHotSegment = array(oneHotSegment[,,columns],c(dim(oneHotSegment)[1:2],dim(oneHotSegment)[3]-1))
dimnames(oneHotSegment)[[3]] = columns
platformValues = factor(oneHotSegment[,,"platform"], levels=platformLevels)
platformDF = data.frame(platformValues)
platformHot = predict(dummyVars(~ platformValues, data = platformDF),platformDF)
platformArray = array(platformHot,c(nrow(platformHot),1,ncol(platformHot)))
dimnames(platformArray)[[3]]=paste0("platform_",platformLevels)
oneHotPlatform = abind(oneHotSegment, platformArray, along=3)
columns = dimnames(oneHotPlatform)[[3]]
columns = columns["platform" != columns]
oneHotPlatform = array(oneHotPlatform[,,columns],c(dim(oneHotPlatform)[1:2],dim(oneHotPlatform)[3]-1))
dimnames(oneHotPlatform)[[3]] = columns
oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "0.0"] = 1
oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(0, 25]"] = 2
oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(25, 50]"] = 3
oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(50, 75]"] = 4
oneHotPlatform[,,"perc_sale"][oneHotPlatform[,,"perc_sale"] == "(75, 100]"] = 5
X[[name]] = oneHotPlatform
class(X[[name]]) = "numeric"
}
for (name in names(countryX)){
countryX[[name]] = array(countryX[[name]], c(dim(countryX[[name]])[1],1))
}
Due to time and computational limitations, it was not feasible to test a large number of model topologies, and as such, the model was left at the bare minimal for our intent. It consists of 2 input layers, the first of which takes the X set and directly passes it to the LSTM layer. The other takes the X country set, passes it first by an embedding layer and then to the LSTM. There is a single LSTM layer, directly followed by a dense output layer with a softmax activation.
One important note for the LSTM layer is that the parameter stateful is set to FALSE, thus resetting its internal state (memory not weights) at each fit or prediction operation, thus ensuring that each sequence is treated independently.
build_model = function(){
mainInput = layer_input(shape = c(1,dim(X[[1]])[3]), name = 'mainInput')
countryInput = layer_input(shape = c(1), name = 'countryInput')
countryEmbeding = countryInput %>%
layer_embedding(input_dim = length(levels(as.factor(userBrandsDF10$country))), output_dim = 1, name ='countryEmbeding')
lstmLayer = layer_concatenate(c(mainInput, countryEmbeding), name = "fullInput") %>%
layer_lstm(units = 1, # size of the layer
return_sequences = TRUE,
stateful = FALSE, name = "LSTM") %>%
time_distributed(keras::layer_dense(units = dim(Y[[1]])[3], activation = 'softmax'), name = "output")
lstmModel = keras_model(
inputs = c(mainInput, countryInput),
outputs = c(lstmLayer)
)
lstmModel %>%
compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = c('accuracy'))
lstmModel
}
It is not possible to use the created evaluation schemes for training, but a serious attempt to emulate them as closely as possible was done. First, the data is split into 10 folds. For each fold the model is totally reset (including weights) and the respective data is obtained. Then, for each user, the model is trained on its sequence, resetting memory each time. With the training complete, the predictions are obtained in a similar manner, resetting the memory each time. For consistency with the previous approaches of task 2, although all outputs are used for training, only the last is saved for performance comparisons. In the end, the output of this process is a list containing 10 lists; one for each fold. Each of these lists in turn contains another list for each user. Finally, each of these contains the predicted class indexes in the order of highest predicted value (predicted likelihood).
create_k_folds = function(data, k){
foldSize = length(data) %/% k
foldExtra = length(data) %% k
sampleIndex = 1:length(data)
folds = list()
for (fold in 1:k){
if (foldExtra>=fold){
folds[[fold]] = sort(sample(sampleIndex, foldSize+1))
}
else{
folds[[fold]] = sort(sample(sampleIndex, foldSize))
}
sampleIndex = sampleIndex[!sampleIndex %in% folds[[fold]]]
}
folds
}
preds = list()
folds = create_k_folds(X, k = 10)
for (fold in 1:10){
lstmModel = build_model() # reset model for each fold
trainX = X[unlist(folds[-fold])]
trainCountry = countryX[unlist(folds[-fold])]
trainY = Y[unlist(folds[-fold])]
validationX = X[unlist(folds[fold])]
validationCountry = countryX[unlist(folds[fold])]
for (entry in 1:length(trainX)){
x = list(trainX[[entry]], trainCountry[[entry]])
y = trainY[[entry]]
lstmModel %>% fit(
x = x,
y = y,
batch_size = 1,
epochs = 1,
shuffle = FALSE,
verbose = FALSE,
view_metrics = FALSE
)
}
predFold = list()
for (entry in 1:length(validationX)){
x = list(validationX[[entry]], validationCountry[[entry]])
pred = predict(lstmModel, x , batch_size=1)
pred = pred[nrow(pred),,]
predFold[[entry]] = order(-pred)
}
preds[[fold]] = predFold
}
The final step in the process is the creation of the precision-recall and ROC curves for the predictions. First the metrics for each user are obtained, then they are averaged per fold, and finally, the results for all folds are averaged.
Before looking at the results it bears mentioning that the parameters for the number of epochs, LSTM layer size and embedding size were tuned, but due to the limited amount of time and lack of access to a TensorFlow compatible GPU (which increased train times over 100X), are far from optimal.
Thus, the best obtained results are presented below.
precision_at_n = function(pred, real, N){
TP = 0
for (row in 1:nrow(pred)){
if (real[row] %in% pred[row,1:N]){
TP = TP +1
}
}
P = nrow(pred)*N
precision = TP/P
precision
}
recall_at_n = function(pred, real, N){
TP = 0
for (row in 1:nrow(pred)){
if (real[row] %in% pred[row,1:N]){
TP = TP +1
}
}
recall = TP/(nrow(pred))
recall
}
FPR_at_n = function(pred, real, N, factorCount){
TP = 0
for (row in 1:nrow(pred)){
if (real[row] %in% pred[row,1:N]){
TP = TP + 1
}
}
FPR = (nrow(pred)*N - TP) / (nrow(pred) * factorCount)
FPR
}
plot_prec_rec = function(results,xlim=c(0,1),ylim=c(0,1)){
plot(1, type="n", xlab="recall", ylab="precision",xlim=xlim,ylim=ylim)
legendText=c()
for (resultIdx in 1:length(results)){
result = results[[resultIdx]]
x = result[,"recall"]
y = result[,"precision"]
lines(x,y,type="b",pch=resultIdx,col=resultIdx)
legendText=c(legendText,names(results)[resultIdx])
}
legend(
"bottomright",
legend=legendText,
col=1:length(results),
lty=1,
pch =1:length(results),
cex=1,
bty = "n"
)
}
plot_ROC = function(results,xlim=c(0,1),ylim=c(0,1)){
plot(1, type="n", xlab="FPR", ylab="TPR",xlim=xlim,ylim=ylim)
legendText=c()
for (resultIdx in 1:length(results)){
result = results[[resultIdx]]
x = result[,"FPR"]
y = result[,"TPR"]
lines(x,y,type="b",pch=resultIdx,col=resultIdx)
legendText=c(legendText,names(results)[resultIdx])
}
legend(
"bottomright",
legend=legendText,
col=1:length(results),
lty=1,
pch =1:length(results),
cex=1,
bty = "n"
)
}
topN = c(1,2,6)
results$UBCF5 = NULL
results$IBCF5 = NULL
results = avg(results)
result = data.frame(row.names = topN)
for (N in topN){
precision = c()
recall = c()
FPR = c()
for (fold in 1:10){
pred = do.call(rbind,preds[[fold]])
testY = Y[unlist(folds[fold])]
real = c()
for (user in testY){
real = c(real, which.max(user[nrow(user),,]))
}
precision = c(precision, precision_at_n(pred, real, N))
recall = c(recall, recall_at_n(pred, real, N))
FPR = c(FPR, FPR_at_n(pred, real, N, length(levels(as.factor(userBrandsDF$brand_id)))))
}
result[as.character(N),"precision"] = mean(precision)
result[as.character(N),"recall"] = mean(recall)
result[as.character(N),"TPR"] = mean(recall)
result[as.character(N),"FPR"] = mean(FPR)
}
results$LSTM10 = result
maxPrecision = 0
maxRecall = 0
maxFPR = 0
for (result in results){
maxPrecision = max(maxPrecision, max(result[,"precision"]))
maxRecall = max(maxRecall, max(result[,"recall"]))
maxFPR = max(maxFPR, max(result[,"FPR"]))
}
plot_prec_rec(results, xlim=c(0,maxRecall), ylim=c(0,maxPrecision))
plot_ROC(results, xlim=c(0,maxFPR), ylim=c(0,maxRecall))
On the previous ROC and precision-recall curves, it is possible to see that the LSTM algorithm achieves betters results than the second best ar10. Surprisingly, despite some variety of hyperparameters tested, the simplest employing an embedding and LSTM layer sizes of 1, with training being done for a single epoch attained the best results.
From the work performed there are a set of finding of great importance related to the relative performance of different methods and dataset sizes. Overall, restricting the dataset to users with more transactions reduced the performance of all methods in all top-N configurations. It was further found that from the original set of methods, association rules perform the best, with popular in second. Both instances of collaborative filtering achieved the worst results, with user-based being by far the worst. Given the low values for recall, a greater top-N of 6 might be preferable, to increase the chance that at least one recommendation is accurate. Finally, regarding the developed LSTM model, it was possible to verify that by using it the overall results can be improved, with the downside of a significantly increased computational cost.
The work performed has many limitations, almost all of which computational in nature. The collaborative filtering models ran into memory limitations with the largest dataset subset, while the LSTM models took too long to train for an extensive combination of model hyperparameters and topologies to be tested, which hindered the results.
The principal avenue for future work lies in the continuation of finetuning the various LSTM parameters. To speed up this process an inheritance from an unknown rich uncle would be welcome, allowing the obtention of a GPU that is compatible with TensorFlow and Keras.