Building your own recommendation engine
A in-depth tutorial on how you can build your own Recommendation Engine in-house for your product/service. Learn more.
Subscribe to our awesome Newsletter.
Wouldn’t it be super cool if your code provides you with solutions as you’re typing? It’ll inspire you to write some kickass code! We love such kind of code! And thats basically how we were inspired to make a recommendation system - feed it with some raw data, put in iterations and see what it does. From there on, through learning, the machine will prompt out the most relevant results.
We got hold of this creepy, yet interesting thing called “Collaborative Filtering”. So, we just Googled the title, skimmed through a few articles and now we have a basic understanding of the topic. Well, we didn’t do all that - Naveen did. He’s the brain, code and everything in between for this system (and also this article) After he had sufficient understanding of the topic, he made a smart move (at least he thinks so!) and in fact he learnt a lot on this which in turn helped him make his code-writing that much more easy and understandable.
This article explains with examples, the few basic things you need to know before making a recommendation system. For coders, this might be a small step, and for non-coders, its a huge leap. Now, lets make sure we know a few things before taking the plunge,
And those few things will be, * Juccard Similarity * Cosine Similarity * Centered Cosine Similarity
(Yes, we get that they’re all fancy words. Its apparently eye candy for math and code geeks. But, hey it sounds exotic on this article too, so we decided to keep it. Don’t judge!) 🤓
Juccard Similarity
If you are in need of a technical definition, then you can check with Google. But we have a small example to explain the same -
Now, referring to the table below - let’s consider a situation where we have data from certain users on how much they like a said movie
User | Movie 1 | Movie 2 | Movie 3 | Movie 4 |
---|---|---|---|---|
A | 1 | 0 | 0 | 1 |
B | 0 | 1 | 1 | 1 |
C | 0 | 0 | 1 | 1 |
The simple formula here to find the similarity is,
Therefore, the Similarity (A,B) = 1/(4-1) = 1⁄3 = 0.333
This way we can find the similarity between two said users (let’s say A and B) - now knowing what A has liked and sensing the similarity, we can suggest similar movies that B may also like!
Cosine Similarity
We’re taking up our example scenario a notch higher so that you readers can understand better. Now, what if the users were to give some ratings for the movie they liked? Then, we won’t be able to find the similarity percentile between the users using juccard index, and that’s when Cosine Similarity comes to our rescue. Here, we will be using cosine distance to take advantage of the efficiency that we will be touching upon later.
So for some of the weighted values in our dataset (which will be the ratings), we will need to use cosine similarity.
User | Movie 1 | Movie 2 | Movie 3 | Movie 4 |
---|---|---|---|---|
A | 4 | 3 | 5 | |
B | 1 | 3 | ||
C | 3 | 2 | 4 |
Obviously, we can see that Similarity (A,B) < Similarity (A,C) But we need to prove that with our mathematical formulae. The steps to calculate cosine similarity is relatively simple,
Similarity (A,B) = 0.15
Similarity (A,C) = 0.22
From our calculations, we can say that they aren’t so different from one another - now we sense that there’s something wrong (Naveen also thinks so too). This loophole is because, according to our theory the empty spaces in the table are considered to have a zero rating - But they aren’t . So we now need another way to deal with this 🤔
Centered Cosine Similarity
There’s nothing major here (haha, you wish!) We are just going to find the row mean value and subtract them from the allotted values. At the end we’ll get a new table for which we need to find the cosine similarity. Simple yeah?
User | Movie 1 | Movie 2 | Movie 3 | Movie 4 | Mean |
---|---|---|---|---|---|
A | 4 | 3 | 5 | 12⁄3 | |
B | 1 | 3 | 4⁄2 | ||
C | 3 | 2 | 4 | 9⁄3 |
The new table will be as follows,
User | Movie 1 | Movie 2 | Movie 3 | Movie 4 |
---|---|---|---|---|
A | 0 | -1 | 1 | |
B | -1 | 1 | ||
C | 0 | -1 | 1 |
Now we need to apply and find the similarity formula,
Basically, prediction, is the easiest of all - apply a formula and add the subtracted mean back to the value.
The formula for prediction is, R(X,i) = for all values of Y [ Similarity(X,Y).R(Y,i) / Similarity(X,Y) ] i.e, Y = All the users.
So in our case it will be, R(A,3) = [Sim(A,B).R(B,3)+Sim(A,C).R(C,3)) / Sim(A,B)+Sim(A,C)]
r(A,3) = R(A,3) + Mean Value in ROW A
Now, that’s the predicted rating value! Simple Right!
This code is also hosted on Github. Have fun coding everyone!