What is content-based filtering, how it works and who use it?

Content based filtering is one of the most popular recommender system types on websites. It is simple, easy to implement and sometimes performs even better than collaborative filtering method. But is it the best method? Whar are it's advantages? In this article I will answer these questions and also explain what is content-based filtering and how it works. We will also take a look at the real-life examples and companies that use this method in their personalization strategy.

What is content-based filtering?

Usually, we specify three types of recommender systems that are widely used across the Internet:

Rule-based personalization, which is bases on created decision rules, connected with specific behavior, preferences, or demographic criteria
Collaborative filtering, which uses specific algorithms to analyze similar user’s behavior and recommend products based on their preferences and decisions
Content-based filtering, which uses similarities between products to recommend a product that matches user preferences

We can define content-based filtering as filtering which uses similarities between product names, parameters, attributes, description or other, to present product similar to the one that attracted user in the past. To provide recommendations, the system uses only particular user profile data, such as previous orders, recently watched products, searched products, or product ratings.

But today's content-based filtering is far more complex and we could use more information to deliver better recommendations. There are two main trends in the content-based filtering approach:

Data-driven – using available product information, not only based on metadata, but also user-created data, such as ratings, tags, and comments. Moreover, some content-based recommender systems use image and multimedia analysis to get more data for personalization.
Alghoritmic – using metadata algorithms (which connects clients, products, or even subbrands and localization to connect data to into multi-dimensional structure) and machine learning algorithms (which allows finding similarities more efficiently.

Content-based filtering is still evolving. It evolved from a simple recommendation process comparing product names and descriptions to the complex solution that allows marketers to automate the whole procedure with several clicks and use much more data and much more complex algorithms, including machine learning and big data.

How content-based filtering works

The most important thing during content-based filtering implementation is... data. The more data you have, the better and more accurate recommendations you can provide. Some information could be explicit (gathered from the user), and some could be implicit (provided by the user).

Content-based filtering uses only specific user data, so you don't need other user data for creating recommendations. To show relevant recommendations, you need to set up similarity metric (like points or something else), following this math formula:

$\Large \langle x, y \rangle = \sum_{i=1}^{d} x_i y_i$

Without going into the mathematical details, we can show an example on some kind of matrix table. Four products match some from the four topics. Specific user is interested in two of these topics, as you can see below:

Type	Product1	Product2	Product3	Product4	User
Topic1	1	0	0	0	0
Topic2	0	1	0	1	1
Topic3	0	1	1	0	0
Topic4	1	0	0	1	1
Matched topics	1	1	0	2

First and second product matches only one topic, that user is interested into (Topic4 for Product1 and Topic2 for Product2). Third product (Product3) matches no topics that the user is interested in. But the last product (Product4) matches two topics (Topic2 and Topic4). This is the sign for content-based recommendation system to recommend this product to the user.

Content-based filtering pros and cons

As all of the recommender systems, content-based filtering has got it's own advantages and disadvantages. They can determine for which scenarios this method can be relevant, and when it is better to use something else.

Advantages

limited to specific user – in terms of privacy, content-based filtering uses only the data related to specific user, which means system won't use another user data to prepare recommendations
effective for all users – even if user have unique preferences, content-based filtering will recommend products for him
easy to implement – content-based filtering is widely used in many ecosystems and can be easily implemented on a large scale
easy to start – even with little amount of data, content-based filtering can create tailored recommendations

Disadvantages

lack of diversity – content-based recommendations are usually straight-forward and can be not as interesting as other recommendations
excessive specialization – sometimes recommendations can be very specialized, considering strict matching with user preferences
not so smart – with basic matching criterias and few filter paramaters, content-based recommendations won't be as smart as other recommendations

To sum up, content-based filtering is a great recommendation method, but not as smart as someone might want. It's easy to implement and quite effective, but for advanced scenarios, it's better to combine it with other methods.

Content-based filtering examples

Let's consider a simple scenario. You are reading a blog post about content-based filtering, which has set recommender systems tag. If you spend enough time (for example one minute) reading this article, I can rate your profile as someone interested in the recommender systems topic. Then, the content-based recommender system will show you similar articles, for example, this one about content-based filtering.

Now let's move to a more advanced scenario. You bought a brand new corner settee for your apartment in the online furniture shop. From the order, we know that this settee has got light gray upholstery and it's made in Scandinavian style. The Analytics system also noticed that you searched for coffee tables using internal search. Algorithms can suggest you another living room Scandinavian furniture (based on information about product type and product style). Let's create a detailed table for this scenario:

Type	Corner settee	Coffee table 1	Coffee table 2	Pillow 1
Scandinavian style	1	1	0	0
Light gray	1	0	0	1
Living room	1	1	1	1
Searched by user	1	1	1	0
"Enkelhet" collection	1	1	0	1
Matched topics		4	2	3

As we can see, the product that is nearest to the corner settee is Coffee table 1. It has a Scandinavian style and is from the living room category, this product was searched by the user and it comes from the same furniture collection. The second coffee table scored only two points, as it is not made in the Scandinavian style, and comes not from the same furniture collection.

But let's look at the second place. A pillow can be a quite nice addition to the user's new settee. It comes from the same collection and it has got the same color. But does the pillow with the same color as the sofa will always look nice? That's why we need to be careful when using content-based filtering and ensure that not all parameters should be considered in specific matching.

We can adjust our recommendation criteria and, what's more, with advanced algorithms, check what pillow colors will match the sofa upholstery, then recommend cushions that will be the perfect addition.

P.S. Of course please consider that this scenario has got some simplifications for explanation purposes.:)