Recommendation Engines

by Krishna on October 9, 2011

I just finished reading “Collective Intelligence in Action“, a book by Satyam Alag about using information from user actions to improve the working and usability of applications. This is a field that has seen much interest in recent years given that most web applications, especially Web 2.0 apps, have massive amounts of data flowing through them in terms of user-generated content and interactions, and which has the potential to be tapped to make the user experience much smoother as well as improve the goals of the application itself (such as greater sales for an e-commerce site). I recommend the book itself, though its frequent dives into programming code may not be what you are interested in, and also it is a little old (Cuil, anyone?)

There were a few takeaways from the book. One is that even if you don’t have lots of knowledge or experience in the field, there are open source packages to get you started easily. The obvious one is Apache Lucene for search, and the book also provides other examples for clustering, data mining, etc. Another is that a mathematical (and especially statistical) background can help you easily grasp some of the major concepts such as matrices, vectors, correlation, etc. But if you are smart, I am sure that putting some hard work will help understand the material over time. And also there is much to be said about having some existing general knowledge relevant to the field, such as language usage patterns. Again, these are learnable concepts. Start with Wikipedia and expand out to more specialized links and material.

I was particularly interested in the last chapter of the book, which was about building recommendation engines. When it comes to marketing, you either have to work with a user you know nothing about (which means generic marketing) or you know something about. The latter is an interesting specimen. You might start with pure demographic information (age, location, income, etc.) and over time, gather more information about their online activities. If you are someone like Google and the user is a heavy user of Google products, you know (or could know) what they search on, what kind of emails they receive and send, what news/blogs / people they read, what products they are interested in and so on. How do you use that information to tailor the content you are serving to them (with Google, this is mainly ads)?

So, it seems that there are two main approaches: One is “related items” and one is “related users”. In the first scenario, if a user likes one item, then related items are shown to them. So maybe if I search for “Canon camera”, I also get ads for Nikon. In the second scenario, we look for users similar to the current user and then serve up items liked by those users. So, a person similar to me and doing the same search is an amateur photographer and may also be interested in camera lenses or travel to scenic places. There are pros and cons to both approaches, depending on the number of items, number of users, change in the items and so on. It may also be computationally difficult to implement a related-users approach when the number of records are very large.

The other aspect of recommendation engines is how you define similarity. Collaborative techniques use user actions, but you could also use a content-based technique where you analyze text associated with the user (such as user comments) to make a determination about recommendations. The book cites the classic example of how Google became the predominant search engine by focusing on the former (by using incoming page links) instead of the existing technique of text parsing by then leading search engines.

Recommendation engines use collaborative filtering for making predictions of user likes. Memory-based algorithms use user ratings of items, but this can be a problem when they are hard to come by for certain records. For example, a movie database may have many movies that have only a few users rating them (because they are new or not popular) even though others may have hundreds or thousands of ratings. Model-based algorithms (several available) can solve this problem, though they can be expensive.

So, that is that. I do think that recommendation engines still have ways to go. For example, I am not sure that existing sites (like Netflix, Google, Facebook, etc.) use the concept of “family” as opposed to “user” when doing recommendations. At home, my wife and I switch all the time between computers and “my” data is not necessarily mine, because I could have been signed into Google and she could be searching (and vice versa). So all the recipe searches are (mostly) not mine, and most of the technology searches are not hers. And also, a Netflix account is shared by the whole family. So there is much scope to evaluate the mixing of user action data between different household members.

{ 2 comments }

Eric Kawalsky October 11, 2011 at 4:10 pm

Mr. Kumar,

You bring up intriguing issues about how to best create a recommendation engine. As someone who is particularly interested in engineering design problems, I am instantly drawn to the difficulties in designing a recommendation engine and how a software engineer makes decisions that will both be most pleasing to his target audience as well as fulfill the engine’s purpose (whether it is more effective marketing or otherwise). I notice both design problems you present pertain to the situation you would like to design for. In the first design aspect, you ask, “How do you use that information to tailor the content?” Of course, the answer is not such simple one to answer because, as you point out, it depends on factors like size of the databases and variability in a user’s search terms. However, I find the interaction among all these factors particularly fascinating, especially the way these interactions inform the programmer’s decisions. One interesting example of a company that makes these judgments is brainpark.com. This company provides a recommendation engine that searches within the client company to connect disparate users to provide “software [that] learns what you are working on…and feeds you similar and related work that is being done or already has been done before.”

The other design issue you present informs the programmer’s decisions on how to create the recommendation engine by defining parameters for the solution to the design problem. However, it also touches on a common problem engineers run into: the definition of the problem; in this case, how do we define similarity? The definition of the problem is an exciting phase of engineering design because this phase is a very easy phase to be creative in. Indeed, you already point out that “Google became the predominant search engine by focusing on [collaborative techniques rather than content-based techniques].” How have you found the definition of a problem influence your software development projects? Thank you for sharing these interesting issues that arise while trying to build a recommendation engine.

Krishna October 12, 2011 at 9:45 am

Thanks for your comment, Eric

Comments on this entry are closed.

{ 1 trackback }

Previous post:

Next post: