This weekend I built an application that allows users to rate wines. Once a user has rated some wines, the application uses k-means analysis to recommend some wines that the user hasn’t rated, but might enjoy.
This wasn’t my first time writing Python code to apply the principles of machine learning to a problem. I had done that before for machine learning courses from coursera, udacity, and live bootcamps. These courses helped me learn the principles behind machine learning algorithms by asking me to implement the algorithms from scratch in iPython notebooks. Frustratingly, though, the courses never demonstrated how to take those skills out of the iPython notebook and build algorithms into working software products. The wine recommender app gave me a chance to connect my existing skill set in software engineering to my understanding of machine learning. The result: my first functioning intelligent application.
The Stack: Python, Django
I chose to do my first intelligent application in Python because Python has well-documented open source tools for machine learning. The coursera course familiarized me with the Graphlab Create suite from Turi, but for this project I switched to scipy and scikit-learn because I wanted an open source alternative. I used django because I had heard of it and knew it to be well-documented. I had also written Rails apps before and understood Django to be not wholly unlike Rails. The idea was to leverage as much of what I already knew as I could so I could focus on understanding the application of machine learning in user-facing code.
Impressions of Scikit-Learn and Pandas
I used Pandas to bring csv data into the app and store the datasets for use by the machine learning algorithm. I was pleasantly surprised by how easy it was to use, and I would definitely use it again for this purpose.
I also found the APIs for scikit-learn to feel very similar to the Graphlab Create ones that I knew. It would be possible to use these APIs without fully understanding the machine learning algorithms that they implement. By understanding the basics and the use case, a developer could use scikit-learn to create an intelligent application.
Now, that’s not my goal, so on to…
The wine recommender application uses k-means clustering to make its wine recommendations. So how does the algorithm work? Well, users rate wines in the application, so we have data about which users like which wines.
The k-means clustering algorithm takes in a matrix of users and wine ratings that looks a little like this:
|wine 1||wine 2||wine 3|
Then it sorts users into groups based on the ratings they have made. Those groups represent users with similar taste in wine. In order to make recommendations, the applications finds other users in the logged in user’s group and recommends wines that the user’s groupmates have rated highly, but the logged in user has not rated. For example, user 1 and user 3 above have similar ratings for wines 1 and 3, so the algorithm might group them together and recommend wine 2 to user 1, since user 3 rated it quite highly.
The K in the name refers to the number of groups that the algorithm forms with the data, and the means refers to the middle-ish data point around which each group is made. There are a few different methods for finding those points. The Wikipedia article provides a terse explanation of the Forgy and Random Partition methods if you would like to understand more.
Immense credit goes to this tutorial for walking me through the process of including machine learning algorithms in a Django application. Though the tutorial is not entirely complete (I had to figure out several issues on my own), it’s a thorough walkthrough with which a semi-seasoned programmer should be able to succeed.
This project gave me a much more complete understanding of a) how Django applications are structured and b) where and how machine learning algorithms would fit into the structure of an application.
The next plan is to write an application that uses regression analysis. More on that application later.