I recently had the pleasure of reading James Scott’s “Seeing Like a State,” which examines a certain strain of failure in large centrally-organized projects. These failures come down to the kinds of knowledge available to administrators and governments: aggregates and statistics, as opposed to the kinds of direct experience available to the people living ‘on the ground,’ in situations where the centralized knowledge either fails to or has no chance to describe a complex reality. The book classifies these two different kinds of knowledge as techne (general knowledge) and metis (local knowledge). In my reading, the techne – in both strengths and shortcomings – bears similarity to the knowledge we obtain from traditional algorithms, while metis knowledge is just starting to become available via statistical learning algorithms.
In this (kinda long) post, I will outline some of the major points of Scott’s arguments, and look at how they relate to modern machine learning. In particular, the divides Scott observes between the knowledge of administrators and the knowledge of communities suggest an array of topics for research. Beyond simply looking at the difference between the ways that humans and machines process data, we observe areas where traditional, centralized data analysis has systematically failed. And from these failures, we glean suggestions of where we need to improve machine learning systems to be able to solve the underlying problems.
I recently read Edward Tufte’s ‘Visualizing Quantitative Information,’ a classic book on visualizing statistical data. It reads a little bit like the ‘Elements of Style’ for data visualization: Instead of ‘omit needless words,’ we have ‘maximize data-ink.’ Indeed, the primary goal of the book is to establish some basic design principles, and then show that those principles, creatively applied, can lead to genuinely new modes of representing data.
One of my favorite graphics in the book was a scatter plot adapted from a physics paper, mapping four dimensions in a single graphic. It’s pretty typical to deal with data with much more than three dimensions; I was struck by the relative simplicity with which this scatter plot was able to illustrate four dimensional data.
I hacked out a bit of python code to generate similar images; here’s a 4D scatter plot of the Iris dataset:
I met up with some mathematician friends in Toronto yesterday, who were interested in how one goes about getting started on machine learning and data science and such. There’s piles of great resources out there, of course, but it’s probably worthwhile to write a bit about how I got started, and place some resources that might be of more interest to people coming from a similar background. So here goes.
First off, it’s important to understand that machine learning is a gigantic field, with contributions coming from computer science, statistics, and occasionally even mathematics… But on the bright side, most of the algorithms really aren’t that complicated, and indeed they can’t be if they’re going to run at scale. Overall though, you’ll need to learn some coding, algorithms, and theory.
Oh, and you need to do side-projects. Get your hands dirty with a problem quickly, because it’s the fastest way to actually learn.
Recently I’ve seen a couple nice ‘visual’ explanations of principal component analysis (PCA). The basic idea of PCA is to choose a set of coordinates for describing your data where the coordinate axes point in the directions of maximum variance, dropping coordinates where there isn’t as much variance. So if your data is arranged in a roughly oval shape, the first principal component will lie along the oval’s long axis.
My goal with this post is to look a bit at the derivation of PCA, with an eye towards building intuition for what the mathematics is doing.
This was a bit different from other Kaggle competitions. Typically, a Kaggle competition will provide a large set of data and want to optimize some particular number (say, turning anonymized personal data into a prediction of yearly medical costs). The dataset here intrigued me because it’s about learning from and reconstructing graphs, which is a very different kind of problem. In this post, I’ll discuss my approach and insights on the problem.
I attended a really nice talk by Arash Amini yesterday about detecting ‘communities’ in sparse graphs. The basic problem is: In a big graph (like the Facebook graph, or the graph of scientific papers citations) you have clusters of friends/associates, and you want to pick out those clusters. Dr. Amini and his collaborators have been working on methods to solve this problem in particularly noisy cases. The methods presented were developed for cases where you know the specific number of clusters you are seeking to find. It was a fascinating talk, and I came away wondering if there are good methods to use when you aren’t sure of how many clusters there are.
I’ve finally had a bit of time to finish up the code for the Rock-Paper-Scissors bot discussed in the previous posts; I’ve put the code in a GitHub repository here. Feel free to file issues if you would like it to be a bit more user friendly.
The bot which uses the Fourier transform on move probabilities to search for profitable patterns is named `diaconis`, after Persi Diaconis. It’s currently working just fine, but is a bit slow computationally. On startup, it generates all possible move patterns that it will investigate during the course of each game, as well as some character tables. After each play, it tests some of these patterns, and tries to determine if their predictive power is better than any of the patterns seen thus far. If so, it begins using that pattern to choose move probabilities.
This works fine as a proof-of-concept of the basic ideas. Additional improvements could be had by doing some code optimization to speed things up a bit, and keeping a list of good patterns and allowing a bit more dexterity in switching between the patterns used for prediction.
In the last post, we looked at using an algorithm suggested by Bayes’ Theorem to learn patterns in an opponent’s play and exploit them. The game we’re playing is iterated rock-paper-scissors, with 1000 rounds of play per game. The opponent’s moves are a string of choices, ‘r’, ‘p’, or ‘s’, and if we can predict what they will play, we’ll be able to beat them. In trying to discover patterns automatically we’ll gain some general knowledge about detecting patterns in streams of characters, which has interesting applications ranging from biology (imagine ‘GATC’ instead of ‘rps’) to cryptography.
Fourier analysis is helpful in a wide variety of domains, ranging from music to image encoding. A great example suggested by ‘Building Machine Learning Algorithms with Python‘ is classifying pieces of music by genre. If we’re given a wave-form of a piece of music, automatically detecting its genre is difficult. But applying the Fourier transform breaks the music up into its component frequencies, which turn out to be quite useful in determining whether a song is (say) classical or metal.
I’ve recently been doing some reading on machine learning with a mind towards applying some of my prior knowledge of representation theory. The initial realization that representation theory might have some interesting applications in machine learning came from discussions with Chris Olah at the Toronto HackLab a couple months ago; you can get interesting new insights by exploring new spaces! Over winter break I’ve been reading Bishop’s ‘Pattern Recognition and Machine Learning‘ (slowly), alongside faster reads like ‘Building Machine Learning Systems with Python.‘ As I’ve read, I’ve realized that there is plenty of room for introducing group theory into machine learning in interesting ways. (Note: This is the first of a few posts on this topic.)
There’s a strong tradition of statisticians using group theory, perhaps most famously Persi Diaconis, who used representation theory of the symmetric group to find the mixing time for card shuffling. His notes ‘Group Representations in Probability and Statistics‘ are an excellent place to pick up the background material with a strong eye towards applications. Over the next few posts I’ll make a case for the use of representation theory in machine learning, emphasizing automatic factor selection and Bayesian methods.
First, an extremely brief overview of what machine learning is about, and an introduction to using a Bayesian approach to play RoShamBo, or rock paper scissors. In the second post, I’ll motivate the viewpoint of representation by exploring the Fourier transform and how to use it to beat repetitive opponents. Finally, in the third post I’ll look at how we can use representations to select factors for our Bayesian algorithm by examining likelihood functions as functions on a group.