So when do you have enough data, and how do you make the decision that your success rate is ‘good enough?’ In this post, we’ll look at how the Beta distribution helps us answer this question. First, we’ll get some intuition for the Beta distribution, and then discuss why it’s the right distribution for the problem.

Consider the widget’s tests as independent boolean variables, governed by some hidden parameter **μ**, so that the test succeeds with probability **μ**. Our job, then, is to estimate this parameter: We want a model for P(**μ**=x | **s**, **f** ), the probability distribution of **μ**, conditioned on our observations of success and failure. This is a continuous probability distribution, with **μ** a number between zero and one. (This general setup, by the by, is called ‘*parameter estimation*‘ in the stats literature, as we’re trying to estimate the parameters of a well-known distribution.)

The ‘natural’ distribution to consider for P is the Beta distribution. This is given by:

where is the Beta function, which normalizes the P so that its integral is 1. (As it turns out, , when **s** and **f** are integers.) I’ll come back to why the beta distribution is a ‘natural’ choice later.

Looking at this formula and the plot of P for eight successes and two failures, a few things stand out.

- The formula is going to have asymptotes if
**s**or**f**are zero. It’s generally a good idea to**include ‘phantom’ observations of one success and one failure**. We’ll do this in all of the plots to come. (More on this later, in the discussion of why the Beta distribution is natural.) - The numerator indicates that P is 0 at x=0 and x=1 whenever
**s**and**f**are bigger than 1, which provides a small sanity check: Once we observe a failure, it’s impossible for**μ**to be 1. - The peak of the distribution is actually different from the mean. If you’ve only ever looked at normal distributions, this is may be surprising: the beta distribution has a more complicated shape than a normal distribution. In this case, the
**most likely**value of**μ**based on our observations so far is*bigger*than 0.8. (In fact, it’s about 0.87.) - The
**expected value**of the probability distribution is 0.8, exactly the ratio of successes. - Once we add a phantom success and failure, the most likely value (MLE) becomes 0.8, the ratio of observed successes, but the expected value is a bit lower. In fact, this is always the case: With the phantom success and failure, the MLE is .
- The cumulative distribution function (CDF) is P(
**μ**>x). This is super useful for picking a**confidence bound**, and answering questions like ‘what’s the value**x**where we’re 90% sure that**μ**is greater than**x**?’ (In this case, that bound is just 0.633.)

So just going with the maximum likelihood estimate isn’t giving us very much certainty about **μ**, precisely (as it turns out) because we haven’t collected enough data yet. What happens when we collect more data? Here’s four plots with more observations and the same overall observed success ratio, with a phantom success and failure thrown in.

As we add more data, the PDF becomes more and more tightly distributed around the expected value, 0.8: We have more certainty that the expected value is really close to the observed success ratio.

What we’ve gained by looking at the distribution is a concrete way to measure our certainty about our estimate of **μ**. This is kinda great. By the time we’ve collected a thousand observations here, we’re pretty sure that **μ** is at least 0.79.

We also open ourselves up to a new testing strategy: By setting a confidence cutoff, we can **run tests until we’re reasonably sure that we’ve passed the bound, and then stop**. In this example, we can stop a little while after the hundredth test, when we’re 90% certain that **μ** is at least 0.75, which was the launch criteria. If the distribution tells us we’re doing great, we can stop earlier and save time on testing. (Conversely, we can set an upper bound, so we can stop testing when we’re pretty sure we’ve missed the target and tell the engineers to go back to the drawing board.)

This comes down to Bayes theorem and some nice properties of the Beta distribution. Applying Bayes theorem to P(x=**μ **| **s**, **f**), we get:

The right side breaks up into three parts: The *likelihood function*, P(**s**, **f** | **μ**=x), is easy to compute: It’s . The denominator is a *normalizer*, which ensures that the total probability is one. Which leaves just the *prior distribution*, P(**μ**=x) to worry about.

It turns out that the Beta distribution is **self-conjugate**, meaning that when the prior P(**μ**=x) is a beta distribution, then P(x=**μ **| **s**, **f**) is a Beta distribution as well. It also turns out that P(x=**μ **| **1, 1**) is the uniform distribution on [0, 1]. So, if we start the day with the belief that **μ **is equally likely to be any number in the range [0, 1], then we end up with a Beta distribution P(x=**μ **| **s+1**, **f+1**) as the posterior.

And this is why the Beta distribution is natural to consider, and why it makes sense to add a ‘phantom’ success and failure. (Some people prefer to add a half a success and half a failure. This is called the Jeffrey’s prior. It gives less weight to the phantom observations, and – due to the non-uniform shape of the distribution – slightly favors the hypotheses that **μ **is either large or small. But I like integers and the uniform prior, personally.)

The approach we’ve take here is pretty Bayesian. In addition to using Bayes theorem once, we’re also working directly with the distribution here, and taking the stance that our computers can handle the computations and arrive at numerical answers. As opposed to imposing assumptions that the distribution is roughly Gaussian for the sake of generating big plug-and-play formulae. (Incidentally, the methods presented here are essentially a one-sided version of the Jeffrey’s Interval.) We’ve also made no null hypotheses nor computed associated p-values to get where we wanted to go: Our basic assumption is that our tests are independent, and that **μ **could equally well be any number prior to making any observations.

The beta distribution can be found in the scipy.stats package in Python. The plots in this post were made with matplotlib.

]]>I recently had the pleasure of reading James Scott’s “Seeing Like a State,” which examines a certain strain of failure in large centrally-organized projects. These failures come down to the kinds of knowledge available to administrators and governments: aggregates and statistics, as opposed to the kinds of direct experience available to the people living ‘on the ground,’ in situations where the centralized knowledge either fails to or has no chance to describe a complex reality. The book classifies these two different kinds of knowledge as techne (general knowledge) and metis (local knowledge). In my reading, the techne – in both strengths and shortcomings – bears similarity to the knowledge we obtain from traditional algorithms, while metis knowledge is just starting to become available via statistical learning algorithms.

In this (kinda long) post, I will outline some of the major points of Scott’s arguments, and look at how they relate to modern machine learning. In particular, the divides Scott observes between the knowledge of administrators and the knowledge of communities suggest an array of topics for research. Beyond simply looking at the difference between the ways that humans and machines process data, we observe areas where traditional, centralized data analysis has systematically failed. And from these failures, we glean suggestions of where we need to improve machine learning systems to be able to solve the underlying problems.

**An Example In Computation**

I’ll start with a very simple example: Consider the problem of sorting. It’s straight-forward: Place a total ordering on the objects you wish to sort (say, alphabetical) and then apply any of a variety of available algorithms, like quick-sort or merge-sort to arrive at a well-sorted list of objects. This is an example of general knowledge applied to a non-specific problem; it’s a techne solution.

Now, suppose we run a shop and want to sort our rolodex of accounts, so that we can easily find the account of the next customer who walks in. The best sorting may depend on the time of day (or time of month), how long it has been since a given customer last visited, how sensitive each customer is to waiting for their account card, and the ability of the staff to learn whatever system you come up with. Simple alphabetization (the techne solution) seems inadequate: Maybe it’s better to use a move-to-front heuristic, or an alphabetized rolodex with notches on the cards for the most frequent or irritable customers. Either of these would count as a metis solution, as the sorting is taking into account specialized local knowledge. (Also, notice that much of the complexity is coming from the evaluation function, which is a nebulous combination of customer happiness and transaction efficiency; we adopt an adaptive algorithm in part to deal with the complexity in the evaluation function.)

So which do you choose: the techne algorithm or the metis approach? Of course, it depends. There are advantages (things like ease of maintenance and ease of interpretation) which favor the simpler techne solutions. But once the assumptions diverge too radically from reality, something more is required.

A move-to-front strategy is still simple enough to yield to analysis, but more extreme examples abound. Consider image classification, where we’ve seen traditional algorithmic approaches fail for decades, replaced recently with far more success by machine learning algorithms.

Prior to this revolution, image classification involved rather complicated pipelines and the careful curation of features developed by experts who had been working on the classification problem for years. Success with techne approaches required a deep metis knowledge of the field, in the form of extremely detailed local knowledge of the experts assembling the pipelines. More recently, these kinds of solutions have been supplanted by deep neural networks, which have been making fast progress towards parity with humans. And successfully training these networks requires little of the expert image processing knowledge needed in the previous approaches.

A trained neural network is still essentially a metis solution, turning a body of experience (training data) into a specialized knowledge of a problem domain. But the local knowledge of image classification has been offloaded from the experts to the machines. This is a real shift.

**Legibility and Interprebility**

Prior to the rise of computers and machine learning, a metis solution for administration always meant hiring someone, and maybe a large number of someones, to store and execute on the needed local knowledge. For these kinds of problems, a centralized state essentially has three options: Hire a small army of bureaucrats to keep track of the needed information, give up, or (crucially) wield its power to change the conditions on the ground to accommodate a simpler solution.

The failed projects that Scott investigates are particularly extreme examples where centralized powers enforced order to create ‘legible’ systems. Scott’s initial example is German scientific forestry, where the story goes something like this: The king wants to export lumber, which requires knowing the size of the forests and how much lumber they are capable of producing. However, a natural forest is difficult to quantify. There are any kinds of trees in different stages of growth and indeterminate number. The solution was to create mono-crop forests, with trees planted at the same time in careful rows, allowing easy counting of trees and prediction of lumber yields. While this worked well for one generation of trees, the mono-cropping depleted the soil quickly, and required a large quota of direct maintenance afterward.

Scott provides a wide variety of other case studies, including urban design, taxes and land rights, sub-Saharan agriculture and villagization programs, and various Soviet programmes, which all relate back to this theme of ‘legibility,’ which essentially refers to the amenibility of a community or system to classical statistical analysis. In the forestry example, the natural forest is illegible, with a huge number of variables which are difficult to combine into a real understanding of the forest. The scientific forest is much more legible: Land area and age of the trees translate almost directly into a low variance estimate of how much lumber the forest will produce.

A large part of the promise of machine learning is to produce legibility from complex systems. Given a complicated set of inputs, one wants to determine some easily digested information, like whether a person will default on a loan, whether a picture is a picture of a dog, or whether the author of a comment is a human or not. At the same time, we worry quite a bit about the interpretability of a model: Can we figure out why the model made a certain decision? One can view model-building as a way of transferring illegibility from the observed system to the model itself. If the model is fully interpretable (eg, it can be reduced to the observation of just a couple variables), one may as well have just built a traditional algorithm.

Interestingly, we don’t often talk about changing systems to make them easier for our machine learning systems to understand. We talk about improving the pipeline – often through adding new features or providing cleaner data – but not so often about changing the thing being observed in the first place.

**Superstition and Illegibility**

On a related note, one of the common criticisms of ‘local knowledge’ is that it is superstitious, and can come with irrelevant or even harmful artifacts. These are the demons that science is meant to guard us from.

Firstly, Scott makes the argument that many of the claims of harmful superstitious knowledge are overhyped, while the actual contributions of traditional knowledge to science are completely underrated. For example, the practice of intentionally exposing people in a good state of health to a disease, in order to establish immunity, was apparently known for centuries prior to the invention of vaccinations. Vaccination was a huge step forward in terms of ease, reliability, and safety, but it is also a scientific derivation from a previously known practice. Likewise, it is rare for a new medicine to be synthesized completely from scratch: many are derived from leads given by traditional medicine.

That said, there are certainly huge issues with simply trusting traditional practice. Female genital mutilation seems lke a pretty clear example of a terrible practice followed for no particularly good reason.

Likewise, there are fears of our learning models picking up superstitions and spurious correlations. In large, deployed systems, there’s a related problem of ‘vestigial features:’ inputs fed into a model which have very little positive contributions, but which would take a huge amount of effort to know that they can be safely removed. These excess features can unneccesarily complicate the model, or evn obscure simpler insights which might be available if the model were sufficiently reduced to be interpretable. And in fact, this runs to the heart of the tatistics-versus-data-science debate: Sometimes we train a large, complicated machine learning model when we haven’t put in the effort to understand that a simple single variable regression would solve the problem. If we don’t have a full understanding of the model, we don’t know if we can replace the whole thing with something much simpler.

One can dream of a system which looks at a data set and tells you whether you’re better off calling in a statistician or just using a random forest.

**French Land Rights and Adversarial Machine Learning**

One of my favorite case studies is on land rights in post-revolutionary France. The government wants administer rural areas, but it turns out there’s little single-ownership of land, and things are pretty chaotic in the wake of the revolution. There are feudal landowners with existing claims, new claims on land liberated during the revolution, complex usufruct community land rights, a whole system of guilds, and informal agreements within communities which allow this mess to continue more-or-less amicably. The government wants to make things more legible with a new ‘code rural.’

Two proposals were entertained. The first was to throw away all existing agreements, and move to a system of title deeds, private property, and contracts governing land use. (Essentially, how we think of land rights today.) The second proposal, by Lalouette, was to travel around to all of the villages and transcribe all of the existing land agreements into law. The first proposal does nothing to respect the existing players on the ground, in favor of making things easy for the government. Meanwhile, the Lalouette proposal would require a huge amount of administrative work, both to document all of the existing agreements and to dredge it up again when a disagreement winds up in court.

In the end, neither proposal is enacted. The private property system was deemed to be too easy to manipulate by the old aristocracy and bourgeoise, and would lead to eventual concentration of wealth and power away from the peasantry and proletariat. Meanwhile, the ‘transcribe it all’ solution was far too much work and would be too difficult to enforce. Furthermore, the transcribed agreements would be obsolete almost instantly, as new agreements cropped up to deal with new problems on the ground.

There are a couple things I love about this example. First, it gives a crystal clear illustration of the relationship between knowledge and power. You can centralize administration of a system if and only if you can centralize knowledge of the system. If the facts on the ground are too complicated for the administrators to follow, then the people on the ground can do as they wish whenever the administrators aren’t around.

Second, the Lalouette proposal illustrates a problem we run into in machine learning all the time: a statically trained model in many domains will quickly fall into obsolesence. In fact, the areas where machine learning has been most successful are areas where timing of the model doesn’t matter too much. A good solution for image classification or automatic translation doesn’t need to change very quickly, since images and languages don’t change very quickly.

In many domains, however, the facts on the ground can change quickly, as the thing you’re trying to learn changes its form. This occurs especially in adversarial situations: The adversary changes behaviour, and the model needs to change to react to this new behaviour. This produces problems on both ends of the ML pipeline: a trained model is quickly obsolete, but even worse, your training data also becomes obsolete. And labelling new training data, or even evaluating the model you’ve built, can be very expensive in these situations. These situations require a tight feedback loop between models, data, and human supervisors and analysts.

**Polycropping and Objective Functions**

Another interesting case study is polycropping in sub-Saharan Africa. Polycropping is the process of interspersing many different crops, as opposed to the monocropping which predominates in North America. Scott argues at length that polycropping provides better yields and better variety of crops for local communities, but is more difficult for a streamlined, capitalist economy to understand and interact with. Even though polycropping produces higher yields, there is higher adminstrative cost for processing the outputs, and the additional outputs may be in crops which are difficult to utilize in downstream industrial systems.

But even this is an overconstrained view of the issue: Scott illustrates the quite wide variety of uses that crops and crop byproducts are put to, such as stalks from harvested plants providing trellising for other crops with no further work. What strikes me is that the polycropping subsistence farmer is driven by a wide range of motivations in their planting choices, well beyond just how much food will be produced. Planting one crop might reduce the work needed to plant a follow-on crop, provide medicine or ceremonial by-products, act as pest repellants, and so on. The best crops may succeed at fulfilling a number of different objectives all at once, reducing the overall number of crops which need to be planted (and the attendant human labor).

We often look at ensembles of models in machine learning, which often perform much better than a single model. But here we see that there is also a great value in diversity of evaluation functions, or what we’re training the model(s) to excel at.

It is fairly common to train a model and deploy it, only to eventually realize that the choice of objective function at training time has biased the results in some undesirable way. For example, a recommendation system trained to maximize revenues might always recommend higher-priced items, since buyers will produce more revenue. Over time, though, consumers come to understand that the site is over-priced relative to other sources and go elsewhere for their purchases. A standard response is then to change the objective function to something ‘better,’ realize the shortcomings of the new objective a few months later, and repeat.

But in fact, it may be better to train an ensemble of models on different objectives, to provide a range of options which are good for different reasons. Such an ensemble provides not just model diversity, but diversity of quality. The difficulty here is that it becomes quite difficult to establish the quality of the trained ensemble: Any evaluation function chosen could be used to train a single model directly, after all!

**Extremely Local Knowledge**

Another favorite example of the divide between local and global knowledge arises in ship navigation. Scott points out that many harbors have special pilots with intimate knowledge of local conditions, who come on board a seagoing ship and pilot it through the harbor and into port. The sea navigator is familiar with global techniques of navigation – how to efficiently get from point A to point B using stars, charts, and GPS – but that these methods give no help with the placement of rocks in port or the variety of local currents. For this, specialized local knowledge is required. And not only this: an expert is needed for every port.

This illustrates another kind of area where machine learning has great opporunities for success: areas where there is too much local variation to allow strong global solutions, but enough data available to learn from. Areas where geography has a large impact on outcomes are pretty common; training models for (say) medical diagnosis at each hospital will be more effective than training a single model on all patients globally. For example, incidence of Lyme’s disease will be less common in deserts, and the quality of surgeons on staff at a specific hospital may have a strong bearing on individual outcomes.

This also hints at the value of highly personalized models. Recently, I was discussing the problem of abusive comments on Facebook. Abusive comments and trolls are well-known to discourage people (especially women) from speaking up in public spaces on the internet. And yet the major providers of these public spaces (like Facebook and Twitter) have been pretty much powerless to stop the problem, in part because any top-down, global solution will have too strong an impact on freedom of expression. Facebook, when they delete a post, cite a nebulous breach of ‘community standards,’ but Facebook is a massive collection of communities, with widely varying standards, ranging from Black Lives Matter groups to Trump supporters. A site-wide solution needs to allow both safer spaces for highly targeted groups, and anything-goes spaces for the b-tards. Giving control of models to individuals who have authority over subsites (like Twitter comments in response to one’s own tweets) would provide an excellent way forward for public discourse on the internet.

]]>For me, the book emphasized the importance of overcoming (or circumventing) boundaries in the pursuit of scientific progress. Von Neumann in particular became obsessed with applications (particularly after Godel’s theorem put an end to the Hilbert programme), and served as a bridge between pure and applied mathematics. Meanwhile, construction of the physical computer brought in a variety of brilliant engineers. It’s clear that the departmental politics at the IAS were still quite strong – the pure mathematicians didn’t have much regard for the engineers, and the computer project ground to a halt quite senselessly after von Neumann left the IAS. Dyson argues that Princeton missed an opportunity to be a world center for computing theory and practice as a result.

The computer project itself bridged numerous applications. One of the primary motivations was running calculations that would allow the construction of the first hydrogen bomb, which brought significant investment from the US military. But alongside these calculations, the computer was used for early attempts at numerical weather prediction, using the grid method originally proposed in the 1930’s, long before data collection and computational power were ready. Other projects included experiments with artificial life and cellular automata, and simulating nuclear reactions inside of stars.

All of this came just on the heels of the Manhattan Project, in which the divisions between applied and pure mathematics were broken down by the pressing need to win the war. Indeed, the ideas that came out of mathematics in response to the war would go on to drive many of the major scientific developments of the second half of the century.

And yeah, there’s been a lot of ink spilt in consideration of why it was that the mid-20th century scientific mega-projects were so successful in comparison to more recent efforts. It’s clear that the computer project didn’t exactly have the same laser focus of the Manhattan Project: the idea of building a universal computing machine is pretty inherently opposed to the notion of laser-focus on a single objective. There’s no formula for fundamental advances.

Diversity is helpful, though, and there’s very little to be gained from senseless silo-ing of academic disciplines. The proximity of a wide variety of problems in the computer project allowed a free circulation of ideas, allowing simple ideas from one area to flow into the fundamentals of another research problem.

Specialization is easy. You have a smaller group of people that you’re trying to impress, you don’t have to do the messy work of understanding new notations and formalisms. Once you’re on the frontier of knowledge, it’s easy to stay there, instead of learning your way to the frontier of an entirely different area. In contrast, when you pick up a new area, it’s difficult to tell what ideas have already been explored – what’s novel, and what’s so obvious that no one bothers to write it down. You lose your social network and any name recognition you’ve built up.

But notice that the advantage of diversity is in global improvements, while the advantages of specialization are largely personal. Greedy algorithms are prone to get stuck in local optima, and innovation is exploration in a highly non-convex space.

When I was making the choice to hop over to industry, these questions of improving diversity were pretty important. A huge number of essential figures in 20th century math and CS spent time in industry, whether in the WWII-era military or Bell Labs. Going out into industry has provided me a great opportunity to put my hands in the dirt and get a sense for the problems people (and corporate ‘people’) are actually struggling with, and the sorts of requirements that exist for a solution to actually be applicable.

The division between academic math and industry deprives both sides. Mathematics is deprived of a steady supply of problems, while industry loses out on the radically different perspective that a good mathematician can bring to a problem. Since there’s a steady flow of mathematicians into industry, though, academic math suffers more in the relationship.

It’s interesting to note that there’s a much better relationship between industry and academia in computer science: the supply of problems back to the academics is better, in part because the problems of academic and industrial CS are better aligned, and in part because in transition to industry isn’t considered a one-way door out of the university. Still, there are better and worse ways to keep roads open back to academia (apply for ‘research’ jobs over SWE jobs), and better and worse ways to hop into a new field (find a collaborator, rather than jump into the literature single-handedly). This is advice you rarely hear in pure math, though; I think there’s a lot to be gained by trying to keep these roads healthy and open.

]]>Meanwhile, a **space-filling curve** is an mathematical invention of the 19th century, and one of the earlier examples of a fractal. The basic idea is to define a path that passes through every point of a square, while also being *continuous*. This is accomplished by defining a sequence of increasingly twisty paths (H1, H2, H3, …) in such a way that H∞ is well-defined and continuous. Of course, we don’t want a infinitely twisty road, but the model of the space filling curve will still be useful to us.

There are a few important ideas in the space filling curve. The first is a notion that by getting certain properties right in the sequence of curves H1, H2, H3, …, we’ll be able to carry those properties on to the limit curve H∞.

The second main idea is how to get continuity. Thinking of the curve as a function where you’re at the start at time 0 , and you always get to the end at time 1, we want an H∞ where *small changes in time produce small changes in position*. The tricky part here is that the path itself gets longer and longer as we try to fill the square, potentially making continuity hard to satisfy… When the length of the path doubles, you’re moving twice as fast.

In fact, because of continuity, you can also “go backwards:” Given a point in the square, you can approximate what time you would have passed through the point on the limit curve H∞, with arbitrary precision. This gives direct proof that the curve actually covers the whole square.

Here’s an example of a space filling curve which is **not** continuous. Define Bk as the curve you get from following these instructions:

- Start in the lower-left corner.
- Go to the top of the square, and then move right by 1/k.
- Move to the bottom of the square, and move right by 1/k.
- Repeat steps 2 and 3 until you get to the right side of the square.

The problem here is that a very small change in time might take us all the way from the top of the square to the bottom of the square. We need to be twistier to make sure that we don’t jump around in the square. The Moore curve, illustrated above, does his nicely: small changes in time (color) don’t move you from one side of the square to the other.

What happens if we try to use space filling curves to build a city in Cities: Skylines?

My first attempt at building ‘Hilbertville’ was to make large blocks, with a single, winding one-way road for access, using the design of a (second-order) Hilbert Curve. In addition to the roads, though, I placed a number of pedestrian walkways, which allow people on foot to get in and out of these neighborhoods directly. I like to think that this strongly encourages pedestrian transit, though it’s hard to tell what people’s actual overall commuting choices are from the in-game statistics.

Skylines only allows building directly facing a road; corners tend to lead to empty space. You can see a large empty square in the middle of the two blocks pictured above. There are also two smaller rectangles and two small empty squares inside of each of these two blocks. Making the top ‘loop’ a little bit longer removed most of the internal empty space. This internal space is bad from the game perspective; ideally we would still be able to put a park in the empty spaces to allow people to extra space, but even parks require road access.

Intersections with the main connecting roads end up as ‘sinks’ for all of the traffic congestion. So we should try to reduce the number of such intersections… The Moore curve is a slight variation on the Hilbert curve which puts the ‘start’ and ‘finish’ of the path next to one another. If we merge the start and finish into a wide two-way road, we get this:

We still get the wasted square between neighborhoods, but somewhat reduce the amount of empty interior space. Potentially, we could develop a slightly different pattern and alternate between blocks to eliminate the lost space between blocks. Also, because the entrance and exit to the block coincide, we get to halve the number of intersections with the main road, which is a big win for traffic congestion.

Here’s a view of the full city; it’s still not super big, with a population of about 25k. We still get pretty heavy congestion on the main ring road, though the congestion is much less bad than in some earlier cities I built. In particular, the industrial areas (with lots of truck traffic) do much better with these long and twisty one-way streets.

The empty space is actually caused by all of the turns in the road; fewer corners implies fewer wasted patches of land. The easiest way to deal with this is to just use a ‘back-and-forth’ one-way road, without all of the fancy twists.

The other major issue with this style of road design is access to services. Fire trucks in particular have a long way to go to get to the end of a block; the ‘fire danger’ indicators seem to think this is a bad idea. I’m not sure if it’s actually a problem, though, as the amount of traffic within a block is next to none, allowing pretty quick emergency response in spite of the distance.

Overall, I would say it’s a mixed success. There’s not a strong reason to favor the twisty space-filling curves over simpler back-and-forth one-way streets, and in either case the access for fire and trash trucks seems to be an issue. The twistiness of the space-filling curve is mainly used for getting the right amount of locality to ensure continuity in the limit curve; this doesn’t serve a clear purpose in the design of cities, though, and the many turns end up creating difficult-to-access corner spaces. On the bright side, though, traffic is reduced and pedestrian transit is strongly encouraged by the design of the city.

]]>One of my favorite graphics in the book was a scatter plot adapted from a physics paper, mapping four dimensions in a single graphic. It’s pretty typical to deal with data with much more than three dimensions; I was struck by the relative simplicity with which this scatter plot was able to illustrate four dimensional data.

I hacked out a bit of python code to generate similar images; here’s a 4D scatter plot of the Iris dataset:

The Iris dataset consists of measurements of three species of iris flowers: Iris Setosa (red), Iris Virginica (green), and Iris Versicolor (blue). For each flower, four measurements are taken: petal length, petal width, sepal length, and sepal width. (It turns out the sepal length is the length of the small leaves just behind/between the petals, the remnants of the bulb that bloomed into the flower.) The data was published in 1936, by a RA Fisher, who was interested in using higher-dimensional data in classification: Fisher’s paper was titled “The Use of Multiple Measurements in Taxonomic Problems.” And ever since, it’s been a standard data set for testing statistical classification techniques, as well as schemes for high-dimensional plotting. (It’s pretty tiny by modern standards, though.)

In the main grid of the 4D scatter plot, each of plots shows petal length vs petal width, for a range of sepal lengths and sepal widths. As one moves through the scatter plots to the right, the sepal lengths represented in the scatter plots increase; you can think of a row of scatter plots as frames in an animation where the sepal length is acting as ‘time.’ Likewise, as one moves up through the plots, sepal width increases.

On the top and right side, there are marginal plots of sepal length vs petal width (top), and sepal width vs petal length (right). Note that the divisions of the smaller scatter plots give the binning of sepal length and width: notice that there’s only one blue dot in the middle column of the scatter matrix, corresponding to the right-most dot in the marginal plot at the top.

Finally, the upper-right gives space for one more plot, which I’ve used for an overall marginal plot illustrating just petal length vs petal width.

Scatter plots can use a few other tricks for packing in a bit more data: point size, transparency, and color add a bit more data, but are usually useful mostly for comparison of points, and can be impossible to read if over-thought. I tend to think of each of these as a half-dimension: Color in particular tends to fail if used for more than a few discrete labels, or if more complicated than a simple two-color gradient.

The better-known multi-dimensional scatter plot is the scatter matrix, which has one scatter plot for each pair of variables. Here’s an example scatter matrix, for comparison:

It’s advantage is that one can have any number of variables (not just four), but one loses the sense of how pairs of variables change as a third variable changes. The plots on the diagonal show the single-variable distributions; this feels a bit like wasted space, though in this case we see that petal characteristics on their own make it very easy to separate the *versicolor* irises.

I met up with some mathematician friends in Toronto yesterday, who were interested in how one goes about getting started on machine learning and data science and such. There’s piles of great resources out there, of course, but it’s probably worthwhile to write a bit about how I got started, and place some resources that might be of more interest to people coming from a similar background. So here goes.

First off, it’s important to understand that machine learning is a gigantic field, with contributions coming from computer science, statistics, and occasionally even mathematics… But on the bright side, most of the algorithms really aren’t that complicated, and indeed they can’t be if they’re going to run at scale. Overall though, you’ll need to learn some **coding**, **algorithms**, and **theory**.

Oh, and you need to do side-projects. Get your hands dirty with a problem quickly, because it’s the fastest way to actually learn.

On the theory front, I learned quite a lot by reading Christopher Bishop’s book *Pattern Recognition and Machine Learning*. It’s not an easy book if you don’t know anything, though: Bishop’s description of how K-means works (as a specialization of EM) isn’t terribly useful unless you’ve already got an idea of how K-means works, for example. So I found it very useful to watch talks from Andrew Ng’s Machine Learning course on Coursera, which is pitched at undergraduates, and then go read a relevant chapter from Bishop to get a more in-depth understanding and really think about the mathematics of a method. I find this to be a generally good approach: Find the ‘easy’ resources pitched at undergrads, burn through them quickly, and then dive directly into work aimed at grad students and researchers. The authors of the easy resources put a lot of effort into distilling the fundamental ideas, and it’s immensely helpful to have that starting point before wading into the deep end.

Another book I found quite useful is Flach’s Machine Learning: The Art and Science of Algorithms that Make Sense of Data. It’s a quick read, and deals with tree-based methods in depth, which aren’t quite as popular lately, but are fantastically useful. Decision trees train quickly, require no additional normalization step, and are very interpretable. Random forest, a technique which uses many decision trees together, often provides an as-good-as-anything machine learning solution, and gets one thinking how to make proper use of ensembles.

If you’re coming from mathematics, you probably already know linear algebra in some detail. You will learn it in greater detail. Knowing when it’s useful to compute an eigenvector (answer: always), and having some ability to figure out how to explain the eigenvectors that you find will make you immensely useful.

The way to get good at coding is to get a lot of practice.

For me, this is easy, because I view code as a way to **solve problems** and as a way to **understand structures**. When you start viewing code as a general problem-solving technique, you automatically get a lot more practice: Oh, I need to send grades to all my students? Let’s write a little python script to do it. Bam, practice. Oh, I need to count an obscure combinatorial object? I can write some code to generate the objects, and probably learn a lot about natural ways to arrange the objects in the process. Bam, lots of practice…

If you haven’t written any code at all, Codecademy is a great place to start learning Python. It is stupidly fantastic.

To get a lot of mathy practice, you can work problems at Project Euler, which collects interesting math problems that require a computer program to solve. The first few just ensure that you know your programming language, but they get much more interesting quickly.

At some point, I knew that I was probably going to be interviewing with Google, and started going through preparation for the interview process. The interviews are probably easier for a mathematician than they are for most people: thinking on your feet while writing precise statements on a whiteboard is pretty much exactly what math grad school prepares you for. It’s kind of like a qualifying exam, but with easier questions.

You do need to study, though. I took Steve Yegge’s advice and worked through Skiena’s Algorithm Design Manual, which is a fantastic book that falls solidly in the ‘quick and enlightening’ end of the spectrum. Work through all of the problems and you will probably kick the ass of any interview question that comes your way. (Unless it’s deep-in-the-weeds language knowledge stuff, which, ugh. And tends not to be in Google interview questions, anyways.) You can also read Cormen if you feel like you have lots of time…

You’ll also get a better sense of what algorithmic efficiency really means: in the real world of giant data sets, everything needs to run fast. My general experience is that any interesting algorithm starts its life running in cubic time. After a couple years, it gets pared down to something that runs in full completeness in quadratic time. And then somebody comes up with a burnt-out husk of the algorithm that runs in almost-linear time with a shady statistical guarantee that the answers aren’t complete garbage. Polynomial is cool and all, but at big enough scale, we need algorithms that run in almost-linear time.

- Kaggle is a website for data science competitions. It’s also a great source for side-projects if you don’t know what else to do. Go download a dataset and start hacking.
- Scikit-Learn is the primary machine learning library in Python. It’s ‘fit’ and ‘predict’ framework makes it easy to try lots of different algorithms without much friction. You can build something that basically works with almost no effort.
- Matplotlib is the Python library for basic plotting. It’s a bit more painful to use than it really should be, but it’s still a great library. Seaborn and some similar projects try to take some of the rough edges off, though.
- Hacker News is where bored programmers hang out while their code compiles. There are also lots of stories about what’s going on in machine learning and algorithms research. You can also see other people’s random side projects, and aspire to get yours to the top!
- Partially Derivative is a podcast about data science, and is sort of like having the relevant parts of Hacker News in audio form. They also have their own list of resources here.

Recently I’ve seen a couple nice ‘visual’ explanations of principal component analysis (PCA). The basic idea of PCA is to choose a set of coordinates for describing your data where the coordinate axes point in the directions of maximum variance, dropping coordinates where there isn’t as much variance. So if your data is arranged in a roughly oval shape, the first principal component will lie along the oval’s long axis.

My goal with this post is to look a bit at the derivation of PCA, with an eye towards building intuition for what the mathematics is doing.

Suppose we have a matrix of data X, whose rows are data points, and whose columns are the feature values of the data points. To be overly concrete, suppose we have a matrix recording whether each of six people has watched each of four movies. Thus, we have a 6×4 binary matrix.

Our first job is to clean up the data a bit by removing the column means from each of the data points, so that all of the features have mean zero. We’ll call this adjusted matrix .

Once we’ve ‘demeaned’ the data, we compute two very important related matrices, the *scatter matrix* and the *Gram matrix*. PCA is fundamentally a way to extract important information from the scatter matrix. The scatter matrix is , and the Gram matrix is . The entries of the scatter matrix are dot products of columns of , and the entries of the Gram matrix are dot products of rows. Visually:

And here’s a bit of example python code:

```
import numpy as np
rnd = lambda x: np.round(x,2)
X=np.matrix([
[1,1,0,0],
[1,1,0,0],
[1,0,1,1],
[1,1,0,1],
[0,0,1,1],
[0,1,1,1],
])
print X
# Remove column means
Xm = X-X.mean(axis=0)
# Divide columns by their norms. Equivalent to tfidf.
# Xm=X / np.linalg.norm(X, axis=0)
print 'Column means: ', rnd(Xm.mean(axis=0))
print 'Column norms: ', rnd(np.linalg.norm(Xm, axis=0))
print 'Demeaned X:\n', rnd(Xm)
S = Xm.transpose()*Xm
G = Xm*Xm.transpose()
print 'Scatter matrix:\n', rnd(S)
print 'Gram Matrix:\n', rnd(G)
```

This produces the following output:

```
[[1 1 0 0]
[1 1 0 0]
[1 0 1 1]
[1 1 0 1]
[0 0 1 1]
[0 1 1 1]]
Column means: [[ 0. 0. 0. 0.]]
Column norms: [ 1.15 1.15 1.22 1.15]
Demeaned X:
[[ 0.33 0.33 -0.5 -0.67]
[ 0.33 0.33 -0.5 -0.67]
[ 0.33 -0.67 0.5 0.33]
[ 0.33 0.33 -0.5 0.33]
[-0.67 -0.67 0.5 0.33]
[-0.67 0.33 0.5 0.33]]
Scatter matrix:
[[ 1.33 0.33 -1. -0.67]
[ 0.33 1.33 -1. -0.67]
[-1. -1. 1.5 1. ]
[-0.67 -0.67 1. 1.33]]
Gram Matrix:
[[ 0.92 0.92 -0.58 0.25 -0.92 -0.58]
[ 0.92 0.92 -0.58 0.25 -0.92 -0.58]
[-0.58 -0.58 0.92 -0.25 0.58 -0.08]
[ 0.25 0.25 -0.25 0.58 -0.58 -0.25]
[-0.92 -0.92 0.58 -0.58 1.25 0.58]
[-0.58 -0.58 -0.08 -0.25 0.58 0.92]]
```

Because the scatter matrix is so important, I’m going to linger on it a bit.

Since the entries of the scatter matrix are dot products of column, we can think of them as dot products of the features that those columns represent. So the (2,3) entry compares the 2nd and 3rd feature. The dot product , so the scatter matrix records a scaled version of the *cosine similarity* of the features. Cosine similarity measures the extent to which two vectors point in the same direction, opposite direction, or whether they’re simply orthogonal. Thus, the scatter matrix asks the extent to which two features ‘point’ in the same direction, multiplied by the overall scale of the features. Finally, since the dot product is commutative, the scatter matrix is symmetric.

If we were to add more data points, the scatter matrix would generally have a larger norm. It seems a bit silly to be so reactive to the size of the data set; if we rescale by the number of data points we get something a bit more robust. Indeed, if you multiply the scatter matrix by , you get the (sample) *covariance matrix* of X*.*

So the scatter matrix records vital statistical data about the relationship between features in our particular dataset.

Now, the singular value decomposition of is a product , where:

- and are orthogonal matrices (so that and are identity matrices), and
- is a diagonal matrix of the eigenvalues of (if is a square matrix).

But if we multiply the scatter matrix by , we get (try it!). This means that the columns of are eigenvectors for the scatter matrix. (Likewise, the column of are eigenvectors of the Gram matrix.)

We can express in terms of these eigenvectors by simply multiplying . This expression is almost what we’re after.

The main idea of principal component analysis is to sort the eigenvalues and associated eigenvectors by size, and throw away the eigenvectors with small eigenvalues. The singular value decomposition usually consists of matrices of sizes , and . Restricting to the largest eigenvalues reduces these sizes to , and by throwing away (or zeroing out) rows and columns.

In this picture, we show the PCA obtained by keeping the two largest eigenvectors. Since the resulting data is two dimensional instead of four, we can plot it!

As indicated in the picture, when we squint at the columns of , we can make out what they’re saying about the data. The first is (roughly) asking whether someone tended to watch the first pair or second pair of movies. Looking at the matrix of data, this does seem like an important feature. And the second vector is asking whether someone watched the first or the second movie.

The beauty of PCA is that it makes asking these kinds of questions of the data automatic: What are the combinations of features (and in what amounts) that best decompose the data set?

Additionally, to apply PCA, we only have to keep the matrix and perform matrix multiplication on any new data points we encounter, assuming that the new data points are similar enough to our original training data.

Note that the sizes of the eigenvalues explicitly describe the contribution of each eigenvector to our dataset . As a result, removing the smallest eigenvectors has the least impact on the data. There’s another derivation of PCA in which one tries to find the best possible r-dimensional projection of the data , in terms of the error in ‘reconstructing’ data from its PCA representation. Minimizing this error leads directly to PCA decomposition with r components.

The key takeaway here is that PCA comes down to understanding the *relationship between features in your dataset*, as measured by a dot product. However, there are many other ways to write down a relationship than just the dot product!

For example, we can make a new scatter matrix whose entries are simply the cosines between the feature vectors, instead of the dot product. Or we could use an entirely different function, like: e^(-sum_k (Xik – Xjk)). (Sorry, I couldn’t get the latex to render in WordPress!). Suitably ‘good’ functions are called **kernels**. Doing an eigenvector analysis of the scatter matrix evaluated with a kernel function gives rise to kernel PCA. Kernel methods make it possible to find non-linear patterns in the data space.

You can find an implementation of kernel PCA in scikit-learn.

If your data naturally clusters into major segments with radically different behaviour, the standard PCA may not do a great job of describing your data. Instead, you can perform some segmentation or pre-clustering, and then apply a different PCA decomposition to each of the different data clusters.

Bishop’s Machine Learning and Pattern Recognition book describes a couple nice probability-based interpretations of PCA. The first allows computation of the PCA iteratively using an EM algorithm, which can be faster than explicitly computing eigenvectors with certain large datasets. Bishop also describes a Bayesian approach to determine the number of components one should actually keep.

]]>This was a bit different from other Kaggle competitions. Typically, a Kaggle competition will provide a large set of data and want to optimize some particular number (say, turning anonymized personal data into a prediction of yearly medical costs). The dataset here intrigued me because it’s about learning from and reconstructing graphs, which is a very different kind of problem. In this post, I’ll discuss my approach and insights on the problem.

Kaggle is a data science competition site; given a ‘large’ dataset, participants try to find algorithms which extract useful data to optimize against some ground truth. For this particular competition, the data consisted of social circles from Facebook, and accompanying anonymized data. For example, you would have the list of John Doe’s friends, along with all of the ‘friend’ relationships amongst his friends, where people have lived and worked, where they went to school, what languages they speak, and so on. All of the names of people and places are replaced with sequential numbers to prevent reconstructing identities from the data.

Each John Doe would then indicate their social circles within their set of Facebook friends. These social circles might overlap (for example, if you had a ‘work’ circle and a ‘family’ circle, and happened to work at a family business) or not. It also wasn’t necessary that every friend appear in at least one circle. So the circles were quite free-form. Given the Facebook data, the task is to reconstruct the set of social circles.

The Kaggle data is divided into three sets: Test, Public, and Private. The test data is data with solutions that you’re given as ground truth for designing and fine-tuning your algorithms. The Public and Private sets are just data without solutions. As you make submissions throughout the competition, your solutions are scored against the Public dataset and put on a leaderboard. The results at the end of the competition, though, are determined by the Private dataset, where your solutions are scored at the end of the competition.

I worked for probably a week or week and a half on the problem; I believe I peaked at 7th place on public leaderboard. I stopped working on the problem when I moved at the end of August. By the end of the competition in late October, I was ranked 67th on the Public leaderboard.

It’s well-known that people tend to over-fit the data in the Public leaderboard. In this case, there were a total of 110 data instances, of which ‘solutions’ were provided for 60. One third of the remaining 50 instances were used for the Public scoring, and two-thirds were used for the Private scoring. I got the sense from my work with the Test data that the Public set was a little bit strange, and so I tried to restrain myself from putting too much work into doing well on the Public leaderboard, and instead on understanding and doing well with the Test data. This seems to have worked well for me in the end.

Examining the test data, it was clear that there was a great deal of variation in how people identified their social circles, as well as variation in the size and density of their friend networks. For example, some people very carefully placed every one of their friends into some circle, with all circles completely disjoint, while others had just a few large circles with significant (over 80%) overlap. This meant that the overall dataset was actually pretty small relative to the degree of variation.

Second, there’s a great deal of variation in the Facebook demographic data. Amongst my actual Facebook friends, some people flesh out their whole profiles with every school they’ve attended since they were six years old, and other leave everything blank that they can get away with and use a fake name besides. Most people fall somewhere in between, and often don’t update their current location or job if their status changes. This suggests that the Facebook demographic data can be pretty unreliable.

I made some initial attempts at using the demographic data, but eventually completely abandoned it, sticking to using just the graph data. My rationale was that the demographic data should be replicated in the graph data for people with meaningful connections, which saves me from trying to infer who left something blank or didn’t update their profile. Additionally, knowing that a set of friends attended the same university might be sufficient to generate a circle, but that circle could easily be far too large if someone had a few different, non-overlapping interests while they were in school; the equestrian team and the bee-keeping society might have a small intersection. The graph data itself should have the ‘right’ level of granularity built in, in a way that pre-selected demographic data cannot.

Third, the actual evaluation metric used in the competition has a real effect on the optimal strategy for constructing circles. The metric was the ‘edit distance’ between our predicted circles and the ground truth. After an initial best matching of our predicted circles and the ground truth, we count the number of edits required to turn the prediction into the ground truth. An edit consists of adding a person to a circle, deleting a person from a circle, creating a new empty circle, or deleting an empty circle.

Given the volatility in how the ground truth circles were labelled, I decided to focus on identifying a few large circles that were almost certainly correct. Luckily, large circles have the biggest impact on the final score, and are also easier to detect.

My final solutions used the following process:

- Compute an approximation of the exponential adjacency matrix of the friend graph. The adjacency matrix of the friend graph is simply the matrix of connections. The exponential adjacency matrix is the sum of . The nth power of the adjacency matrix counts the number of length-n paths between various vertices; taking the exponential is summing up paths of all lengths, weighted by length of the path. Conceptually, it gives a better notion of the degree of connectedness of various nodes, by smoothing out inconsistencies in the basic adjacency matrix. For example, if vertices A and B have many common neighbors but no edge directly connecting them, then the AB-entry of the adjacency matrix is 0, but the entry in the exponential adjacency matrix will be quite high. This helps to fill in connections between people who ‘should’ be friends.
- Apply a generic clustering algorithm using the rows of the as the feature vectors. I began by using K-means clustering, and found that spectral clustering worked better in the end. Both of these clustering algorithms produce a fixed number K of non-overlapping clusters; finding the right value of K is itself a bit of a problem, and it clearly depends on the number of friends a person has.
Spectral clustering works by computing a similarity matrix for the data, and then using some of the eigenvectors of that similarity matrix to reduce the dimensionality of the data. Clustering is then performed on the lower-dimension reduction. I think that the use of the exponential adjacency matrix helped quite a bit here, since it naturally accentuates the similarities in the adjacency matrix. It might be worth following up the particular interaction between the exponential adjacency matrix and spectral clustering algorithms. I found that setting K=10 for people with more than 350 friends and K=6 for fewer than 350 friends worked well for the test data.

- Throw away low-density circles. For a circle with people, define the density as the number of edges in the circle divided by (the maximal possible number of edges). Any circle with density less than 0.25 was thrown out. (This threshold was arrived at by grid search, and then bumped up a bit to be conservative.)
- Augment the remaining circles by adding in people with more than F friends in the circle, or if they are friends with at least 50% of the people in the circle. For graphs with more than 250 vertices, we set F to 8, and otherwise set F=6.
- Include small connected components with at least 5 members and no more than 15 as independent circles.
- Merge circles with more than 75% overlap.

There are clearly a number of arbitrarily chosen constants here. The overall size of the dataset was far too small to try to ‘learn’ the right values of these constants. I tried at one point bucketing the dataset into four or five classes of circle by circle size, and learning the right constants, but ended up with wildly over-fitted results. In the end, it was down to choosing ‘reasonable’ constants and hoping for the best.

It was really fun to work on a graph learning problem! Many thanks are due to Julian McAuley for organizing the contest.

Honestly, I was pretty excited to try out using graph Fourier transforms as feature vectors, but didn’t end up doing this because the Fourier transforms are too slow once we have a graph with more than about 400 vertices… Alas. I still think it would be an interesting approach for smaller circles, though!

The main take-away for me from this contest is that personal data (like university, workplace, and so on) ended up mattering less than the actual structure of the graph. The personal data might be too fine or too coarse to actually capture the social relationships of an individual, but the web of actual connections encodes that structure directly. Another interesting follow-up question is how to ask for data that’s actually more useful in recovering community structure. It might be more useful to have people provide self-defined ‘tags’ describing their interests or communities.

There’s a game called ‘advanced chess’ in which human players use chess computers to explore possible moves, and then move as they will; the exploration here felt similar. I had a number of heuristic ideas which I would code up and watch the outcome of, and then discard the ones which didn’t perform as well. In the absence of a proliferation of training data, one is forced into heuristic choices by necessity. My final solution was a nice mathematical idea (use spectral clustering on the exponential adjacency matrix) augmented by a handful of simple heuristics.

And ultimately, this isn’t an unusual occurrence. The funny bit about Kaggle competitions is that you’re learning with labelled data. Outside of Kaggle, you often don’t actually have training labels, and a major part of the learning problem is figuring out what data to collect in the first place. The size of the data set here almost certainly reflects the difficulty of getting ground-truth data on social circles.

]]>Against this background, Lawrence Lessig’s Code made the case that the internet TAZ was in fact temporary. Lessig argued that the internet’s behaviour is determined by a combination of computer code and legal code, and that while the legal code hadn’t been written yet, it would be soon. His prediction (which has largely been realized) was that the internet would lose its anarchic character through government regulation mixed with a need for security and convenience in commercial transactions. (In addition to these forces, social media also came along, in which people largely sacrificed their anonymity willingly for the convenience of being able to easily communicate with their meatspace social networks.)

In thinking about Bitcoin, it’s useful to see *how* the regulation came to change the internet. The prediction (again pretty much correct) was that regulations would target large companies instead of individual users. Companies are compelled to follow the law under the ultimate threat of not being allowed to operate at all. Because of the tendency for people to glom onto just a few instances of working solutions, it becomes easy to target a few large entities to enact regulation on a broad base of users.

This is instructive for Bitcoin, which has amongst its supporters a large contingent of extreme libertarians. We could call these libertarians fisco-anarchists, who see in Bitcoin an opportunity to free economic transactions from government regulation in the same way the crypto-anarchists wanted liberated communication channels in the nineties. It’s currently 1999 for Bitcoin: The law simply hasn’t moved to regulate Bitcoin *yet*, and when it does it will likely target larger players like Coinbase, Circle, and BitPay instead of trying to regulate individual users.

In fact, the state of New York is in the process of implementing a ‘Bitlicense’ law which will work in exactly this way. The law will require registration of any businesses transacting in Bitcoin or other virtual currencies, and will require, for example, reporting any transactions over $10,000 in value.

I personally expect that we’ll see another successful crypto-currency come into existence, which should capture most of the actual benefits of Bitcoin while avoiding at least some of the major pitfalls. To my mind, the primary advantages offered by the cryptocurrencies include that they are:

- Trans-national, and allows free movement of currency internationally, and have
- Low/No transaction costs, which enables micropayments and will help to remove parasites from the commercial internet, while they
- Provide a space for experimenting with new notions of currency.

While this list looks short, these functions are important and potentially transformational for the economy of the internet specifically and the world more broadly. Indeed, Scott McCloud’s ‘Reinventing Comics’ suggested micropayments as a way to move comics to the web all the way back in 2000, but transaction costs with PayPal and Visa never fell low enough for microtransactions to be viable.

Meanwhile, the biggest problems with Bitcoin are:

- Deflation: The deflationary design is a huge long-term problem, encouraging people to sit on large hoards of Bitcoin instead of using them for small transactions. (Much as dragons like to sit on hoards of gold.) As the value of the Coin increases, the incentive to use it for small transactions decreases. This could eventually lead to a liquidity problem. For comparison, Europe in the middle ages suffered sever bullion shortages, which led to entire shiploads of cargo going unsold when no one at the destination city had any money to buy things.
- Power consumption: Aside from the obvious ecological problems, the cost of electricity becomes a barrier to mining, which will further concentrate the distribution of Bitcoins in the hands of a few large mining outfits. There area few existing crypto-currencies which have fixes for this problem, most notably LiteCoin.
- Prevalence: As Jessamyn West put it, you still can’t use Bitcoin to pay your rent.
- Security: The best choices for securing you Coin for some time involved either giving them to an untrustworthy exchange, or doing the digital equivalent of hiding your cash in a mattress. I think CoinBase is helping with this, or at least they seem more trustworthy than MtGox…
- Scalability: There are well-documented scalability issues, and a number of ideas in circulation for fixing them.

The deflation and power consumption issues are both architectural problems with Bitcoin, which probably necessitate a successor coin, or at least a major modification of the current architecture. I’ll emphasize that Bitcoin does have real utility and advantages over current payment networks, so we can expect some real interest in making a cryptocurrency which addresses Bitcoin’s shortcomings. Of course, commercial and government organizations will also have an interest in making sure that a successor is more convenient, secure, and regulable.

What might the successor coin look like? Let’s first think about how to ensure that people sign on for the new currency.

Interestingly, the deflationary aspect of Bitcoin has served as one of the primary ways to get people involved, in the form of speculators who view the currency itself as a kind of investment. A good question, then, is how to get people into a new Coin without having a deflationary structure. David Graeber’s book on the history of Debt suggests to me two easy ways to solve this problem:

- It turns out that colonial Europeans often found cashless societies in their travels, which was fairly inconvenient from their perspective. As they set up the colonies, they would require that farmers start paying taxes which could only be paid in the currency of the colonialists. Meanwhile, they were happy to pay out this currency in exchange for their favourite crops or labour. This quickly led to general circulation of currency. For us, it’s easy to imagine, for example, the Amazon Cloud requiring that all payments be made in NewCoin, creating an instant need for those coins.
- Alternately, a conglomeration of large players could directly solve the prevalence problem by announcing at launch time that the Newcoin will be accepted (and possibly preferred or discounted) at their sites. Again, sites like Amazon would be a natural choice of member in such a coalition.

We can wager that a non-deflationary successor to Bitcoin will be announced in a top-down way, with a number of players signed on at the outset: you need a strong incentive of one kind for another to adopt an inflationary currency.

The architecture for a successor coin should solve many of the current issues with Bitcoin, and if its coming from large commercial players, we can expect an architecture that will enhance security and convenience. My guess is that a successor currency would greatly reduce anonymity to help with these concerns, possibly requiring that users explicitly tie their identities to their accounts. As Lessig points out, commercial interests often coincide with measures that increase regulability.

There are interesting moves in this direction occurring even as I’ve been writing this essay. Square has been running an experiement with Bitcoin, and recently published an extremely interesting article on cryptocurrencies as a kind of infrastructure for the digital commercial space. And just today they’ve endorsed Stellar, a group working on creating easy interchange of cryptocurrencies with other currencies. Stellar has their own cryptocurrency, which seems to be centrally generated and strongly encourages users to tie their identity to their accounts through Facebook sign-in or associating an email address. As always, though, it will take time to see how these developments hold up.

Interesting times.

Thanks to John Mardlin for offering revisions and insight as I was working on this post.

]]>This week I’m at the IMA workshop on Modern Applications of Representation Theory. So far it’s been really cool!

One of the graduate students asked me about how one goes about learning the Linux command line, so I thought I would write down a few of the things I think are most useful on a day-to-day basis. Such a list is sure to stir controversy, so feel free to comment if you see a grievous omission. In fact, I learned the command line mainly through installing Gentoo Linux back before there was any automation whatsoever in the process, and suffering through lengthy forum posts getting every bit of my system more-or-less working. (Note: Starting with Gentoo is probably a bad idea. I chose it at the time because it had the best forums, but there are probably better places to start these days. I mainly use Xubuntu these days.)

So, off to the races. I’m going to skip the really, really basic ones, like ls, cd, apt-get and sudo. In fact, there’s a nice tutorial at LinuxCommand.org which covers a lot of the basics, including IO redirection. Finally, I’m assuming that one is using the bash terminal.

- man:

Pulls up the manual page of most commands, a life-saver in a pinch. Try reading the bash manual page if you have a bunch of free time; it’s worth it. In spite of Jefferson’s idealism, not all men are created equal. - cron:

Run specified programs at prescheduled times, or at boot time. Cron itself is the name of the program; it’s a daemon, meaning it is always running in the background, invisibly doing what it’s meant to do. The actual list of commands lives in a file called a crontab, which is typically edited using the ‘crontab’ command (as root) which just shows you the system’s crontab file in an editor. To learn the proper layout of the crontab file, try ‘man 5 crontab’.

Cron is useful for automating mindless tasks, and is thus the favored tool of sys-admins everywhere. I’m pretty sure there are people whose entire careers consist of writing shell scripts and then running them with cron. - ctrl+r:

Reverse-search through the last couple thousand commands you’ve used in bash. Hit ctrl+r and start typing and Bash will start searching for the last command that used your typed characters. Typing something partial and then hitting ctrl+r again will then cycle back through all commands that match what you’ve typed.

Note that to get back a very recent command, you can just hit the up button repeatedly.

This is has probably saved hours of my life. Use it, love it. - Tab completion:

If you start typing a command, you can finish it by hitting ‘tab.’ If there are more than one possible completions, hitting tab twice will show you a list of everything that matches. - screen:

Screen is a program for running other programs in the background. This is useful when you want to, say, ssh into your department server, start an enormously long computation, and then log out while having the computation continue. It also saves you from losing work due to lost network connections or accidentally closed terminal windows. Just type ‘screen’ to start a screen session. To leave the screen session (keeping it running in the background), hit ‘ctrl+a+d’. We then say that the screen is ‘detached.’ To see what screens you have running, type ‘screen -ls’. And to reattach to a detached screen, type ‘screen -r NUM’, where NUM is the number that appears in ‘screen -ls’. - dmesg | tail:

Shows the last few lines of your system log. Note that the pipe character ‘|’ takes the output of dmesg (which is ALL of the system log since the last boot) and runs it into a program called ‘tail’ which just shows the end of whatever you feed it. To see, say, fifty lines of the log, try ‘dmesg | tail -n 50’. By the way, most logs live in ‘/var/log’, if you want to read them directly. - grep:

Grep is the sledgehammer of the command line. It searches for text. It’s really common to pipe output of other programs into grep. For example, ‘dmesg | grep usb’ will show all lines of the current system log containing the phrase ‘usb’. Grep can also handle regular expressions, if you’re feeling fancy. The -A and -B options show a number of lines after/before the searched terms, which is occasionally useful. - top:

Shows what processes are hogging your cpu and/or memory. You can then kill them with ‘pkill’. - ifconfig, ping, and nmap:

The ‘ifconfig’ command lists your current network connections, including your IP and MAC addresses. ‘ping’ sends a very short packet somewhere, and is useful for seeing if your internet connection is working. Try ‘ping -c 5 google.com’, for example. Finally, ‘nmap’ is useful for scanning your local network for other machines. For example, my IP address as I write this might be something like 192.168.42.137. So I know that my local network is probably assigning all of its connections as 192.168.42.X, with X between 1 and 254. Then you can run: ‘nmap 192.168.42.0/24 -sP’ to silently ‘ping’ every address 192.168.42.X and get a report of who ponged. I tend to use this to figure out where my Raspberry Pi is on the wireless network so I can SSH in.

Ok, that’s what I’ve got off the top of my head… Any suggestions for more?

]]>