## Seeing Like a Statistical Learning Algorithm

I recently had the pleasure of reading James Scott’s “Seeing Like a State,” which examines a certain strain of failure in large centrally-organized projects. These failures come down to the kinds of knowledge available to administrators and governments: aggregates and statistics, as opposed to the kinds of direct experience available to the people living ‘on the ground,’ in situations where the centralized knowledge either fails to or has no chance to describe a complex reality.  The book classifies these two different kinds of knowledge as techne (general knowledge) and metis (local knowledge).  In my reading, the techne – in both strengths and shortcomings – bears similarity to the knowledge we obtain from traditional algorithms, while metis knowledge is just starting to become available via statistical learning algorithms.

In this (kinda long) post, I will outline some of the major points of Scott’s arguments, and look at how they relate to modern machine learning.  In particular, the divides Scott observes between the knowledge of administrators and the knowledge of communities suggest an array of topics for research.  Beyond simply looking at the difference between the ways that humans and machines process data, we observe areas where traditional, centralized data analysis has systematically failed. And from these failures, we glean suggestions of where we need to improve machine learning systems to be able to solve the underlying problems.

An Example In Computation

I’ll start with a very simple example: Consider the problem of sorting.  It’s straight-forward: Place a total ordering on the objects you wish to sort (say, alphabetical) and then apply any of a variety of available algorithms, like quick-sort or merge-sort to arrive at a well-sorted list of objects.  This is an example of general knowledge applied to a non-specific problem; it’s a techne solution.

Now, suppose we run a shop and want to sort our rolodex of accounts, so that we can easily find the account of the next customer who walks in.  The best sorting may depend on the time of day (or time of month), how long it has been since a given customer last visited, how sensitive each customer is to waiting for their account card, and the ability of the staff to learn whatever system you come up with.  Simple alphabetization (the techne solution) seems inadequate: Maybe it’s better to use a move-to-front heuristic, or an alphabetized rolodex with notches on the cards for the most frequent or irritable customers.  Either of these would count as a metis solution, as the sorting is taking into account specialized local knowledge.  (Also, notice that much of the complexity is coming from the evaluation function, which is a nebulous combination of customer happiness and transaction efficiency; we adopt an adaptive algorithm in part to deal with the complexity in the evaluation function.)

So which do you choose: the techne algorithm or the metis approach? Of course, it depends. There are advantages (things like ease of maintenance and ease of interpretation) which favor the simpler techne solutions.  But once the assumptions diverge too radically from reality, something more is required.

A move-to-front strategy is still simple enough to yield to analysis, but more extreme examples abound.  Consider image classification, where we’ve seen traditional algorithmic approaches fail for decades, replaced recently with far more success by machine learning algorithms.

Prior to this revolution, image classification involved rather complicated pipelines and the careful curation of features developed by experts who had been working on the classification problem for years.  Success with techne approaches required a deep metis knowledge of the field, in the form of extremely detailed local knowledge of the experts assembling the pipelines.  More recently, these kinds of solutions have been supplanted by deep neural networks, which have been making fast progress towards parity with humans. And successfully training these networks requires little of the expert image processing knowledge needed in the previous approaches.

A trained neural network is still essentially a metis solution, turning a body of experience (training data) into a specialized knowledge of a problem domain.  But the local knowledge of image classification has been offloaded from the experts to the machines.  This is a real shift.

Legibility and Interprebility

Prior to the rise of computers and machine learning, a metis solution for administration always meant hiring someone, and maybe a large number of someones, to store and execute on the needed local knowledge.  For these kinds of problems, a centralized state essentially has three options: Hire a small army of bureaucrats to keep track of the needed information, give up, or (crucially) wield its power to change the conditions on the ground to accommodate a simpler solution.

The failed projects that Scott investigates are particularly extreme examples where centralized powers enforced order to create ‘legible’ systems.  Scott’s initial example is German scientific forestry, where the story goes something like this: The king wants to export lumber, which requires knowing the size of the forests and how much lumber they are capable of producing.  However, a natural forest is difficult to quantify. There are any kinds of trees in different stages of growth and indeterminate number.  The solution was to create mono-crop forests, with trees planted at the same time in careful rows, allowing easy counting of trees and prediction of lumber yields.  While this worked well for one generation of trees, the mono-cropping depleted the soil quickly, and required a large quota of direct maintenance afterward.

Scott provides a wide variety of other case studies, including urban design, taxes and land rights, sub-Saharan agriculture and villagization programs, and various Soviet programmes, which all relate back to this theme of ‘legibility,’ which essentially refers to the amenibility of a community or system to classical statistical analysis. In the forestry example, the natural forest is illegible, with a huge number of variables which are difficult to combine into a real understanding of the forest.  The scientific forest is much more legible: Land area and age of the trees translate almost directly into a low variance estimate of how much lumber the forest will produce.

A large part of the promise of machine learning is to produce legibility from complex systems.  Given a complicated set of inputs, one wants to determine some easily digested information, like whether a person will default on a loan, whether a picture is a picture of a dog, or whether the author of a comment is a human or not.  At the same time, we worry quite a bit about the interpretability of a model: Can we figure out why the model made a certain decision?  One can view model-building as a way of transferring illegibility from the observed system to the model itself.  If the model is fully interpretable (eg, it can be reduced to the observation of just a couple variables), one may as well have just built a traditional algorithm.

Interestingly, we don’t often talk about changing systems to make them easier for our machine learning systems to understand.  We talk about improving the pipeline – often through adding new features or providing cleaner data – but not so often about changing the thing being observed in the first place.

Superstition and Illegibility

On a related note, one of the common criticisms of ‘local knowledge’ is that it is superstitious, and can come with irrelevant or even harmful artifacts.  These are the demons that science is meant to guard us from.

Firstly, Scott makes the argument that many of the claims of harmful superstitious knowledge are overhyped, while the actual contributions of traditional knowledge to science are completely underrated.  For example, the practice of intentionally exposing people in a good state of health to a disease, in order to establish immunity, was apparently known for centuries prior to the invention of vaccinations.  Vaccination was a huge step forward in terms of ease, reliability, and safety, but it is also a scientific derivation from a previously known practice.  Likewise, it is rare for a new medicine to be synthesized completely from scratch: many are derived from leads given by traditional medicine.

That said, there are certainly huge issues with simply trusting traditional practice. Female genital mutilation seems lke a pretty clear example of a terrible practice followed for no particularly good reason.

Likewise, there are fears of our learning models picking up superstitions and spurious correlations. In large, deployed systems, there’s a related problem of ‘vestigial features:’ inputs fed into a model which have very little positive contributions, but which would take a huge amount of effort to know that they can be safely removed.  These excess features can unneccesarily complicate the model, or evn obscure simpler insights which might be available if the model were sufficiently reduced to be interpretable.  And in fact, this runs to the heart of the tatistics-versus-data-science debate: Sometimes we train a large, complicated machine learning model when we haven’t put in the effort to understand that a simple single variable regression would solve the problem.  If we don’t have a full understanding of the model, we don’t know if we can replace the whole thing with something much simpler.

One can dream of a system which looks at a data set and tells you whether you’re better off calling in a statistician or just using a random forest.

French Land Rights and Adversarial Machine Learning

One of my favorite case studies is on land rights in post-revolutionary France.  The government wants administer rural areas, but it turns out there’s little single-ownership of land, and things are pretty chaotic in the wake of the revolution.  There are feudal landowners with existing claims, new claims on land liberated during the revolution, complex usufruct community land rights, a whole system of guilds, and informal agreements within communities which allow this mess to continue more-or-less amicably.  The government wants to make things more legible with a new ‘code rural.’

Two proposals were entertained.  The first was to throw away all existing agreements, and move to a system of title deeds, private property, and contracts governing land use.  (Essentially, how we think of land rights today.)  The second proposal, by Lalouette, was to travel around to all of the villages and transcribe all of the existing land agreements into law.  The first proposal does nothing to respect the existing players on the ground, in favor of making things easy for the government.  Meanwhile, the Lalouette proposal would require a huge amount of administrative work, both to document all of the existing agreements and to dredge it up again when a disagreement winds up in court.

In the end, neither proposal is enacted. The private property system was deemed to be too easy to manipulate by the old aristocracy and bourgeoise, and would lead to eventual concentration of wealth and power away from the peasantry and proletariat.  Meanwhile, the ‘transcribe it all’ solution was far too much work and would be too difficult to enforce.  Furthermore, the transcribed agreements would be obsolete almost instantly, as new agreements cropped up to deal with new problems on the ground.

There are a couple things I love about this example. First, it gives a crystal clear illustration of the relationship between knowledge and power.  You can centralize administration of a system if and only if you can centralize knowledge of the system.  If the facts on the ground are too complicated for the administrators to follow, then the people on the ground can do as they wish whenever the administrators aren’t around.

Second, the Lalouette proposal illustrates a problem we run into in machine learning all the time: a statically trained model in many domains will quickly fall into obsolesence.  In fact, the areas where machine learning has been most successful are areas where timing of the model doesn’t matter too much.  A good solution for image classification or automatic translation doesn’t need to change very quickly, since images and languages don’t change very quickly.

In many domains, however, the facts on the ground can change quickly, as the thing you’re trying to learn changes its form.  This occurs especially in adversarial situations: The adversary changes behaviour, and the model needs to change to react to this new behaviour.  This produces problems on both ends of the ML pipeline: a trained model is quickly obsolete, but even worse, your training data also becomes obsolete.  And labelling new training data, or even evaluating the model you’ve built, can be very expensive in these situations.  These situations require a tight feedback loop between models, data, and human supervisors and analysts.

Polycropping and Objective Functions

Another interesting case study is polycropping in sub-Saharan Africa.  Polycropping is the process of interspersing many different crops, as opposed to the monocropping which predominates in North America. Scott argues at length that polycropping provides better yields and better variety of crops for local communities, but is more difficult for a streamlined, capitalist economy to understand and interact with.  Even though polycropping produces higher yields, there is higher adminstrative cost for processing the outputs, and the additional outputs may be in crops which are difficult to utilize in downstream industrial systems.

But even this is an overconstrained view of the issue: Scott illustrates the quite wide variety of uses that crops and crop byproducts are put to, such as stalks from harvested plants providing trellising for other crops with no further work.  What strikes me is that the polycropping subsistence farmer is driven by a wide range of motivations in their planting choices, well beyond just how much food will be produced.  Planting one crop might reduce the work needed to plant a follow-on crop, provide medicine or ceremonial by-products, act as pest repellants, and so on.  The best crops may succeed at fulfilling a number of different objectives all at once, reducing the overall number of crops which need to be planted (and the attendant human labor).

We often look at ensembles of models in machine learning, which often perform much better than a single model.  But here we see that there is also a great value in diversity of evaluation functions, or what we’re training the model(s) to excel at.

It is fairly common to train a model and deploy it, only to eventually realize that the choice of objective function at training time has biased the results in some undesirable way.  For example, a recommendation system trained to maximize revenues might always recommend higher-priced items, since buyers will produce more revenue.  Over time, though, consumers come to understand that the site is over-priced relative to other sources and go elsewhere for their purchases.  A standard response is then to change the objective function to something ‘better,’ realize the shortcomings of the new objective a few months later, and repeat.

But in fact, it may be better to train an ensemble of models on different objectives, to provide a range of options which are good for different reasons.  Such an ensemble provides not just model diversity, but diversity of quality. The difficulty here is that it becomes quite difficult to establish the quality of the trained ensemble: Any evaluation function chosen could be used to train a single model directly, after all!

Extremely Local Knowledge

Another favorite example of the divide between local and global knowledge arises in ship navigation.  Scott points out that many harbors have special pilots with intimate knowledge of local conditions, who come on board a seagoing ship and pilot it through the harbor and into port.  The sea navigator is familiar with global techniques of navigation – how to efficiently get from point A to point B using stars, charts, and GPS – but that these methods give no help with the placement of rocks in port or the variety of local currents.  For this, specialized local knowledge is required.  And not only this: an expert is needed for every port.

This illustrates another kind of area where machine learning has great opporunities for success: areas where there is too much local variation to allow strong global solutions, but enough data available to learn from.  Areas where geography has a large impact on outcomes are pretty common; training models for (say) medical diagnosis at each hospital will be more effective than training a single model on all patients globally.  For example, incidence of Lyme’s disease will be less common in deserts, and the quality of surgeons on staff at a specific hospital may have a strong bearing on individual outcomes.

This also hints at the value of highly personalized models.  Recently, I was discussing the problem of abusive comments on Facebook. Abusive comments and trolls are well-known to discourage people (especially women) from speaking up in public spaces on the internet.  And yet the major providers of these public spaces (like Facebook and Twitter) have been pretty much powerless to stop the problem, in part because any top-down, global solution will have too strong an impact on freedom of expression.  Facebook, when they delete a post, cite a nebulous breach of ‘community standards,’ but Facebook is a massive collection of communities, with widely varying standards, ranging from Black Lives Matter groups to Trump supporters.  A site-wide solution needs to allow both safer spaces for highly targeted groups, and anything-goes spaces for the b-tards.  Giving control of models to individuals who have authority over subsites (like Twitter comments in response to one’s own tweets) would provide an excellent way forward for public discourse on the internet.