Cataclysmic Mutation

Machine Learning and Whatever Else

Building Your Own Keyboard Layout for Fun and Profit

It’s commonly known, if you’re the kind of person who commonly knows this sort of thing, that the normal QWERTY keyboard layout is not particularly well suited to typing English prose. The usual story is that it was designed to intentionally slow typists down so that the physical hammer mechanism used by early typewriters would have time to clear between key punches so as to not jam. Probably, that’s a bit suspect. But certainly the design was affected by the implementation on these typewriters to some degree. You can switch to the Dvorak layout, and there’s Colemak, and a few lesser known ones as well, but what is the optimal layout? It turns out that mathematicians and computer scientists have actually spent a bit of time thinking about this problem, and it’s a fun little exercise to look at how we might go about answering it.

I’ve seen this question appear two or three times on Reddit now, and I’ve always written a lengthy explanation. My Ph.D. work consisted of a large segment the general optimization problem that we’ll use to model keyboard layout, and I actually gave a talk on using it for that purpose in something of an “Ig nobel” approach. Anyway, it finally occurred to me that I have a convenient place where I can publish this sort of thing.

First, what is the actual problem we’re trying to solve? Basically, we want to be able to type as quickly as possible. Anyone who types with greater than “hunt-and-peck” speeds knows that certain words are harder to type than others. Personally, I notice this whenever I visit “”. Let’s look at that URL to see why. The first bit isn’t so bad really. “N-E-W-S” is typed with my right index finger reaching a bit, the “E-W” rolls off the left hand quite easily as the two characters are adjacent on my QWERTY keyboard layout so you can sort of drum your fingers across the two letters quite quickly. But then I have to hit the “S” with the same finger that just typed the “W”, but down a row. Moving my fingers up and down rows takes much longer than simply pressing a different finger on the home row. So there’s a little delay while I get my ring finger in position after typing the “W”. The “.bbc.” isn’t so bad. The “B” is one of the harder keys to reach, and I generally find the bottom row to be harder to type on than the top row (both of which are harder than the home row obviously), but the repeated keypress on the “B” is as easy as possible, and the “C” uses a different finger, and the “.” is on the other hand, both characteristics helping to allow faster typing of those keys. But the end of the URL, “” has a lot of right hand action, and a lot of jumping up and down rows. Because it will be useful later, let’s think about the difficulty of pressing two physical keys on the keyboard in succession as the distance between those keys.

Note a critical insight here. This distance we’ve defined is between keys or buttons on the keyboard. It has nothing to do with letters. We can (and will) think about removing the binding of letters and symbols to keys so that we can rebind them in more optimal ways. To do this, we need this distance measure between the physical keys where we can put them.

We need a bit more information before we can start though. Namely, what is it that we hope to type quickly. After all, you may think that QWERTY is an awful layout, but if you want to type the sentence, “asdf asdf asdf asdf asdf,” it’s a pretty good system. Generally, we’re probably interested in typing ordinary English prose, but it’s not clear exactly what that means. We need to formalize that description somewhat. The key is to understand that typing quickly is simply being able to type successive pairs of characters needed quickly and easily. If “T-H” and “H-E” are both very easy to type successively, then the word “the” will be easy to type. This means we need a convenient source of what we’ll call 2-gram frequencies. The idea is that, given the type of text you wish to write, can you estimate the frequency of each pair of characters appearing in succession? We could sit down and attempt to reason our way through it, e.g., “t-h appears a lot, e-r does too…” However, a more reliable solution is to simply analyze a bunch of text that seems to be a good representative sample of the type of thing you’ll be typing, and compute the frequencies directly from that.

Let’s take a little detour into classical optimization problems. Introduced in 1957 by Koopmans and Beckman, the Quadratic Assignment Problem (QAP) asks the following question. Can you find the best way to put buildings in available physical locations such that overall amount of traffic between the buildings is minimized? Imagine you’re tasked with building a new campus for a university with 25,000 students, faculty, and staff. The university planning committee hands you a list of all the academic buildings they’ll have (physics, engineering, mathematics, nursing, psychology, literature, etc.) and a table of numbers that tell you how many people walk from one building to another during the course of an average week, for all pairs of buildings on campus. They’ve noticed anecdotally that there are a lot of students who complain that they’re always late for class because they have to walk all the way across campus to get from their calculus class to their physics classes. They want you to try and put related buildings like that closer together when you design their new campus.

Looking at the problem, you realize that what you really want to do is minimize the total amount of miles walked by all the people on campus. If 20 people walk one mile each, and two people walk five miles each, the total walked is just miles. That is, you multiply the distance between the two locations by the amount of traffic flowing between the buildings you assign to those two locations. Sum this over all the location/building assignments, and you have the total miles walked. Minimizing this is the QAP.

Because we’re computer scientists, let’s make this easy and number everything. We have buildings and locations to put them in. Let’s number the buildings from 0 to . If we agree to always write them in the same order, we can represent a possible assignment as a valid permutation of those numbers. For example,

2 7 1 4 0 6 5 3

denotes that building 2 goes in location 0, building 7 in location 1, building 1 in location 2, and so on.

We can now formalize the problem. Take all those distances between locations and stuff them in a matrix such that is the distance between location and location . Similarly, we define a cost or “flow” matrix such that is the amount of traffic flowing between building and building . Our goal is to find the permutation such that the following objective function

is minimized. Expressed in pseudocode, this function is as shown below.

int objective(P)
    f = 0
    for i=0 to N-1
        for j=0 to N-1
            f  = D[i,j] * C[P[i],P[j]]
        end for
    end for
    return f

It should hopefully be starting to fall into place. The problem of finding an optimal keyboard layout for a given set of 2-gram frequencies is just an instance of the QAP. Instead of some form of physical distance between locations, we have a generalized distance metric that intends to conform to the idea that harder to type pairs of keys are “farther apart” than easier to type pairs. And our flow or cost matrix comes directly from the 2-gram frequencies calculated from a large sample corpus of representative text we might wish to type.

To actually solve the underlying QAP instance is challenging. Unlike, say the traveling salesrep problem (TSP), QAP is very difficult to solve well even in practice. Whereas TSP instances of tens of thousands of cities can be routinely solved within a few percent of optimality by simple heuristics, QAP instances much larger than about 35 or so remain very difficult to handle in practice. Smallish instances can be directly solved via integer programming techniques, often based on some form of branch-and-bound algorithm. However, these methods can be quite slow, and fail completely for large instances. Depending on precisely how you formulate the problem, keyboard layout is likely small enough to be handled directly. However, as my expertise is on heuristics, that’s where I will focus my attention.

Going back to Reddit, another common theme is a fascination with genetic algorithms. It’s a fascination that I understand completely. Unfortunately, GAs rather, well…how to put this delicately…suck at QAP. Actually, they’re not so fantastic on a lot of problems, but QAP in particular is kryptonite to whatever super powers your GA might ever have. That said, they are fun to explore, and looking at how we might approach QAP with a GA is instructive on several levels, so I want to talk a bit about how you’d build a GA for this problem.

Without going into all the details on GAs in general, the real issue here is crossover. Recall that every valid potential solution is a permutation, and all the simple crossover operators you might be familiar with from a GA tutorial or two share the problem that they will almost never actually produce permutations if we use them here. Assume the following two permutations:

p1 = 0 1 2 3 4 5 6 7
p2 = 4 5 7 3 1 0 6 2

Let’s imagine doing a typical one-point crossover operation with crossover point equal to three. This gives us the following situation, with the parts of the parents that will not be swapped replaced by xs.

p1' = 0 1 2 x x x x
p2' = 4 5 7 x x x x

If I just swap those two pieces, I get nonsensical results.

c1 = 0 1 2 3 1 0 6 2
c2 = 4 5 7 3 4 5 6 7

In this example, offspring c1 has building 0 in two different places (and the same for buildings 1 and 2) and it doesn’t put buildings 4, 5, or 7 anywhere.

Obviously, this is not good. The reason I get bad results is because the subset of values I chose from each parent contains values not contained in the other, which means that those values must be in the other part of the parents, and will be repeated when I swap the pieces out. For example, if I swap them, the child 2 is going to get “0 1 2” as its first three numbers, but because those weren’t in the first three numbers of parent 2, they must be somewhere else, and since child 2 gets the rest of the values from parent two, you repeat them, which is bad. What if instead, I choose indexes 0, 1, 4, and 5. Now I have

p1' = 0 1 x x 4 5 x x
p2' = 4 5 x x 1 0 x x

If I swap those values to create two partial offspring, I get

c1 = 4 5 x x 1 0 x x
c2 = 0 1 x x 4 5 x x

Completing the crossover operation by copying the x bits down from the two parents yields

c1 = 4 5 2 3 1 0 6 7
c2 = 0 1 7 3 4 5 6 2

which works nicely.

Fortunately, it turns out that there is an algorithm that guarantees exactly this situation. The method is called Cycle crossover (CX), and it’s one of a number of methods that can be used on permutations. More precisely, what CX does is define the algorithm by which you can repeatedly generate these subsets of indices in such a way as to allow you go do a crossover operation without mucking up your permutations. Slightly more formally, what CX tries to do is find subsets of positions such that a “cycle” is formed, where a cycle is defined as a closure on the contained values. OK, that doesn’t help at all. That’s why I provided an example (and code).

Note that the code below actually does a bit more than I let on in the example. In general, finding one cycle isn’t enough. Cycles can be quite short compared to the length of the string, so finding one cycle would not tend to mix the parents very well. Instead, what we do is repeat the process of finding and crossing cycles until we’ve found them all. By the same token, if we crossed every cycle over, we’d end up just swapping the parents whole. Instead, every other cycle is swapped; the rest are passed down unmodified.

void cycle_crossover(const chromosome& p1,
                     const chromosome& p2,
                     chromosome& c1,
                     chromosome& c2)
    mtrandom mt;

    // find a cycle starting from a random point
    int n = (int)p1.length();
    int cycle_len = 0;
    vector<int> perm = mt.permutation(n);
    int index;

    bool mix = true;            // mix cycles from different parents
    while(cycle_len < n) {
        // pick the first point in the cycle
        for(index=0; index<n; index++) {
            if(perm[index] != -1) {
        index = perm[index];

        // store the points in the cycle
        int start = p1[index];
        vector<int> cycle;
        do {
            perm[index] = -1;   // this point has been included in a cycle

            // set index equal to position of p2[index] in p1
            for(int i=0; i<(int)p2.length(); i++) {
                if(p1[i] == p2[index]) {
                    index = i;
        } while(p1[index] != start);

        // now that we have a cycle, update the offspring
        for(unsigned int i=0; i<cycle.size(); i++) {
            if(mix) {
                c1[cycle[i]] = p2[cycle[i]];
                c2[cycle[i]] = p1[cycle[i]];
            } else {
                c1[cycle[i]] = p1[cycle[i]];
                c2[cycle[i]] = p2[cycle[i]];
        mix = !mix;

An interesting question is why one would choose CX over one of the other permutation crossover operators. The answer lies in understanding what properties of a candidate solution are important, and should thus be heritable. Let’s back off of QAP for a moment and think about the traveling salesrep problem (TSP), another permutation problem. In TSP, our goal is to find an ordering of cities such that a tour of each city in the given order is shortest (assuming a return to the first city as well). Again, assuming some numbering of cities, we might have the following two candidate solutions:

p1 = 6 4 3 7 0 2 1 5
p2 = 7 0 2 1 5 6 4 3

Note that these two solutions have no cities in the same location. And yet a bit of thought should convince you that they by necessity have the same cost, as they visit each city in exactly the same order, only starting and ending in a different city. They are the same tour. Generalizing a bit, the important part of a TSP solution is not where exactly the numbers appear in the permutation. It’s what numbers they appear next to in the permutation. So ideally, if we have two good parents in TSP, our crossover operator should try to produce offspring in which the adjacency of cities is similar to that exhibited in the parents.

Compare this to QAP. In QAP, the physical location of a building is important. The building will be connected to every other building with some flow, and the physical distance between locations is very important. Adjacency doesn’t matter at all, as location 0 and location 1 may be on opposite sides of a campus, so the fact that a solution puts one building “next to” another one purely in terms of adjacent location numbers is not a meaningful piece of information. Crossover should be designed to pass down relevant information, and for QAP, it’s absolute position in the chromosome. CX does exactly that.

With a suitable crossover operator, basic familiarity with GAs should allow you to implement one. Mutation can be done simply by swapping two elements at random within the chromosome. Selection and replacement occur as usual.

The one piece we’ve yet to address is how to build a distance matrix for optimizing your keyboard layout. I alluded briefly earlier to the notion that keys that are hard to type consecutively should be further apart according to this distance measure, but there are a lot of ways to interpret that. In addition, elements of that decision will need to take into account the specific model of keyboard you’re trying to optimize. I advocate the following approach: sit down with your keyboard and a pen and paper and start listing little rules and associated penalties. As a really quick example, you might have something like this.

  1. Default distance between two keys is 0
  2. +1 if keys are on the same hand
  3. +1 for each key in the pair that is not on the home row
  4. +10 if the keys use the same finger, but skip one row (e.g., ‘c’ and ‘e’ on a QWERTY keyboard)
  5. +12 if the keys use the same finger, but skip two rows (e.g., ‘c’ and ‘3’ on a QWERTY keyboard
  6. +9 for hard to reach keys (e.g., the ‘=’ key on most American keyboards)

The important thing here is that your rules match your expected keyboard and that the resulting typing model be roughly consistent and representative of realistic typing preferences. It’s not easy to come up with good rules. My advice would be to write a program that lets you easily tweak the rules and generate the resulting distance matrix. That way, you can experiment much more easily, as it’s unlikely you’ll be happy with your model the first few tries. The nice thing about this approach is that you can use it for very non-standard purposes. The most recent Redditor to ask about this was attempting to design a layout that could be used with one finger. That’s a perfect opportunity to use something like this, because you can easily play with suitable models of what it would be like to type on a hypothetical keyboard, and optimize your key assignments accordingly.

Finally, there’s one other really important thing you’d want to do, and it’s a real bummer as it takes us out of the nice pure world of mathematics where the basic abstract QAP formulation is all we need. This last issue is that while your computer will be capable of finding “good” layouts as measured by that overall distance times flow metric, real humans have several other constraints for which violations will make them cranky. People won’t like it if the ‘(‘ character is where the comma key normally goes and the ‘)’ is a shifted ‘2’. Look at your keyboard: you have several issues like this. The square brackets, parenthesis, and curly braces should have a coherent organization (next to their matching symbol, same level of shiftedness, etc.).

In the vernacular of optimization problems, these are called constraints, and constrained optimization is a whole category of study by itself. If you’re doing heuristic search like GAs or tabu search, you have two choices really. You can penalize the fitness of solutions that violate these constraints and hope that the algorithm will find a way to find better objective function values by satisfying them, or you can directly attempt to detect and fix violations. You can also treat the degree of violation as an extra objective to be minimized and do multiobjective optimization. My particular preference is usually to repair violations if I can and penalize when I can’t.

I’ve posted some C++ source code for generating the distance and flow matrices. Note a couple of things. My keyboard model is pretty basic. And by “basic”, I mean “shitty”. Nonetheless, it might give you a base to start from. Also, I did this a few years ago as a graduate student as part of my department’s sort of Ig Nobel Research Day where we all did something goofy with our serious research. As my research was mostly on multiobjective optimization, I generated two sets of flow matrices: one based on a corpus of my dissertation’s LaTeX source code, and the other based on a fairly large C++ code base I used for me research at the time. With multiobjective optimization, you get a set of solutions when your done that represent different trade-offs between the objectives. So I had one keyboard that was really good for typing my dissertation, one keyboard that was really good for typing my C++ code, and a bunch of keyboards in between. However, I punted completely on the constraints mentioned above, so my keyboards were all very screwy.

So anyway, that’s a pretty decent overview of how you can build your own customized keyboard layout using a bit of ingenuity to build a good keyboard model and some standard algorithms to attack the resulting optimization problem. If you try this yourself and come up with something interesting, drop me a line. I’d love to see what you came up with.