The following was a technical question asked to me during the initial interviewing process for Athena Health. I do not know if this question was written by them, but I really enjoyed working on the problem and I thought I would share my solution to it.
Consider a “word” as any sequence of letters A-Z (not limited to just “dictionary words”). For any word with at least two different letters, there are other words composed of the same letters but in a different order (for instance, stationarily/antiroyalist, which happen to both be dictionary words; for our purposes “aaiilnorstty” is also a “word” composed of the same letters as these two).
We can then assign a number to every word, based on where it falls in an alphabetically sorted list of all words made up of the same set of letters. One way to do this would be to generate the entire list of words and find the desired one, but this would be slow if the word is long.
Write a program which takes a word as a command line argument and prints to standard output its number. Do not use the method above of generating the entire list. Your program should be able to accept any word 25 letters or less in length (possibly with some letters repeated), and should use no more than 1 GB of memory and take no more than 500 milliseconds to run.
Sample words, with their rank:
ABAB = 2
AAAB = 1
BAAA = 4
QUESTION = 24572
BOOKKEEPER = 10743
Your program will be judged on how fast it runs and how clearly the code is written. We will be running your program as well as reading the source code, so anything you can do to make this process easier would be appreciated.
Well, generating the list of permutations for the given word would not only take forever, but would also take up way too much space in memory. Instead, I implemented my own search operation, which doesn’t generate the list of permutations, but still uses properties of that list. If you want to skip right to the code or if you want to just read the code along with my explanation, then you can find it at the following link:
So, let’s assume that we have the list of permutations for the current word in front of us. The list is organized alphabetically, and our goal is to provide the index of our word in that list. The tl;dr for our algorithm is this: look at each letter in our word, and use that letter to calculate how many positions in the list above our word we can safely ignore. We keep a count of how many positions we can ignore, and when we keep adding to this number as we iterate over the letters in the word, we will eventually end up with the index of our word in it’s list of permutations (without actually generating the list).
So let’s get started with an example. Let’s say our word is “ONOD” (a permutation of the word “DOON”). We want to take our word (“ONOD”) and compute it’s position in the list of lexicographically ordered permutations. Without even generating the list of permutations, we know that the size of our list (without counting duplicates) will follow the formula:
(Where a, b, c, … is the frequencies of each of the letters in the word)
So, for our word “ONOD”:
- Size of word = 4
- a -> “O” = 2
- b -> “N” = 1
- c -> “D” = 1
And our new formula will look like:
And we learn that the size of the list of permutations will have 12 words! In the rest of the explanation, I’m going to include a visualization of the list of permutations so that we can see what’s going on (even though our algorithm doesn’t actually compute the list, but rather uses a simple formula N times).
So, this is what we know initially. We know from our formula that we will have 12 permutations in our list (I have included the whole list for reference). We also know the following information about our word:
- Size of word: 4
- Frequency of “D”: 1
- Frequency of “N”: 1
- Frequency of “O”: 2
Let’s look at the first letter in the word: “O”. We know that there are 2 letters that come before “O” in our alphabet (D, N, O). Because of this, we can compute the size of the sub-list of permutations that start with “D” and “N”. The frequency of “N” and “D” make up 50% of the letters in the word, so we know the lengths of their combined sub-lists will be 50% of the total permutations. In other words, we know we can skip (12 * .5) = 6 elements.
I have updated our list to show the changes. You can see how if our word starts with an “O”, then we can just ignore everything above it in the list (shaded in dark to demonstrate). Now that we have computed how much we can skip for our first letter, we remove it from our word and repeat the process. Our new data looks like:
- Word: “NOD”
- Size of word: 3
- Frequency of “D”: 1
- Frequency of “N”: 1
- Frequency of “O”: 1
We re-compute the size of our search space for the new word by following our formula, which will be:
(3!)/(1! * 1! * 1!) = 6.
Our new current letter is “N”. There is only 1 letter this time that is earlier than the current letter in the list, that letter accounts for 1/3 of the remaining letters in the word. So, we know we can skip:
(6 * (1/3)) = 2 Elements.
We add this to our previous amount to skip and we end up with eliminating 8 words just by looking at the first two letters in the word. Here is the updated table:
We simply keep repeating these steps until we reach the end of the word. The only thing we need to do is keep track of the frequency table of the letters in the word, as well as 3 simple calculations per iteration. The end result is a pretty fast algorithm that can easily compute the index of of words 25+ letters in length. Not only that, but the search’s best case scenario performs exactly the same as the worst case scenario. It doesn’t matter how many elements you skip at each iteration, because it’s just a simple calculation instead of a search. So searching for “OOND” (the last word in the list) would theoretically take as much time as “DNOO” (the first word in the list). My implementation was able to compute indices for words in just a couple milliseconds.
An Alternate Solution
While working on my algorithm, I thought of another way to solve the problem that was also interesting. So, imagine we do everything exactly the same as the above method. The only difference is that the technique used in the above method would be the “upper window” for our index. At each iteration, we would also keep track of a “lower window” that is calculated in almost the exact same way as the upper window except instead of starting at 0 and adding to the window, you start at the size of the list and subtract what you calculate. So the upper window is the smallest index that our word could be at, and our lower window is the largest index that our word can be found. You can terminate the search when the lower window equals the upper window. What this means is that this current algorithm has the potential to find the index without looking through every single letter in the word. So, this algorithm would potentially run through fewer iterations than the above. If you implement these algorithms in another language or improve upon mine in some way, please leave me a comment and I’d love to take a look at it!