Pseudonym Generator with Anagrams
This article describes my project of creating a pseudonym generator. I built a first version a long time ago, but I'm currently reviewing old projects and I decided to improve it a bit. You'll find:
The idea is to generate anagrams that look like real words but aren't, thus inventing new aliases for a given name. It was inspired by rrenaud/Gibberish-Detector.
Scoring Permutations
First, we build a model that counts occurrences of n-gram/letter combinations in a corpus. For instance, victo
was encountered 58 times; it was followed by an i
43 times (probably in words like victoire
) and by an r
15 times (probably in words like victor
, the corpus author's firtsname). Raw text is first normalized to keep lowercase ASCII letters and spaces only. Accentuated characters are preserved but accents are removed. Special tokens are added to mark the beginning (^
) and the end ($
) of a word.
Then, given a query string, we generate permutations of it, evaluate how pronounceable they look, and keep the best ones. The score of a permutation is the product of the score of each of its n-gram/letter combinations. The score of a combination is the ratio between the occurrences of that combination and the total occurrences of the n-gram. For instance, the score of victo-i
is 43/58 = 0.74 and the score of victo-r
is 15/58 = 0.26. For each letter, we only consider the combination with the longest n-gram possible. In the word victor
, for the letter r
, we only consider the score the combination victo-r
, even if icto-r
, cto-r
, to-r
and o-r
are found in the model. This comes from the idea that longer existing sequences will provide better results.
Optimizing Exploration
Iterating through each permutation is a combinatorial problem. A timeout ensures that the generator always answer within a given time frame, but some refinements can be made. Permutations are generated by exploring a tree where nodes are lists of remaining letters to place. The root contains all letters in the query. At a given node, each letter yields a new branch where that letter is removed from the list and appended to the permutation. Each branch is thus associated to a letter. Leaf nodes correspond to nodes originating from a $
branch.
This architecture allows for a nice optimization. First, we can keep track of the score of a permutation while going down the tree: we start at 1 at the root and multiply at each branch with the score of the chosen letter. Second, this value can only decrease while going down the tree: the score of any combination is a number between 0 and 1, so multiplying any positive number by it will return a lower or equal value. If all we are interested in are the top-K permutations, we can keep a list of the top-K scores encountered so far, and discard any branch where the value is already below the worst of those top-K scores.
In practice, this is not always enough, especially with very long sequences. Additional constraints, such as an absolute minimum score, or a minimal value for n-grams length, provide additional help. Also, the tree is explored depth-first, and high score branches are explorered first. In case of timeout, we can hope that our results are as decent as they can be.