Blog

Pseudonym Generator with Anagrams

Dec. 27, 2023

# Introduction This article describes my project of creating a pseudonym generator. I built a first version a long time ago, but I'm currently reviewing old projects and I decided to improve it a bit. You'll find: - the **source code** on [GitHub](https://github.com/ychalier/pseudo), - and the **demo** on my [website](/pseudo/). The idea is to generate anagrams that look like real words but aren't, thus inventing new aliases for a given name. It was inspired by [rrenaud/Gibberish-Detector](https://github.com/rrenaud/Gibberish-Detector). # Scoring Permutations First, we build a model that counts occurrences of n-gram/letter combinations in a corpus. For instance, `victo` was encountered 58 times; it was followed by an `i` 43 times (probably in words like `victoire`) and by an `r` 15 times (probably in words like `victor`, the corpus author's firtsname). Raw text is first normalized to keep lowercase ASCII letters and spaces only. Accentuated characters are preserved but accents are removed. Special tokens are added to mark the beginning (`^`) and the end (`$`) of a word.
A screenshot of the spreadsheet showing the results of the counting mentioned above
Excerpt of the model built on Les Misérables by Victor Hugo. Each row is one n-gram. Each column is one letter. Each cell contains the occurrences of the combination n-gram/letter.
Then, given a query string, we generate permutations of it, evaluate how pronounceable they look, and keep the best ones. The score of a permutation is the product of the score of each of its n-gram/letter combinations. The score of a combination is the ratio between the occurrences of that combination and the total occurrences of the n-gram. For instance, the score of `victo-i` is 43/58 = 0.74 and the score of `victo-r` is 15/58 = 0.26. For each letter, we only consider the combination with the longest n-gram possible. In the word `victor`, for the letter `r`, we only consider the score the combination `victo-r`, even if `icto-r`, `cto-r`, `to-r` and `o-r` are found in the model. This comes from the idea that longer existing sequences will provide better results. # Optimizing Exploration Iterating through each permutation is a combinatorial problem. A timeout ensures that the generator always answer within a given time frame, but some refinements can be made. Permutations are generated by exploring a tree where nodes are lists of remaining letters to place. The root contains all letters in the query. At a given node, each letter yields a new branch where that letter is removed from the list and appended to the permutation. Each branch is thus associated to a letter. Leaf nodes correspond to nodes originating from a `$` branch. This architecture allows for a nice optimization. First, we can keep track of the score of a permutation while going down the tree: we start at 1 at the root and multiply at each branch with the score of the chosen letter. Second, this value can only decrease while going down the tree: the score of any combination is a number between 0 and 1, so multiplying any positive number by it will return a lower or equal value. If all we are interested in are the top-K permutations, we can keep a list of the top-K scores encountered so far, and discard any branch where the value is already below the worst of those top-K scores.
A plot of a tree-shaped graph where about half of nodes are red
Plot of the exploration tree for the starting sequence jade. Nodes in red are ignored when only looking for the top-2 permutations (deja and jade), which represents about half of the tree.
In practice, this is not always enough, especially with very long sequences. Additional constraints, such as an absolute minimum score, or a minimal value for n-grams length, provide additional help. Also, the tree is explored depth-first, and high score branches are explorered first. In case of timeout, we can hope that our results are as decent as they can be.