I noticed that the UUID for a developer account I was prefixed with cafe. Showing the coincidence to a colleague, he asked “I wonder how many words of the danish language could be spelled using ABCDEF. I’m going to find out.

## The danish language

I have a copy of the danish dictionary 2019 lying around (turns out you can get it for free if you ask nicely).

First, let’s load it as see what we have here.

import re
import pandas as pd

dictionary = pd.read_csv("RO2012 fuldformer 2019.txt", delimiter=";")

dictionary

Grundform Bøjning Type
0 1. A A sb.
1 1. A A'et sb.
2 1. A A'ets sb.
3 1. A A's sb.
4 1. A A'erne sb.
... ... ... ...
412796 åsyn åsyns sb.
412797 åsyn åsynene sb.
412798 åsyn åsynenes sb.
412799 åsyn åsyn sb.
412800 åsyn åsyns sb.

412801 rows × 3 columns

dictionary.nunique()

Grundform     64978
Bøjning      388378
Type             32
dtype: int64


It seems like there’s around 65 thousand words in their base form in the dictionary, and some words are counted more than once. I guess they are listed for each meaning they might have, for example seas have and to have at have.

## UUID

The format of the textual representation of a UUID poses some limits on word length (32 hexadecimal digits, hyphens that split up the string into 8-4-4-4-12 and so on), I will count words in the dictionary that only consist of the following:

1. Letters a, b, c, d, e & f
2. Digits 0 through 9

On to the query:

words = set(dictionary['Bøjning'].to_numpy())
regex = r"[a-fA-F0-9]"
count = 0
matched = []

for word in words:
matches = re.finditer(regex, word)
if len(tuple(matches)) == len(word):
count += 1
matched.append(word)

matched

['EA',
'feed',
'b',
'ebbede',
'bed',
'F',
'abe',
'affede',
'dB',
'c',
'C',
'fa',
'abbed',
'affedede',
'a',
'ab',
'aede',
'ED',
'ea',
'AF',
'e',
'edb',
'De',
'af',
'EDB',
'BA',
'E',
'fede',
'da',
'abede',
'DAB',
'de',
'abc',
'eb',
'bede',
'B',
'ae',
'ed',
'D',
'f',
'fedede',
'ABC',
'fe',
'CD',
'cd',
'dab',
'fed',
'affed',
'd',

len(matched)

66