The Danish language and UUIDs - Christoffer Hauthorn

I noticed that the UUID for a developer account I was prefixed with cafe. Showing the coincidence to a colleague, he asked “I wonder how many words of the danish language could be spelled using ABCDEF. I’m going to find out.

The danish language

I have a copy of the danish dictionary 2019 lying around (turns out you can get it for free if you ask nicely).

First, let’s load it as see what we have here.

import re
import pandas as pd

dictionary = pd.read_csv("RO2012 fuldformer 2019.txt", delimiter=";")

dictionary

	Grundform	Bøjning	Type
0	1. A	A	sb.
1	1. A	A'et	sb.
2	1. A	A'ets	sb.
3	1. A	A's	sb.
4	1. A	A'erne	sb.
...	...	...	...
412796	åsyn	åsyns	sb.
412797	åsyn	åsynene	sb.
412798	åsyn	åsynenes	sb.
412799	åsyn	åsyn	sb.
412800	åsyn	åsyns	sb.

412801 rows × 3 columns

dictionary.nunique()

Grundform     64978
Bøjning      388378
Type             32
dtype: int64

It seems like there’s around 65 thousand words in their base form in the dictionary, and some words are counted more than once. I guess they are listed for each meaning they might have, for example seas have and to have at have.

UUID

The format of the textual representation of a UUID poses some limits on word length (32 hexadecimal digits, hyphens that split up the string into 8-4-4-4-12 and so on), I will count words in the dictionary that only consist of the following:

Letters a, b, c, d, e & f
Digits 0 through 9

On to the query:

words = set(dictionary['Bøjning'].to_numpy())
regex = r"[a-fA-F0-9]"
count = 0
matched = []

for word in words:
    matches = re.finditer(regex, word)
    if len(tuple(matches)) == len(word):
        count += 1
        matched.append(word)

matched

['EA',
 'feed',
 'b',
 'ebbede',
 'badede',
 'bed',
 'F',
 'abe',
 'affede',
 'dB',
 'c',
 'C',
 'fa',
 'abbed',
 'affedede',
 'dada',
 'a',
 'ab',
 'aede',
 'ED',
 'cad',
 'ad',
 'ea',
 'AF',
 'e',
 'fade',
 'AD',
 'edb',
 'De',
 'af',
 'EDB',
 'fadede',
 'BA',
 'E',
 'fede',
 'da',
 'facade',
 'abede',
 'DAB',
 'de',
 'abc',
 'eb',
 'fad',
 'bede',
 'B',
 'ae',
 'ed',
 'D',
 'f',
 'fedede',
 'ABC',
 'fe',
 'CD',
 'cd',
 'dab',
 'fed',
 'affed',
 'CAD',
 'd',
 'bade',
 'bebe',
 'ba',
 'EF',
 'ebbe',
 'A',
 'bad']

len(matched)

I guess the answer can almost fit on your screen. 66 words.