The Danish language and UUIDs
I noticed that the UUID for a developer account I was prefixed with cafe
. Showing the coincidence to a colleague, he asked “I wonder how many words of the danish language could be spelled using ABCDEF. I’m going to find out.
The danish language
I have a copy of the danish dictionary 2019 lying around (turns out you can get it for free if you ask nicely).
First, let’s load it as see what we have here.
import re
import pandas as pd
dictionary = pd.read_csv("RO2012 fuldformer 2019.txt", delimiter=";")
dictionary
Grundform | Bøjning | Type | |
---|---|---|---|
0 | 1. A | A | sb. |
1 | 1. A | A'et | sb. |
2 | 1. A | A'ets | sb. |
3 | 1. A | A's | sb. |
4 | 1. A | A'erne | sb. |
... | ... | ... | ... |
412796 | åsyn | åsyns | sb. |
412797 | åsyn | åsynene | sb. |
412798 | åsyn | åsynenes | sb. |
412799 | åsyn | åsyn | sb. |
412800 | åsyn | åsyns | sb. |
412801 rows × 3 columns
dictionary.nunique()
Grundform 64978
Bøjning 388378
Type 32
dtype: int64
It seems like there’s around 65 thousand words in their base form in the dictionary, and some words are counted more than once. I guess they are listed for each meaning they might have, for example seas have and to have at have.
UUID
The format of the textual representation of a UUID poses some limits on word length (32 hexadecimal digits, hyphens that split up the string into 8-4-4-4-12 and so on), I will count words in the dictionary that only consist of the following:
- Letters a, b, c, d, e & f
- Digits 0 through 9
On to the query:
words = set(dictionary['Bøjning'].to_numpy())
regex = r"[a-fA-F0-9]"
count = 0
matched = []
for word in words:
matches = re.finditer(regex, word)
if len(tuple(matches)) == len(word):
count += 1
matched.append(word)
matched
['EA',
'feed',
'b',
'ebbede',
'badede',
'bed',
'F',
'abe',
'affede',
'dB',
'c',
'C',
'fa',
'abbed',
'affedede',
'dada',
'a',
'ab',
'aede',
'ED',
'cad',
'ad',
'ea',
'AF',
'e',
'fade',
'AD',
'edb',
'De',
'af',
'EDB',
'fadede',
'BA',
'E',
'fede',
'da',
'facade',
'abede',
'DAB',
'de',
'abc',
'eb',
'fad',
'bede',
'B',
'ae',
'ed',
'D',
'f',
'fedede',
'ABC',
'fe',
'CD',
'cd',
'dab',
'fed',
'affed',
'CAD',
'd',
'bade',
'bebe',
'ba',
'EF',
'ebbe',
'A',
'bad']
len(matched)
66
I guess the answer can almost fit on your screen. 66 words.