GPM: Generative Password Manager

Community Article Published July 7, 2024
Colab logo GitHub logo Neural tokenization

Password management is up there with cookie popups and ads, a major pain in the ass.

Turns out you don't need to save passwords, they can all be derived from a single master key.

Here's a small experiment using neural networks.

Intuition

Password managers can be seen as a deterministic function: given a target (app, website, etc) and a login ID (username, mail, etc) they produce a password.

password = manager(target, login)

Typically they just save your credentials to ensure you always get the same password.

Despite appearances, neural networks are actually deterministic too. So the manager function above could be a (L)LM.

In this case:

  • the master key would be used to initiate the weights
  • the MLP would take the login and / or target as input prompt
  • the passwords would not be saved but generated as "prediction" of the model

Also, the vocabulary and model can be setup to:

  • satisfy the password requirements
  • have high entropy while being deterministic
  • create memorable passwords like a sentence, contrary to random generation

And the NN cannot be a common / shared model, unless you are ok with typing "password" in a login field.

0. Setup The Hyper Parameters

The generative function is a MLP: it is defined by hyper-parameters.

  • the seed for the random number generators
  • the tensor shapes
  • the input vocabulary (all the ASCII characters)
  • the output vocabulary (alpha / numbers / symbols)
  • the password length, which is the length of the sampling

0.1. Defining the Input Vocabulary

The inputs are projected on the ASCII table, all unicode characters are ignored.

This vocabulary is fixed, whatever the user typed:

INPUT_VOCABULARY = ''.join(chr(__i) for __i in range(128)) # all ASCII characters

0.2. Composing The Output Vocabulary

The output vocabulary dictates the composition of the model output, IE the password.

This vocabulary can contain:

  • lowercase letters
  • uppercase letters
  • digits
  • ASCII symbols, apart from the quotes " and '
VOCABULARY_ALPHA_UPPER = ''.join(chr(__i) for __i in range(65, 91))                             # A-Z
VOCABULARY_ALPHA_LOWER = VOCABULARY_ALPHA_UPPER.lower()                                         # a-z
VOCABULARY_NUMBERS = '0123456789'                                                               # 0-9
VOCABULARY_SYMBOLS = ''.join(chr(__i) for __i in range(33, 48) if chr(__i) not in ["'", '"'])   # !#$%&\()*+,-./

It is generated from the user preferences with:

def compose(lower: bool=True, upper: bool=True, digits: bool=True, symbols: bool=False) -> str:
    return sorted(set(lower * VOCABULARY_ALPHA_LOWER + upper * VOCABULARY_ALPHA_UPPER + digits * VOCABULARY_NUMBERS + symbols * VOCABULARY_SYMBOLS))

By default, it is:

OUTPUT_VOCABULARY = ''.join(compose(1, 1, 1, 0))
# 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

In the end, the meta-parameters are:

N_INPUT_DIM = len(INPUT_VOCABULARY) # all ASCII characters
N_OUTPUT_DIM = N_INPUT_DIM # placeholder, it depends on the user settings

N_CONTEXT_DIM = 8
N_EMBEDDING_DIM = 128

N_PASSWORD_DIM = 16
N_PASSWORD_NONCE = 1

0.3. Casting The Master Key Into The Seed

A naive approach is to interpret the master key as a HEX sequence, then cast into the integer seed:

def seed(key: str) -> int:
    return int(bytes(key, 'utf-8').hex(), 16) % (2 ** 32) # dword

But many inputs produce the same seed:

print(seed('never seen before combination of letters'))
# 1952805491
print(seed('combination of letters'))
# 1952805491
print(b'combination of letters'.hex())
# 636f6d62696e6174696f6e206f66206c657474657273

The encoding of the string 'combination of letters' requires 22 bytes, so it is greater than 2 ** 168. Prepending a prefix means adding a number times 2 ** 176 which leads to the same value modulo 2 ** 32.

To separate the encoding of similar mater keys, it is first hashed using sha256:

def seed(key: str) -> int:
    __hash = hashlib.sha256(string=key.encode('utf-8')).hexdigest()
    return int(__hash[:8], 16) # take the first 4 bytes: the seed is lower than 2 ** 32

Now:

print(seed('never seen before combination of letters'))
# 3588870616
print(seed('combination of letters'))
# 3269272188

1. Preprocessing The Inputs

The inputs are the login information for which the user wants a password:

  • login target
  • login id

Before being handled to the model, they need to be preprocessed to guarantee that the output matches the user expectations.

1.1. Removing Unwanted Characters

First, the inputs should be cleaned to:

  • remove spaces: they serve no purpose and are typos like http://example. com
  • remove unicode characters: many typos produce invisible control characters like chr(2002)

Spaces can be removed with:

def remove_spaces(text: str) -> str:
    return text.replace(' ', '').replace('\t', '')

1.2. Normalizing The Strings

Several variants can be used to point to the same service:

example.com
https://example.com
https://example.com/
ExamPLE.COM

So they need to be normalized with:

def remove_prefix(text: str) -> str:
    __r = r'^((?:ftp|https?):\/\/)'
    return re.sub(pattern=__r, repl='', string=text, flags=re.IGNORECASE)

def remove_suffix(text: str) -> str:
    __r = r'(\/+)$'
    return re.sub(pattern=__r, repl='', string=text, flags=re.IGNORECASE)

Pieced together:

def preprocess(target: str, login: str) -> list:
    __left = remove_suffix(text=remove_prefix(text=remove_spaces(text=target.lower())))
    __right = remove_spaces(text=login.lower())
    return __left + '|' + __right
print(preprocess(target='example.com', login='user'))
# example.com|user
print(preprocess(target='https://example.com', login='user'))
# example.com|user
print(preprocess(target='example.com/', login='USER'))
# example.com|user

2. Encoding The Inputs

2.1. Mapping The Characters To Integers

The mapping between character and integer is a straightforward enumeration:

def mappings(vocabulary: list) -> dict:
    __itos = {__i: __c for __i, __c in enumerate(vocabulary)}
    __stoi = {__c: __i for __i, __c in enumerate(vocabulary)}
    # blank placeholder
    __blank_c = __itos[0] # chr(0)
    __blank_i = 0
    # s => i
    def __encode(c: str) -> int:
        return __stoi.get(c, __blank_i)
    # i => s
    def __decode(i: int) -> str:
        return __itos.get(i, __blank_c)
    # return both
    return {'encode': __encode, 'decode': __decode}

It will remove all the characters outside the input vocabulary, EG unicode characters.

def encode(text: str, stoi: callable) -> list:
    return [stoi(__c) for __c in text] # defaults to 0 if a character is not in the vocabulary

def decode(sequence: list, itos: callable) -> list:
    return ''.join([itos(__i) for __i in sequence]) # defaults to the first character

2.2. Adding Entropy

With a character level embedding the input tensor would look like:

array([101, 120,  97, 109, 112, 108, 101,  46,  99, 111, 109, 124, 117, 115, 101, 114], dtype=int32)

Which means that each repetition in the input would also yield a repetition in the output password.

Just like regular transformer models, using a context as input will make each sample more unique. Instead of a single character, a sample is now composed of the N latest characters:

array([[  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0, 101],
       [  0,   0,   0,   0,   0,   0, 101, 120],
       [  0,   0,   0,   0,   0, 101, 120,  97],
       [  0,   0,   0,   0, 101, 120,  97, 109],
       [  0,   0,   0, 101, 120,  97, 109, 112],
       [  0,   0, 101, 120,  97, 109, 112, 108],
       [  0, 101, 120,  97, 109, 112, 108, 101],
       [101, 120,  97, 109, 112, 108, 101,  46],
       [120,  97, 109, 112, 108, 101,  46,  99],
       [ 97, 109, 112, 108, 101,  46,  99, 111],
       [109, 112, 108, 101,  46,  99, 111, 109],
       [112, 108, 101,  46,  99, 111, 109, 124],
       [108, 101,  46,  99, 111, 109, 124, 117],
       [101,  46,  99, 111, 109, 124, 117, 115],
       [ 46,  99, 111, 109, 124, 117, 115, 101]], dtype=int32)

This can still be improved. As long as the process is deterministic, the input can be modified in any way.

For example, the successive ordinal values can be accumulated:

def accumulate(x: int, y: int, n: int) -> int:
    return (x + y) % n

The modulo guarantees that the encoding stays within the range of the ASCII encoding.

Also the context can start from the current index, instead of ending on it. Finally the encoded input can be cycled through to create and infinite iterator:

def feed(source: list, nonce: int, dimension: int) -> iter:
    __func = lambda __x, __y: accumulate(x=__x, y=__y + nonce, n=dimension) # add entropy by accumulating the encodings
    return itertools.accumulate(iterable=itertools.cycle(source), func=__func) # infinite iterable

This will allow to create passwords longer than the input text.

2.3. Formatting As A Tensor

Finally, the iterator of encoded inputs is used to generate the tensor X:

def tensor(feed: 'Iterable[int]', length: int, context: int) -> tf.Tensor:
    __x = [[next(feed) for _ in range(context)] for _ in range(length)]
    return tf.constant(tf.convert_to_tensor(value=__x, dtype=tf.dtypes.int32))

This tensor has shape (N_PASSWORD_LENGTH, N_CONTEXT_DIM):

__feed = feed(source=list('kaggle.com|apehex'.encode('utf-8')), nonce=1, dimension=256)
tensor(feed=__feed, length=N_PASSWORD_DIM, context=N_CONTEXT_DIM)
# <tf.Tensor: shape=(16, 8), dtype=int32, numpy=
# array([[107, 205,  53, 157,  10, 112, 159,   3],
#        [115, 225,  94, 192,  49, 151,   0, 102],
#        [223,  75, 173,  21, 125, 234,  80, 127],
#        [227,  83, 193,  62, 160,  17, 119, 224],
#        [ 70, 191,  43, 141, 245,  93, 202,  48],
#        [ 95, 195,  51, 161,  30, 128, 241,  87],
#        [192,  38, 159,  11, 109, 213,  61, 170],
#        [ 16,  63, 163,  19, 129, 254,  96, 209],
#        [ 55, 160,   6, 127, 235,  77, 181,  29],
#        [138, 240,  31, 131, 243,  97, 222,  64],
#        [177,  23, 128, 230,  95, 203,  45, 149],
#        [253, 106, 208, 255,  99, 211,  65, 190],
#        [ 32, 145, 247,  96, 198,  63, 171,  13],
#        [117, 221,  74, 176, 223,  67, 179,  33],
#        [158,   0, 113, 215,  64, 166,  31, 139],
#        [237,  85, 189,  42, 144, 191,  35, 147]], dtype=int32)>

Even though the input strings 'kaggle.com|apehex' had repetitions ("e" and "a") no two lines of the tensor are the same.

The process detailed here will always produce the same tensor X.

3. Creating The MLP Model

Now that all the hyper-parameters are set, creating the MLP is just a formality:

def create_model(
    seed: int,
    n_input_dim: int,
    n_output_dim: int,
    n_context_dim: int=N_CONTEXT_DIM,
    n_embedding_dim: int=N_EMBEDDING_DIM,
) -> tf.keras.Model:
    __model = tf.keras.Sequential()
    # initialize the weights
    __embedding_init = tf.keras.initializers.GlorotNormal(seed=seed)
    __dense_init = tf.keras.initializers.GlorotNormal(seed=(seed ** 2) % (2 ** 32)) # different values
    # embedding
    __model.add(tf.keras.layers.Embedding(input_dim=n_input_dim, output_dim=n_embedding_dim, embeddings_initializer=__embedding_init, name='embedding'))
    # head
    __model.add(tf.keras.layers.Reshape(target_shape=(n_context_dim * n_embedding_dim,), input_shape=(n_context_dim, n_embedding_dim), name='reshape'))
    __model.add(tf.keras.layers.Dense(units=n_output_dim, activation='tanh', use_bias=False, kernel_initializer=__dense_init, name='head'))
    __model.add(tf.keras.layers.Softmax(axis=-1, name='softmax'))
    # compile
    __model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False, label_smoothing=0., axis=-1, reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE, name='loss'))
    return __model

For the purpose of this POC we are using Tensorflow and Keras, but it could actually be done with basic matrix multiplications.

Numpy would be almost as convenient to use and yield the same result.

4. Sampling = Password Generation

The forward pass of the tensor X in the above model will result in the probabilities for each character in the output vocabulary.

This can be directly decoded as a string like this:

def password(model: tf.keras.Model, x: tf.Tensor, itos: callable) -> str:
    __y = tf.squeeze(model(x, training=False))
    __p = list(tf.argmax(__y, axis=-1).numpy())
    return decode(__p, itos=itos)

Evaluation

All the operations are pieced together in the process function:

def process(
    master_key: str,
    login_target: str,
    login_id: str,
    password_length: int,
    password_nonce: int,
    include_lower: bool,
    include_upper: bool,
    include_digits: bool,
    include_symbols: bool,
    input_vocabulary: str=INPUT_VOCABULARY,
    model_context_dim: int=N_CONTEXT_DIM,
    model_embedding_dim: int=N_EMBEDDING_DIM
) -> str:
    # seed to generate the model weights randomly
    __seed = seed(key=master_key)
    # input vocabulary
    __input_mappings = mappings(vocabulary=input_vocabulary)
    __input_dim = len(input_vocabulary)
    # output vocabulary
    __output_vocabulary = compose(lower=include_lower, upper=include_upper, digits=include_digits, symbols=include_symbols)
    __output_mappings = mappings(vocabulary=__output_vocabulary)
    __output_dim = len(__output_vocabulary)
    # inputs
    __inputs = preprocess(target=login_target, login=login_id)
    __source = encode(text=__inputs, stoi=__input_mappings['encode'])
    __feed = feed(source=__source, nonce=password_nonce, dimension=__input_dim)
    __x = tensor(feed=__feed, length=password_length, context=model_context_dim)
    # model
    __model = create_model(seed=__seed, n_input_dim=__input_dim, n_output_dim=__output_dim, n_context_dim=model_context_dim, n_embedding_dim=model_embedding_dim)
    # password
    return password(model=__model, x=__x, itos=__output_mappings['decode'])

We can fix the internal parameters of the model like so:

_process = functools.partial(
    process,
    password_length=32,
    password_nonce=1,
    include_lower=True,
    include_upper=True,
    include_digits=True,
    include_symbols=False,
    input_vocabulary=INPUT_VOCABULARY,
    model_context_dim=N_CONTEXT_DIM,
    model_embedding_dim=N_EMBEDDING_DIM)

Which makes it easier to test the password generation:

print(_process(master_key='test', login_target='huggingface.co', login_id='apehex'))
# UEkmcY3IgIjT7o0ISs7qNon66FIVT1Qi
print(_process(master_key='test', login_target='https://huggingface.co/', login_id='APEHEX'))
# UEkmcY3IgIjT7o0ISs7qNon66FIVT1Qi
print(_process(master_key='anotherkey', login_target='https://huggingface.co/', login_id='APEHEX'))
# 4IazpOJxK4aOrBfnLsWNLrsq7ftzxHfE

As expected the whole process is deterministic: calls with equivalent inputs will always yield the same password, there is no need to save it.

print(_process(master_key='verysecretpassphrase', login_target='example.com', login_id='u s e [email protected]'))
# 4ZUHYALvuXvcSoS1p9j7R64freclXKvf
print(_process(master_key='verysecretpassphrase', login_target='HTTPS://example.com/', login_id='[email protected]'))
# 4ZUHYALvuXvcSoS1p9j7R64freclXKvf

CLI

I wrapped this script in a Python CLI.

python  gpm/main.py --key 'never seen before combination of letters' --target 'http://example.com' --id '[email protected]'
# YRLabEDKqWQrN6JF

The full list of parameters is the following:

Generate / retrieve the password matching the input information

optional arguments:
  -h, --help                                    show this help message and exit
  --key MASTER_KEY, -k MASTER_KEY               the master key (all ASCII)
  --target LOGIN_TARGET, -t LOGIN_TARGET        the login target (URL, IP, name, etc)
  --id LOGIN_ID, -i LOGIN_ID                    the login id (username, email, etc)
  --length PASSWORD_LENGTH, -l PASSWORD_LENGTH  the length of the password (default 16)
  --nonce PASSWORD_NONCE, -n PASSWORD_NONCE     the nonce of the password
  --lower, -a                                   exclude lowercase letters from the password
  --upper, -A                                   exclude uppercase letters from the password
  --digits, -d                                  exclude digits from the password
  --symbols, -s                                 include symbols in the password

Improvements

This POC could be turned into a full-fledged product with:

  • performance improvements:
    • use the base numpy instead of tensorflow
    • replace the model with its base weight tensors and matrix multiplications
  • more output options:
    • generate the password as a bag of words
    • create whole sentences / quotes
    • force the use of certain characters / sub-vocabularies (like the symbols)
  • an actual distribution as:
    • browser extension
    • binary executable (CLI)
    • mobile app