GleghornLab/DSM_650 · Unconditional Sequence Generation does not work

10 days ago

Hi GleghornLab and Synthyra,
@lhallee

Thanks again for releasing this and other protein LLM models.

I'm following your example for "Unconditional Sequence Generation".
If I am running it as is getting this error:
TypeError: GenerateMixin.mask_diffusion_generate() missing 1 required positional argument: 'tokenizer'

If I add 'tokenizer' object (tokenizer=tokenizer) I get this error:

    148 self.special_token_ids = self._get_special_token_ids(extra_tokens)
    149 self.special_token_ids = torch.tensor(self.special_token_ids, device=device).flatten()
--> 151 num_mask_tokens = (input_tokens == mask_token_id).sum().item()
    152 steps = max(1, num_mask_tokens // step_divisor)
    154 trajectory = []
AttributeError: 'bool' object has no attribute 'sum'

lhallee

Gleghorn Lab org 10 days ago

Hey @DaniDubi ,

The readme was outdated and has been updated now. Is running fine on my end. If you still get the error could you paste in a complete step by step on how you got there.
Best,
Logan

The code I ran to confirm:

import torch
from models.modeling_dsm import DSM # Or DSM_ppi for binder generation

# Load a pre-trained model
model_name_or_path = "GleghornLab/DSM_650" # Replace with your model of choice
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DSM.from_pretrained(model_name_or_path).to(device).eval()
tokenizer = model.tokenizer

### Unconditional generation
length = 100
mask_token = tokenizer.mask_token
# optionally, enforce starting with methionine
input_tokens = tokenizer.encode('M' + ''.join([mask_token] * (length - 1)), add_special_tokens=True, return_tensors='pt').to(device)
output = model.mask_diffusion_generate(
    tokenizer=tokenizer,
    input_tokens=input_tokens,
    step_divisor=100,          # lower is slower but better
    temperature=1.0,           # sampling temperature
    remasking="random",        # strategy for remasking tokens not kept
    preview=False,             # set this to True to watch the mask tokens get rilled in real time
    slow=False,                # adds a small delay to the real time filling (because it is usually very fast and watching carefully is hard!)
    return_trajectory=False    # set this to True to return the trajectory of the generation (what you watch in the preview)
) # Note: output will be a tuple if return_trajectory is True

generated_sequences = model.decode_output(output)
print(f"Generated sequence: {generated_sequences[0]}")

The output:

Generated sequence: MKRIDLLFTGFVDQRPHNEEVILVAYGITLGAPASERTGFTRDLQGDLIDERARGGEFRFDMIAKDDFAPAGFTCHGAVHVLRRFIFLGAPDPIYVNMSL

DaniDubi changed discussion status to closed 9 days ago