README.md · silicobio/peleke-phi-4 at main

peleke-phi-4 / README.md

colbyford

Update README.md

c7c84da verified 2 months ago

preview code

raw

history blame contribute delete

2.56 kB

	---
	base_model: microsoft/phi-4
	library_name: peft
	model_name: peleke-phi-4
	tags:
	- base_model:adapter:microsoft/phi-4
	- lora
	- sft
	- transformers
	- trl
	- chemistry
	- biology
	- antibody
	- antigen
	- protein
	- amino-acid
	- drug-design
	licence: gpl-3
	pipeline_tag: text-generation
	license: gpl-3.0
	datasets:
	- silicobio/peleke_antibody-antigen_sabdab
	---

	# Model Card for peleke-phi-4

	This model is a fine-tuned version of [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) for antibody sequence generation.
	It takes in an antigen sequence, and returns novel Fv portions of heavy and light chain antibody sequences.

	## Quick start

	1. Load in the Model

	```python
	model_name = 'silicobio/peleke-phi-4'
	config = PeftConfig.from_pretrained(model_name)

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
	model.resize_token_embeddings(len(tokenizer))
	model = PeftModel.from_pretrained(model, model_name).cuda()
	```

	2. Format your Input

	This model uses `<epi>` and `</epi>` to annotate epitope residues of interest.

	It may be easier to use other characters for annotation, such as `[ ]`'s. For example: `...CSFS[S][F][V]L[N]WY...`.
	Then, use the following function to properly format the input.

	```python
	def format_prompt(antigen_sequence):
	epitope_seq = re.sub(r'\[([A-Z])\]', r'<epi>\1</epi>', antigen_sequence)
	formatted_str = f"Antigen: {epitope_seq}<\|im_end\|>\nAntibody:"
	return formatted_str
	```

	3. Generate an Antibody Sequence

	```python
	prompt = format_prompt(antigen)
	inputs = tokenizer(prompt, return_tensors="pt")
	inputs = {k: v.cuda() for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=1000,
	do_sample=True,
	temperature=0.7,
	pad_token_id=tokenizer.eos_token_id,
	use_cache=False,
	)

	full_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
	antibody_sequence = full_text.split('<\|im_end\|>')[1].replace('Antibody: ', '')
	print(f"Antigen: {antigen}\nAntibody: {antibody_sequence}\n")
	```

	This will generate a `\|`-delimited output, which is an Fv portion of a heavy and light chain.

	```sh
	Antigen: NPPTFSPALL...
	Antibody: QVQLVQSGGG...\|DIQMTQSPSS...
	```

	## Training procedure


	This model was trained with SFT.

	### Framework versions

	- PEFT 0.17.0
	- TRL: 0.19.1
	- Transformers: 4.54.0
	- Pytorch: 2.7.1
	- Datasets: 4.0.0
	- Tokenizers: 0.21.2