pan-li commited on Apr 23

Commit

d7174d3

verified ·

1 Parent(s): 826f039

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

.gitattributes +2 -0
LICENSE +49 -0
README.md +164 -2
config.json +65 -0
configuration_ragplm.py +118 -0
generation_config.json +6 -0
modeling_ragplm.py +1260 -0
performance.png +3 -0
proteinmoe_architecture.png +3 -0
pytorch_model-00001-of-00007.bin +3 -0
pytorch_model-00002-of-00007.bin +3 -0
pytorch_model-00003-of-00007.bin +3 -0
pytorch_model-00004-of-00007.bin +3 -0
pytorch_model-00005-of-00007.bin +3 -0
pytorch_model-00006-of-00007.bin +3 -0
pytorch_model-00007-of-00007.bin +3 -0
pytorch_model.bin.index.json +0 -0
special_tokens_map.json +1 -0
tokenization.py +268 -0
tokenizer_config.json +19 -0
vocab.json +549 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+performance.png filter=lfs diff=lfs merge=lfs -text
+proteinmoe_architecture.png filter=lfs diff=lfs merge=lfs -text

LICENSE CHANGED Viewed

	@@ -0,0 +1,49 @@

+GENBIO AI COMMUNITY LICENSE AGREEMENT
+This GenBio AI Community License Agreement (the “License”) constitutes an agreement between you or the legal entity you represent (“you” or “your”) and GENBIO.AI, INC. (“GenBio”), governing your use of the GenBio Materials. If you are using the GenBio Materials on behalf of a legal entity, you represent and warrant to GenBio that you have full legal authority to act on behalf of that legal entity as applicable under the License. If you do not have the authority to accept this License or if you disagree with any or all of the License, you shall not use the GenBio Materials in any manner. By using or distributing any portion or element of the GenBio Materials, you imply your agreement to be bound by the License.
+“GenBio Materials” means any datasets, code, model weights or any other materials provided by GenBio at the following GitHub Page https://github.com/genbio-ai or Hugging Face Page https://huggingface.co/genbio-ai, including any updates or modifications made from time to time, whether in Source or Object form, and is made available to you under this License.
+1. License Grant.
+      1.1 License Scope. Subject to the terms of this License, GenBio grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under GenBio’s intellectual property or other rights owned by GenBio embodied in the GenBio Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the GenBio Materials for any Non-Commercial Purposes.
+      1.2 Use Restrictions. Restricted activities in relation to the License or use of GenBio Materials include:
+            1.2.1 You shall use the GenBio Materials, Contributions, Derivative Works, Outputs and Output Derivatives (as defined below) solely for Non-Commercial Purposes;
+            1.2.2 You shall not, directly or indirectly: (a) use or provide access to any Outputs or Output Derivatives to train, optimize, improve, or otherwise enhance the functionality or performance of any machine learning models or related technologies that are similar to the GenBio Materials; (b) engage in any form of model distillation or other methods that would achieve the purposes described in subsection (a) above. Notwithstanding the foregoing, you may use Outputs and Output Derivatives to train, optimize, improve, or enhance the functionality or performance of: (i) The GenBio Materials itself; and (ii) downstream Derivative Works of the GenBio Materials;
+            1.2.3 Your use of the GenBio Materials shall be subject to any additional terms and conditions that: (a) GenBio provides to you separately; or (b) GenBio otherwise makes available to you.
+2. Sharing and Distribution.
+      2.1 Subject to Section 1, if you distribute or make available the GenBio Materials or a Derivative Work to a third party for your Non-Commercial Purposes, in Source or Object form, you shall:
+            2.1.1 provide a copy of this License to that third party;
+            2.1.2 retain the following attribution notice within a “Notice” text file distributed as a part of such copies: “This is licensed under the GenBio AI Community License Agreement, Copyright ©  GENBIO.AI, INC. All Rights Reserved”; and
+            2.1.3 prominently display “Powered by GenBio AI” on a related website, user interface, blogpost, about page, or product documentation.
+      2.2 If You create a Derivative Work, you may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that you clearly indicate which attributions apply to the GenBio Materials and state in the “Notice” text file that you changed the GenBio Materials and how it was modified.
+3. Submission of Contribution.
+      Unless you explicitly state otherwise, any Contribution intentionally submitted for inclusion in the GenBio Materials by you to GenBio shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with GenBio regarding such Contributions.
+4. Export Control.
+      You shall comply with the applicable U.S. Foreign Corrupt Practices Act and all applicable export laws, restrictions and regulations of the U.S. Department of Commerce, and any other applicable U.S. and foreign authority.
+5. Disclaimer of Warranty.
+      GENBIO MATERIALS PROVIDED BY GENBIO OR ANY OUTPUT YOU RECEIVED ARE PROVIDED “AS IS.” EXCEPT TO THE EXTENT PROHIBITED BY LAW. GENBIO MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED OR OTHERWISE, REGARDING THE ACCURACY, COMPLETENESS OR PERFORMANCE OF THE SERVICES AND YOUR OUTPUT, OR WITH RESPECT TO SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT.
+6. Limitation of Liability.
+      In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the GenBio Materials (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+7. General Terms.
+      7.1 Relationship of Parties.  You and GenBio are independent contractors, and nothing herein shall be deemed to constitute either party as the agent or representative of the other or both parties as joint venturers or partners for any purpose.
+      7.2 Assignment.  This License and the rights and obligations herein may not be assigned or transferred, in whole or in part, by You without the prior written consent of GenBio. Any assignment in violation of this provision is void. GenBio may freely assign or transfer this License, in whole or in part. This License shall be binding upon, and inure to the benefit of, the successors and permitted assigns of the parties.
+      7.3 Governing Law.  This License shall be governed, construed and interpreted in accordance with the laws of the State of California, without giving effect to principles of conflicts of law. Each of the parties to this License consents to the exclusive jurisdiction and venue of the courts of the state and federal courts of California.
+      7.4 Severability. If any provision of this License is held to be invalid, illegal or unenforceable in any respect, that provision shall be limited or eliminated to the minimum extent necessary so that this License otherwise remains in full force and effect and enforceable.
+8. Definitions.
+      8.1 “Commercial Entity” means any entity engaged in any activity intended for or directed toward commercial advantage or monetary compensation, including, without limitation, the development of any product or service intended to be sold or made available for a fee. For the purpose of this License, references to a Commercial Entity expressly exclude any universities, non-profit organizations, not-for-profit entities, research institutes and educational and government bodies.
+      8.2 “Contribution” means any work of authorship, including the original version of the GenBio Materials and any modifications or additions to that GenBio Materials or Derivative Works thereof, that is intentionally submitted to GenBio for inclusion in the GenBio Materials by the copyright owner or by an individual or legal entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to GenBio or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, GenBio for the purpose of discussing and improving the GenBio Materials, but excluding Outputs and all communications that are conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution”.
+      8.3 “Contributor” means GenBio and any individual or legal entity on behalf of whom a Contribution has been received by GenBio and subsequently incorporated within the GenBio Materials.
+      8.4 “Derivative Work” means any work, whether in Source or Object form, that is based on (or derived from) the GenBio Materials and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the GenBio Materials and Derivative Works thereof.
+      8.5 “Non-Commercial Purposes” means uses not intended for or directed toward commercial advantage or monetary compensation, or the facilitation of development of any product or service to be sold or made available for a fee. For the avoidance of doubt, the provision of Outputs as a service is not a Non-Commercial Purpose.
+      8.6 “Object” means any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+      8.7 “Output” means any output, including any protein sequence, structure prediction, functional annotation, molecule, descriptions of a molecule, model, sequence, text, and/or image that is elicited directly or indirectly by, or otherwise made available to, you in connection with your use of the GenBio Materials, including, but not limited to, the use of AI-Powered Technology. For the avoidance of doubt, it includes any intermediate results, such as activations across model layers, intermediate outputs from model layers (e.g., attention maps), as well as gradients and embeddings produced by the GenBio Materials.
+      8.8 “Output Derivatives” means any enhancements, modifications and derivative works of Outputs (including, but not limited to, any derivative sequences or molecules).
+      8.9 “Source” means the preferred form for making modifications, including but not limited to GenBio Materials source code, documentation source, and configuration files.

README.md CHANGED Viewed

@@ -1,5 +1,167 @@
 ---
 license: other
-license_name: genbio.ai-community-license
-license_link: LICENSE
 ---

 ---
 license: other
 ---
+# AIDO.Protein-RAG-16B-proteingym-dms-zeroshot
+AIDO.Protein-RAG-16B-proteingym-dms-zeroshot is a multimodal protein language model that integrates Multiple Sequence Alignment (MSA) and structural data, building upon the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) foundation. The training process comprises three main stages:
+1. 2D RoPE encoding fine-tuning
+2. Initial training on 100 billion tokens from UniRef50/UniClust30 MSA data
+3. Subsequent training on 23 billion tokens from AlphaFold Database MSA and structural data
+## Model Architecture
+AIDO.Protein-RAG-16B-proteingym-dms-zeroshot employs a transformer encoder-only architecture featuring sparse Mixture-of-Experts (MoE) layers that replace dense MLP layers in each transformer block. Utilizing single amino acid tokenization and optimized through masked language modeling (MLM), the model activates 2 experts per token via top-2 routing mechanisms.
+<center><img src="proteinmoe_architecture.png" alt="An Overview of AIDO.Protein" style="width:70%; height:auto;" /></center>
+More architecture details are shown below:
+| Model Arch Component    | Value |
+| ----------------------- | :---: |
+| Num Attention Head      |  36   |
+| Num Hidden Layer        |  36   |
+| Hidden Size             | 2304  |
+| FFN Hidden Size         | 7680  |
+| Num MoE Layer per Block |   8   |
+| Num MoE Layer per Token |   2   |
+| Vocab Size              |  44   |
+| Context Length          | 2048  |
+## Pre-training of AIDO.Protein-RAG-16B-proteingym-dms-zeroshot
+Here we briefly introduce the details of pre-training of AIDO.Protein-RAG-16B-proteingym-dms-zeroshot. Mainly divided into three stages: (1) 1D -> 2D RoPE encoding fine-tuning; (2) UniRef50/Uniclust30 MSA fine-tuning; (3) AlphaFold Database MSA & Structure tokens fine-tuning
+### Data
+**UniRef50/Uniclust30 MSA dataset**: We utilized sequences from UniRef50 as queries to search for homologous sequences in UniClust30, subsequently constructing multiple sequence alignments (MSAs). UniRef50 comprises a total of 53.6 million sequences. Using HHblits, we searched all sequences, identifying over 25 homologous sequences for 23.7 million of them. This dataset was directly used as the training set, referred to as `HHblits_MSA`. The remaining 29.9 million sequences were input into MSA Retriever, resulting in 7.7 million sequences with more than 25 homologous sequences. This dataset was designated as `Retriever_MSA`. During training, RAGPLM randomly sampled from the two datasets with probabilities of 0.75 and 0.25. Refer to AIDO.Protein-RAG-3B paper ([link](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1)) for more information.
+**AlphaFold Database MSA & Structure dataset**: We downloaded all structural data from the AlphaFold Database and kept only those where more than 40% of amino acids had a pLDDT score > 70. The remaining sequences were clustered using `mmseqs` (`seq id=0.5`), and one representative per cluster was retained, resulting in 46.9 million sequence/structure pairs. For each structure, we used [genbio-ai/AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) to obtain structure tokens and embeddings. [MSA Retriever](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1) was used to obtain the corresponding MSA.
+### Training Details
+Model training is divided into three stages:
+#### (1) 1D -> 2D RoPE Encoding Fine-tuning
+Same training data as [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B), but with [2D rotary position embedding](https://arxiv.org/abs/2406.05347) for token encoding.
+#### (2) UniRef50/UniClust30 MSA Fine-tuning
+The model from Stage 1 is further fine-tuned on the UniRef50/Uniclust30 MSA dataset. See the [AIDO.Protein-RAG-3B paper](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1) for more.
+#### (3) AlphaFold Database MSA & Structure Fine-tuning
+We fine-tuned the model with concatenated query and homologous sequences. Structure embeddings (dim = 384) are linearly mapped to 2304 and added to the query token embeddings.
+##### Sequence Masking
+* Randomly sample `0.05 × L` span positions from a query of length `L`. Span lengths follow a geometric distribution (`p=0.2`), capped at length 10. On average, ~15% of query tokens are masked.
+* When a residue is selected, its aligned residues across all sequences (MSA column) are also masked.
+* For masked MSA columns: 80% are replaced with `<MASK>`, 10% with random amino acids, and 10% left unchanged.
+##### Structure Masking
+* In 20% of cases, structure embeddings are replaced with 0.
+* In 80% of cases, a number of amino acids is sampled using the BetaLinear30 distribution and corresponding embeddings are zeroed. (BetaLinear30 = 20% Uniform(0,1) + 80% Beta(3,9)).
+##### Positional Embedding
+We use [2D rotary position embedding](https://arxiv.org/abs/2406.05347) to help the model distinguish token chain identities and residue indices. See AIDO.Protein-RAG-3B paper ([link](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1)) for more information.
+##### Loss Function
+Total loss is a weighted sum of sequence loss (weight 1.0) and structure loss (weight 0.025).
+* **Sequence loss**: CrossEntropy loss for masked token prediction.
+* **Structure loss**: CrossEntropy loss for masked structure token prediction.
+| Hyper-params                | (1) 1D -> 2D fine-tuning | (2) UniRef50/Uniclust30 MSA fine-tuning | (3) AFDB MSA & Structure tokens fine-tuning |
+| --------------------------- | :---------------------: | :------------------------------------: | :----------------------------------------: |
+| Initialized parameters      |    AIDO.Protein-16B     |               Stage (1)                |                 Stage (2)                  |
+| Data                        |   ColabFoldDB, UniRef   |       HHblits_MSA, Retriever_MSA       |        AFDB MSA & Structure tokens         |
+| Global Batch Size           |           512           |                  256                   |                    256                     |
+| Sequence length             |          2048           |                 12800                  |                   12800                    |
+| Per Device Micro Batch Size |            1            |                   1                    |                     1                      |
+| Precision                   |     Mixed FP32-FP16     |            Mixed FP32-FP16             |              Mixed FP32-FP16               |
+| LR                          |       [5e-6,5e-5]       |              [1e-6, 1e-5]              |                    1e-5                    |
+| Num Tokens                  |       10 billion        |              100 billion               |                 23 billion                 |
+| Structural loss             |           N/A           |                  N/A                   |                   0.025                    |
+### Tokenization
+We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also suffixed with a `[SEP]` token as hooks for downstream tasks.
+## Results
+### Zero-shot DMS score
+<center><img src="performance.png" alt="performance" style="width:100%; height:auto;" /></center>
+## How to Run
+### Load the model and tokenizer
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, AutoModelForMaskedLM
+tokenizer  = AutoTokenizer.from_pretrained("genbio-ai/AIDO.Protein-RAG-16B-proteingym-dms-zeroshot", trust_remote_code=True)
+model      = AutoModelForCausalLM.from_pretrained("genbio-ai/AIDO.Protein-RAG-16B-proteingym-dms-zeroshot", trust_remote_code=True, torch_dtype=torch.bfloat16)
+model      = model.bfloat16().eval().to('cuda:0')
+```
+### Clone the github respository and install environment **TODO**
+Please read introduction of [github respository](https://gitlab.genbio.ai/pan.li/ragplm_zeroshot/-/tree/master) to get the detail of installing environment and running method.
+```bash
+conda create -n ragplm python=3.11 -y
+conda activate ragplm
+pip install tabulate seaborn deepspeed
+pip install git+https://github.com/genbio-ai/ModelGenerator.git
+git clone [email protected]:pan.li/ragplm_zeroshot.git
+cd ragplm_zeroshot
+tar xf dms_data.tar.gz
+tar xf struc_data.tar.gz
+mkdir output
+```
+### Run zero-shot
+```bash
+python compute_fitness.py --dms_ids PTEN_HUMAN_Mighell_2018
+```
+# Citation
+Please cite AIDO.Protein-RAG-16B-proteingym-dms-zeroshot using the following BibTex code:
+```
+@inproceedings{sun_mixture_2024,
+    title = {Mixture of Experts Enable Efficient and Effective Protein Understanding and Design},
+    url = {https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1},
+    doi = {10.1101/2024.11.29.625425},
+    publisher = {bioRxiv},
+    author = {Sun, Ning and Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Cheng, Xingyi and Song, Le and Xing, Eric P.},
+    year = {2024},
+    booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
+}
+@article {Li2024.12.02.626519,
+    author = {Li, Pan and Cheng, Xingyi and Song, Le and Xing, Eric},
+    title = {Retrieval Augmented Protein Language Models for Protein Structure Prediction},
+    url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1},
+    year = {2024},
+    doi = {10.1101/2024.12.02.626519},
+    publisher = {bioRxiv},
+    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "_name_or_path": "Protein/RAGPLM",
+  "add_bias_linear": true,
+  "add_qkv_bias": true,
+  "add_seq_emb_ln": false,
+  "add_str_emb_ln": false,
+  "apply_query_key_layer_scaling": true,
+  "apply_residual_connection_post_layernorm": false,
+  "architectures": [
+    "RAGPLMForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "attention_softmax_in_fp32": true,
+  "auto_map": {
+    "AutoConfig": "configuration_ragplm.RAGPLMConfig",
+    "AutoModel": "modeling_ragplm.RAGPLMModel",
+    "AutoModelForCausalLM": "modeling_ragplm.RAGPLMForConditionalGeneration",
+    "AutoModelForSeq2SeqLM": "modeling_ragplm.RAGPLMForConditionalGeneration"
+  },
+  "bias_dropout_fusion": true,
+  "classifier_dropout": null,
+  "deepnorm": false,
+  "eos_token_id": 34,
+  "experts_per_token": 2,
+  "ffn_hidden_size": 7680,
+  "fp32_residual_connection": false,
+  "glu_activation": "swiglu",
+  "hidden_dropout": 0.0,
+  "hidden_size": 2304,
+  "is_causal": false,
+  "kv_channels": 64,
+  "layernorm_epsilon": 1e-05,
+  "lora": false,
+  "lora_alpha": 16,
+  "lora_before_position": false,
+  "lora_dropout": 0,
+  "lora_r": 8,
+  "mlp_lora": false,
+  "model_type": "ragplm",
+  "moe": true,
+  "multi_query_attention": false,
+  "multi_query_group_num": 2,
+  "num_attention_heads": 36,
+  "num_experts": 8,
+  "num_layers": 36,
+  "original_rope": true,
+  "pad_token_id": 0,
+  "padded_vocab_size": 640,
+  "post_layer_norm": true,
+  "qseq_output_dim": null,
+  "quantization_bit": 0,
+  "rmsnorm": true,
+  "rotary_embedding_2d": true,
+  "rotary_freq_base": 10000,
+  "seq_length": 2048,
+  "str_input_dim": 384,
+  "str_output_dim": 512,
+  "str_vocab_size": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "torch.bfloat16",
+  "transformers_version": "4.48.3",
+  "use_cache": true,
+  "use_pytorch_sdpa": true,
+  "vocab_size": 128
+}

configuration_ragplm.py ADDED Viewed

	@@ -0,0 +1,118 @@

+from transformers import PretrainedConfig
+import torch
+class RAGPLMConfig(PretrainedConfig):
+    model_type = "ragplm"
+    def __init__(
+        self,
+        num_layers=28,
+        padded_vocab_size=65024,
+        hidden_size=4096,
+        ffn_hidden_size=13696,
+        kv_channels=128,
+        num_attention_heads=32,
+        add_str_emb_ln=False, # Add layer norm to the structure embedding layer
+        add_seq_emb_ln=False, # Add layer norm to the sequence embedding layer
+        str_vocab_size=None,
+        str_input_dim=None,
+        str_output_dim=None,
+        qseq_output_dim=None,
+        seq_length=2048,
+        hidden_dropout=0.0,
+        classifier_dropout=None,
+        attention_dropout=0.0,
+        layernorm_epsilon=1e-5,
+        glu_activation='geglu',
+        torch_dtype=torch.bfloat16,
+        rmsnorm=True,
+        deepnorm=True,
+        apply_residual_connection_post_layernorm=False,
+        post_layer_norm=True,
+        add_bias_linear=False,
+        add_qkv_bias=False,
+        bias_dropout_fusion=True,
+        multi_query_attention=False,
+        multi_query_group_num=1,
+        apply_query_key_layer_scaling=True,
+        attention_softmax_in_fp32=True,
+        fp32_residual_connection=False,
+        quantization_bit=0,
+        # pre_seq_len=None,
+        # prefix_projection=False,
+        rotary_embedding_2d=True,
+        rotary_freq_base=10000,
+        lora=False,
+        mlp_lora=False,
+        lora_before_position=False, ### Default the QKV LoRA is after the position encoding
+        lora_r=8,
+        lora_alpha=16,
+        lora_dropout=0,
+        use_pytorch_sdpa=True,
+        is_causal=True,
+        moe=False,
+        num_experts=16,
+        experts_per_token=2,
+        **kwargs
+    ):
+        if not deepnorm and apply_residual_connection_post_layernorm:
+            print(f"Warning: deepnorm is False and apply_residual_connection_post_layernorm is True")
+        self.num_layers = num_layers
+        self.vocab_size = padded_vocab_size
+        self.padded_vocab_size = padded_vocab_size
+        self.hidden_size = hidden_size
+        self.ffn_hidden_size = ffn_hidden_size
+        self.kv_channels = kv_channels
+        self.num_attention_heads = num_attention_heads
+        self.add_str_emb_ln = add_str_emb_ln
+        self.add_seq_emb_ln = add_seq_emb_ln
+        self.str_vocab_size = str_vocab_size
+        self.str_input_dim = str_input_dim
+        self.str_output_dim = str_output_dim
+        self.qseq_output_dim = qseq_output_dim
+        self.seq_length = seq_length
+        self.hidden_dropout = hidden_dropout
+        self.classifier_dropout = classifier_dropout
+        self.attention_dropout = attention_dropout
+        self.layernorm_epsilon = layernorm_epsilon
+        self.torch_dtype = torch_dtype
+        self.glu_activation = glu_activation
+        self.rmsnorm = rmsnorm
+        self.deepnorm = deepnorm
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.post_layer_norm = post_layer_norm
+        self.add_bias_linear = add_bias_linear
+        self.add_qkv_bias = add_qkv_bias
+        self.bias_dropout_fusion = bias_dropout_fusion
+        self.multi_query_attention = multi_query_attention
+        self.multi_query_group_num = multi_query_group_num
+        self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+        self.fp32_residual_connection = fp32_residual_connection
+        self.quantization_bit = quantization_bit
+        #self.pre_seq_len = pre_seq_len
+        #self.prefix_projection = prefix_projection
+        self.rotary_embedding_2d = rotary_embedding_2d
+        self.rotary_freq_base = rotary_freq_base
+        self.is_causal = is_causal
+        self.lora = lora
+        self.mlp_lora = mlp_lora
+        self.lora_before_position = lora_before_position
+        self.lora_r = lora_r
+        self.lora_alpha = lora_alpha
+        self.lora_dropout = lora_dropout
+        self.use_pytorch_sdpa = use_pytorch_sdpa
+        self.moe = moe
+        self.num_experts = num_experts
+        self.experts_per_token = experts_per_token
+        super().__init__(**kwargs)
+        if isinstance(torch_dtype, str):
+            if torch_dtype.startswith('torch.'):
+                self.torch_dtype = eval(torch_dtype)
+            else:
+                self.torch_dtype = eval(f"torch.{torch_dtype}")

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "eos_token_id": 34,
+  "pad_token_id": 0,
+  "transformers_version": "4.48.3"
+}

modeling_ragplm.py ADDED Viewed

	@@ -0,0 +1,1260 @@

+""" PyTorch AIDO.Protein-DMS-16B model. """
+import math
+import copy
+import warnings
+import re
+import sys
+import os
+import pathlib
+import time
+import argparse
+import random
+import numpy as np
+from tqdm.auto import tqdm, trange
+from functools import partial
+import torch, deepspeed
+import torch.utils.checkpoint
+import torch.nn.functional as F
+from torch import nn
+from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
+from torch.nn.utils import skip_init
+from typing import Optional, Tuple, Union, List, Callable, Dict, Any
+from copy import deepcopy
+from collections import namedtuple
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    SequenceClassifierOutputWithPast,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from transformers.generation.logits_process import LogitsProcessor
+from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
+from .configuration_ragplm import RAGPLMConfig
+def get_checkpoint_fn():
+    if deepspeed.checkpointing.is_configured():
+        # checkpoint = deepspeed.checkpointing.non_reentrant_checkpoint
+        checkpoint = deepspeed.checkpointing.checkpoint
+    else:
+        checkpoint = partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
+        # checkpoint = partial(torch.utils.checkpoint.checkpoint, use_reentrant=True)
+    return checkpoint
+# flags required to enable jit fusion kernels
+if sys.platform != 'darwin':
+    torch._C._jit_set_profiling_mode(False)
+    torch._C._jit_set_profiling_executor(False)
+    torch._C._jit_override_can_fuse_on_cpu(True)
+    torch._C._jit_override_can_fuse_on_gpu(True)
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "Protein/Protein_RAGPLM"
+_CONFIG_FOR_DOC = "RAGPLMConfig"
+def default_init(cls, *args, **kwargs):
+    return cls(*args, **kwargs)
+DeepNormCoefficients = namedtuple("DeepNormCoefficients", ["alpha", "beta"])
+def get_deepnorm_coefficients(config: RAGPLMConfig):
+    """
+        DeepNorm coefficients from : https://kexue.fm/archives/8978
+    """
+    num_layers = config.num_layers
+    return DeepNormCoefficients(alpha=(2 * num_layers) ** 0.5, beta=(2 * num_layers) ** -0.5)
+class InvalidScoreLogitsProcessor(LogitsProcessor):
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        if torch.isnan(scores).any() or torch.isinf(scores).any():
+            scores.zero_()
+            scores[..., 5] = 5e4
+        return scores
+def split_tensor_along_last_dim(
+        tensor: torch.Tensor,
+        num_partitions: int,
+        contiguous_split_chunks: bool = False,
+) -> List[torch.Tensor]:
+    """Split a tensor along its last dimension.
+    Arguments:
+        tensor: input tensor.
+        num_partitions: number of partitions to split the tensor
+        contiguous_split_chunks: If True, make each chunk contiguous
+                                 in memory.
+    Returns:
+        A list of Tensors
+    """
+    # Get the size and dimension.
+    last_dim = tensor.dim() - 1
+    last_dim_size = tensor.size()[last_dim] // num_partitions
+    # Split.
+    tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
+    # Note: torch.split does not create contiguous tensors by default.
+    if contiguous_split_chunks:
+        return tuple(chunk.contiguous() for chunk in tensor_list)
+    return tensor_list
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, base=10000, precision=torch.half, learnable=False):
+        super().__init__()
+        inv_freq = 1. / (base ** (torch.arange(0, dim, 2).float() / dim)).to(precision)
+        self.dim = dim
+        self.base = base
+        self.learnable = learnable
+        if learnable:
+            self.inv_freq = torch.nn.Parameter(inv_freq)
+            self.max_seq_len_cached = None
+        else:
+            self.register_buffer('inv_freq', inv_freq)
+            self.max_seq_len_cached = None
+            self.cos_cached = None
+            self.sin_cached = None
+        self.precision = precision
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+        # import pdb; pdb.set_trace();
+        if f'{prefix}inv_freq' in state_dict:
+            super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
+        else:
+            self.inv_freq.copy_(1. / (self.base ** (torch.arange(0, self.dim, 2).float() / self.dim)).to(self.precision))
+    def forward(self, x, seq_dim=1, seq_len=None):
+        # self.inv_freq = 1. / (self.base ** (torch.arange(0, self.dim, 2).float() / self.dim)).to(x.device)
+        if seq_len is None:
+            seq_len = x.shape[seq_dim]
+        if self.max_seq_len_cached is None or (seq_len > self.max_seq_len_cached):
+            self.max_seq_len_cached = None if self.learnable else seq_len
+            t = torch.arange(seq_len, device=x.device, dtype=torch.float32)
+            # import pdb; pdb.set_trace();
+            freqs = torch.einsum('i,j->ij', t, self.inv_freq.to(x.device))
+            # Different from paper, but it uses a different permutation in order to obtain the same calculation
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            if self.precision == torch.bfloat16 or self.precision == torch.half:
+                emb = emb.float()
+            # [sx, 1 (b * np), hn]
+            cos_cached = emb.cos()[:, None, :]
+            sin_cached = emb.sin()[:, None, :]
+            if self.precision == torch.bfloat16:
+                cos_cached = cos_cached.bfloat16()
+                sin_cached = sin_cached.bfloat16()
+            elif self.precision == torch.half:
+                cos_cached = cos_cached.half()
+                sin_cached = sin_cached.half()
+            if self.learnable:
+                return cos_cached, sin_cached
+            self.cos_cached, self.sin_cached = cos_cached, sin_cached
+        return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
+def rotate_half(x):
+    x1, x2 = x[..., :x.shape[-1] // 2], x[..., x.shape[-1] // 2:]
+    return torch.cat((-x2, x1), dim=x1.ndim - 1)  # dim=-1 triggers a bug in earlier torch versions
+def assert_dim_check(tensor, ndim=None, shape=None):
+    if ndim is not None:
+        assert tensor.ndim == ndim, f"Exepct tensor.ndim={ndim}. gut got tensor.shape={tensor.shape}"
+    if shape is not None:
+        assert list(tensor.shape) == list(shape), f"Exepct tensor.shape={shape}. gut got tensor.shape={tensor.shape}"
+def apply_rotary_pos_emb_index_torch(q, k, cos, sin, position_id):  # jitting fails with bf16
+    # position_id: [sq, b], q, k: [sq, b, np, hn], cos: [sq, 1, hn] -> [sq, b, 1, hn]
+    cos, sin = F.embedding(position_id, cos.squeeze(1)).unsqueeze(2), \
+               F.embedding(position_id, sin.squeeze(1)).unsqueeze(2)
+    q, k = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+    return q, k
+try:
+    # raise 'Errror'
+    from apex.normalization import MixedFusedRMSNorm
+    from apex.normalization import FusedLayerNorm
+    print(f"{__file__}: Use apex.normalization.MixedFusedRMSNorm as RMSNorm")
+    class RMSNorm(MixedFusedRMSNorm):
+        def __init__(self, normalized_shape, eps=1e-5, elementwise_affine=True, memory_efficient=False):
+            super(RMSNorm, self).__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, memory_efficient=memory_efficient)
+        def forward(self, input):
+            dtype = input.dtype
+            with torch.autocast('cuda', enabled=True, dtype=torch.float32, cache_enabled=None):
+                output = super().forward(input)
+            return output.to(dtype)
+    class LayerNorm(FusedLayerNorm):
+        def __init__(self, normalized_shape, eps=1e-5, elementwise_affine=True, memory_efficient=False):
+            super(LayerNorm, self).__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, memory_efficient=memory_efficient)
+        def forward(self, input):
+            dtype = input.dtype
+            with torch.autocast('cuda', enabled=True, dtype=torch.float32, cache_enabled=None):
+                output = super().forward(input)
+            return output.to(dtype)
+except:
+    class RMSNorm(torch.nn.Module):
+        def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
+            super().__init__()
+            self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
+            self.eps = eps
+        @torch.jit.export
+        def forward(self, hidden_states: torch.Tensor):
+            input_dtype = hidden_states.dtype
+            variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+            hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
+            return (self.weight * hidden_states).to(input_dtype)
+    print(f"{__file__}: Use custom RMSNorm")
+class CoreAttention(torch.nn.Module):
+    def __init__(self, config: RAGPLMConfig, layer_number):
+        super(CoreAttention, self).__init__()
+        self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
+        if self.apply_query_key_layer_scaling:
+            self.attention_softmax_in_fp32 = True
+        self.layer_number = max(1, layer_number)
+        projection_size = config.kv_channels * config.num_attention_heads
+        # Per attention head and per partition values.
+        self.hidden_size_per_partition = projection_size
+        self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
+        self.num_attention_heads_per_partition = config.num_attention_heads
+        coeff = None
+        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
+        if self.apply_query_key_layer_scaling:
+            coeff = self.layer_number
+            self.norm_factor *= coeff
+        self.coeff = coeff
+        self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
+        self.is_causal = config.is_causal
+        self.use_pytorch_sdpa = config.use_pytorch_sdpa
+    def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        # query_layer, key_layer, value_layer: [seq_len, batch_size, num_heads, head_dim]
+        # import pdb; pdb.set_trace();
+        pytorch_major_version = int(torch.__version__.split('.')[0])
+        # assert pytorch_major_version >= 2, f"Expect PyTorch version > 2.0"
+        if pytorch_major_version >= 2 and self.use_pytorch_sdpa:
+            dropout_p = self.attention_dropout.p if self.training else 0
+            # [seq_len, batch_size, num_heads, head_dim] -> [batch_size, num_heads, seq_len, head_dim]
+            query_layer, key_layer, value_layer = [k.permute(1, 2, 0, 3) for k in [query_layer, key_layer, value_layer]]
+            # import pdb; pdb.set_trace();
+            if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
+                # context_layer: [batch_size, num_heads, seq_len, head_dim]
+                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, is_causal=self.is_causal, dropout_p=dropout_p)
+                #print(f"torch.nn.functional.scaled_dot_product_attention")
+            else:
+                if (attention_mask is not None) and (attention_mask.dtype == torch.bool):
+                    attention_mask = attention_mask.logical_not() ## DO NOT inplace operation!!!!
+                    #print(f"attention_mask.shape={attention_mask.shape}, attention_mask={attention_mask}")
+                else:
+                    pass
+                    # print(f"query_layer.shape={query_layer.shape}, key_layer.shape={key_layer.shape}, attention_mask={attention_mask}")
+                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, attention_mask, dropout_p=dropout_p)
+            # [batch_size, num_heads, seq_len, head_dim] -> [seq_len, batch_size, num_heads, head_dim]
+            context_layer = context_layer.permute(2, 0, 1, 3)
+            # [seq_len, batch_size, 2560]
+            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+            context_layer = context_layer.reshape(*new_context_layer_shape)
+        else:
+            # Raw attention scores
+            # [b, np, sq, sk]
+            output_size = (query_layer.size(1), query_layer.size(2), query_layer.size(0), key_layer.size(0))
+            # [sq, b, np, hn] -> [sq, b * np, hn]
+            query_layer = query_layer.view(output_size[2], output_size[0] * output_size[1], -1)
+            # [sk, b, np, hn] -> [sk, b * np, hn]
+            key_layer = key_layer.view(output_size[3], output_size[0] * output_size[1], -1)
+            # preallocting input tensor: [b * np, sq, sk]
+            matmul_input_buffer = torch.empty(
+                output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
+                device=query_layer.device
+            )
+            # Raw attention scores. [b * np, sq, sk]
+            matmul_result = torch.baddbmm(
+                matmul_input_buffer,
+                query_layer.transpose(0, 1),  # [b * np, sq, hn]
+                key_layer.transpose(0, 1).transpose(1, 2),  # [b * np, hn, sk]
+                beta=0.0,
+                alpha=(1.0 / self.norm_factor),
+            )
+            # change view to [b, np, sq, sk]
+            attention_scores = matmul_result.view(*output_size)
+            # ===========================
+            # Attention probs and dropout
+            # ===========================
+            # attention scores and attention mask [b, np, sq, sk]
+            if self.attention_softmax_in_fp32:
+                attention_scores = attention_scores.float()
+            if self.coeff is not None:
+                attention_scores = attention_scores * self.coeff
+            if self.is_causal and attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
+                attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
+                                            device=attention_scores.device, dtype=torch.bool)
+                attention_mask.tril_()
+                attention_mask = ~attention_mask
+            if attention_mask is not None:
+                attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
+            attention_probs = F.softmax(attention_scores, dim=-1)
+            attention_probs = attention_probs.type_as(value_layer)
+            # This is actually dropping out entire tokens to attend to, which might
+            # seem a bit unusual, but is taken from the original Transformer paper.
+            attention_probs = self.attention_dropout(attention_probs)
+            # =========================
+            # Context layer. [sq, b, hp]
+            # =========================
+            # value_layer -> context layer.
+            # [sk, b, np, hn] --> [b, np, sq, hn]
+            # context layer shape: [b, np, sq, hn]
+            output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3))
+            # change view [sk, b * np, hn]
+            value_layer = value_layer.view(value_layer.size(0), output_size[0] * output_size[1], -1)
+            # change view [b * np, sq, sk]
+            attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
+            # matmul: [b * np, sq, hn]
+            context_layer = torch.bmm(attention_probs, value_layer.transpose(0, 1))
+            # change view [b, np, sq, hn]
+            context_layer = context_layer.view(*output_size)
+            # [b, np, sq, hn] --> [sq, b, np, hn]
+            context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
+            # [sq, b, np, hn] --> [sq, b, hp]
+            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+            context_layer = context_layer.view(*new_context_layer_shape)
+        return context_layer
+class SelfAttention(torch.nn.Module):
+    """Parallel self-attention layer abstract class.
+    Self-attention layer takes input with size [s, b, h]
+    and returns output of the same size.
+    """
+    def __init__(self, config: RAGPLMConfig, layer_number, device=None):
+        super(SelfAttention, self).__init__()
+        self.layer_number = max(1, layer_number)
+        self.projection_size = config.kv_channels * config.num_attention_heads
+        # Per attention head and per partition values.
+        self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads
+        self.num_attention_heads_per_partition = config.num_attention_heads
+        self.multi_query_attention = config.multi_query_attention
+        self.qkv_hidden_size = 3 * self.projection_size
+        if self.multi_query_attention:
+            self.num_multi_query_groups_per_partition = config.multi_query_group_num
+            self.qkv_hidden_size = (
+                    self.projection_size + 2 * self.hidden_size_per_attention_head * config.multi_query_group_num
+            )
+        self.query_key_value = nn.Linear(config.hidden_size, self.qkv_hidden_size,
+                                         bias=config.add_bias_linear or config.add_qkv_bias,
+                                         device=device, **_config_to_kwargs(config)
+                                         )
+        self.core_attention = CoreAttention(config, self.layer_number)
+        # Output.
+        self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear, device=device, **_config_to_kwargs(config))
+        self.rotary_embedding_2d = config.rotary_embedding_2d
+        # dim, base=10000, precision=torch.half, learnable=False
+        self.rotary_emb = RotaryEmbedding(self.hidden_size_per_attention_head // 2 if self.rotary_embedding_2d else self.hidden_size_per_attention_head,
+                                          base=config.rotary_freq_base, precision=config.torch_dtype, learnable=False)
+        ##### LoRA
+        self.lora = config.lora
+        if config.lora:
+            self.lora_linear = torch.nn.ModuleDict()
+            self.lora_dropout = torch.nn.Dropout(config.lora_dropout)
+            self.lora_alpha = config.lora_alpha
+            self.lora_r = config.lora_r
+            self.lora_before_position = config.lora_before_position
+            for name in ('Q', 'K', 'V', 'O'):
+                self.lora_linear[f'{name}_A'] = torch.nn.Linear(config.hidden_size, config.lora_r, bias=False)
+                self.lora_linear[f'{name}_B'] = torch.nn.Linear(config.lora_r, config.hidden_size, bias=False)
+                torch.nn.init.kaiming_uniform_(self.lora_linear[f"{name}_A"].weight, a=math.sqrt(5))
+                torch.nn.init.zeros_(self.lora_linear[f'{name}_B'].weight)
+    def forward(
+            self, hidden_states, attention_mask, position_ids, kv_cache=None, use_cache=True
+    ):
+        # =================================================
+        # Pre-allocate memory for key-values for inference.
+        # =================================================
+        # =====================
+        # Query, Key, and Value
+        # =====================
+        # Attention heads [sq, b, h] --> [sq, b, (np * 3 * hn)]
+        mixed_x_layer = self.query_key_value(hidden_states) # [12800, 1, 6912]
+        if self.multi_query_attention:
+            (query_layer, key_layer, value_layer) = mixed_x_layer.split(
+                [
+                    self.num_attention_heads_per_partition * self.hidden_size_per_attention_head,
+                    self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
+                    self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
+                ],
+                dim=-1,
+            )
+            query_layer = query_layer.view(
+                query_layer.size()[:-1] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head)
+            )
+            key_layer = key_layer.view(
+                key_layer.size()[:-1] + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
+            )
+            value_layer = value_layer.view(
+                value_layer.size()[:-1]
+                + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
+            )
+        else:
+            new_tensor_shape = mixed_x_layer.size()[:-1] + (self.num_attention_heads_per_partition, 3 * self.hidden_size_per_attention_head) # [12800, 1, 36, 192]
+            mixed_x_layer = mixed_x_layer.view(*new_tensor_shape) # [12800, 1, 36, 192]
+            # [sq, b, np, 3 * hn] --> 3 [sq, b, np, hn]
+            (query_layer, key_layer, value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3)
+        if self.lora and self.lora_before_position:
+            scaling     = self.lora_alpha / self.lora_r
+            query_layer = query_layer + ( self.lora_linear['Q_B'](self.lora_linear['Q_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(query_layer.shape)
+            key_layer   = key_layer   + ( self.lora_linear['K_B'](self.lora_linear['K_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(key_layer.shape)
+            value_layer = value_layer + ( self.lora_linear['V_B'](self.lora_linear['V_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(value_layer.shape)
+        # apply relative positional encoding (rotary embedding)
+        if position_ids is not None: # [seq_len, 2, batch_size, 32, 2]
+            if self.rotary_embedding_2d:
+                q1, q2 = query_layer.chunk(2, dim=(query_layer.ndim - 1)) # 32
+                k1, k2 = key_layer.chunk(2, dim=(key_layer.ndim - 1))
+                # import pdb; pdb.set_trace();
+                cos, sin = self.rotary_emb(q1, seq_len=position_ids.max() + 1) # 32
+                position_ids, block_position_ids = \
+                    position_ids[:, 0, :].transpose(0, 1).contiguous(), \
+                    position_ids[:, 1, :].transpose(0, 1).contiguous()
+                q1, k1 = apply_rotary_pos_emb_index_torch(q1, k1, cos, sin, position_ids)
+                q2, k2 = apply_rotary_pos_emb_index_torch(q2, k2, cos, sin, block_position_ids)
+                query_layer = torch.concat([q1, q2], dim=(q1.ndim - 1))
+                key_layer = torch.concat([k1, k2], dim=(k1.ndim - 1))
+            else:
+                # [b, sq] -> [sq, b]
+                position_ids = position_ids.transpose(0, 1)
+                cos, sin = self.rotary_emb(value_layer, seq_len=position_ids.max() + 1)
+                query_layer, key_layer = apply_rotary_pos_emb_index_torch(query_layer, key_layer, cos, sin, position_ids)
+        if self.lora and not self.lora_before_position:
+            # query_layer = query_layer +  lora_layer["Q_B"](lora_layer["Q_A"](self.lora_dropout(hidden_states)))* self.scaling
+            scaling = self.lora_alpha / self.lora_r
+            query_layer = query_layer + ( self.lora_linear['Q_B'](self.lora_linear['Q_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(query_layer.shape)
+            key_layer   = key_layer   + ( self.lora_linear['K_B'](self.lora_linear['K_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(key_layer.shape)
+            value_layer = value_layer + ( self.lora_linear['V_B'](self.lora_linear['V_A'](self.lora_dropout(hidden_states))) * scaling ).reshape(value_layer.shape)
+        # adjust key and value for inference
+        if kv_cache is not None:
+            cache_k, cache_v = kv_cache
+            key_layer = torch.cat((cache_k, key_layer), dim=0)
+            value_layer = torch.cat((cache_v, value_layer), dim=0)
+        if use_cache:
+            kv_cache = (key_layer, value_layer)
+        else:
+            kv_cache = None
+        if self.multi_query_attention:
+            key_layer = key_layer.unsqueeze(-2)
+            key_layer = key_layer.expand(-1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1)
+            key_layer = key_layer.contiguous().view(key_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head))
+            value_layer = value_layer.unsqueeze(-2)
+            value_layer = value_layer.expand(-1, -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1)
+            value_layer = value_layer.contiguous().view(value_layer.size()[:2] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head))
+        # ==================================
+        # core attention computation
+        # ==================================
+        context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask) # context_layer: [seq_len, batch_size, num_heads*head_dim]
+        output = self.dense(context_layer)
+        if self.lora:
+            scaling = self.lora_alpha / self.lora_r
+            output = output + self.lora_linear['O_B'](self.lora_linear['O_A'](self.lora_dropout(context_layer))) * scaling
+        # =================
+        # Output. [sq, b, h]
+        # =================
+        # output = context_layer @ self.dense.weight.T + self.dense.bias
+        return output, kv_cache
+def _config_to_kwargs(args):
+    common_kwargs = {
+        "dtype": args.torch_dtype,
+    }
+    return common_kwargs
+class MLP(torch.nn.Module):
+    """MLP.
+    MLP will take the input with h hidden state, project it to 4*h
+    hidden dimension, perform nonlinear transformation, and project the
+    state back into h hidden dimension.
+    """
+    def __init__(self, config: RAGPLMConfig, device=None):
+        super(MLP, self).__init__()
+        self.add_bias = config.add_bias_linear
+        self.moe = config.moe
+        self.mlp_lora = config.mlp_lora
+        self.num_experts = config.num_experts
+        self.experts_per_token = config.experts_per_token # 2
+        if self.moe is True and self.mlp_lora is True:
+            raise NotImplementedError(f"moe and mlp_lora are both enabled")
+        # Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf
+        self.dense_h_to_4h = nn.Linear(
+            config.hidden_size,
+            config.ffn_hidden_size * 2,
+            bias=self.add_bias,
+            device=device,
+            **_config_to_kwargs(config)
+        )
+        def swiglu(x):
+           x = torch.chunk(x, 2, dim=-1)
+           return x[0] * F.silu(x[1])
+        def geglu(x):
+            x = torch.chunk(x, 2, dim=-1)
+            return x[0] * F.gelu(x[1])
+        if config.glu_activation == 'geglu':
+            self.activation_func = geglu
+        elif config.glu_activation == 'swiglu':
+            self.activation_func = swiglu
+        else:
+            assert RuntimeError(f"Unsupported glu_activation: {config.glu_activation}")
+        # Project back to h.
+        self.dense_4h_to_h = nn.Linear(
+            config.ffn_hidden_size,
+            config.hidden_size,
+            bias=self.add_bias,
+            device=device,
+            **_config_to_kwargs(config)
+        )
+        if self.moe:
+            assert self.num_experts > 1
+            del self.dense_h_to_4h
+            del self.dense_4h_to_h
+            self.router = nn.Linear(
+                config.hidden_size,
+                config.num_experts,
+                bias=False,
+                device=device,
+                dtype=torch.float32
+            )
+            for i in range(0, self.num_experts):
+                self.register_module(f"dense_h_to_4h_{i}", nn.Linear(
+                    config.hidden_size,
+                    config.ffn_hidden_size * 2,
+                    bias=self.add_bias,
+                    device=device,
+                    **_config_to_kwargs(config)
+                ))
+                self.register_module(f"dense_4h_to_h_{i}", nn.Linear(
+                    config.ffn_hidden_size,
+                    config.hidden_size,
+                    bias=self.add_bias,
+                    device=device,
+                    **_config_to_kwargs(config)
+                ))
+        if self.mlp_lora:
+            self.lora_linear = torch.nn.ModuleDict()
+            self.lora_dropout = torch.nn.Dropout(config.lora_dropout)
+            self.lora_alpha = config.lora_alpha
+            self.lora_r = config.lora_r
+            for name in ('dense_h_to_4h', 'dense_4h_to_h'):
+                if name == 'dense_h_to_4h':
+                    self.lora_linear[f'{name}_A'] = torch.nn.Linear(config.hidden_size, config.lora_r, bias=False)
+                    self.lora_linear[f'{name}_B'] = torch.nn.Linear(config.lora_r, config.ffn_hidden_size * 2, bias=False)
+                elif name == 'dense_4h_to_h':
+                    self.lora_linear[f'{name}_A'] = torch.nn.Linear(config.ffn_hidden_size, config.lora_r, bias=False)
+                    self.lora_linear[f'{name}_B'] = torch.nn.Linear(config.lora_r, config.hidden_size, bias=False)
+                torch.nn.init.kaiming_uniform_(self.lora_linear[f"{name}_A"].weight, a=math.sqrt(5))
+                torch.nn.init.zeros_(self.lora_linear[f'{name}_B'].weight)
+    def moe_forward(self, hidden_states, expert_idx):
+        # hidden_states: torch.Size([503, 1920])
+        # import pdb; pdb.set_trace();
+        intermediate_parallel = getattr(self, f"dense_h_to_4h_{expert_idx}")(hidden_states) # torch.Size([503, 20480])
+        intermediate_parallel = self.activation_func(intermediate_parallel) # torch.Size([503, 10240])
+        output = getattr(self, f"dense_4h_to_h_{expert_idx}")(intermediate_parallel) # torch.Size([503, 1920])
+        return output
+    def forward(self, hidden_states):
+        if self.moe:
+            # import pdb; pdb.set_trace();
+            s, b, n = hidden_states.shape
+            dtype = hidden_states.dtype
+            hidden_states = hidden_states.view(-1, hidden_states.size(2)) # [s*b h]
+            route = self.router(hidden_states).to(dtype)
+            weights, selected_experts = torch.topk(route, self.experts_per_token)
+            weights = F.softmax(weights, dim=1, dtype=torch.float).to(hidden_states.dtype)
+            output = torch.zeros_like(hidden_states, dtype=hidden_states.dtype, device=hidden_states.device)
+            for expert_idx in range(self.num_experts):
+                batch_idx, nth_expert = torch.where(selected_experts == expert_idx)
+                if nth_expert.shape[0] == 0:
+                    continue
+                cur_out = self.moe_forward(hidden_states[batch_idx], expert_idx)
+                output[batch_idx] += weights[batch_idx, nth_expert, None] * cur_out
+            output = output.reshape(s, b, n)
+        else:
+            # [s, b, 4hp]
+            #intermediate_parallel = hidden_states @ self.dense_h_to_4h.weight.T + self.dense_h_to_4h.bias
+            intermediate_parallel = self.dense_h_to_4h(hidden_states)
+            if self.mlp_lora:
+                scaling = self.lora_alpha / self.lora_r
+                intermediate_parallel = intermediate_parallel + ( self.lora_linear['dense_h_to_4h_B'](self.lora_linear['dense_h_to_4h_A'](self.lora_dropout(hidden_states))) * scaling )
+            intermediate_parallel = self.activation_func(intermediate_parallel)
+            # [s, b, h]
+            output = self.dense_4h_to_h(intermediate_parallel)
+            if self.mlp_lora:
+                output = output + ( self.lora_linear['dense_4h_to_h_B'](self.lora_linear['dense_4h_to_h_A'](self.lora_dropout(intermediate_parallel))) * scaling )# .reshape(output.shape)
+            #output = intermediate_parallel @ self.dense_4h_to_h.weight.T + self.dense_4h_to_h.bias # self.dense_4h_to_h(intermediate_parallel)
+        return output
+class RAGPLMBlock(torch.nn.Module):
+    """A single transformer layer.
+    Transformer layer takes input with size [s, b, h] and returns an
+    output of the same size.
+    """
+    def __init__(self, config: RAGPLMConfig, layer_number, device=None):
+        super(RAGPLMBlock, self).__init__()
+        self.layer_number = layer_number
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+        self.fp32_residual_connection = config.fp32_residual_connection
+        LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
+        # Layernorm on the input data.
+        self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon)
+        # Self attention.
+        self.self_attention = SelfAttention(config, layer_number, device=device)
+        self.hidden_dropout = config.hidden_dropout
+        # Layernorm on the attention output
+        self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon)
+        # MLP
+        self.mlp = MLP(config, device=device)
+        self.deepnorm_coeff = get_deepnorm_coefficients(config) if config.deepnorm else None
+    def forward(
+            self, hidden_states, attention_mask, position_ids, kv_cache=None, use_cache=True,
+    ):
+        # hidden_states: [s, b, h]
+        layernorm_output = self.input_layernorm(hidden_states)
+        # Self attention.
+        attention_output, kv_cache = self.self_attention(
+            layernorm_output,
+            attention_mask,
+            position_ids, # [batch_size, 2, seq_len, 32, 2]
+            kv_cache=kv_cache,
+            use_cache=use_cache
+        )
+        # Residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = hidden_states
+        layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
+        if self.deepnorm_coeff is not None:
+            layernorm_input = residual*self.deepnorm_coeff.alpha + layernorm_input
+        else:
+            layernorm_input = residual + layernorm_input
+        # Layer norm post the self attention.
+        layernorm_output = self.post_attention_layernorm(layernorm_input)
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+        # Second residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = layernorm_input
+        output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
+        if self.deepnorm_coeff is not None:
+            output = residual*self.deepnorm_coeff.alpha + output
+        else:
+            output = residual + output
+        return output, kv_cache
+class RAGPLMTransformer(torch.nn.Module):
+    """Transformer class."""
+    def __init__(self, config: RAGPLMConfig, device=None):
+        super(RAGPLMTransformer, self).__init__()
+        self.config = config
+        self.fp32_residual_connection = config.fp32_residual_connection
+        self.post_layer_norm = config.post_layer_norm
+        # Number of layers.
+        self.num_layers = config.num_layers
+        # Transformer layers.
+        def build_layer(layer_number):
+            return RAGPLMBlock(config, layer_number, device=device)
+        self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
+        if self.post_layer_norm:
+            LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
+            # Final layer norm before output.
+            self.final_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon)
+        self.gradient_checkpointing = False
+        # Introduce a gradient checkpointing for per num_checkpoint layers
+        # For example:  num_checkpoint=1 will checkpoint all layers, num_checkpoint=2 will checkpoint half of layers
+        self.num_checkpoint = 1
+    def _get_layer(self, layer_number):
+        return self.layers[layer_number]
+    def forward(
+            self, hidden_states, attention_mask, position_ids, kv_caches=None,
+            use_cache: Optional[bool] = True,
+            output_hidden_states: Optional[bool] = False,
+    ):
+        if not kv_caches:
+            kv_caches = [None for _ in range(self.num_layers)]
+        presents = () if use_cache else None
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        all_self_attentions = None
+        all_hidden_states = () if output_hidden_states else None
+        for index in range(self.num_layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer = self._get_layer(index)
+            if self.gradient_checkpointing and self.training and torch.is_grad_enabled() and index % self.num_checkpoint == 0:
+                #### A trick to enable gradient to avoid gradient checkpointing error
+                if hidden_states.requires_grad is False and deepspeed.checkpointing.is_configured() and (self.config.lora or self.config.mlp_lora):
+                    # print(f"index={index}, set hidden_states.requires_grad = True")
+                    hidden_states = hidden_states.clone()
+                    hidden_states.requires_grad = True
+                layer_ret = get_checkpoint_fn()(
+                    layer,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    kv_caches[index],
+                    use_cache
+                )
+            else:
+                layer_ret = layer(
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    kv_cache=kv_caches[index],
+                    use_cache=use_cache
+                )
+            hidden_states, kv_cache = layer_ret
+            if use_cache:
+                presents = presents + (kv_cache,)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        # Final layer norm.
+        if self.post_layer_norm:
+            hidden_states = self.final_layernorm(hidden_states)
+        return hidden_states, presents, all_hidden_states, all_self_attentions
+class RAGPLMPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and
+    a simple interface for downloading and loading pretrained models.
+    """
+    is_parallelizable = False
+    supports_gradient_checkpointing = True
+    config_class = RAGPLMConfig
+    base_model_prefix = "transformer"
+    _no_split_modules = ["RAGPLMBlock"]
+    def _init_weights(self, module: nn.Module):
+        """Initialize the weights."""
+        return
+    def get_masks(self, input_ids, past_key_values, padding_mask=None):
+        batch_size, seq_length = input_ids.shape
+        full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device)
+        full_attention_mask.tril_()
+        past_length = 0
+        if past_key_values:
+            past_length = past_key_values[0][0].shape[0]
+        if past_length:
+            full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
+                                                        device=input_ids.device), full_attention_mask), dim=-1)
+        if padding_mask is not None:
+            full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
+        if not past_length and padding_mask is not None:
+            full_attention_mask -= padding_mask.unsqueeze(-1) - 1
+        full_attention_mask = (full_attention_mask < 0.5).bool()
+        full_attention_mask.unsqueeze_(1)
+        return full_attention_mask
+    def get_position_ids(self, input_ids, device):
+        batch_size, seq_length = input_ids.shape
+        position_ids_1 = torch.zeros( seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1) # [batch_size, seq_len]
+        position_ids_2 = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1) # [batch_size, seq_len]
+        position_ids   = torch.stack([position_ids_1, position_ids_2], axis=1) # [batch_size, 2, seq_len]
+        return position_ids
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, RAGPLMTransformer):
+            module.gradient_checkpointing = value
+class Embedding(torch.nn.Module):
+    """Language model embeddings."""
+    def __init__(self, config: RAGPLMConfig, device=None):
+        super(Embedding, self).__init__()
+        self.hidden_size = config.hidden_size
+        # Word embeddings (parallel).
+        self.word_embeddings = nn.Embedding(
+            config.padded_vocab_size,
+            self.hidden_size,
+            dtype=config.torch_dtype,
+            device=device
+        )
+        self.fp32_residual_connection = config.fp32_residual_connection
+    def forward(self, input_ids):
+        # Embeddings.
+        words_embeddings = self.word_embeddings(input_ids)
+        embeddings = words_embeddings
+        # Data format change to avoid explicit tranposes : [b s h] --> [s b h].
+        embeddings = embeddings.transpose(0, 1).contiguous()
+        # If the input flag for fp32 residual connection is set, convert for float.
+        if self.fp32_residual_connection:
+            embeddings = embeddings.float()
+        return embeddings
+class RAGPLMModel(RAGPLMPreTrainedModel):
+    def __init__(self, config: RAGPLMConfig, device=None, empty_init=True):
+        super().__init__(config)
+        if empty_init:
+            init_method = skip_init
+        else:
+            init_method = default_init
+        init_kwargs = {}
+        if device is not None:
+            init_kwargs["device"] = device
+        self.embedding = init_method(Embedding, config, **init_kwargs)
+        self.num_layers = config.num_layers
+        self.multi_query_group_num = config.multi_query_group_num
+        self.kv_channels = config.kv_channels
+        self.str_emb_transform = None
+        self.seq_ln = None
+        self.str_ln = None
+        self.str_embedding = None
+        self.add_str_emb_ln = config.add_str_emb_ln
+        self.add_seq_emb_ln = config.add_seq_emb_ln
+        if config.str_input_dim is not None and config.str_input_dim > 0:
+            # Structure input as codebook: str_input_dim given, str_output_dim given
+            self.str_emb_transform = torch.nn.Linear(config.str_input_dim, config.hidden_size, bias=False)
+            if config.add_seq_emb_ln:
+                self.seq_ln = torch.nn.LayerNorm(config.hidden_size)
+            if config.add_str_emb_ln:
+                self.str_ln = torch.nn.LayerNorm(config.hidden_size)
+        # if config.str_input_dim is None and config.str_output_dim is not None and config.str_output_dim > 0:
+        if config.str_input_dim is None and config.str_vocab_size is not None and config.str_vocab_size > 0:
+            # Structure input as index: str_input_dim not given, str_output_dim is the vocab size, the structure embedding will be nn.Embedding(str_output_dim+1, hidden_size)
+            self.str_embedding = torch.nn.Embedding(config.str_vocab_size+1, config.hidden_size)
+        # Rotary positional embeddings
+        self.seq_length = config.seq_length
+        rotary_dim = (
+            config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
+        )
+        # self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, base=10000, precision=config.torch_dtype, learnable=False)
+        self.encoder = init_method(RAGPLMTransformer, config, **init_kwargs)
+        self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
+                                        dtype=config.torch_dtype, **init_kwargs)
+        if config.str_output_dim is not None and config.str_output_dim > 0:
+            self.output_layer_str = init_method(nn.Linear, config.hidden_size, config.str_output_dim, bias=False,
+                                        dtype=config.torch_dtype, **init_kwargs)
+        else:
+            self.output_layer_str = None
+        if config.qseq_output_dim is not None and config.qseq_output_dim > 0:
+            self.output_layer_qseq = init_method(nn.Linear, config.hidden_size, config.qseq_output_dim, bias=False,
+                                        dtype=config.torch_dtype, **init_kwargs)
+        else:
+            self.output_layer_qseq = None
+    def init_lora_modules(self):
+        for name, param in self.named_parameters():
+            if 'lora_linear' in name:
+                if '_A' in name:
+                    torch.nn.init.kaiming_uniform_(param, a=math.sqrt(5))
+                elif '_B' in name:
+                    torch.nn.init.zeros_(param)
+    def get_input_embeddings(self):
+        return self.embedding.word_embeddings
+    def forward(
+            self,
+            input_ids,
+            position_ids: Optional[torch.Tensor] = None, # position_ids: [batch_size, 2, seq_len]
+            attention_mask: Optional[torch.BoolTensor] = None,
+            full_attention_mask: Optional[torch.BoolTensor] = None,
+            past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            inputs_str_ids: Optional[torch.Tensor] = None,
+            inputs_str_embeds: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ):
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        batch_size, seq_length = input_ids.shape
+        if inputs_embeds is None:
+            inputs_embeds = self.embedding(input_ids) # [L, B, E]
+        if self.str_emb_transform is not None and inputs_str_embeds is not None:
+            # inputs_embeds: torch.Size([12800, 1, 2304]), inputs_str_embeds: torch.Size([1, 337, 384])
+            assert inputs_str_embeds.ndim == 3, f"inputs_embeds: {inputs_embeds.shape}, inputs_str_embeds: {inputs_str_embeds.shape}"
+            assert inputs_str_embeds.shape[0] == inputs_embeds.shape[1], f"inputs_embeds: {inputs_embeds.shape}, inputs_str_embeds: {inputs_str_embeds.shape}"
+            inputs_str_embeds = inputs_str_embeds.transpose(0, 1) # [L, B, E]
+            num_res, num_batch, num_dim = inputs_str_embeds.shape
+            # inputs_embeds: [L, B, E]
+            padding  = inputs_embeds.shape[0] - num_res
+            inputs_str_embeds = F.pad(inputs_str_embeds, [0, 0, 0, 0, 0, padding], value=0)
+            str_embs = self.str_emb_transform(inputs_str_embeds)
+            if self.add_str_emb_ln:
+                str_embs = self.str_ln(str_embs)
+            if self.add_seq_emb_ln:
+                # seq_ln only apply to the query sequence part
+                inputs_embeds = torch.cat([ self.seq_ln(inputs_embeds[:num_res]), inputs_embeds[num_res:] ], dim=0)
+            inputs_embeds = inputs_embeds + str_embs
+            #if self.add_emb_ln:
+            #    inputs_embeds = self.seq_ln(inputs_embeds) + self.str_ln(self.str_emb_transform(inputs_str_embeds))
+            #else:
+            #    inputs_embeds = inputs_embeds + self.str_emb_transform(inputs_str_embeds)
+        if self.str_embedding is not None and inputs_str_ids is not None:
+            str_embedding_weight = self.str_embedding.weight # [513, 2304]
+            # Add a dimension represent the padding token
+            str_embedding_weight = F.pad(str_embedding_weight, (0, 0, 0, 1)) # [514, 2304]
+            assert inputs_str_ids.max() < str_embedding_weight.shape[0], f"inputs_str_ids.max()={inputs_str_ids.max()}, str_embedding_weight.shape[0]={str_embedding_weight.shape[0]}"
+            str_embs = str_embedding_weight[inputs_str_ids] # [B, L, E]
+            str_embs = str_embs.permute([1, 0, 2]) # [L, B, E]
+            num_res, num_batch, num_dim = str_embs.shape
+            padding  = inputs_embeds.shape[0] - num_res
+            str_embs = F.pad(str_embs, [0, 0, 0, 0, 0, padding], value=0)
+            inputs_embeds = inputs_embeds + str_embs
+        if full_attention_mask is None:
+            if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1):
+                full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
+        # Run encoder.
+        hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
+            inputs_embeds, full_attention_mask, position_ids=position_ids,
+            kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
+        )
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+class RAGPLMForConditionalGeneration(RAGPLMPreTrainedModel):
+    def __init__(self, config: RAGPLMConfig, empty_init=True, device=None):
+        super().__init__(config)
+        self.max_sequence_length = config.max_length
+        self.transformer = RAGPLMModel(config, empty_init=empty_init, device=device)
+        self.config = config
+    def _update_model_kwargs_for_generation(
+            self,
+            outputs: ModelOutput,
+            model_kwargs: Dict[str, Any],
+            is_encoder_decoder: bool = False,
+            standardize_cache_format: bool = False,
+    ) -> Dict[str, Any]:
+        # update past_key_values
+        model_kwargs["past_key_values"] = self._extract_past_from_model_output(
+            outputs, standardize_cache_format=standardize_cache_format
+        )
+        # update attention mask
+        if "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            model_kwargs["attention_mask"] = torch.cat(
+                [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
+            )
+        if 'full_attention_mask' in model_kwargs:
+            raise NotImplementedError(f"full_attention_mask...")
+            model_kwargs['full_attention_mask'] = F.pad(model_kwargs['full_attention_mask'], [0, 1, 0, 1])
+            if self.config.is_causal:
+                model_kwargs['full_attention_mask'][..., -1] = True
+        # update position ids
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            new_position_id = position_ids[..., -1:].clone() # [batch_size, 2, 1]
+            if self.config.rotary_embedding_2d:
+                new_position_id[:, 1] += 1 # Only update the 2nd dimension
+            else:
+                new_position_id[:] += 1
+            model_kwargs["position_ids"] = torch.cat(
+                [position_ids, new_position_id], dim=-1
+            ) # [batch_size, 2, seq_len+1]
+        model_kwargs["is_first_forward"] = False
+        return model_kwargs
+    def prepare_inputs_for_generation(
+            self,
+            input_ids: torch.LongTensor,
+            past_key_values: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            full_attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            is_first_forward: bool = True,
+            **kwargs
+    ) -> dict:
+        # only last token for input_ids if past is not None
+        if position_ids is None:
+            position_ids = self.get_position_ids(input_ids, device=input_ids.device) # position_ids: [batch_size, 2, seq_len]
+        if not is_first_forward:
+            if past_key_values is not None:
+                position_ids = position_ids[..., -1:]
+                input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past_key_values,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "full_attention_mask": full_attention_mask,
+            "return_last_logit": True,
+            "use_cache": use_cache
+        }
+    def forward(
+            self,
+            input_ids: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            full_attention_mask: Optional[torch.Tensor] = None,
+            past_key_values: Optional[Tuple[torch.FloatTensor]] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            labels: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            return_last_logit: Optional[bool] = False,
+    ):
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids=input_ids,
+            position_ids=position_ids, # position_ids: [batch_size, 2, seq_len]
+            attention_mask=attention_mask,
+            full_attention_mask=full_attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        if return_last_logit:
+            hidden_states = hidden_states[-1:]
+        lm_logits = self.transformer.output_layer(hidden_states)
+        # output_layer_str
+        lm_logits = lm_logits.transpose(0, 1).contiguous()
+        loss = None
+        if labels is not None:
+            lm_logits = lm_logits.to(torch.float32)
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+            lm_logits = lm_logits.to(hidden_states.dtype)
+            loss = loss.to(hidden_states.dtype)
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+    @staticmethod
+    def _reorder_cache(
+            past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        Output shares the same memory storage as `past`.
+        """
+        return tuple(
+            (
+                layer_past[0].index_select(1, beam_idx.to(layer_past[0].device)),
+                layer_past[1].index_select(1, beam_idx.to(layer_past[1].device)),
+            )
+            for layer_past in past
+        )
+    def process_response(self, output, history):
+        content = ""
+        history = deepcopy(history)
+        for response in output.split("<|assistant|>"):
+            if "\n" in response:
+                metadata, content = response.split("\n", maxsplit=1)
+            else:
+                metadata, content = "", response
+            if not metadata.strip():
+                content = content.strip()
+                history.append({"role": "assistant", "metadata": metadata, "content": content})
+                content = content.replace("[[训练时间]]", "2023年")
+            else:
+                history.append({"role": "assistant", "metadata": metadata, "content": content})
+                if history[0]["role"] == "system" and "tools" in history[0]:
+                    content = "\n".join(content.split("\n")[1:-1])
+                    def tool_call(**kwargs):
+                        return kwargs
+                    parameters = eval(content)
+                    content = {"name": metadata.strip(), "parameters": parameters}
+                else:
+                    content = {"name": metadata.strip(), "content": content}
+        return content, history
+    @torch.inference_mode()
+    def chat(self, tokenizer, query: str, max_length: int = 2048, num_beams=1,
+            do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None, **kwargs):
+        if logits_processor is None:
+            logits_processor = LogitsProcessorList()
+        logits_processor.append(InvalidScoreLogitsProcessor())
+        gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
+                      "temperature": temperature, "logits_processor": logits_processor, **kwargs}
+        inputs = tokenizer.build_chat_input(query)
+        inputs = inputs.to(self.device)
+        eos_token_id = [tokenizer.eos_token_id]
+        outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
+        outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]
+        response = tokenizer.decode(outputs)
+        return response

performance.png ADDED Viewed

Git LFS Details

SHA256: 1419e63831d60aabbed4476453f0b2fcd8ca235da3e525a3a6481b582b0a8fcb
Pointer size: 131 Bytes
Size of remote file: 142 kB

proteinmoe_architecture.png ADDED Viewed

Git LFS Details

SHA256: 670762ddcb58e41cea704d9fddb6acf9bb109216baff6eaadea401699b56552f
Pointer size: 131 Bytes
Size of remote file: 450 kB

pytorch_model-00001-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2442d42059efa680a6dfcb0098629bb1661582a0b61b1eed4e6ed472dbf1d7b9
+size 4932953760

pytorch_model-00002-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb9e891095ef8df462d4c192accdf3650764acecf8f3512568c8d25c66ea60ea
+size 4999044744

pytorch_model-00003-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ae85274ab0bc8fcd790fdb55b0993d6bafffe153511d82047b7c12bca28f225e
+size 4991905042

pytorch_model-00004-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d798373023367553ae425c7e57170ee7dce7ac516aec6edabd8c894f6842731
+size 4963629472

pytorch_model-00005-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:900893052b8f4b2a5e939c777a3c9a07b1cc1e52efdb7f688231a8cd650ef89a
+size 4991904870

pytorch_model-00006-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2c940f8773468bc21e771d98f35be0250a4ba2ae77e5429f56cd1cbe8b55c84
+size 4999045148

pytorch_model-00007-of-00007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a45ba9f924d8ab3f664f8365f07c40b1247f824acbf662629b324a2022af928
+size 2249880965

pytorch_model.bin.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

tokenization.py ADDED Viewed

	@@ -0,0 +1,268 @@

+from typing import Sequence, Tuple, List, Union, Optional
+from abc import ABC
+from abc import abstractmethod
+# from .tokenizer import AbstractTokenizer
+import logging
+import itertools
+from transformers import PreTrainedTokenizer
+import torch
+import json
+import numpy as np
+logger = logging.getLogger(__name__)
+class ResidueLevelTokenizer(object):
+    """
+    Tokenizer for Protein Residue Level Tokenization.
+    """
+    def __init__(self, **kwargs):
+        super(ResidueLevelTokenizer, self).__init__()
+        ### Set normal tokens
+        self.all_toks = ['[pad]', 'L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', 'X', 'B', 'U', 'Z', 'O', '.', '-']
+        ### Set special tokens
+        _special_tokens = ['tMASK', 'gMASK', 'sMASK', 'eod', 'sop', 'eop', '</s>' ] #  + ['MSA', 'ID'] + [ str(d) for d in range(0, 64) ]
+        self.special_tokens = { tok: len(self.all_toks)+i for i,tok in enumerate(_special_tokens) }
+        self.special_tokens_decoder = { v:k for k, v in self.special_tokens.items() }
+        self.special_tokens['eos'] = self.special_tokens['</s>']
+        self.all_toks.extend(_special_tokens)
+        self.vocab = {tok:idx for idx,tok in enumerate(self.all_toks)}
+        self.command_token = {'[MASK]':'MASK', '[gMASK]': 'gMASK', '[sMASK]':'sMASK'} # , '[MSA]':'MSA', '[ID]':'ID'}
+        self.gMASK_token_id = self.convert_token_to_id('gMASK')
+        self.sop_token_id   = self.convert_token_to_id('sop')
+        self.eos_token_id   = self.convert_token_to_id('</s>')
+        # self.id_token_id    = self.convert_token_to_id('ID')
+        self.pad_token_id   = self.convert_token_to_id('[pad]')
+    def __len__(self):
+        return len(self.vocab)
+    def get_special_token(self, token):
+        return self.special_tokens[token]
+    def get_vocab(self):
+        return self.vocab
+    def convert_id_to_token(self, idx):
+        idx = int(idx)
+        if idx == 0:
+            return '[pad]'
+        elif idx in self.special_tokens_decoder:
+            return f"[{self.special_tokens_decoder[idx]}]"
+        else:
+            return self.all_toks[idx]
+    def convert_token_to_id(self, token):
+        if token == '[pad]':
+            return 0
+        elif token in self.special_tokens:
+            return self.special_tokens[token]
+        else:
+            return self.vocab[token]
+    def encode(self, sequence, add_eos=True):
+        """
+        Encode string or list of tokens into array
+        Examples
+        ----------------
+        encode('[pad]ABDWOKOAKOQA[pad][MSA][3][2][19][3]')
+        encode(['A', 'B', 'D', 'MSA', '34', '2'])
+        """
+        all_toks = set(self.all_toks)
+        a2b = {f"[{t}]":t for t in self.special_tokens if t[0] not in ('[', )}
+        all_toks.update( set(a2b.keys()) )
+        if isinstance(sequence, (tuple, list)):
+            if sequence[-1] != '</s>' and add_eos:
+                sequence = sequence + ['</s>']
+            sequence = [ a2b.get(tok, tok) for tok in sequence ]
+            return np.array([ self.convert_token_to_id(t) for t in sequence ])
+        elif isinstance(sequence, str):
+            if not sequence.endswith('</s>') and add_eos:
+                sequence = sequence + '</s>'
+            s = 0
+            e = 1
+            tok_list = []
+            while s < len(sequence):
+                while sequence[s:e] not in all_toks and e < len(sequence):
+                    e += 1
+                assert sequence[s:e] in all_toks, f"Error: sub sequence {sequence[s:]} cannot be parsed"
+                tok = sequence[s:e]
+                tok = a2b.get(tok, tok) # [gMASK], [sMASK] ...
+                tok_id = self.convert_token_to_id(tok)
+                tok_list.append(tok_id)
+                s = e
+            return np.array(tok_list)
+        else:
+            raise RuntimeError(f"Error: sequence must be list/tuple/str, but got {type(sequence)}")
+    def decode(self, tokens, rem_eos=True, return_str=True):
+        if tokens[-1] == self.eos_token_id and rem_eos:
+            tokens = tokens[:-1]
+        if return_str:
+            return "".join([ self.convert_id_to_token(tok) for tok in tokens ])
+        else:
+            return [ self.convert_id_to_token(tok) for tok in tokens ]
+    def tokenize(self, text, add_eos=True):
+        return self.encode(text, add_eos=add_eos)
+    def extend_vocab(self, tokens):
+        """Extend the vocab with the list of tokens."""
+        for token in tokens:
+            if token not in self.vocab:
+                self.vocab[token] = len(self.vocab)
+                self.all_toks.append(token)
+class ProteinTokenizer(PreTrainedTokenizer):
+    """
+    Protein Tokenizer based on Residue level tokenizer
+    """
+    def __init__(
+        self,
+        vocab_file='xxx',
+        padding_side="right",
+        clean_up_tokenization_spaces=False,
+        encode_special_tokens=True,
+        **kwargs
+    ):
+        self.name = "ProteinTokenizer"
+        self.vocab_file = vocab_file
+        self.tokenizer = ResidueLevelTokenizer()
+        self.special_tokens = self.tokenizer.special_tokens
+        self.encode_special_tokens = encode_special_tokens
+        super().__init__(
+            padding_side=padding_side,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs
+        )
+    def get_command(self, token):
+        if token in self.special_tokens:
+            return self.special_tokens[token]
+        assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}"
+        return self.tokenizer.special_tokens[token]
+    @property
+    def unk_token(self) -> str:
+        return '[pad]'
+    @property
+    def pad_token(self) -> str:
+        return '[pad]'
+    @property
+    def eos_token(self) -> str:
+        return '</s>'
+    @property
+    def unk_token_id(self) -> int:
+        return '[pad]'
+    @property
+    def pad_token_id(self) -> int:
+        return self.tokenizer.pad_token_id
+    @property
+    def eos_token_id(self):
+        return self.tokenizer.eos_token_id
+    @property
+    def gMASK_token_id(self):
+        return self.tokenizer.gMASK_token_id
+    @property
+    def sop_token_id(self):
+        return self.tokenizer.sop_token_id
+    @property
+    def id_token_id(self):
+        return self.tokenizer.id_token_id
+    def IdToToken(self, id_):
+        return self.tokenizer.convert_id_to_token(id_)
+    def TokenToId(self, token):
+        return self.tokenizer.convert_token_to_id(token)
+    @unk_token.setter
+    def unk_token(self, value):
+        logger.warning("Setting unk_token is not supported, use the default one.")
+    @pad_token.setter
+    def pad_token(self, value):
+        logger.warning("Setting pad_token is not supported, use the default one.")
+    @eos_token.setter
+    def eos_token(self, value):
+        logger.warning("Setting eos_token is not supported, use the default one.")
+    @property
+    def vocab_size(self):
+        return len(self.tokenizer)
+    def encode(self, sequence, add_eos=True):
+        return self.tokenizer.encode(sequence, add_eos=add_eos)
+    def decode(self, token_ids, rem_eos=True, return_str=True):
+        return self.tokenizer.decode(token_ids, rem_eos=rem_eos, return_str=return_str)
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.tokenizer.convert_id_to_token(index)
+    def get_vocab(self):
+        """ Returns vocab as a dict """
+        vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
+        return vocab
+    @property
+    def eod(self):
+        return self.tokenizer.get_special_token('eos')
+    def detokenize(self, Ids, type_token=False):
+        new_tokens = self.tokenizer.decode(Ids)
+        return new_tokens
+    def tokenize(self, text):
+        ids = self.tokenizer.tokenize(text)
+        return ids
+    def extend_vocab(self, tokens):
+        """ Extend the vocab with the list of tokens """
+        self.tokenizer.extend_vocab(tokens)
+    def add_retriever_tokens(self):
+        retriever_tokens = ['MSA', 'ID'] + [ str(d) for d in range(0, 64) ]
+        self.tokenizer.extend_vocab(retriever_tokens)
+        self.tokenizer.command_token['[MSA]'] = 'MSA'
+        self.tokenizer.command_token['[ID]'] = 'ID'
+    def add_structure_tokens(self, codebook_size):
+        self.tokenizer.extend_vocab( [ str(i) for i in range(codebook_size) ] )
+    def build_chat_input(self, query):
+        input_ids  = [ self.tokenizer.convert_token_to_id('gMASK'), self.tokenizer.convert_token_to_id('sop') ]
+        input_ids += [ self.tokenizer.convert_token_to_id(tok) for tok in query ]
+        input_ids += [ self.tokenizer.convert_token_to_id('ID') ]
+        # return self.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True)
+        position_ids = torch.stack([torch.zeros(len(input_ids)), torch.arange(len(input_ids))], axis=0).unsqueeze(0).long()
+        return {
+            'input_ids': torch.from_numpy(np.array([ input_ids ])).long(),
+            'attention_mask': None,
+            'position_ids': position_ids
+        }
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        vocab = self.get_vocab()
+        with open(f"{save_directory}/vocab.json", 'w') as f:
+            json.dump(vocab, f, indent=4)
+        return ( f"{save_directory}/vocab.json", )

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "added_tokens_decoder": {},
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization.ProteinTokenizer",
+      null
+    ]
+  },
+  "clean_up_tokenization_spaces": false,
+  "do_lower_case": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[pad]",
+  "padding_side": "right",
+  "remove_space": false,
+  "tokenizer_class": "ProteinTokenizer",
+  "unk_token": "[pad]"
+}

vocab.json ADDED Viewed

	@@ -0,0 +1,549 @@

+{
+    "[pad]": 0,
+    "L": 1,
+    "A": 2,
+    "G": 3,
+    "V": 4,
+    "S": 5,
+    "E": 6,
+    "R": 7,
+    "T": 8,
+    "I": 9,
+    "D": 10,
+    "P": 11,
+    "K": 12,
+    "Q": 13,
+    "N": 14,
+    "F": 15,
+    "Y": 16,
+    "M": 17,
+    "H": 18,
+    "W": 19,
+    "C": 20,
+    "X": 21,
+    "B": 22,
+    "U": 23,
+    "Z": 24,
+    "O": 25,
+    ".": 26,
+    "-": 27,
+    "[tMASK]": 28,
+    "[gMASK]": 29,
+    "[sMASK]": 30,
+    "[eod]": 31,
+    "[sop]": 32,
+    "[eop]": 33,
+    "[</s>]": 34,
+    "0": 35,
+    "1": 36,
+    "2": 37,
+    "3": 38,
+    "4": 39,
+    "5": 40,
+    "6": 41,
+    "7": 42,
+    "8": 43,
+    "9": 44,
+    "10": 45,
+    "11": 46,
+    "12": 47,
+    "13": 48,
+    "14": 49,
+    "15": 50,
+    "16": 51,
+    "17": 52,
+    "18": 53,
+    "19": 54,
+    "20": 55,
+    "21": 56,
+    "22": 57,
+    "23": 58,
+    "24": 59,
+    "25": 60,
+    "26": 61,
+    "27": 62,
+    "28": 63,
+    "29": 64,
+    "30": 65,
+    "31": 66,
+    "32": 67,
+    "33": 68,
+    "34": 69,
+    "35": 70,
+    "36": 71,
+    "37": 72,
+    "38": 73,
+    "39": 74,
+    "40": 75,
+    "41": 76,
+    "42": 77,
+    "43": 78,
+    "44": 79,
+    "45": 80,
+    "46": 81,
+    "47": 82,
+    "48": 83,
+    "49": 84,
+    "50": 85,
+    "51": 86,
+    "52": 87,
+    "53": 88,
+    "54": 89,
+    "55": 90,
+    "56": 91,
+    "57": 92,
+    "58": 93,
+    "59": 94,
+    "60": 95,
+    "61": 96,
+    "62": 97,
+    "63": 98,
+    "64": 99,
+    "65": 100,
+    "66": 101,
+    "67": 102,
+    "68": 103,
+    "69": 104,
+    "70": 105,
+    "71": 106,
+    "72": 107,
+    "73": 108,
+    "74": 109,
+    "75": 110,
+    "76": 111,
+    "77": 112,
+    "78": 113,
+    "79": 114,
+    "80": 115,
+    "81": 116,
+    "82": 117,
+    "83": 118,
+    "84": 119,
+    "85": 120,
+    "86": 121,
+    "87": 122,
+    "88": 123,
+    "89": 124,
+    "90": 125,
+    "91": 126,
+    "92": 127,
+    "93": 128,
+    "94": 129,
+    "95": 130,
+    "96": 131,
+    "97": 132,
+    "98": 133,
+    "99": 134,
+    "100": 135,
+    "101": 136,
+    "102": 137,
+    "103": 138,
+    "104": 139,
+    "105": 140,
+    "106": 141,
+    "107": 142,
+    "108": 143,
+    "109": 144,
+    "110": 145,
+    "111": 146,
+    "112": 147,
+    "113": 148,
+    "114": 149,
+    "115": 150,
+    "116": 151,
+    "117": 152,
+    "118": 153,
+    "119": 154,
+    "120": 155,
+    "121": 156,
+    "122": 157,
+    "123": 158,
+    "124": 159,
+    "125": 160,
+    "126": 161,
+    "127": 162,
+    "128": 163,
+    "129": 164,
+    "130": 165,
+    "131": 166,
+    "132": 167,
+    "133": 168,
+    "134": 169,
+    "135": 170,
+    "136": 171,
+    "137": 172,
+    "138": 173,
+    "139": 174,
+    "140": 175,
+    "141": 176,
+    "142": 177,
+    "143": 178,
+    "144": 179,
+    "145": 180,
+    "146": 181,
+    "147": 182,
+    "148": 183,
+    "149": 184,
+    "150": 185,
+    "151": 186,
+    "152": 187,
+    "153": 188,
+    "154": 189,
+    "155": 190,
+    "156": 191,
+    "157": 192,
+    "158": 193,
+    "159": 194,
+    "160": 195,
+    "161": 196,
+    "162": 197,
+    "163": 198,
+    "164": 199,
+    "165": 200,
+    "166": 201,
+    "167": 202,
+    "168": 203,
+    "169": 204,
+    "170": 205,
+    "171": 206,
+    "172": 207,
+    "173": 208,
+    "174": 209,
+    "175": 210,
+    "176": 211,
+    "177": 212,
+    "178": 213,
+    "179": 214,
+    "180": 215,
+    "181": 216,
+    "182": 217,
+    "183": 218,
+    "184": 219,
+    "185": 220,
+    "186": 221,
+    "187": 222,
+    "188": 223,
+    "189": 224,
+    "190": 225,
+    "191": 226,
+    "192": 227,
+    "193": 228,
+    "194": 229,
+    "195": 230,
+    "196": 231,
+    "197": 232,
+    "198": 233,
+    "199": 234,
+    "200": 235,
+    "201": 236,
+    "202": 237,
+    "203": 238,
+    "204": 239,
+    "205": 240,
+    "206": 241,
+    "207": 242,
+    "208": 243,
+    "209": 244,
+    "210": 245,
+    "211": 246,
+    "212": 247,
+    "213": 248,
+    "214": 249,
+    "215": 250,
+    "216": 251,
+    "217": 252,
+    "218": 253,
+    "219": 254,
+    "220": 255,
+    "221": 256,
+    "222": 257,
+    "223": 258,
+    "224": 259,
+    "225": 260,
+    "226": 261,
+    "227": 262,
+    "228": 263,
+    "229": 264,
+    "230": 265,
+    "231": 266,
+    "232": 267,
+    "233": 268,
+    "234": 269,
+    "235": 270,
+    "236": 271,
+    "237": 272,
+    "238": 273,
+    "239": 274,
+    "240": 275,
+    "241": 276,
+    "242": 277,
+    "243": 278,
+    "244": 279,
+    "245": 280,
+    "246": 281,
+    "247": 282,
+    "248": 283,
+    "249": 284,
+    "250": 285,
+    "251": 286,
+    "252": 287,
+    "253": 288,
+    "254": 289,
+    "255": 290,
+    "256": 291,
+    "257": 292,
+    "258": 293,
+    "259": 294,
+    "260": 295,
+    "261": 296,
+    "262": 297,
+    "263": 298,
+    "264": 299,
+    "265": 300,
+    "266": 301,
+    "267": 302,
+    "268": 303,
+    "269": 304,
+    "270": 305,
+    "271": 306,
+    "272": 307,
+    "273": 308,
+    "274": 309,
+    "275": 310,
+    "276": 311,
+    "277": 312,
+    "278": 313,
+    "279": 314,
+    "280": 315,
+    "281": 316,
+    "282": 317,
+    "283": 318,
+    "284": 319,
+    "285": 320,
+    "286": 321,
+    "287": 322,
+    "288": 323,
+    "289": 324,
+    "290": 325,
+    "291": 326,
+    "292": 327,
+    "293": 328,
+    "294": 329,
+    "295": 330,
+    "296": 331,
+    "297": 332,
+    "298": 333,
+    "299": 334,
+    "300": 335,
+    "301": 336,
+    "302": 337,
+    "303": 338,
+    "304": 339,
+    "305": 340,
+    "306": 341,
+    "307": 342,
+    "308": 343,
+    "309": 344,
+    "310": 345,
+    "311": 346,
+    "312": 347,
+    "313": 348,
+    "314": 349,
+    "315": 350,
+    "316": 351,
+    "317": 352,
+    "318": 353,
+    "319": 354,
+    "320": 355,
+    "321": 356,
+    "322": 357,
+    "323": 358,
+    "324": 359,
+    "325": 360,
+    "326": 361,
+    "327": 362,
+    "328": 363,
+    "329": 364,
+    "330": 365,
+    "331": 366,
+    "332": 367,
+    "333": 368,
+    "334": 369,
+    "335": 370,
+    "336": 371,
+    "337": 372,
+    "338": 373,
+    "339": 374,
+    "340": 375,
+    "341": 376,
+    "342": 377,
+    "343": 378,
+    "344": 379,
+    "345": 380,
+    "346": 381,
+    "347": 382,
+    "348": 383,
+    "349": 384,
+    "350": 385,
+    "351": 386,
+    "352": 387,
+    "353": 388,
+    "354": 389,
+    "355": 390,
+    "356": 391,
+    "357": 392,
+    "358": 393,
+    "359": 394,
+    "360": 395,
+    "361": 396,
+    "362": 397,
+    "363": 398,
+    "364": 399,
+    "365": 400,
+    "366": 401,
+    "367": 402,
+    "368": 403,
+    "369": 404,
+    "370": 405,
+    "371": 406,
+    "372": 407,
+    "373": 408,
+    "374": 409,
+    "375": 410,
+    "376": 411,
+    "377": 412,
+    "378": 413,
+    "379": 414,
+    "380": 415,
+    "381": 416,
+    "382": 417,
+    "383": 418,
+    "384": 419,
+    "385": 420,
+    "386": 421,
+    "387": 422,
+    "388": 423,
+    "389": 424,
+    "390": 425,
+    "391": 426,
+    "392": 427,
+    "393": 428,
+    "394": 429,
+    "395": 430,
+    "396": 431,
+    "397": 432,
+    "398": 433,
+    "399": 434,
+    "400": 435,
+    "401": 436,
+    "402": 437,
+    "403": 438,
+    "404": 439,
+    "405": 440,
+    "406": 441,
+    "407": 442,
+    "408": 443,
+    "409": 444,
+    "410": 445,
+    "411": 446,
+    "412": 447,
+    "413": 448,
+    "414": 449,
+    "415": 450,
+    "416": 451,
+    "417": 452,
+    "418": 453,
+    "419": 454,
+    "420": 455,
+    "421": 456,
+    "422": 457,
+    "423": 458,
+    "424": 459,
+    "425": 460,
+    "426": 461,
+    "427": 462,
+    "428": 463,
+    "429": 464,
+    "430": 465,
+    "431": 466,
+    "432": 467,
+    "433": 468,
+    "434": 469,
+    "435": 470,
+    "436": 471,
+    "437": 472,
+    "438": 473,
+    "439": 474,
+    "440": 475,
+    "441": 476,
+    "442": 477,
+    "443": 478,
+    "444": 479,
+    "445": 480,
+    "446": 481,
+    "447": 482,
+    "448": 483,
+    "449": 484,
+    "450": 485,
+    "451": 486,
+    "452": 487,
+    "453": 488,
+    "454": 489,
+    "455": 490,
+    "456": 491,
+    "457": 492,
+    "458": 493,
+    "459": 494,
+    "460": 495,
+    "461": 496,
+    "462": 497,
+    "463": 498,
+    "464": 499,
+    "465": 500,
+    "466": 501,
+    "467": 502,
+    "468": 503,
+    "469": 504,
+    "470": 505,
+    "471": 506,
+    "472": 507,
+    "473": 508,
+    "474": 509,
+    "475": 510,
+    "476": 511,
+    "477": 512,
+    "478": 513,
+    "479": 514,
+    "480": 515,
+    "481": 516,
+    "482": 517,
+    "483": 518,
+    "484": 519,
+    "485": 520,
+    "486": 521,
+    "487": 522,
+    "488": 523,
+    "489": 524,
+    "490": 525,
+    "491": 526,
+    "492": 527,
+    "493": 528,
+    "494": 529,
+    "495": 530,
+    "496": 531,
+    "497": 532,
+    "498": 533,
+    "499": 534,
+    "500": 535,
+    "501": 536,
+    "502": 537,
+    "503": 538,
+    "504": 539,
+    "505": 540,
+    "506": 541,
+    "507": 542,
+    "508": 543,
+    "509": 544,
+    "510": 545,
+    "511": 546
+}