patembed-base
This is a sentence-transformers model trained specifically for patent text embeddings. It is part of the PatenTEB project, which provides state-of-the-art models for patent document understanding and retrieval.
Note: This model uses task-specific instruction prompts during inference for optimal performance.
Model Details
- Model Type: Sentence Transformer
- Base Architecture: Distilled from patembed-large using layers {0,2,4,6,8,10,12,14,16,18,20,22}
- Parameters: 193M
- Number of Layers: 12
- Hidden Size: 1024
- Embedding Dimension: 768
- Max Sequence Length: 512 tokens
- Language: English
- License: CC BY-NC-SA 4.0
Model Description
Primary deployment target distilled from patembed-large. Maintains 1024 hidden size with projection to 768-dim embeddings.
This model is part of the patembed family, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.
Usage
Using Sentence Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('datalyes/patembed-base')
# Encode patent texts
patent_texts = [
    "A method for manufacturing semiconductor devices...",
    "An apparatus for processing chemical compounds...",
]
embeddings = model.encode(patent_texts)
# Compute similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
Using Transformers
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-base')
model = AutoModel.from_pretrained('datalyes/patembed-base')
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Tokenize and encode
texts = ["A method for manufacturing semiconductor devices..."]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded)
    embeddings = mean_pooling(model_output, encoded['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
Patent Retrieval Example
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('datalyes/patembed-base')
# Query patent
query = "Method for reducing power consumption in mobile devices"
# Candidate patents
candidates = [
    "A power management system for portable electronic devices...",
    "Chemical composition for battery manufacturing...",
    "Method for wireless data transmission in mobile networks...",
]
# Encode and retrieve
query_emb = model.encode(query)
candidate_embs = model.encode(candidates)
# Compute similarities
scores = util.cos_sim(query_emb, candidate_embs)[0]
# Get ranked results
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
results.sort(key=lambda x: x[1], reverse=True)
for patent, score in results:
    print(f"Score: {score:.4f} - {patent[:100]}...")
Intended Use
This model is designed for patent-specific tasks including:
- Patent search and retrieval
- Prior art search
- Patent classification and clustering
- Technology landscape analysis
For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.
Citation
If you use this model, please cite our paper:
@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
      title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, 
      author={Iliass Ayaou and Denis Cavallucci},
      year={2025},
      eprint={2510.22264},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.22264}
}
Paper: PatenTEB on arXiv
License
This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Key Terms:
- โ You can use, share, and adapt the model
- โ You must give appropriate credit
- โ You may not use the model for commercial purposes
- โ ๏ธ If you adapt or build upon this model, you must distribute under the same license
For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/
Contact
- Authors: Iliass Ayaou, Denis Cavallucci
- Institution: ICUBE Laboratory, INSA Strasbourg
- GitHub: PatentTEB/PatentTEB
- HuggingFace: datalyes
- Downloads last month
- -
