import json
from sagemaker.huggingface import HuggingFaceModel

sagemaker config

instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 900

Define Model and Endpoint configuration parameter

config = {
'HF_MODEL_ID': "liuhaotian/llava-v1.5-13b", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), # Limits the number of tokens that can be processed in parallel during the generation
'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

Deploy model to an endpoint

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

liuhaotian
/

llava-v1.5-13b

Error when deploying the model on Sagemaker : ValueError: sharded is not supported for AutoModel

sagemaker config

Define Model and Endpoint configuration parameter

create HuggingFaceModel with the image uri

Deploy model to an endpoint

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy