Error when deploying the model on Sagemaker : ValueError: sharded is not supported for AutoModel
import json
from sagemaker.huggingface import HuggingFaceModel
sagemaker config
instance_type = "ml.g5.48xlarge"
number_of_gpu = 8
health_check_timeout = 900
Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "liuhaotian/llava-v1.5-13b", # model_id from hf.co/models
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), # Limits the number of tokens that can be processed in parallel during the generation
'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
Deploy model to an endpoint
https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
llava is not part of HF Transformer, you need to use llava git repo's LLavLlama modle loader