Elpis-VL-7B
Introduction
With the rapid development of artificial intelligence, large vision language models have demonstrated astonishing capabilities, bringing unprecedented transformational opportunities to the security field. Based on this, we conducted a series of explorations and launched our Elpis-VL-7B model.
Our technical blog: https://zhuanlan.zhihu.com/p/1910765935360451665
Core Technical Highlights
1. End-to-End Synthetic Data Pipeline for Security Domains
Multi-Source Data Acquisition: Integrates real-world surveillance videos (with desensitization), public datasets, and automatic annotation pipelines to build a foundational security data pool.
Scene-Targeted Content Generation: Employs generative models to synthesize rare or complex scenarios such as hazardous operations, low-light conditions, and crowd occlusions.
Vision-Language Bridge for QA Generation: Utilizes LVLMs and DS-R1 to produce thought-process-style text QA pairs based on synthetic images, enabling cold-start datasets tailored to safety-critical tasks.
2. RL-Based Robustness Enhancement for Long-Tail & Hard Cases
Scenario-Aware Reward Learning: Introduces reinforcement-based optimization to improve model robustness under ambiguous conditions such as occlusion, motion blur, and nighttime scenes.
Hard Case Mining Strategies: Constructs high-value RL training sets through methods like failure case analysis, predictive deviation detection, and adversarial simulation (e.g., lighting, occlusion).
Sustainable Model Adaptation Loop: Establishes a feedback-driven enhancement cycle that continually adapts the model to difficult scenarios, improving generalization and deployment reliability.
Performance
Using the model
loading with HuggingFace
To load the model with HuggingFace, use the following snippet:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Beagledata001/Elpis-VL-7B", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Beagledata001/Elpis-VL-7B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://buudoo-virtual-human.oss-cn-beijing.aliyuncs.com/security_scene/elipis-vl-demo.png",
},
{"type": "text", "text": "fire extinguisher equipped?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
VLLM
Run the command below to start an OpenAI-compatible API service:
vllm serve Beagledata001/Elpis-VL-7B --served-model-name Elpis-VL --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5
Then you can use the chat API as below (via curl or Python API):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Elpis-VL",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "http://buudoo-virtual-human.oss-cn-beijing.aliyuncs.com/security_scene/elipis-vl-demo.png"}},
{"type": "text", "text": "fire extinguisher equipped?"}
]}
]
}'
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Elpis-VL",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "http://buudoo-virtual-human.oss-cn-beijing.aliyuncs.com/security_scene/elipis-vl-demo.png"
},
},
{"type": "text", "text": "fire extinguisher equipped?"},
],
},
],
)
print("Chat response:", chat_response)
You can also upload base64-encoded local images (see OpenAI API protocol document for more details):
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="Elpis-VL",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "fire extinguisher equipped?"},
],
},
],
)
print("Chat response:", chat_response)
License
Released under the Apache 2.0 License.
Contact
For questions, feedback, or collaboration, please open an issue on the Hugging Face model repository.
- Downloads last month
- 56