Multi-modal structured data extraction. Give it a JSON schema, some sort of data (image, text) and optionally examples, and get the filled-in json schema as a result.
Kind of crazy.
THOUGHTS BELOW: Its super fast and super effective. It is so practical. Im getting basically instant results on an A100(40gb) with VLLM and the 4b size.
Multi-modal structured data extraction. Give it a JSON schema, some sort of data (image, text) and optionally examples, and get the filled-in json schema as a result.
Kind of crazy.
THOUGHTS BELOW: Its super fast and super effective. It is so practical. Im getting basically instant results on an A100(40gb) with VLLM and the 4b size.