base_model:
- stabilityai/stable-diffusion-xl-base-1.0
- zer0int/LongCLIP-GmP-ViT-L-14
opendiffusionAI sdxl-longclip (RAW version V0.1)
BUGFIXES!!! (Latest update 2025/05/21)
Please note that the initial release had a bug in the tokenizer config.
tokenizer/tokenizer_config.json
needs to have "model_max_length": 248
Additionally.. I padded out tokenizer_2 and text_encoder_2 so that the normal python code works with this model now.
What is this?
This is the base SDXL model.. but the CLIP-L text encoder swapped out with "LongCLIP".... and then the values for CLIP-G zeroed so it has no effect.
In theory, it should be possible to replace the clip-g model with a dummy placeholder, so that the model takes 3G less VRAM overall. But my attempts at that failed.
Why is this?
SDXL's largest limitations are primarily due to the lousy CLIP(s) used. Not only are they of poor quality, but they have hidden token count limits, which make effective token count closer to 10. It is believed that one of the reasons CLIP-G was added on was to work around the limits of original CLIP-L. But.... that makes the model harder to train, and needlessly takes up more memory and time.
So, I created this version to experimentally prove the better way.
This allows use of up to 248 tokens with SDXL natively, without the layering hacks that some diffusion programs do.
Long Prompt demo
This prompt is stolen fromn the LongCLIP demo prompts:
The serene lake surface resembled a flawless mirror, reflecting the soft blue sky and the surrounding greenery. A gentle breeze played across its expanse, ruffling the surface into delicate ripples that gradually spread out, disappearing into the distance. Along the shore, weeping willows swayed gracefully in the light breeze, their long branches dipping into the water, creating a soothing sound as they gently brushed against the surface. In the midst of this serene scene, a pure white swan floated gracefully on the lake. Its elegant neck curved into a graceful arc, giving it an air of dignity.
The point here is that mention of a "swan" is beyond the 77 token limit. So if you see a swan, longclip is working.
Quality Comparison:
I had originally expected a need to do finetuning after the modifications. I was pleasantly surprised, then, to see that the new raw model combination performs better than sdxl base, out of the box.
Sample image links:
- Before= https://huggingface.co/opendiffusionai/sdxl-longcliponly/resolve/main/2025-05-18_13-26-23-training-sample-0-0-0.png
- After= https://huggingface.co/opendiffusionai/sdxl-longcliponly/resolve/main/2025-05-18_14-49-43-training-sample-0-0-0.png
(The face is more realistic, the clothes are better, and there is no duplication of the coffee cup on the table)
Known Problems
Needs training
This raw version of the model seems to work great with 3-5 word prompts... but then decays after that. So I'm working on a finetuned version.
Some programs blow up on it.
Some programs hardcode a CLIP-L token limit of "77". There isnt a valid reason to do this; it is possible to detect the actual token count limit.
This is extra unfortunate, since some programs do not merely disallow more tokens... they SEE that the new model supports 248 tokens, then complain, "hey! this model supports 248 tokens! I'm not going to allow you to use it."
Programs known to load the model
InvokeAI
SD.Next (a bit buggy at times, though)
safetensors file is not known to work (so use the huggingface loader!)
I did a blind conversion of the diffusers format to safetensors format using the OneTrainer conversion tools. However, it is not known to work. Even OT itself does not load it.
It is provided here in hope that someone so inclined, may use it as input to fix the problems with the relevant checkpoint loader.
Sample python code
The following works, as of 2025/05/21
import torch
from diffusers import StableDiffusionXLPipeline
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--prompt", default="sad girl in snow")
parser.add_argument("-m", "--model", default="opendiffusionai/sdxl-longcliponly")
args = parser.parse_args()
pipe = StableDiffusionXLPipeline.from_pretrained(args.model)
device = torch.device("cuda")
pipe = pipe.to(device)
result = pipe(args.prompt, guidance_scale=7.5, num_inference_steps=30)
image = result.images[0]
image.save("testimg.png")