Framepack as an image edit/instruct model - a small study

Community Article Published September 7, 2025

Qwen Edit and Nano Banana are getting most of the attention now in the image editing space, and I recently saw posts about using the Wan I2V model for similar tasks. Having a history with Framepack and Hunyuan Video, I decided to give it a go and see how well it can perform the tasks.

I recently came across a fork of ComfyUI-FramePackWrapper with single frame inference called ComfyUI-FramePackWrapper_PlusOne, and this is what I'm using with the regular FramepackWrapper workflow.

What is the goal? Getting the model to produce an output where the input image is changed, but maintaining style and concept and character. This is of course something a video model must be able to do for video sequences, but there is an additional value if it can do it for a single frame. Better control for first frame, last frame workflows, storyboarding, lora creation, etc.

This is not a Youtube video, so I won't go on for 15 minutes before showing some results. I'll give some pointers along the way, instead.

Here is Al Bundy, the humble shoe salesman, the image used as input:

image/jpeg

The following generations used the resolution you see here, 320x416.

Prompt: "the man puts on sunglasses"

image/png

This worked really well. Not cherry-picked.

Prompt: "the man puts on a hat"

image/png

OK, result. I tried a couple of times. The thing about video models as instruct models is that you can't control where in the video sequence it will produce the image. Had it produced a 49 or 77 frame sequence, the hat would most likely have ended up on the head. Here, we get an image somewhere in the middle.

Prompt: "the man points at the camera"

image/png

Again, a good result. First try.

Prompt: "the man touches his chin and looks superior"

image/png

While I'm already impressed by the results, let's give it something more complex.

Prompt: "the man holds up a woman's shoe"

image/png

image/png

Technically correct. It started losing character adherence on the small resolution, so I had to go up to 560x720 to get these decent results.

Conclusion:

Video models are strong at manipulating scenes while maintaining adherence. But they are limited by their training on sequences. Transformative prompts (one thing turns into another), or those that are spatially or temporally distance are difficult for them. Here, image edit models excel, since they can be trained on anything; "The goat turns into a duck".

The prompts should also reflect the kind of data they're trained on. Instead of "put a hat on the man's head", you use "the man puts on a hat", since that's more akin to the type of training it's had.

Consider picking a video model when your goal is variations on the original image rather than something completely different, and when realism is key.

Edit: Realisation after posting, but with limited testing: You can play with "latent window size" and increase it to up the likelihood of getting a "later in the sequence" result, so generally stronger in many cases.

Bonus:

The technique seems to work with loras. If you have a concept you want to apply, a lora may help doing so (within reason).Here is an example with my own "camera loras" for Framepack:

Prompt (and lora): "The camera rotates to the left"

image/png

Prompt (and lora): "The camera zooms in slowly"

image/png

Subtle result, but then again it's the "slow" zoom.

I'm going to leave you with a few more generations, some succesful and some less so.

Prompt: "The man puts on a jacket." (would probably have been better with a higher resolution)

image/png

Prompt: "The man puts on a dress jacket." (Here I used 576p)

image/png

Prompt: "The man smiles."

image/png

I tried this prompt with many seeds and settings, and none worked well. This is the expected result, since the model has never seen Al Bundy smile.

Workflow used for examples:

{ "last_node_id": 86, "last_link_id": 236, "nodes": [ { "id": 55, "type": "MarkdownNote", "pos": [ 567.05908203125, -628.8865966796875 ], "size": [ 459.8609619140625, 285.9714660644531 ], "flags": {}, "order": 0, "mode": 0, "inputs": [], "outputs": [], "properties": {}, "widgets_values": [ "Model links:\n\nhttps://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/FramePackI2V_HY_fp8_e4m3fn.safetensors\n\nhttps://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/FramePackI2V_HY_bf16.safetensors\n\nsigclip:\n\nhttps://huggingface.co/Comfy-Org/sigclip_vision_384/tree/main\n\ntext encoder and VAE:\n\nhttps://huggingface.co/Comfy-Org/HunyuanVideo_repackaged/tree/main/split_files" ], "color": "#432", "bgcolor": "#653" }, { "id": 27, "type": "FramePackTorchCompileSettings", "pos": [ 623.3660278320312, -140.94215393066406 ], "size": [ 531.5999755859375, 202 ], "flags": {}, "order": 1, "mode": 0, "inputs": [], "outputs": [ { "name": "torch_compile_args", "type": "FRAMEPACKCOMPILEARGS", "links": [], "slot_index": 0 } ], "properties": { "Node name for S&R": "FramePackTorchCompileSettings", "aux_id": "lllyasviel/FramePack", "ver": "0e5fe5d7ca13c76fb8e13708f4b92e7c7a34f20c" }, "widgets_values": [ "inductor", false, "default", false, 64, true, true ] }, { "id": 12, "type": "VAELoader", "pos": [ 570.5363159179688, -282.70068359375 ], "size": [ 469.0488586425781, 58 ], "flags": {}, "order": 2, "mode": 0, "inputs": [], "outputs": [ { "name": "VAE", "type": "VAE", "links": [ 163, 164 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "VAELoader", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "hunyuan_video_vae_fp16.safetensors" ], "color": "#322", "bgcolor": "#533" }, { "id": 33, "type": "VAEDecodeTiled", "pos": [ 2433.923828125, -374.082275390625 ], "size": [ 315, 150 ], "flags": {}, "order": 14, "mode": 0, "inputs": [ { "name": "samples", "type": "LATENT", "link": 224 }, { "name": "vae", "type": "VAE", "link": 164 } ], "outputs": [ { "name": "IMAGE", "type": "IMAGE", "links": [ 165 ] } ], "properties": { "Node name for S&R": "VAEDecodeTiled", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ 256, 64, 64, 8 ], "color": "#322", "bgcolor": "#533" }, { "id": 13, "type": "DualCLIPLoader", "pos": [ 320.9956359863281, 166.8336181640625 ], "size": [ 340.2243957519531, 130 ], "flags": {}, "order": 3, "mode": 0, "inputs": [], "outputs": [ { "name": "CLIP", "type": "CLIP", "links": [ 102 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "DualCLIPLoader", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "clip-vit-large-patch14/model.safetensors", "llava_llama3_fp8/llava_llama3_fp8_scaled.safetensors", "hunyuan_video", "default" ], "color": "#432", "bgcolor": "#653" }, { "id": 52, "type": "LoadFramePackModel", "pos": [ 1779.28662109375, -250.3265380859375 ], "size": [ 480.7601013183594, 174 ], "flags": {}, "order": 4, "mode": 0, "inputs": [ { "name": "compile_args", "type": "FRAMEPACKCOMPILEARGS", "shape": 7, "link": null }, { "name": "lora", "type": "FPLORA", "shape": 7, "link": null } ], "outputs": [ { "name": "model", "type": "FramePackMODEL", "links": [ 225 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "LoadFramePackModel", "aux_id": "kijai/ComfyUI-FramePackWrapper", "ver": "49fe507eca8246cc9d08a8093892f40c1180e88f" }, "widgets_values": [ "hyvideo/hunyuan_fp8/FramePack_F1_I2V_HY_20250503_fp8_e4m3fn.safetensors", "bf16", "fp8_e4m3fn", "offload_device", "sageattn" ] }, { "id": 17, "type": "CLIPVisionEncode", "pos": [ 1133.875732421875, 536.022705078125 ], "size": [ 380.4000244140625, 78 ], "flags": {}, "order": 9, "mode": 0, "inputs": [ { "name": "clip_vision", "type": "CLIP_VISION", "link": 162 }, { "name": "image", "type": "IMAGE", "link": 160 } ], "outputs": [ { "name": "CLIP_VISION_OUTPUT", "type": "CLIP_VISION_OUTPUT", "links": [ 222 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "CLIPVisionEncode", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "center" ], "color": "#233", "bgcolor": "#355" }, { "id": 23, "type": "VHS_VideoCombine", "pos": [ 2802.747802734375, -167.40277099609375 ], "size": [ 319.54736328125, 703.4266357421875 ], "flags": {}, "order": 15, "mode": 0, "inputs": [ { "name": "images", "type": "IMAGE", "link": 165 }, { "name": "audio", "type": "VHS_AUDIO", "shape": 7, "link": null }, { "name": "meta_batch", "type": "VHS_BatchManager", "shape": 7, "link": null }, { "name": "vae", "type": "VAE", "shape": 7, "link": null } ], "outputs": [ { "name": "Filenames", "type": "VHS_FILENAMES", "links": null } ], "properties": { "Node name for S&R": "VHS_VideoCombine", "cnr_id": "comfyui-videohelpersuite", "ver": "0a75c7958fe320efcb052f1d9f8451fd20c730a8" }, "widgets_values": { "frame_rate": 30, "loop_count": 0, "filename_prefix": "FramePack", "format": "video/h264-mp4", "pix_fmt": "yuv420p", "crf": 19, "save_metadata": true, "pingpong": false, "save_output": true, "videopreview": { "hidden": false, "paused": false, "params": { "filename": "", "subfolder": "", "type": "output", "format": "video/h264-mp4", "frame_rate": 30, "workflow": "", "fullpath": "" } } } }, { "id": 19, "type": "LoadImage", "pos": [ 115.66331481933594, 638.7184448242188 ], "size": [ 315, 314 ], "flags": {}, "order": 5, "mode": 0, "inputs": [], "outputs": [ { "name": "IMAGE", "type": "IMAGE", "links": [ 160, 234, 235 ], "slot_index": 0 }, { "name": "MASK", "type": "MASK", "links": [], "slot_index": 1 } ], "title": "Load Image: Start", "properties": { "Node name for S&R": "LoadImage", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "", "image" ] }, { "id": 47, "type": "CLIPTextEncode", "pos": [ 883.961669921875, 181.54994201660156 ], "size": [ 400, 200 ], "flags": {}, "order": 7, "mode": 0, "inputs": [ { "name": "clip", "type": "CLIP", "link": 102 } ], "outputs": [ { "name": "CONDITIONING", "type": "CONDITIONING", "links": [ 180, 231 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "CLIPTextEncode", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "the man puts on a dress jacket" ], "color": "#232", "bgcolor": "#353" }, { "id": 71, "type": "FramePackFindNearestBucket", "pos": [ 864.4935302734375, 642.2278442382812 ], "size": [ 315, 78 ], "flags": {}, "order": 8, "mode": 0, "inputs": [ { "name": "image", "type": "IMAGE", "link": 234 } ], "outputs": [ { "name": "width", "type": "INT", "links": [ 175 ], "slot_index": 0 }, { "name": "height", "type": "INT", "links": [ 176 ], "slot_index": 1 } ], "properties": { "Node name for S&R": "FramePackFindNearestBucket" }, "widgets_values": [ 560 ] }, { "id": 83, "type": "FramePackSingleFrameSampler", "pos": [ 2293.34375, -21.796367645263672 ], "size": [ 380.4000244140625, 574 ], "flags": {}, "order": 13, "mode": 0, "inputs": [ { "name": "model", "type": "FramePackMODEL", "link": 225 }, { "name": "positive", "type": "CONDITIONING", "link": 231 }, { "name": "negative", "type": "CONDITIONING", "link": 221 }, { "name": "start_latent", "type": "LATENT", "link": 236 }, { "name": "image_embeds", "type": "CLIP_VISION_OUTPUT", "shape": 7, "link": 222 }, { "name": "initial_samples", "type": "LATENT", "shape": 7, "link": null }, { "name": "reference_latent", "type": "LATENT", "shape": 7, "link": null }, { "name": "reference_image_embeds", "type": "CLIP_VISION_OUTPUT", "shape": 7, "link": null }, { "name": "input_mask", "type": "MASK", "shape": 7, "link": null }, { "name": "reference_mask", "type": "MASK", "shape": 7, "link": null } ], "outputs": [ { "name": "samples", "type": "LATENT", "links": [ 224 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "FramePackSingleFrameSampler" }, "widgets_values": [ 20, false, 0.15, 1, 7.340000000000002, 0, 1696, "fixed", 9, 6, "unipc_bh1", false, 1, 1, 13 ] }, { "id": 73, "type": "ConditioningZeroOut", "pos": [ 1566.576171875, 206.9683837890625 ], "size": [ 317.4000244140625, 26 ], "flags": {}, "order": 10, "mode": 0, "inputs": [ { "name": "conditioning", "type": "CONDITIONING", "link": 180 } ], "outputs": [ { "name": "CONDITIONING", "type": "CONDITIONING", "links": [ 221 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "ConditioningZeroOut" }, "widgets_values": [] }, { "id": 18, "type": "CLIPVisionLoader", "pos": [ 330.4465637207031, 403.89849853515625 ], "size": [ 388.87139892578125, 58 ], "flags": {}, "order": 6, "mode": 0, "inputs": [], "outputs": [ { "name": "CLIP_VISION", "type": "CLIP_VISION", "links": [ 162 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "CLIPVisionLoader", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [ "sigclip_vision.safetensors" ], "color": "#2a363b", "bgcolor": "#3f5159" }, { "id": 20, "type": "VAEEncode", "pos": [ 1815.9931640625, 536.2833862304688 ], "size": [ 210, 46 ], "flags": {}, "order": 12, "mode": 0, "inputs": [ { "name": "pixels", "type": "IMAGE", "link": 177 }, { "name": "vae", "type": "VAE", "link": 163 } ], "outputs": [ { "name": "LATENT", "type": "LATENT", "links": [ 236 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "VAEEncode", "cnr_id": "comfy-core", "ver": "0.3.28" }, "widgets_values": [], "color": "#322", "bgcolor": "#533" }, { "id": 72, "type": "ImageScale", "pos": [ 1471.69873046875, 703.3203125 ], "size": [ 315, 170 ], "flags": {}, "order": 11, "mode": 0, "inputs": [ { "name": "image", "type": "IMAGE", "link": 235 }, { "name": "width", "type": "INT", "widget": { "name": "width" }, "link": 175 }, { "name": "height", "type": "INT", "widget": { "name": "height" }, "link": 176 } ], "outputs": [ { "name": "IMAGE", "type": "IMAGE", "links": [ 177 ], "slot_index": 0 } ], "properties": { "Node name for S&R": "ImageScale" }, "widgets_values": [ "nearest-exact", 512, 512, "disabled" ] } ], "links": [ [ 102, 13, 0, 47, 0, "CLIP" ], [ 160, 19, 0, 17, 1, "IMAGE" ], [ 162, 18, 0, 17, 0, "CLIP_VISION" ], [ 163, 12, 0, 20, 1, "VAE" ], [ 164, 12, 0, 33, 1, "VAE" ], [ 165, 33, 0, 23, 0, "IMAGE" ], [ 175, 71, 0, 72, 1, "INT" ], [ 176, 71, 1, 72, 2, "INT" ], [ 177, 72, 0, 20, 0, "IMAGE" ], [ 180, 47, 0, 73, 0, "CONDITIONING" ], [ 221, 73, 0, 83, 2, "CONDITIONING" ], [ 222, 17, 0, 83, 4, "CLIP_VISION_OUTPUT" ], [ 224, 83, 0, 33, 0, "LATENT" ], [ 225, 52, 0, 83, 0, "FramePackMODEL" ], [ 231, 47, 0, 83, 1, "CONDITIONING" ], [ 234, 19, 0, 71, 0, "IMAGE" ], [ 235, 19, 0, 72, 0, "IMAGE" ], [ 236, 20, 0, 83, 3, "LATENT" ] ], "groups": [ { "id": 2, "title": "Start Image", "bounding": [ 122.08280944824219, 490.6686706542969, 2032.7288818359375, 442.6904602050781 ], "color": "#3f789e", "font_size": 24, "flags": {} } ], "config": {}, "extra": { "ds": { "scale": 0.7513148009015777, "offset": [ -649.9556316071233, 462.8578082303444 ] }, "frontendVersion": "1.17.3", "VHS_latentpreview": true, "VHS_latentpreviewrate": 0, "VHS_MetadataImage": true, "VHS_KeepIntermediate": true }, "version": 0.4 }

Community

Sign up or log in to comment