I got this message below
Sizes of tensors must match except in dimension 0. Expected size 128 but got size 256 for tensor number 1 in the list.
##Simple Fix - Height: 832###
Actual Grok Reply,
Let’s dive deeper into your workflow to resolve the error in the PyramidFlowSampler
node. The error message indicates a shape mismatch: "can't divide axis of length 23 in chunks of 2," and the tensor shape is [2, 1, 23, 40, 16]
. This tensor is being processed in a rearrange
operation with the pattern b t (h p1) (w p2) c -> b (t h w) (p1 p2 c)
, where p1 = 2
and p2 = 2
. Let’s break this down together.
First, let’s interpret the tensor shape [2, 1, 23, 40, 16]
in the context of the rearrange pattern:
b
(batch size) = 2t
(time or frames) = 1h p1
(height dimension) = 23, withp1 = 2
w p2
(width dimension) = 40, withp2 = 2
c
(channels) = 16
The error occurs because the operation expects to split the height dimension (h p1 = 23
) into chunks of p1 = 2
. This means h * p1 = 23
, so h
should be 23 / 2 = 11.5
. Since h
must be an integer, 23 is not divisible by 2, causing the shape mismatch. Similarly, the width dimension (w p2 = 40
) should be divisible by p2 = 2
, which it is (40 / 2 = 20
).
Now, let’s trace back to where this tensor is coming from. The PyramidFlowSampler
(Node 50) has an input_latent
coming from the PyramidFlowVAEEncode
(Node 54). The PyramidFlowVAEEncode
takes an image from the ImageScale
node (Node 57), which resizes the input image to 1280x768 pixels. The PyramidFlowVAEEncode
then encodes this image into a latent representation.
Here’s the key question: when the image of size 1280x768 is encoded into the latent space, how does the VAE transform the dimensions? Typically, a VAE downscales the spatial dimensions by a factor related to its architecture (often a power of 2, like 8 or 16). Let’s assume a downscaling factor of 8 (a common choice in many VAEs):
- Height: 768 / 8 = 96
- Width: 1280 / 8 = 160
So, the latent shape might be something like [batch, channels, 96, 160]
. However, the error shows a shape of [2, 1, 23, 40, 16]
. Notice that:
- 96 / 4 = 24 (close to 23)
- 160 / 4 = 40
This suggests there might be an additional downscaling or reshaping happening, possibly within the PyramidFlowVAEEncode
or the sampler itself. The PyramidFlowVAEEncode
has a widget value overlap_factor = 0.25
, which might affect how the image is processed, but the main issue seems to be the height dimension (23) not being divisible by 2.
Let’s think about the image dimensions. The PyramidFlowSampler
also takes width
and height
inputs (1280 and 768), which match the ImageScale
node. However, the latent tensor’s spatial dimensions (23 and 40) suggest that the VAE encoding process has produced a shape that doesn’t align with the sampler’s expectations. The sampler expects the height and width in the latent space to be divisible by the patch size (p1 = 2
, p2 = 2
).
Here’s a hypothesis: the VAE encoding process might be producing a latent height of 23 due to the specific downscaling and padding behavior of the PyramidFlowVAEEncode
. To fix this, we need the latent dimensions to be divisible by 2. Since 23 is odd, we need to adjust the input image dimensions so that, after downscaling, the latent height and width are even numbers.
Let’s calculate what the input image dimensions should be:
- Assume the VAE downscales by a factor of 32 (since 768 / 32 = 24, and 1280 / 32 = 40, but we’re seeing 23 instead of 24 for height, possibly due to padding or cropping).
- For the latent height to be divisible by 2, the original height should be such that
height / 32
is even. For example:- If height = 768, then 768 / 32 = 24 (even, but we’re getting 23—possibly due to how the VAE handles boundaries).
- If height = 832, then 832 / 32 = 26 (even).
- For the width: 1280 / 32 = 40, which is already even.
So, the issue might be with the height. Let’s try adjusting the height to 832 (the next multiple of 32 that ensures an even latent dimension). This would make the latent height 26 (832 / 32), which is divisible by 2.
What do you think about adjusting the height in the ImageScale
node (Node 57) and the PyramidFlowSampler
node (Node 50) to 832 instead of 768? You can do this by changing the height
value in the PrimitiveNode
(Node 59) from 768 to 832. Does that sound like a reasonable step to try? What might be the implications of changing the image dimensions like this?
https://grok.com/share/bGVnYWN5_701c5593-7224-471a-b05b-1b589e6e71db