Quantum Instruct
Collection
6 items
•
Updated
Thanks for this great article! I'm learning a lot from the nanoVLM project.
I'm not an expert in gen ai but I noticed the attention calculation example seems to be missing the scaling √(d_k). Is this intentional for simplification?
d_k = K.shape[-1]
attention_scores = (Q @ K.T) / math.sqrt(d_k)
From my understanding this scaling prevents the dot product growing too large and control the softmax region