In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.
๐ Given its input sentence like "๐๐ฉ๐ข๐ต ๐ช๐ด ๐ต๐ฉ๐ฆ 7๐ต๐ฉ ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ? ๐๐ฉ๐ฆ 7๐ต๐ฉ ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next. For instance: "๐๐จ" gets score 0.56, and "๐๐๐ฃ" gets score 0.35.
๐ค ๐๐ซ๐๐๐๐ฒ ๐๐๐๐จ๐๐ข๐ง๐ is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day). In our example, the next highest score token might be "๐๐จ", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "๐๐๐ฃ" could have been completed with "๐ฃ๐ฆ ๐ฐ๐ฃ๐ต๐ข๐ช๐ฏ๐ฆ๐ฅ ๐ง๐ณ๐ฐ๐ฎ ๐ค๐ฐ๐ฎ๐ฑ๐ถ๐ต๐ช๐ฏ๐จ ๐ฑ๐ณ๐ฆ๐ท๐ช๐ฐ๐ถ๐ด ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ๐ด ๐ง๐ช๐ณ๐ด๐ต", which steers the LLM towards a correct reasoning!
๐บ๏ธ ๐๐๐๐ฆ ๐ฌ๐๐๐ซ๐๐ก improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "๐๐จ" and the "๐๐๐ฃ" completion could be tested. โ
๐ I've created a tool to let you visualize it, thank you @joaogante for your great help! ๐๐ง๐ฎ ๐๐ฉ ๐๐๐ง๐: m-ric/beam_search_visualizer
reacted to Jaward's
post with โค๏ธover 1 year ago
Retrieval-Augmented Generation (RAG) Redeemer of the "hallucination problem"
It is fair enough to argue that "hallucinations" in LLMs are just mere reflections of what we humans occasionally do - well it gets worse as we get older, but these models are brain inspired, thus such behaviors are likely inherently unavoidable. After all, we are just dreamers trying make sense of this life.
The best we can do is minimize and control it - but humanly how? By first feeding on relevant facts and then developing a habit that allows us to easily access those facts when needed. This is what RAG is all about - it's just a control mechanism that keeps the LLM aligned with reality and fact.
But How Does RAG Work?
Well, to some extent it is domain-specific but the overall workflow boils down to the following:
1. It makes use of a retrieval mechanism that hunts for facts relevant to a query - this involves an end-to-end backpropagation that leverages a retriever (Query Encoder + Document Index or Source of Truth) with a pre-trained generative model.
2. The generative model then uses the facts retrieved, performs some verification to give a more accurate response.
To summarize, the RAG architecture houses a pre-existing knowledge source model (termed parametric memory), which then utilizes a Source-of-Truth model or vector indexed data (termed non-parametric memory) that is accessed by a pre-trained neural retriever, in order to produce more informed, contextually appropriate and factually correct responses.
Sort of a "Genius Engine" if you might say. If only we humans could harness such, AGI would be much much sooner lol.
In the meantime, I have been Jaward Sesay (Chinese name ่ๆฐ Sujie) - a young Sierra Leonean, aspiring AI Researcher. I like to read, share and try implementing AI research papers. Also like dunking on big tech while rooting for open-source. My mentor @karpathy, I dream of him following me back on X lol. Thanks.