# PSTUTS_RAG RAG evaluation

We are going to be comparing the RAG with (a) base, (b) fine-tuned embedding model.


In [50]:
%load_ext autoreload
%autoreload 2


In [1]:
base_model_HF_id = "Snowflake/snowflake-arctic-embed-s"
ft_model_HF_id = "mbudisic/snowflake-arctic-embed-s-ft-pstuts"

In [2]:
import os
import logging

import requests
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

from qdrant_client import QdrantClient

from pstuts_rag.rag import RAGChainInstance
import nest_asyncio


from dataclasses import dataclass
from datasets import load_dataset
from langsmith import EvaluationResult
from ragas import EvaluationDataset
from pstuts_rag.evaluator_utils import apply_rag_chain_inplace, summary_stats
from pandas import DataFrame
from langchain_core.runnables import Runnable

load_dotenv()

def set_api_key_if_not_present(key_name, prompt_message=""):
    if len(prompt_message) == 0:
        prompt_message=key_name
    if key_name not in os.environ or not os.environ[key_name]:
        os.environ[key_name] = getpass.getpass(prompt_message)


set_api_key_if_not_present("OPENAI_API_KEY")

logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("langchain").setLevel(logging.WARNING)
nest_asyncio.apply()



Raw data is now stored on huggingface, so we can download it directly.

In [3]:
import pstuts_rag.loader

url = "https://huggingface.co/datasets/mbudisic/PsTuts-VQA/raw/main/train.json"
resp = requests.get(url)
resp.raise_for_status()
group = url.split('/')[-1].split('.')[0]
docs_json = pstuts_rag.loader.load_json_string(resp.content.decode('utf-8'), group)



Now, let's create the base chain.

In [4]:
from langchain_openai import OpenAIEmbeddings

from langchain_huggingface import HuggingFaceEmbeddings


qdrant_client = QdrantClient(":memory:")


In [5]:
from dataclasses import dataclass
from dataclasses import field


@dataclass
class DataGroup:
    rag:RAGChainInstance= field(init=False)  
    dataset:EvaluationDataset= field(init=False)  
    result:EvaluationResult= field(init=False)  
    statistics:DataFrame= field(init=False)  

## Base model

In [6]:

base = DataGroup()
base.rag = RAGChainInstance(name="base",
                            qdrant_client=qdrant_client,
                            llm=ChatOpenAI(model="gpt-4.1-nano"),
                            embeddings=HuggingFaceEmbeddings(model_name=base_model_HF_id))



Now, let's populate the datastore of the first chain and create the chain
and test it out.


In [7]:
_ = await base.rag.build_chain(docs_json)
response = base.rag.rag_chain.invoke({"question":"What is a layer?"})
response.pretty_print()

<built-in function repr>

A layer is like a separate sheet in your Photoshop document that you can work on independently. You can add new layers, rename them, and change their order. These layers can hold different parts of your image, like colors or drawings, and you can manipulate them without affecting the rest of the image. (Timestamp: 00:02:21)
**REFERENCES**
[
  {
    "title": "Learn layer basics",
    "source": "https://images-tv.adobe.com/avp/vr/b758b4c4-2a74-41f4-8e67-e2f2eab83c6a/01a575ae-f8b7-486c-987b-bcb4f2f4e57d/3868e305-c73c-4931-82a0-5e46f5eb41e5_20170727011800.1280x720at2400_h264.mp4",
    "start": 141.29,
    "stop": 156.87
  },
  {
    "title": "Unlock the Background layer",
    "source": "https://images-tv.adobe.com/avp/vr/b758b4c4-2a74-41f4-8e67-e2f2eab83c6a/696245e0-aaad-42df-b48f-8b44b1f5211a/22729011-a533-48a4-a7a2-0b5f86d4eedd_20170727011751.1280x720at2400_h264.mp4",
    "start": 113.65,
    "stop": 227.99
  }
]


Formal evaluation goes through the "golden" dataset also stored 
on HF.

We're going to evaluate only on a portion of it.


In [8]:

golden_small_hf = load_dataset("mbudisic/pstuts_rag_qa",split="train[:10]")

base.dataset = EvaluationDataset.from_hf_dataset(golden_small_hf)

In [9]:
_ = await apply_rag_chain_inplace(base.rag.rag_chain, base.dataset )
base.dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference
0,how i use adobe photoshop creative cloud for d...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Here's how to use Perspective Warp in Adobe Ph...,"in adobe photoshop creative cloud, to use pers..."
1,wut is Adobee Photoshoop Cretive Cloud?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a version of...,Adobe Photoshop Creative Cloud is a version of...
2,"As a beginner Photoshop user, can you explain ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,The Perspective Warp feature in Adobe Photosho...,Adobe Photoshop Creative Cloud's Perspective W...
3,Who is PhotoSpin in relation to the image used...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,PhotoSpin is the company that took the photogr...,PhotoSpin is the company that took the photogr...
4,"How you use Perspective Warp in Photoshop, wha...","[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop let you change t...
5,What does the Perspective Warp feature in Phot...,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,The Perspective Warp feature in Photoshop allo...,Perspective Warp in Photoshop allows you to ch...
6,As a Photoshop trainer developing step-by-step...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"Based on the transcript, here is how the Persp...",The new Perspective Warp feature in Adobe Phot...
7,wut is adobee fotoshop cretive clowd?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a service th...,Adobe Photoshop Creative Cloud is a version of...
8,Wut duz Perspectiv Warp do in Photoshop?,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop lets yu change t...
9,"How can I, as a Photoshop trainer, explain to ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"In the Perspective Warp tutorial, the trainer ...","In the Perspective Warp tutorial, the image us..."


Since we now have the dataset, let's run it through evalutors.

In [10]:
from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [11]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

base.result = evaluate(
    dataset=base.dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

In [12]:
base.result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,how i use adobe photoshop creative cloud for d...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Here's how to use Perspective Warp in Adobe Ph...,"in adobe photoshop creative cloud, to use pers...",0.3,0.25,0.65,0.907531,0.4,0.0625
1,wut is Adobee Photoshoop Cretive Cloud?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a version of...,Adobe Photoshop Creative Cloud is a version of...,1.0,1.0,0.67,0.879631,1.0,0.2
2,"As a beginner Photoshop user, can you explain ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,The Perspective Warp feature in Adobe Photosho...,Adobe Photoshop Creative Cloud's Perspective W...,0.4,1.0,0.56,0.933397,0.222222,0.25
3,Who is PhotoSpin in relation to the image used...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,PhotoSpin is the company that took the photogr...,PhotoSpin is the company that took the photogr...,1.0,1.0,1.0,0.914422,0.5,0.0
4,"How you use Perspective Warp in Photoshop, wha...","[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop let you change t...,0.25,0.666667,0.33,0.959503,0.5,0.333333
5,What does the Perspective Warp feature in Phot...,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,The Perspective Warp feature in Photoshop allo...,Perspective Warp in Photoshop allows you to ch...,0.5,0.666667,0.8,0.982226,1.0,0.333333
6,As a Photoshop trainer developing step-by-step...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"Based on the transcript, here is how the Persp...",The new Perspective Warp feature in Adobe Phot...,0.272727,0.916667,0.5,0.945618,0.4,0.25
7,wut is adobee fotoshop cretive clowd?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a service th...,Adobe Photoshop Creative Cloud is a version of...,1.0,0.6,0.6,0.846981,1.0,0.0
8,Wut duz Perspectiv Warp do in Photoshop?,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop lets yu change t...,1.0,0.666667,0.67,0.926159,1.0,0.333333
9,"How can I, as a Photoshop trainer, explain to ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"In the Perspective Warp tutorial, the trainer ...","In the Perspective Warp tutorial, the image us...",1.0,0.444444,0.76,0.849011,0.333333,0.222222


In [13]:
base.statistics = summary_stats(base.result.to_pandas())
print( base.statistics.select_dtypes(include="number") \
               .loc["Mean"] )
print( base.statistics.select_dtypes(include="number") \
.loc["StdDev"])

context_recall                      0.672273
faithfulness                        0.721111
factual_correctness(mode=f1)        0.654000
answer_relevancy                    0.914448
context_entity_recall               0.635556
noise_sensitivity(mode=relevant)    0.198472
Name: Mean, dtype: float64
context_recall                      0.111424
faithfulness                        0.081215
factual_correctness(mode=f1)        0.057081
answer_relevancy                    0.014215
context_entity_recall               0.102262
noise_sensitivity(mode=relevant)    0.041861
Name: StdDev, dtype: float64


  retval = retval.apply(partial(pd.to_numeric, **{"errors": "ignore"}))


## Fine-tuned model

In [14]:
ft = DataGroup()
ft.rag = RAGChainInstance(name="ft",
                            qdrant_client=qdrant_client,
                            llm=ChatOpenAI(model="gpt-4.1-nano"),
                            embeddings=HuggingFaceEmbeddings(model_name=ft_model_HF_id))


Some weights of BertModel were not initialized from the model checkpoint at mbudisic/snowflake-arctic-embed-s-ft-pstuts and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
_ = await ft.rag.build_chain(docs_json)
response = ft.rag.rag_chain.invoke({"question":"What is a layer?"})
response.pretty_print()

<built-in function repr>

A layer is something you can work with in Photoshop, like a part of your image that you can edit separately. When you make a new layer, itâ€™s like adding a new sheet of paper that you can move or fill with color without affecting the other parts of your image. (Timestamp: 1:41 - 1:56)
**REFERENCES**
[
  {
    "title": "Learn layer basics",
    "source": "https://images-tv.adobe.com/avp/vr/b758b4c4-2a74-41f4-8e67-e2f2eab83c6a/01a575ae-f8b7-486c-987b-bcb4f2f4e57d/3868e305-c73c-4931-82a0-5e46f5eb41e5_20170727011800.1280x720at2400_h264.mp4",
    "start": 141.29,
    "stop": 156.87
  },
  {
    "title": "Unlock the Background layer",
    "source": "https://images-tv.adobe.com/avp/vr/b758b4c4-2a74-41f4-8e67-e2f2eab83c6a/696245e0-aaad-42df-b48f-8b44b1f5211a/22729011-a533-48a4-a7a2-0b5f86d4eedd_20170727011751.1280x720at2400_h264.mp4",
    "start": 113.65,
    "stop": 227.99
  }
]


In [16]:
ft.dataset = EvaluationDataset.from_hf_dataset(golden_small_hf)

In [17]:
_ = await apply_rag_chain_inplace(ft.rag.rag_chain, ft.dataset )
ft.dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference
0,how i use adobe photoshop creative cloud for d...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Here's how to use Perspective Warp in Photosho...,"in adobe photoshop creative cloud, to use pers..."
1,wut is Adobee Photoshoop Cretive Cloud?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is mentioned as...,Adobe Photoshop Creative Cloud is a version of...
2,"As a beginner Photoshop user, can you explain ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud's Perspective W...,Adobe Photoshop Creative Cloud's Perspective W...
3,Who is PhotoSpin in relation to the image used...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,PhotoSpin is the company that took the photogr...,PhotoSpin is the company that took the photogr...
4,"How you use Perspective Warp in Photoshop, wha...","[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop is a tool that a...,Perspective Warp in Photoshop let you change t...
5,What does the Perspective Warp feature in Phot...,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,The Perspective Warp feature in Photoshop allo...,Perspective Warp in Photoshop allows you to ch...
6,As a Photoshop trainer developing step-by-step...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"Based on the transcript, here's how the Perspe...",The new Perspective Warp feature in Adobe Phot...
7,wut is adobee fotoshop cretive clowd?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a version of...,Adobe Photoshop Creative Cloud is a version of...
8,Wut duz Perspectiv Warp do in Photoshop?,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop lets yu change t...
9,"How can I, as a Photoshop trainer, explain to ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"In the Perspective Warp tutorial, the PhotoSpi...","In the Perspective Warp tutorial, the image us..."


In [18]:
ft.result = evaluate(
    dataset=ft.dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
ft.result.to_pandas()

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,how i use adobe photoshop creative cloud for d...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Here's how to use Perspective Warp in Photosho...,"in adobe photoshop creative cloud, to use pers...",0.3,0.25,0.47,0.896976,0.4,0.0625
1,wut is Adobee Photoshoop Cretive Cloud?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is mentioned as...,Adobe Photoshop Creative Cloud is a version of...,1.0,0.6,0.5,0.876445,1.0,0.6
2,"As a beginner Photoshop user, can you explain ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud's Perspective W...,Adobe Photoshop Creative Cloud's Perspective W...,0.4,1.0,0.5,0.93448,0.222222,0.0
3,Who is PhotoSpin in relation to the image used...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,PhotoSpin is the company that took the photogr...,PhotoSpin is the company that took the photogr...,1.0,0.5,0.67,0.914422,0.5,0.0
4,"How you use Perspective Warp in Photoshop, wha...","[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop is a tool that a...,Perspective Warp in Photoshop let you change t...,0.25,0.8,0.43,0.960246,0.5,0.2
5,What does the Perspective Warp feature in Phot...,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,The Perspective Warp feature in Photoshop allo...,Perspective Warp in Photoshop allows you to ch...,0.5,0.666667,0.8,0.982226,1.0,0.333333
6,As a Photoshop trainer developing step-by-step...,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"Based on the transcript, here's how the Perspe...",The new Perspective Warp feature in Adobe Phot...,0.272727,1.0,0.47,0.0,0.4,0.3
7,wut is adobee fotoshop cretive clowd?,[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,Adobe Photoshop Creative Cloud is a version of...,Adobe Photoshop Creative Cloud is a version of...,1.0,1.0,0.8,0.820373,1.0,0.0
8,Wut duz Perspectiv Warp do in Photoshop?,"[If I turn it on and off, you can see the befo...",[>> What I want to show you in this video is s...,Perspective Warp in Photoshop allows you to ch...,Perspective Warp in Photoshop lets yu change t...,1.0,0.666667,0.67,0.926159,1.0,0.333333
9,"How can I, as a Photoshop trainer, explain to ...",[>> What I want to show you in this video is s...,[>> What I want to show you in this video is s...,"In the Perspective Warp tutorial, the PhotoSpi...","In the Perspective Warp tutorial, the image us...",1.0,1.0,0.67,0.879323,0.333333,0.4


In [19]:
print ("Base")
print( base.dataset[0].retrieved_contexts )
print( base.dataset[0].reference_contexts )
print ("FT")
print( ft.dataset[0].retrieved_contexts )
print( ft.dataset[0].reference_contexts )


Base
[">> What I want to show you in this video is something that is absolutely amazing. It's a brand new feature in Adobe Photoshop Creative Cloud called Perspective Warp. Now I have a photograph open. I didn't take this photo. It was taken by a company called PhotoSpin. And don't forget if you want to follow along, you can download the assets for this video. What I want to do first though is make a copy of it. I'm going to drag it down- this is one way to do it-make a copy. That is not necessary, but this way we get to see kind of a before and an after. Now it will work with just about any image, but your first test is to go up to the word Edit on the pull-down menu and go down, and you better see Perspective Warp. If you don't, no big deal. Just go out to the cloud, and download the latest version of Photoshop. Now what does it do? What does Perspective Warp do? It literally allows me to re-enter a three-dimensional world to change the perspective of the image as if, as the photogra

In [20]:
ft.statistics = summary_stats(ft.result.to_pandas())
print( ft.statistics.select_dtypes(include="number") \
               .loc["Mean"] )
print( ft.statistics.select_dtypes(include="number") \
.loc["StdDev"])

context_recall                      0.672273
faithfulness                        0.748333
factual_correctness(mode=f1)        0.598000
answer_relevancy                    0.819065
context_entity_recall               0.635556
noise_sensitivity(mode=relevant)    0.222917
Name: Mean, dtype: float64
context_recall                      0.111424
faithfulness                        0.081742
factual_correctness(mode=f1)        0.044392
answer_relevancy                    0.092153
context_entity_recall               0.102262
noise_sensitivity(mode=relevant)    0.064911
Name: StdDev, dtype: float64


  retval = retval.apply(partial(pd.to_numeric, **{"errors": "ignore"}))


# Statistics

Let's now pool the results from base and ft models, and see if there is any
significance to the result.

In [42]:
from typing import Tuple
import pandas as pd
from pstuts_rag.evaluator_utils import combine_stats, z_test, summary_stats
# Create a new DataFrame with renamed index


In [None]:

means, stds = ( combine_stats( 
                              (base.statistics, ft.statistics), field, ["Base","FT"] ) 
               for field in ["Mean","StdDev"] 
               )
summary_base_ft = pd.concat(
    [means, stds],
    axis=1,
    keys=["Mean","StdDev"]
).swaplevel(axis=1).sort_index(axis=1)

summary_base_ft

Next, we'll do a bit of statistics.

Compute the z-test to determine if the mean has shifted significantly or not
between base and FT.

Small value (e.g. $< 0.05$) would indicate a statistically significant move. But
we're simply looking for `p` value that stands out as smaller than the rest.

In [22]:


significance = pd.DataFrame(columns=['z_score', 'p_value'])
# The error occurs because df_combined has a MultiIndex with levels swapped
# We need to access the data differently - the structure is (metric, stat) not (stat, metric)
for c in summary_base_ft.columns.get_level_values(0).unique():
    z, p = z_test(summary_base_ft.loc['Base', (c, 'Mean')], 
                  summary_base_ft.loc['FT', (c, 'Mean')], 
                  summary_base_ft.loc['Base', (c, 'StdDev')], 
                  summary_base_ft.loc['FT', (c, 'StdDev')])
    significance.loc[c] = [z, p]

significance = significance.sort_values(by='p_value')
significance

Unnamed: 0,z_score,p_value
answer_relevancy,-1.02295,0.306332
factual_correctness(mode=f1),-0.774432,0.438675
noise_sensitivity(mode=relevant),0.31648,0.751638
faithfulness,0.236245,0.813243
context_recall,0.0,1.0
context_entity_recall,0.0,1.0


What we see is that there is no difference in context recall.

My guess is that this result has to do with the specific application.
These were audio transcripts of fairly short videos. Most transcripts therefore
fit completely into a single, or a few, chunks (nodes) of the knowledge graph
used to generate the golden data set.

At the same time, due to video diversity, transcripts were quite distinct from 
each other. Therefore, even a base embedding model likely did as good of a job
as it could. Since the embedding models to train here were chosen so that they
could be fine-tuned on a laptop, it's hard to get better.

Let's see if we can test that by using a SOTA model (`text-embedding-3-small`) and see how well it does.

In [None]:
sota = DataGroup()
sota.rag = RAGChainInstance(name="sota",
                            qdrant_client=qdrant_client,
                            llm=ChatOpenAI(model="gpt-4.1-nano"),
                            embeddings=OpenAIEmbeddings(model="text-embedding-3-small"))

sota.dataset = EvaluationDataset.from_hf_dataset(golden_small_hf)
_ = await sota.rag.build_chain(docs_json)
_ = await apply_rag_chain_inplace(sota.rag.rag_chain, sota.dataset )
sota.result = evaluate(
    dataset=sota.dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config,
    show_progress=True
)
sota.statistics = summary_stats(sota.result.to_pandas())


<built-in function repr>


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

In [61]:
fields = ["Mean","StdDev"]
summary = pd.concat(
    [combine_stats( (base.statistics, sota.statistics, ft.statistics), field, ("Base","SOTA", "FT") ) 
               for field in fields ],
    axis=1,
    keys=fields
).swaplevel(axis=1).sort_index(axis=1)

summary

Unnamed: 0_level_0,answer_relevancy,answer_relevancy,context_entity_recall,context_entity_recall,context_recall,context_recall,factual_correctness(mode=f1),factual_correctness(mode=f1),faithfulness,faithfulness,noise_sensitivity(mode=relevant),noise_sensitivity(mode=relevant)
Unnamed: 0_level_1,Mean,StdDev,Mean,StdDev,Mean,StdDev,Mean,StdDev,Mean,StdDev,Mean,StdDev
Base,0.914448,0.014215,0.635556,0.102262,0.672273,0.111424,0.654,0.057081,0.721111,0.081215,0.198472,0.041861
SOTA,0.913301,0.013716,0.673333,0.091732,0.672273,0.111424,0.533,0.054651,0.610907,0.085214,0.199487,0.047707
FT,0.819065,0.092153,0.635556,0.102262,0.672273,0.111424,0.598,0.044392,0.748333,0.081742,0.222917,0.064911


Is FT any better or worse than the SOTA model?

In [62]:
significance = pd.DataFrame(columns=['z_score', 'p_value'])
# The error occurs because df_combined has a MultiIndex with levels swapped
# We need to access the data differently - the structure is (metric, stat) not (stat, metric)
for c in summary.columns.get_level_values(0).unique():
    z, p = z_test(summary.loc['FT', (c, 'Mean')], 
                  summary.loc['SOTA', (c, 'Mean')], 
                  summary.loc['FT', (c, 'StdDev')], 
                  summary.loc['SOTA', (c, 'StdDev')])
    significance.loc[c] = [z, p]

significance = significance.sort_values(by='p_value')
significance


Unnamed: 0,z_score,p_value
faithfulness,-1.163832,0.244492
answer_relevancy,1.011458,0.311797
factual_correctness(mode=f1),-0.923176,0.355916
noise_sensitivity(mode=relevant),-0.290844,0.77117
context_entity_recall,0.274994,0.783321
context_recall,0.0,1.0


Looks doubtful. If anything, swapping SOTA for FT makes the metrics go down.

Notice that the context recall is still exactly the same.


So, in the end, the conclusion is that the embedding model is not the
right spot to optimize this RAG chain.