# Setup

In [1]:
# %pip install -q -r requirements.txt

## Config

In [1]:
INPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-reasoning'
REVISION = '536f3b8'
OUTPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-tokenized'

In [2]:
from transformers import AutoTokenizer
from huggingface_hub import get_token, login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
BASE_MODEL = 'tiiuae/Falcon3-7B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, token=get_token())

# Prompt Experiment
## Goal
I want to explore a few scenarios for prompt fine-tuning. Lets consider a scenario where we have the following available:
- (Q) Question
- (AC) Answer Choices
- (R) Reasoning
- (FA) Final Answer

How would performance differ if we tried changing the order?

Scenario 1: `Q - AC - R - FA`
This is the most natural, we want the model to generate reasoning before the final answer. Based on how decoding works, this will give the most information before selecting the Final Answer.

Scenario 2: `Q - AC - FA - R`
This is quite awkward. Why would we put our reasoning last? Its faster. Could the model be trained in such a way to "know" the reasoning before responding with it? If so we could save a lot of tokens. Im skeptical, but its worth testing.

Scenario 3: `Q - AC - FA`
This is our fine-tuning control.

Scenario 4: Base
This is our un-fine-tuned control.

In each of these scenarios I will build prompts with strucutred generation to fine-tune with. I noticed some difficulty in a first pass with getting consistent response formats, but thats out of scope, so structured generation can help a lot here.

## Implementation
To explore this goal, we will start with [layoric/labeled-multiple-choice-explained](https://huggingface.co/datasets/layoric/labeled-multiple-choice-explained) as our dataset. It has explanations already provided by GPT-3.5-turbo. Given that these explanations are a bit different than what falcon would do, it might be useful if we generate some from falcon as well. Based on [this notebook](./poe-generate-falcon-reasoning.ipynb) we have been able to generate falcon reasoning in this refined dataset [derek-thomas/labeled-multiple-choice-explained-falcon-reasoning](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-reasoning).

In this notebook we will format our data such that we can try each experiment and then we will push it to my repo: [derek-thomas/labeled-multiple-choice-explained](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained).

## Imports

In [4]:
import pandas as pd
from datasets import load_dataset
import json

## Load and Preprocess the Dataset

In [5]:
# Load dataset from Hugging Face Hub
dataset = load_dataset(INPUT_DATASET, split='train')

# Convert to pandas dataframe
df = dataset.to_pandas()
print(f"Before Cleaning: {len(df)} rows")
print(df.columns)

# Drop the __index_level_0__ column if it exists
df.drop(columns=['falcon_reasoning_prompt'], errors='ignore', inplace=True)

# Ensure all values in 'formatted_question' are strings
df.rename(columns={
    'explanation': 'gpt3_5_reasoning',
}, inplace=True)

# Fix formatting
df['question_text'] = df['question_text'].str.replace('"', '', regex=False)
df['gpt3_5_reasoning'] = df['gpt3_5_reasoning'].str.replace('"', "'", regex=False)
df['falcon_reasoning'] = df['falcon_reasoning'].str.replace('"', "'", regex=False)

df

Before Cleaning: 8413 rows
Index(['formatted_question', 'combined_fact', 'answer_key', 'topic',
       'explanation', 'question_text', 'answer_choices',
       'falcon_reasoning_prompt', 'falcon_reasoning'],
      dtype='object')


Unnamed: 0,formatted_question,combined_fact,answer_key,topic,gpt3_5_reasoning,question_text,answer_choices,falcon_reasoning
0,what is satellite technology used for predicti...,satellite technology is used for predicting wh...,c,Technology,a) Seconds and minutes: This option is incorre...,What is satellite technology used for predicting?,(a) Seconds and minutes (b) The strength and m...,- (a) Seconds and minutes: Satellite technolog...
1,what does irradiating food do? (a) relieve pai...,irradiated food improves food safety.,c,Food science,(a) Relieve pain: This option is not correct b...,What does irradiating food do?,(a) Relieve pain (b) Enhance food's nutrients ...,(a) Relieve pain: Irradiating food does not ha...
2,what protects a mammal's skin? (a) fiber folli...,fiber follicles protect mammal skin,a,Biology,b) Exfoliation: Exfoliation is the process of ...,What protects a mammal's skin?,(a) Fiber follicles (b) Exfoliation (c) Resist...,(a) **Fiber follicles**: This is the correct a...
3,what do earthworms do when a segment breaks of...,earthworms can regrow segments that break off,b,Biology,a) Dies: This option is not correct because ea...,What do earthworms do when a segment breaks off?,(a) Dies (b) Regrows it (c) Reproduces (d) Sed...,1. **Option (a): Dies**\n - Earthworms are s...
4,lightning can be bad for what? (a) the environ...,lightning can be bad for the environment.,a,Electricity,b) Rainstorms: Lightning is actually a natural...,Lightning can be bad for what?,(a) The environment (b) Rainstorms (c) Destruc...,(a) The environment: Lightning can release lar...
...,...,...,...,...,...,...,...,...
8408,organisms that can cause infection do what? (a...,organisms that can cause infection make humans...,g,Biology,a) Bandaging open sores is not the correct ans...,Organisms that can cause infection do what?,(a) Bandage open sores (b) Keep flesh clean (c...,(a) Bandage open sores: This action is typical...
8409,fungi are living things that cannot make thei...,fungi are living things that cannot make their...,a,Biology,b) Fungi are living things that can make their...,Fungi are living things that cannot make their...,(a) Food (b) Cells (c) Energy (d) Fruits (e) H...,1. **Read the question and options carefully.*...
8410,an overheated body can use water for: (a) meta...,the evaporation of water from the skin cools t...,g,Biology,a) Metabolic reaction: This option is incorrec...,An overheated body can use water for:?,(a) Metabolic reaction (b) Dehydrating (c) Rai...,- (a) Metabolic reaction: This is incorrect be...
8411,what is essential for cellular respiration for...,plants are essential for cellular respiration ...,f,Biology,a) Electrons are involved in cellular respirat...,What is essential for cellular respiration for...,(a) Electron (b) Glucose (c) Energy (d) Energy...,1. **Glucose (b)**: Glucose is one of the reac...


## Create Prompts from Processed Data

We need to convert our sample into a format similar to below for each of the scenarios. This is ideal since we can use [chat templates](https://huggingface.co/docs/transformers/en/chat_templating) to easily switch models which might have different special tokens.

```
[
    {"content": system_prompt, "role": "system"},
    {"content": user_content, "role": "user"},
    {"content": assistant_response, "role": "assistant"}
]
```

We should include a helpful `system_prompt` with a general trivia prefix, and a suffix that contains instructions that fit each scenario.
The `user_content` will have the Question and answer choices.
The `assistant_response` should reflect the scenario. 

Its best to understand `template_blocks` in a couple layers. 
- The top layer (macro) allows me to decide which pieces I want to include. Sometimes I want just the `system`+`user` message, and for fine-tuning Ill want `system`+`user`+`assistant`
- System+User:
    - Inside the this layer I use jinja to interpolate the values I want to add
    - I moved `user_content` out to get a feel for how it looks
- Assistant:
    - Here we have an if statement to allow me to chose between FA, RFA and FAR
    - Inside that we just have the same interploation as seen elsewhere

You can see in `initial` and `full` the json for the messages structure. Here Im selecting which macros I want to use.

In [6]:
from jinja2 import Environment, DictLoader

template_blocks = '''
{%- macro user_message(system_content, question_text, answer_choices) -%}
{
    "role": "system",
    "content": {{ system_content }}
},
{
    "role": "user",
    "content": "Question: {{ question_text }}\\nAnswer Choices: {{ answer_choices }}"

}
{%- endmacro %}

{% macro assistant_response(reasoning, answer_key, response_order='default') -%}
{
    "role": "assistant",
    "content": {
        {% if response_order == 'rfa' -%}
        "reasoning": {{ reasoning | tojson }},
        "final_answer": "{{ answer_key }}"
        {% elif response_order == 'far' -%}
        "final_answer": "{{ answer_key }}",
        "reasoning": {{ reasoning | tojson }}
        {% else -%}
        "final_answer": "{{ answer_key }}"
        {% endif %}
    }
}
{%- endmacro %}
'''

# System + User only (initial template)
initial = '''
[
    {{ user_message(system_content, question_text, answer_choices) }}
]
'''

# Full conversation template
full = '''
[
    {{ user_message(system_content, question_text, answer_choices) }},
    {{ assistant_response(reasoning, answer_key, response_order) }}
]
'''

# Create Jinja environment and load templates
env = Environment(loader=DictLoader({
    'template_blocks': template_blocks,
    'initial': initial,
    'full': full
}))

# # Load the macro definitions into the environment
macro_template = env.get_template('template_blocks')
env.globals.update(macro_template.module.__dict__)

# Compile full and initial templates
full_template = env.get_template('full')
initial_template = env.get_template('initial')

### Reasoning Final Answer

In [7]:
rfa_system_content = 'Answer the Question and include your reasoning and the final answer in a json like: {"reasoning": <reasoning about the answer>, "final_answer": <letter corresponding to the answer>}.'
rfa_system_content = json.dumps(rfa_system_content)

# USER Prompt
df['user_prompt_RFA'] = df.apply(lambda row: initial_template.render(
    system_content=rfa_system_content,
    question_text=row['question_text'],
    answer_choices=row['answer_choices']
), axis=1)
df['user_prompt_RFA'] = df['user_prompt_RFA'].apply(json.loads)

#### RFA ChatGPT 3.5 Example

In [8]:
def generate_full_conversation(row, reasoning_key):
    rfa_template_input = {
        'system_content': rfa_system_content,
        'question_text': row['question_text'],
        'answer_choices': row['answer_choices'],
        'answer_key': row['answer_key'],
        'response_order': 'rfa'
    }
    return full_template.render(**rfa_template_input, reasoning=row[reasoning_key])

# Full Conversation GPT3.5
df['conversation_RFA_gpt3_5'] = df.apply(lambda row: generate_full_conversation(row, 'gpt3_5_reasoning'), axis=1)
df['conversation_RFA_gpt3_5'] = df['conversation_RFA_gpt3_5'].apply(json.loads)

# Full Conversation Falcon
df['conversation_RFA_falcon'] = df.apply(lambda row: generate_full_conversation(row, 'falcon_reasoning'), axis=1)
df['conversation_RFA_falcon'] = df['conversation_RFA_falcon'].apply(json.loads)

In [9]:
rfa_test_row = df.conversation_RFA_gpt3_5.iloc[0]
print(rfa_test_row[0]['content'])
print('---')
print(rfa_test_row[1]['content'])
print('---')
print(rfa_test_row[2]['content'].keys())
print('---')
print(rfa_test_row[2]['content']['reasoning'])

Answer the Question and include your reasoning and the final answer in a json like: {"reasoning": <reasoning about the answer>, "final_answer": <letter corresponding to the answer>}.
---
Question: What is satellite technology used for predicting?
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius
---
dict_keys(['reasoning', 'final_answer'])
---
a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.

b) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake p

#### RFA Falcon Example

In [10]:
rfa_test_row = df.conversation_RFA_falcon.iloc[0]
print(rfa_test_row[0]['content'])
print('---')
print(rfa_test_row[1]['content'])
print('---')
print(rfa_test_row[2]['content'].keys())
print('---')
print(rfa_test_row[2]['content']['reasoning'])

Answer the Question and include your reasoning and the final answer in a json like: {"reasoning": <reasoning about the answer>, "final_answer": <letter corresponding to the answer>}.
---
Question: What is satellite technology used for predicting?
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius
---
dict_keys(['reasoning', 'final_answer'])
---
- (a) Seconds and minutes: Satellite technology is not used to predict seconds and minutes. This is too specific and not what satellite technology is generally used for. Satellite technology is used for broader time scales, such as days, weeks, months, and years.
- (b) The strength and magnitude of an earthquake: While some types of satellite data can be used in conjunction with other information to study seismicity, earthquakes the

At this point we should feel pretty comfortable with our prompt, lets repeat this for `FAR` and `FA`.

### Final Answer Reasoning

In [11]:
far_system_content = 'Answer the Question and include your Final Answer and the Reasoning in a json like: {"final_answer": <letter corresponding to the answer>, "reasoning": <reasoning about the answer>}.'
far_system_content = json.dumps(far_system_content)

# USER Prompt
df['user_prompt_FAR'] = df.apply(lambda row: initial_template.render(
    system_content=far_system_content,
    question_text=row['question_text'],
    answer_choices=row['answer_choices']
), axis=1)
df['user_prompt_FAR'] = df['user_prompt_FAR'].apply(json.loads)

In [12]:
def generate_full_conversation(row, reasoning_key):
    far_template_input = {
        'system_content': far_system_content,
        'question_text': row['question_text'],
        'answer_choices': row['answer_choices'],
        'answer_key': row['answer_key'],
        'response_order': 'far'
    }
    return full_template.render(**far_template_input, reasoning=row[reasoning_key])

# Full Conversation GPT3.5
df['conversation_FAR_gpt3_5'] = df.apply(lambda row: generate_full_conversation(row, 'gpt3_5_reasoning'), axis=1)
df['conversation_FAR_gpt3_5'] = df['conversation_FAR_gpt3_5'].apply(json.loads)

# Full Conversation Falcon
df['conversation_FAR_falcon'] = df.apply(lambda row: generate_full_conversation(row, 'falcon_reasoning'), axis=1)
df['conversation_FAR_falcon'] = df['conversation_FAR_falcon'].apply(json.loads)

#### FAR ChatGPT 3.5 Example

In [13]:
far_test_row = df.conversation_FAR_gpt3_5.iloc[0]
print(far_test_row[0]['content'])
print('---')
print(far_test_row[1]['content'])
print('---')
print(far_test_row[2]['content'].keys())
print('---')
print(far_test_row[2]['content']['reasoning'])

Answer the Question and include your Final Answer and the Reasoning in a json like: {"final_answer": <letter corresponding to the answer>, "reasoning": <reasoning about the answer>}.
---
Question: What is satellite technology used for predicting?
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius
---
dict_keys(['final_answer', 'reasoning'])
---
a) Seconds and minutes: This option is incorrect because satellite technology is not used for predicting time intervals. Satellite technology is used for various purposes such as communication, navigation, and weather forecasting, but it is not used for predicting time intervals.

b) The strength and magnitude of an earthquake: This option is incorrect because satellite technology is not used for predicting earthquakes. Earthquake p

#### FAR Falcon Example

In [14]:
far_test_row = df.conversation_FAR_falcon.iloc[0]
print(far_test_row[0]['content'])
print('---')
print(far_test_row[1]['content'])
print('---')
print(far_test_row[2]['content'].keys())
print('---')
print(far_test_row[2]['content']['reasoning'])

Answer the Question and include your Final Answer and the Reasoning in a json like: {"final_answer": <letter corresponding to the answer>, "reasoning": <reasoning about the answer>}.
---
Question: What is satellite technology used for predicting?
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius
---
dict_keys(['final_answer', 'reasoning'])
---
- (a) Seconds and minutes: Satellite technology is not used to predict seconds and minutes. This is too specific and not what satellite technology is generally used for. Satellite technology is used for broader time scales, such as days, weeks, months, and years.
- (b) The strength and magnitude of an earthquake: While some types of satellite data can be used in conjunction with other information to study seismicity, earthquakes the

### Final Answer

In [15]:
fa_system_content = 'Answer the Question and include your Final Answer in a json like: {"final_answer": <letter corresponding to the answer>}.'
fa_system_content = json.dumps(fa_system_content)

# USER Prompt
df['user_prompt_FA'] = df.apply(lambda row: initial_template.render(
    system_content=fa_system_content,
    question_text=row['question_text'],
    answer_choices=row['answer_choices']
), axis=1)
df['user_prompt_FA'] = df['user_prompt_FA'].apply(json.loads)

In [16]:
def generate_full_conversation(row):
    fa_template_input = {
        'system_content': fa_system_content,
        'question_text': row['question_text'],
        'answer_choices': row['answer_choices'],
        'answer_key': row['answer_key'],
        'response_order': 'fa'
    }
    return full_template.render(**fa_template_input)

# Full Conversation GPT3.5
df['conversation_FA'] = df.apply(lambda row: generate_full_conversation(row), axis=1)
df['conversation_FA'] = df['conversation_FA'].apply(json.loads)


#### FA Example

In [17]:
fa_test_row = df.conversation_FA.iloc[0]
print(fa_test_row[0]['content'])
print('---')
print(fa_test_row[1]['content'])
print('---')
print(fa_test_row[2]['content'])

Answer the Question and include your Final Answer in a json like: {"final_answer": <letter corresponding to the answer>}.
---
Question: What is satellite technology used for predicting?
Answer Choices: (a) Seconds and minutes (b) The strength and magnitude of an earthquake (c) What it's like outside each day (d) 70-75 degrees fahrenheit (e) Rapid changes occur (f) Dead-ends and false starts. (g) Snow, ice, and rock (h) Around 5 to 27 degrees celsius
---
{'final_answer': 'c'}


### Cleanup

In [18]:
df.columns

Index(['formatted_question', 'combined_fact', 'answer_key', 'topic',
       'gpt3_5_reasoning', 'question_text', 'answer_choices',
       'falcon_reasoning', 'user_prompt_RFA', 'conversation_RFA_gpt3_5',
       'conversation_RFA_falcon', 'user_prompt_FAR', 'conversation_FAR_gpt3_5',
       'conversation_FAR_falcon', 'user_prompt_FA', 'conversation_FA'],
      dtype='object')

In [19]:
df = df[['topic', 'question_text', 'answer_key', 'gpt3_5_reasoning', 'falcon_reasoning', 'answer_choices',
         'user_prompt_RFA', 'conversation_RFA_gpt3_5', 'conversation_RFA_falcon',
         'user_prompt_FAR', 'conversation_FAR_gpt3_5', 'conversation_FAR_falcon',
         'user_prompt_FA', 'conversation_FA']]

In [20]:
df

Unnamed: 0,topic,question_text,answer_key,gpt3_5_reasoning,falcon_reasoning,answer_choices,user_prompt_RFA,conversation_RFA_gpt3_5,conversation_RFA_falcon,user_prompt_FAR,conversation_FAR_gpt3_5,conversation_FAR_falcon,user_prompt_FA,conversation_FA
0,Technology,What is satellite technology used for predicting?,c,a) Seconds and minutes: This option is incorre...,- (a) Seconds and minutes: Satellite technolog...,(a) Seconds and minutes (b) The strength and m...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
1,Food science,What does irradiating food do?,c,(a) Relieve pain: This option is not correct b...,(a) Relieve pain: Irradiating food does not ha...,(a) Relieve pain (b) Enhance food's nutrients ...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
2,Biology,What protects a mammal's skin?,a,b) Exfoliation: Exfoliation is the process of ...,(a) **Fiber follicles**: This is the correct a...,(a) Fiber follicles (b) Exfoliation (c) Resist...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
3,Biology,What do earthworms do when a segment breaks off?,b,a) Dies: This option is not correct because ea...,1. **Option (a): Dies**\n - Earthworms are s...,(a) Dies (b) Regrows it (c) Reproduces (d) Sed...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
4,Electricity,Lightning can be bad for what?,a,b) Rainstorms: Lightning is actually a natural...,(a) The environment: Lightning can release lar...,(a) The environment (b) Rainstorms (c) Destruc...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8408,Biology,Organisms that can cause infection do what?,g,a) Bandaging open sores is not the correct ans...,(a) Bandage open sores: This action is typical...,(a) Bandage open sores (b) Keep flesh clean (c...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
8409,Biology,Fungi are living things that cannot make their...,a,b) Fungi are living things that can make their...,1. **Read the question and options carefully.*...,(a) Food (b) Cells (c) Energy (d) Fruits (e) H...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
8410,Biology,An overheated body can use water for:?,g,a) Metabolic reaction: This option is incorrec...,- (a) Metabolic reaction: This is incorrect be...,(a) Metabolic reaction (b) Dehydrating (c) Rai...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."
8411,Biology,What is essential for cellular respiration for...,f,a) Electrons are involved in cellular respirat...,1. **Glucose (b)**: Glucose is one of the reac...,(a) Electron (b) Glucose (c) Energy (d) Energy...,"[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que...","[{'role': 'system', 'content': 'Answer the Que..."


### Datasets Format

We want to push this to the hub for storage, but there is a catch in our format... alas content is a `dict` right now for the "assistant". In "system" and "user" its a string. `Datasets` is based on parquet/arrow which require columns of fixed types, meaning content should always be a str or a dict. Ill cast it to str for simplicity.

In [21]:
cols_to_cast = ['conversation_RFA_gpt3_5', 'conversation_RFA_falcon', 'conversation_FAR_gpt3_5', 'conversation_FAR_falcon', 'conversation_FA']

def cast_content_keys_to_string(conversation):
    user_dict = conversation[2]
    user_dict['content'] = json.dumps(user_dict['content'])
    return conversation

# Apply the function to all columns
for col in cols_to_cast:
    df.loc[:, col] = df[col].apply(lambda x: cast_content_keys_to_string(x))

## Explore Prompts
Gradio can be really useful for quick inline apps. Here I want to make sure everything is as I expect.

While the above print statements helped me see the format, the gradio app helps me explore a large volume of output easily. 

Note: Its tricky as I cant easily render newlines in strings. So be careful!

In [22]:
import base64

from IPython.display import display, HTML

gr_cols = ['conversation_RFA_gpt3_5', 'conversation_RFA_falcon',
       'conversation_FAR_gpt3_5', 'conversation_FAR_falcon',
        'conversation_FA']
gr_df = df[gr_cols]
json_str = json.dumps(gr_df.head(20).to_dict())
encoded_data = base64.b64encode(json_str.encode()).decode()

code = f'''
<html>
	<head>
		<script type="module" crossorigin src="https://cdn.jsdelivr.net/npm/@gradio/lite/dist/lite.js"></script>
		<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@gradio/lite/dist/lite.css" />
	</head>
	<body>
		<gradio-lite>
        import json
        import gradio as gr
        import pandas as pd
        import base64

        encoded_data = "{encoded_data}"
        decoded_data = json.loads(base64.b64decode(encoded_data).decode())
        
        df = pd.DataFrame(decoded_data)


        # Functions to handle prompts
        def get_prompt(index, prompt_type):
            return df.iloc[index][prompt_type]
        
        def next_prompt(index, prompt_type):
            if index < len(df) - 1:
                index += 1
            return index, get_prompt(index, prompt_type)
        
        def previous_prompt(index, prompt_type):
            if index > 0:
                index -= 1
            return index, get_prompt(index, prompt_type)
        
        # Gradio App
        with gr.Blocks() as demo:
            gr.Markdown("# Prompt Browser")
            with gr.Row():
                prompt_type_dropdown = gr.Dropdown(
                    choices=list(df.columns),
                    value=list(df.columns)[0],
                    label="Select Prompt Type"
                )
                index_display = gr.Textbox("0", label="Index", interactive=False)
        
            prompt_display = gr.JSON(value=df.iloc[0][list(df.columns)[0]], label="Prompt")
            
            with gr.Row():
                prev_button = gr.Button("⬅️ Previous")
                next_button = gr.Button("Next ➡️")
            
            # State to hold the current index
            index_state = gr.State(value=0)
        
            # Button click events
            prev_button.click(
                fn=previous_prompt,
                inputs=[index_state, prompt_type_dropdown],
                outputs=[index_state, prompt_display]
            )
            next_button.click(
                fn=next_prompt,
                inputs=[index_state, prompt_type_dropdown],
                outputs=[index_state, prompt_display]
            )
        
            # Dropdown change event
            prompt_type_dropdown.change(
                fn=lambda index, prompt_type: get_prompt(index, prompt_type),
                inputs=[index_state, prompt_type_dropdown],
                outputs=prompt_display
            )
        
            # Update index display
            index_state.change(
                fn=lambda index: str(index),
                inputs=index_state,
                outputs=index_display
            )
        
        # Launch the app
        demo.launch(height=900)
        
        </gradio-lite>
	</body>
</html>
'''

display(HTML(code))

## Push Dataset to the Hub

Its useful to get a train, test split, then we convert to `Dataset` and push to the hub. We also want to stratify on `'topic'`.

In [23]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# First split to create train and remaining (val + test)
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['topic'], random_state=42)

# Reset index to avoid index column in the Hugging Face Dataset
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

# Convert each DataFrame to a Dataset object
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Create a DatasetDict with the train, validation, and test datasets
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

In [24]:
# Push the dataset to the Hugging Face Hub
dataset_dict.push_to_hub(OUTPUT_DATASET)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-tokenized/commit/c29e666182bfae43500a88a115df2554216e0012', commit_message='Upload dataset', commit_description='', oid='c29e666182bfae43500a88a115df2554216e0012', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-tokenized', endpoint='https://huggingface.co', repo_type='dataset', repo_id='derek-thomas/labeled-multiple-choice-explained-falcon-tokenized'), pr_revision=None, pr_num=None)