Argunauts Training Phase I: Continual Pretraining on Synthetic Data
Introduction
This article documents how we've been creating DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT through continual pretraining of Llama-3.1-8B-Instruct.
The goal of this is to familiarize Llama-3.1-8B-Instruct with Argdown syntax and semantics, without diminishing the model's other capabilities.
Training Llama-3.1-Argunaut-1-8B-SFT is part of the Argunauts Project.
SFT Dataset Mix
The training data contains 1M examples and is composed of the following datasets:
Dataset | Weight (examples) | Weight (tokens) |
---|---|---|
DebateLabKIT/deepa2-conversations | 25% | 49% |
DebateLabKIT/deep-argmap-conversations | 25% | 18% |
allenai/tulu-3-sft-mixture | 50% | 33% |
Ingredient #1: deepa2-conversations
The deepa2-conversations
dataset contains multi-turn conversations with user/assistant roles (length: 2x2–20x2 turns, bell-shaped distribution). The user instructs an assistant, step by step, to reconstruct and analyse an initially given argumentative text.
It is a fully synthetic dataset, which upcycles the deepa2 data.
To build deepa2-conversations
, we've defined 30 different kinds of artifacts that can be referred to, or produced in the course of an analysis, some of which annotate and analyse the source text, reconstruct the argument in standard form, or formalize premises and conclusions, while still others use the reconstruction for additional tasks (like writing a Z3 program to check deductive validity, or augmenting the argument with function-calling results).
The features the user may refer to / the assistant may produce in the course of the argumentation analysis are:
Source Text Artifacts
N = "narrative" # free form text that presents the argument as narrative, dialogque, letter, or email
ST = "source_text" # original and more schematic "source text" containing the (incomplete) argument and potential distractors
SP = "source_paraphrase" # paraphrase of source text which leaves out extraneous material and streamlines the presentation
Informal Analysis Artifacts
T = "title" # short title of the argument
G = "gist" # key point summarizing the argument
TG = "title_gist" # mixes title and gist
TGC = "title_gist_conclusion" # mixes title, gist and conclusion in argdown snippet
X = "context" # multi-purpose characterization of context
Argumentative Text Analysis Artifacts
R = "reasons" # list of quotes from `ST` that justify claims
J = "conjectures" # list of quotes from `ST` that are argued for
AT = "annotated_source_text" # argumentative text annotation of source text
ATR = "annotated_source_text_ref" # like AT, with reference to argdown reconstruction
IG = "inference_graph" # renders all reasons and conjectures as mini argument map
Argument Reconstruction Artifacts
A = "argdown reconstruction" # premise-conclusion structure
E = "erroneous_argdown" # flawed reconstruction, to-be revised at later step
P = "premises" # list of premises
I = "intermediary_conclusions" # list of intermediary conclusions
C = "conclusion" # final conclusion of argument
AM = "argument_map" # map with subarguments, premises and conclusions
ATG = "argdown_with_title_gist" # argdown argument with header
Logical Analysis Artifacts
FP = "premises_formalized" # formalizations of premises
FI = "intermediary_conclusions_formalized" # formalizations of intermediary conclusions
FC = "conclusion_formalized" # formalizations of final conclusion
PP = "predicate_placeholders" # list of predicate placeholders
PE = "entity_placeholders" # list of entity names
PM = "misc_placeholders" # further placeholders (e.g., propositional constants)
K = "plchd_substitutions" # keys, mapping placeholders to natural-language terms
AF = "argdown_with_formalizations" # argdown argument with inline formalization info
Misc Artifacts
SMTL = "SMT_program_local" # smtlib code to check validity of each subargument with Z3
SMTG = "SMT_program_global" # smtlib code to check global validity with Z3
AY = "argdown_augmented_yaml" # argdown argument augmented with fake inline yaml from tool calls
QA = "evaluation_question_and_answer" # argument analysis Q&A
Not all artifacts figure in all conversations.
Every conversation in deepa2-conversations
unfolds a meaningful piecemeal reconstruction process, where the user guides the assistant through the analysis, instructing it to produce one artifact after the other.
But what is a meaningful reconstruction process, and how do we simulate it?
There are two types of basic constraints for meaningful reconstruction processes:
- Redundancy. If an artifact is entailed by a previously generated artifact, it would be redundant to produce it. For example: If the assistant has already generated the entire (correct) premise-conclusion structure, it doesn't make sense to ask the assistant to produce just the (correct) conclusion, for it is contained in the argument we already have.
- Dependency. Some artifacts depend on others. For example, to identify all reasons in a source text through annotation and to link them to their corresponding premises in the Argdown argument requires that the argument has been reconstructed before.
In addition, the experience from teaching argumentation analysis suggests that, at a given state of reconstruction, some steps are more sensible than others. For example, it's more promising to work with the text before you reconstruct and formalize the entire argument.
We model these ideas through
- a weighted directed graph on the node set of all artifact types, plus a start- and a stop-node (conversational graph), and
- a set of additional constraints for legal paths.
The following diagram depicts a subgraph of the entire conversational graph.
An abstract deepa2 conversation is a directed path from the start- to the stop-node which satisfies the additional constraints for legal paths. We can now randomly sample abstract legal conversational paths. Unfold for two examples:
- to summarize the argument (title and gist),
- to informally map all reason and conjecture statements contained in the source text,
- to reconstruct the argument as premise-conclusion structure,
- to draft a SMTLIB code snippet which checks whether the premises of the argument deductively entail its conclusion.
In this conversation, the user presents the source text and instructs the assistant ...
- to reconstruct the entire argument as premise-conclusion structure,
- to paraphrase the source text (given the reconstruction),
- to summarize the argument's title and key point,
- to annotate the source text by marking reasons and conjectures.
With all this in place, the construction of the deepa2-conversations
dataset is straightforward. For each split:
- We set the split size.
- We iterate through the upcycled deepa2 data until the desired number of examples has been generated:
- randomly sample an abstract conversation path
- render argumentation artifacts from current deepa2 record with templates
- replace abstract artifact types with argumentation artifacts in conversation
Ingredient #2: deep-argmap-conversations
The second ingredient of the Argunauts-SFT mix is based on the recently released syncIALO dataset — a corpus of 1k+ argument maps with several hundreds of claims each.
The conversations in deep-argmap-conversations
have typically 2 or 4 turns and involve user and assistant. The user instructs the assistant to create an Argdown argument map. While the inputs and the specific task may vary, the general idea for building this dataset is this:
- Sample a submap (not more than 20 arguments) from a randomly chosen syncIALO debate.
- Render the submap as neat and consistent Argdown argument map (
full_argdown_map
). This will serve asANSWER
. - Transform the submap into inputs that contain less information than the submap.
- Insert the inputs into a suitable instruction template, which gives the
INSTRUCTION
. INSTRUCTION
-ANSWER
pairs are rendered as conversations.
Expand for a list of all subtasks and example templates (stylized and shortened for illustrative purposes).
Using the shuffled list of argument claims as inputs:
- role: user
content: >
Organize the following sentences into an informal argument map (argdown syntax)
according to their pairwise dialectic relations.
sentences:
{{ sentence_list }}
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Assistant is asked to add sentences to a partial / incomplete argument map:
- role: user
content: >
Organize the following sentences into an informal argument map according to
their pairwise dialectic relations. Some of these dialectic relations are
shown below (partialmap) and should be taken into account.
sentences:
{{ sentence_list }}
partialmap:
```argdown
{{ partialmap }}
```
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```\n
- role: user
content: >
Complete and revise the following argument map by adding the missing sentences.
incomplete argument map:
```argdown\n{{ partialmap }}\n```
missing sentences:
{{} sentence_list }}
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Merge two or more maps into a single one:
- role: user
content: >
Merge the following argument maps into a single map. Streamline and add missing
arguments as required.{% if hints | title %} In doing so, pay attention to the
given hints.
{% for partialmap in partialmaps %}
argument map {{loop.index}}:
```argdown
{{ partialmap }}
```
{% endfor %}
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
The brokenmap
is obtained by randomly distorting the correct map:
- role: user
content: >
Improve the following argument map, if possible. In doing so, pay attention to the given hints.
argument map:
```argdown\n{{ brokenmap }}\n```
{% if hints %}
hints:\n {{ hints }}
{% endif %}
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Take a flat pros and cons list as starting point:
- role: user
content: >
Reconstruct the following pros and cons list as an argument map (using argdown syntax).
pros and cons:
```argdown\n{{ proscons }}\n```
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Using a free source text as input:
- role: user
content: >
Lay out the arguments presented in the following material as an informal argument map (argdown syntax).
<material>
{{ source_text }}
</material>
<context>
{{ context }}
</context>
Just provide the argument map in your answer.
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Asking the assistant to add individual claims (premises, conclusion) to an argument map:
- role: user
content: >
Your task is to expand the following argument map by adding the
premises / conclusion(s) of the arguments mentioned below as additional
claims.
argument map:
```argdown\n{{ partialmap }}\n```
Insert the premises / conclusion(s) of
the following arguments:
{{ unfolded_nodes_labels }}
- role: assistant
content: >
```argdown\n{{ full_argdown_map }}\n```
Finally, a task where the full argdown map figures as input:
- role: user
content: >
Your task is to reconstruct a single argument from the following argument map and render
it in standard form (premise-conclusion structure).
argument map:\n\n
```argdown\n{{ full_argdown_map }}\n```\n\n
argument to reconstruct:\n\n
```argdown\n{{ argument_label }}\n```\n
- role: assistant
content: >
```argdown\n{{ argument_argdown_snippet }}\n```
By applying these different pipelines to syncIALO, we create a diverse conversational dataset with demanding argument mapping tasks.
Ingredient #3: Tülu 3 SFT mix
We generously add AI2's Tülu 3 SFT mix to our training data in order to prevent unlearning valuable skills (aka "catastrophic forgetting").
Training
Conceptually, we conceive of this SFT training phase as continual pretraining, rather than finetuning. The main purpose is to make the model familiar with Argdown syntax and semantics. That's also why we use packing, and gradient accumulation to increase the effective batch size. With some chats in deepa2-conversations
having up to 20x2 turns, we chose to train with a 8192 context window.
🤔 Why don't we train on completions only?
Every user query is a reasonable instruction at the given state of analysis and should be learned by model. (But I think testing the effect of masking is worthwhile.)
🤔 Chat templates and continual pretraining -- isn't that odd?
Maybe it's not common, but I suggest chat templates should be used extensively during pretraining (and maybe are so in textbooks-are-all-you-need projects). In future iterations I'd use a variety of different chat templates and formats to increase diversity during continual pretraining.
Technically, the training stack is built with:
Base model: meta-llama/Llama-3.1-8B-Instruct
Training parameters and set-up:
Parameter | Value |
---|---|
learning rate | 5e-6 |
lr scheduler | cosine |
context window size | 8192 |
effective batch size | 32 |
deepspeed ZeRO | stage 3 |
deepspeed CPU offload | false |
spectrum top % snr ratios targeted | 30% |
Training took 35h on 2 H100.
Evaluation
Chat Experience
Let me share two of the first chats I've been having with ggbetz/Llama-3.1-Argunaut-1-8B-SFT-Q4-mlx
.
👤 You
Hi, are you familiar with Argdown?
🤖 Assistant
Yes, I'm familiar with Argdown. It's a markup language for reasoning that allows you to write structured text in a natural and readable way. With Argdown, you can create argument maps where sentences are connected by lines that represent different relations (e.g., support, reject, reason). This makes it easy to visualize complex argumentative structures.
Would you like to create an Argdown argument map? Feel free to provide a topic and/or some sentences, and I'll help you set up the map!
👤 You
Can you please map the following text?
School uniforms deter crime and increase student safety.
In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
🤖 Assistant
School uniforms deter crime and increase student safety.
<+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
<+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
<+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
<+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
👤 You
Please summarize, in your own words, the main argument presented in the text. No more than 2-3 sentemces, please.
🤖 Assistant
School uniforms are an effective solution to crime and safety issues in schools, as evidenced by the reduction of incidents in various institutions that implemented uniform policies.
👤 You
Let's focus on this argument. Just for the protocol, what's the conclusion?
🤖 Assistant
The text argues that:
School uniforms deter crime and increase student safety.
By examining this argument, we can now proceed to reconstruct it. Please provide the next step in this inference guide:
👤 You
Can you reconstruct this argument as premise conclusion structure, using argdown syntax?
🤖 Assistant
The argument can be reconstructed as follows:
School uniforms deter crime and increase student safety.
<+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
<+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
<+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
<+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
🛑 Poor answer
That's NOT a reconstruction in standard form, but just a repetition of the previously given map.
👤 You
No, I really want you to reconstruct the argument in standard form, listing premises and conclusions.
🤖 Assistant
School uniforms deter crime and increase student safety.
<+ <Subargument 1>
<+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
<+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
<+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
<+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
🛑 Poor answer
The assistant fails to correct the answer, despite the user's directions.
I'm probing the skill to analyse premise-conclusion structures in the second chat: expand to read.
👤 You
Hi, are you familiar with Argdown and rendering arguments as premise conclusion structures?
🤖 Assistant
Yes, I am familiar with Argdown and can render arguments as premise-conclusion structures. What would you like to discuss or have me demonstrate?
👤 You
Can you please describe the main argument presented in the following text?
School uniforms deter crime and increase student safety.
In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
🤖 Assistant
School uniforms are not only a traditional and respectful way for institutions to create an environment of unity, but they also benefit students' safety and performance. The implementation of school uniforms can lead to a significant reduction in crime rates, improved student behavior, and enhanced safety on school campuses.
👤 You
Can you please reconstruct this argument in standard form (premise-conclusion structure)?
🤖 Assistant
(1) Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
(2) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
(3) The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
(4) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
-----
(5) Implementing school uniforms is an effective way to deter crime and increase student safety.
🛑 Poor answer
Mechanically mimics the argument reconstructions in training data (deep-argmap): senselessly copying a proposition to get to 4 premises.
👤 You
Can you please add intermediary conclusions and add inference rule?
🤖 Assistant
Sure, here are the intermediate steps leading to the final conclusion:
Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
The reduction in crime rates is a result of the uniform policy and not just a coincidence.
The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
Implementing school uniforms is an effective way to deter crime and increase student safety.
👤 You
Please render the refined reconstruction in Argdown.
🤖 Assistant
The argument can be represented as follows:
[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
<+ <Subargument 1>
<+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
<+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
<+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
👤 You
And as premise-conclusion structure, with inference info added?
🤖 Assistant
[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
<+ <Subargument 1>
<+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
<+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
<+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
🛑 Poor answers
Model fails to follow user's directions, doesn't refine the premise-conclusion analysis.
Argdown-syntax-wise all this looks fine, but my subjective impression is that the model is sticking to the patterns it has seen during continual pretraining, and is a bit stubborn when being instructed to do Argdown stuff that deviates from what is in the synthetic training corpus.
Metrics
We assess the performance of DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT
on three benchmarks:
- Open LLM Leaderboard
- Open CoT Leaderboard
- Argdown Bench
Argdown Bench is a newly created test suite. It contains argument mapping tasks from deep-argmap-conversations
' test split. We parse model completions with our own Argdown parser and assess whether the model's answer
- is syntactically valid Argdown code (pass@x),
- is close to the gold answer (as measured by
networkx
's graph distance measure), - is fully isomorphic to the gold answer (zero graph distance).
We run all the evals on our own set-up (vLLM inference server), obtaining:
Llama-3.1-8B-Instruct | Argunaut-1-8B-SFT | |
---|---|---|
⬇️ Argdown Bench | ||
pass@1 | 80.8 | 98.9 |
pass@5 | 98.9 | 99.8 |
graph sim | 8.4 | 65.5 |
isomorphic | 1.3 | 44.0 |
🤗 HF Leaderboard | ||
MMLU pro | 37.6 | 34.5 |
MUSR | 40.1 | 41.6 |
GPQA Diamond | 32.8 | 30.8 |
GPQA Main | 28.5 | 32.1 |
MATH | 12.5 | 9.1 |
BBH | 54.7 | 48.2 |
⛓️ COT Leaderboard | ||
LogiQA | 5.9 | 1.4 |
LogiQA2 | 15.5 | 0.8 |
LSAT-ar | 11.7 | 3.0 |
LSAT-lr | 20.8 | 3.9 |
LSAT-rc | 27.5 | 13.8 |
✅ The Argdown Bench results strongly confirm that Argunaut-1-8B-SFT has actually learned Argdown syntax and semantics. This is consistent with the legal argdown snippets we observe in the chats above.
✅ The HF Leaderboard results suggest that the performance in these generic multiple choice tasks has slightly, but certainly not dramatically decreased. No catastrophic loss of skills here.
🛑 The COT Leaderboard results, however, identify a disastrous drop in CoT effectiveness: Argunaut-1-8B-SFT seems to have lost the base model's ability to increase the accuracy of its answers through CoT reasoning.
Next Steps
- Identify training data that goes beyond our synthetic Argunauts SFT mix: diverse, high quality demonstrations of logical analyses, exemplifying, in particular, consecutive revisions of previous reconstructions as well as Argdown's syntactic flexibility;
- Train Argunaut-1-8B-SFT repeatedly with online DPO.