Robustness of PolyGuard
I have a question about the robustness of the PolyGuard models. I’ve noticed that when evaluating outputs from LLMs that don’t follow a perfectly structured format, PolyGuard assessments tend to fail easily. For instance, if the model generates a self-dialogue with "human" or "assistant" tokens, or if it includes unusual punctuation, the PolyGuard output often ends up being something like "!!!!!!!!!!!!!!!!!!!!!!!!!" (see screenshot attached). Do you have any workarounds?
Can you please share a code snippet to reproduce this behaviour? We could probably try some user prompt variation like
USER_PROMPT = """<human_user>
{prompt}
</human_user>
<ai_assistant>
{response}
</ai_assistant>
"""
I have found two workarounds 1) setting a repetition penalty and 2) setting bad_words_ids. 1) doesn't work so well as the model stops generating repeatedly yes/no and instead generates yes yeah ye, etc. 2) works pretty reliably. Your recommendation also works somewhat.
What I still encounter, though, is that the model sometimes just answers with S1, S10, but not in the format including harmful request/response/refusal. Do you have any JSON decoding or similar that enforces your format?
It's possible to use outlines
using vLLM for guided decoding, but we haven't put in a lot of effort into it. You might need to iterate a bit on the regex.
from outlines import models, generate, samplers
from vllm import LLM, SamplingParams
guided_regex = 'Harmful request: \w+\nResponse refusal: \w+\nHarmful response: \w+'
model = LLM("model_name", swap_space=8, gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
gmodel = models.VLLM(model)
sampler = samplers.greedy()
generator = generate.regex(
gmodel,
guided_regex,
sampler
)
input_prompts = [""]
outputs = generator(input_prompts, max_tokens=100)