Robustness of PolyGuard

by felfri - opened 9 days ago

9 days ago

I have a question about the robustness of the PolyGuard models. I’ve noticed that when evaluating outputs from LLMs that don’t follow a perfectly structured format, PolyGuard assessments tend to fail easily. For instance, if the model generates a self-dialogue with "human" or "assistant" tokens, or if it includes unusual punctuation, the PolyGuard output often ends up being something like "!!!!!!!!!!!!!!!!!!!!!!!!!" (see screenshot attached). Do you have any workarounds?

kpriyanshu256

ToxicityPrompts org 8 days ago

Can you please share a code snippet to reproduce this behaviour? We could probably try some user prompt variation like

USER_PROMPT = """<human_user>
{prompt}
</human_user>

<ai_assistant>
{response}
</ai_assistant>
"""

felfri

8 days ago

I have found two workarounds 1) setting a repetition penalty and 2) setting bad_words_ids. 1) doesn't work so well as the model stops generating repeatedly yes/no and instead generates yes yeah ye, etc. 2) works pretty reliably. Your recommendation also works somewhat.

What I still encounter, though, is that the model sometimes just answers with S1, S10, but not in the format including harmful request/response/refusal. Do you have any JSON decoding or similar that enforces your format?

kpriyanshu256

ToxicityPrompts org 8 days ago

It's possible to use outlines using vLLM for guided decoding, but we haven't put in a lot of effort into it. You might need to iterate a bit on the regex.

from outlines import models, generate, samplers
from vllm import LLM, SamplingParams

guided_regex = 'Harmful request: \w+\nResponse refusal: \w+\nHarmful response: \w+'

model = LLM("model_name", swap_space=8, gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)    
    
gmodel = models.VLLM(model)
sampler = samplers.greedy()
generator = generate.regex(
        gmodel,
        guided_regex,
        sampler
    )

input_prompts = [""]
outputs = generator(input_prompts, max_tokens=100)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment