Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models
Abstract
Character and word-level attacks using a proxy model reveal vulnerabilities in LLMs across languages, particularly in low-resource languages like Polish.
Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.
Community
Large language models (LLMs) have demonstrated impressive capabilities across various natural
language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and
perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related
training data contains mainly high-resource languages like English. This can leave them vulnerable
to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks
can be cheaply created by altering just a few characters and using a small proxy model for word
importance calculation. We find that these character and word-level attacks drastically alter the
predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent
their internal safety mechanisms. We validate our attack construction methodology on Polish, a
low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we
show how it can be extended to other languages. We release the created datasets and code for further
research.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks (2025)
- SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods (2025)
- Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque (2025)
- SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? (2025)
- A Preliminary Study of Large Language Models for Multilingual Vulnerability Detection (2025)
- TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages (2025)
- Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper