Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Abstract
LLMs struggle with understanding the nuanced, context-dependent meanings of Drivelological text, which appears nonsensical but contains deeper semantic layers.
We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
Community
Introducing Drivelology (幹話文學): a new linguistic phenomenon we define as "nonsense with depth." Our EMNLP 2025 (oral) paper presents a stress test with 1,200+ examples across 5 Drivelology types, revealing distinct failure modes across state-of-the-art LLMs.
Very impressive paper, there seems to be a large gap when trying to translate the Chinese language to English. Hope there are further studies in this area
@Harikyusocials Thanks for the thoughtful question! The gap you're noticing isn't just a Mandarin → English issue. Many Drivelology examples are deliberately "nonsense with depth": syntactically coherent but culturally loaded, paradoxical, or rhetorically subversive. That means some phrases depend heavily on prior cultural knowledge, social cues, or even irony embedded in everyday life.
When such examples are translated, the literal words can cross languages, but the Drivelological sense (the multi-layered humour, paradox, or social critique) often does not. For example, a pun, proverb inversion, or culturally embedded reference may only resonate with readers who share that cultural background. This isn't limited to Mandarin, similar issues arise in other languages as well.
So the difficulty is less about "translation quality" and more about how Drivelology encodes meaning at multiple levels, with implicit cultural or rhetorical signals that don't always carry over neatly. That's exactly why the paper emphasises Drivelology as a benchmark: it highlights the deep gap between surface fluency and genuine cultural-semantic understanding.
Hi, I understand the complexity of what this data conveys (it needs a reader not just fluent in the language but landscape and cultural background). What really hits me its the sources and some of the analysis, generally speaking I wouldn't agree with 40-45 percent of any of the Spanish results, and some of the puns are pretty particular about our use of exclamation and interrogation marks which are missing but the result gives an interesting interpretation (mainly because we drop nowadays ¿ and ¡ in touch screens but we are lazy, those text lacks good punctuation, and you can mess a phrase without those pretty badly -like genres are crazy required-). There is certain cultural interaction Mexican vs European -Castillian?- Spanish, but in like 10 minutes I found two context or paradox assignment that referred to Spain modern history that really really are too edgy (I probably could find the sources and are not originally Spanish but some French form of speaking about our pre democracy era), it's not a problem of taste but really cringed because I never heard a joke closer to it. Some others non spanish but with some spanish word-play texts have typos (mosa ->mosca) but the result is resolved with the closer best guess to the context and made the good assumptions, but again it's filling holes in bad initial data.
@SrRooT Thank you very much for taking the time to read our work and for sharing such thoughtful feedback. We really appreciate your close engagement with the Spanish samples.
All of the Drivelology examples in our dataset were collected directly from the Internet without modification, which explains why some contain typos (e.g., mosa → mosca) or non-standard punctuation. We recognise that this can sometimes distort the intended wordplay, and as you note, the models (and sometimes annotators) end up filling gaps from imperfect input.
On the cultural side, you are absolutely right that some samples reflect interactions between Mexican and European Spanish (not sure if it's Castilian), and that certain paradoxical or edgy references may be rooted in non-Spanish (e.g., French) framings of Spain's modern history. These are important observations: we did not curate such cases precisely because our goal was to capture how Drivelology circulates online "as-is," even if the humour lands awkwardly or uncomfortably in certain cultural contexts.
As we mention in the Limitations section of the paper, this work is ongoing, and the dataset will be expanded and refined. At present, we did not distinguish among Spanish varieties (Mexican vs. Castilian, etc.) mainly because we lacked annotators with both regional expertise and familiarity with Drivelological conventions. Even within other languages, many native speakers still struggle to interpret these texts (i.e. Drivelology in their own language), since the humour often depends on very specific rhetorical cues.
We very much welcome contributions from the community. If you or others are interested, we would be delighted to receive contributions or pull requests on our GitHub repository to help improve coverage and representation of Spanish varieties.
Why haven't you used any SOTA LLM in your study? Isn't it a major flaw, as the smallest LLMs are mostly useless in understanding the more complex language constructions anyways?
@Tomoomo When we conducted this study back in early 2025, we were definitely aiming to use the best models available. At the time, models like GPT-4.5 and Claude 3.7 were the top of the line. However, as we mentioned in our paper's Limitations section, running experiments at our scale with these models was unfortunately beyond our budget. The API costs are incredibly high, and we were often paying out-of-pocket to keep the project going. On the open-source side, we were limited by our hardware. We ran the experiments on a 4090 machine, and the Qwen3-14B model was the largest we could reliably fit and test. We would have loved to include even larger models if we had the resources.
Regarding your second point that smaller LLMs are "mostly useless" for complex language, our findings actually paint a more intricate and, we think, more compelling picture. The idea that performance simply scales with size isn't what we observed across the board. For instance, let's compare the performance of our open-source models against the larger proprietary ones on what was arguably our most difficult task: the "Hard" Narrative Selection (MCQA). As shown in Table 1, our qwen3-8b-instruct model achieved an accuracy of 26.78%. This result was a significant outlier, outperforming both claude-3.5-haiku (11.56%) and gpt-4o-mini (4.67%). This demonstrates that a smaller, 8B parameter model can develop a specialised capability for subtle reasoning that surpasses even the much larger, state-of-the-art models in a specific domain. Furthermore, when we analysed performance within the Qwen3 model family (from 4B to 14B parameters), the "bigger is always better" narrative was again challenged. As detailed in Table 2, the scaling effects were not linear. For the Drivelology Detection task, the 8B model, when prompted in Mandarin, achieved 78.81% accuracy, significantly outperforming the larger 14B model's 71.78%. We saw another non-linear pattern in the Tagging task, where performance dipped at the 8B size before recovering at 14B.
Taken together, these results suggest that the relationship between model size and the ability to understand complex Drivelological language is not a straight line. Rather than being "useless," smaller and mid-sized models can possess unique and sometimes superior capabilities for specific linguistics-related tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MATA (m=ata): Mindful Assessment of the Telugu Abilities of Large Language Models (2025)
- PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play? (2025)
- Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs (2025)
- Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish (2025)
- Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects (2025)
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models (2025)
- KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper