ELTEX: A Framework for Domain-Driven Synthetic Data Generation
Abstract
We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.
Community
Hey! We just published the paper and released the dataset of synthetic social media messages for early cyberattack detection on blockchain. Let's discuss!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs (2025)
- WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale (2025)
- Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models (2025)
- Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)
- RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis (2025)
- LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM (2025)
- Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper