arxiv:2510.04016

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Published on Oct 5

· Submitted by

Saksorn Ruangtanusak on Oct 7

Upvote

Authors:

Thanapol Popit ,

Saksorn Ruangtanusak

Abstract

Real-time Thai text-only end-of-turn detection using zero-shot and few-shot prompting of compact LLMs and lightweight transformers achieves near-instant accuracy suitable for on-device agents.

AI-generated summary

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

View arXiv page View PDF Add to collection

Community

saksornr

Paper author Paper submitter about 18 hours ago

Fluid voice-to-voice interaction requires reliable and low-latency detection
of when a user has finished speaking. Traditional audio-silence end-pointers
add hundreds of milliseconds of delay and fail under hesitations or
language-specific phenomena. We present, to our knowledge, the first systematic
study of Thai text-only end-of-turn (EOT) detection for real-time agents. We
compare zero-shot and few-shot prompting of compact LLMs to supervised
fine-tuning of lightweight transformers. Using transcribed subtitles from the
YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final
particles), we formulate EOT as a binary decision over token boundaries. We
report a clear accuracy-latency tradeoff and provide a public-ready
implementation plan. This work establishes a Thai baseline and demonstrates
that small, fine-tuned models can deliver near-instant EOT decisions suitable
for on-device agents.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.04016 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.04016 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.04016 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.