Update README.md

3c7832d verified 1 day ago

7.19 kB

	---
	license: mit
	language:
	- en
	- zh
	tags:
	- audio
	- audio-language-model
	- speech-recognition
	- audio-understanding
	- text-to-speech
	- audio-generation
	- chat
	- kimi-audio
	---

	# Kimi-Audio

	<p align="center">
	<img src="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_logo.png" width="400"/> <!-- TODO: Replace with actual raw image URL from your repo -->
	<p>

	<p align="center">
	Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">🤗</a>  \| 📑 <a href="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf">Paper</a>
	</p>

	## Introduction

	We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository hosts the model checkpoints for Kimi-Audio-7B-Instruct.

	Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

	* Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
	* State-of-the-Art Performance: Achieves SOTA results on numerous audio benchmarks (see our [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/main/assets/kimia_report.pdf)). <!-- TODO: Replace with actual raw PDF URL -->
	* Large-Scale Pre-training: Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
	* Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
	* Efficient Inference: Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.

	For more details, please refer to our [GitHub Repository](https://github.com/MoonshotAI/Kimi-Audio) and [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/main/assets/kimia_report.pdf). <!-- TODO: Replace with actual raw PDF URL -->
	<br>

	## Requirements

	We recommend that you build a Docker image to run the inference. After cloning the inference code, you can construct the image using the `docker build` command.
	```bash
	git clone https://github.com/MoonshotAI/Kimi-Audio
	cd Kimi-Audio
	docker build -t kimi-audio:v0.1 .
	```
	Alternatively, You can also use our pre-built image:
	```bash
	docker pull moonshotai/kimi-audio:v0.1
	```

	Or, you can install requirments by:
	```bash
	pip install -r requirements.txt
	```

	You may refer to the Dockerfile in case of any environment issues.

	## Quickstart

	This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn using the `Kimi-Audio-7B-Instruct` model.

	```python
	import soundfile as sf
	# Assuming the KimiAudio class is available after installation
	from kimia_infer.api.kimia import KimiAudio
	import torch # Ensure torch is imported if needed for device placement

	# --- 1. Load Model ---
	# Load the model from Hugging Face Hub
	# Make sure you are logged in (`huggingface-cli login`) if the repo is private.
	model_id = "moonshotai/Kimi-Audio-7B-Instruct" # Or "Kimi/Kimi-Audio-7B"
	device = "cuda" if torch.cuda.is_available() else "cpu" # Example device placement
	# Note: The KimiAudio class might handle model loading differently.
	# You might need to pass the model_id directly or download checkpoints manually
	# and provide the local path as shown in the original readme_kimia.md.
	# Please refer to the main Kimi-Audio repository for precise loading instructions.
	# Example assuming KimiAudio takes the HF ID or a local path:
	try:
	model = KimiAudio(model_path=model_id, load_detokenizer=True) # May need device argument
	model.to(device) # Example device placement
	except Exception as e:
	print(f"Automatic loading from HF Hub might require specific setup.")
	print(f"Refer to Kimi-Audio docs. Trying local path example (update path!). Error: {e}")
	# Fallback example:
	# model_path = "/path/to/your/downloaded/kimia-hf-ckpt" # IMPORTANT: Update this path if loading locally
	# model = KimiAudio(model_path=model_path, load_detokenizer=True)
	# model.to(device) # Example device placement

	# --- 2. Define Sampling Parameters ---
	sampling_params = {
	"audio_temperature": 0.8,
	"audio_top_k": 10,
	"text_temperature": 0.0,
	"text_top_k": 5,
	"audio_repetition_penalty": 1.0,
	"audio_repetition_window_size": 64,
	"text_repetition_penalty": 1.0,
	"text_repetition_window_size": 16,
	}

	# --- 3. Example 1: Audio-to-Text (ASR) ---
	# TODO: Provide actual example audio files or URLs accessible to users
	# E.g., download sample files first or use URLs
	# wget https://path/to/your/asr_example.wav -O asr_example.wav
	# wget https://path/to/your/qa_example.wav -O qa_example.wav
	asr_audio_path = "asr_example.wav" # IMPORTANT: Make sure this file exists
	qa_audio_path = "qa_example.wav" # IMPORTANT: Make sure this file exists

	messages_asr = [
	{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
	{"role": "user", "message_type": "audio", "content": asr_audio_path}
	]

	# Generate only text output
	# Note: Ensure the model object and generate method accept device placement if needed
	_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
	print(">>> ASR Output Text: ", text_output)
	# Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。" (Example)

	# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
	messages_conversation = [
	{"role": "user", "message_type": "audio", "content": qa_audio_path}
	]

	# Generate both audio and text output
	wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

	# Save the generated audio
	output_audio_path = "output_audio.wav"
	# Ensure wav_output is on CPU and flattened before saving
	sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
	print(f">>> Conversational Output Audio saved to: {output_audio_path}")
	print(">>> Conversational Output Text: ", text_output)
	# Expected output: "A." (Example)

	print("Kimi-Audio inference examples complete.")

	```

	## Citation

	If you find Kimi-Audio useful in your research or applications, please cite our technical report:

	```bibtex
	@misc{kimi_audio_2024,
	title={Kimi-Audio Technical Report},
	author={Kimi Team},
	year={2024},
	eprint={arXiv:placeholder},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## License

	The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).