File size: 2,481 Bytes
1996433
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8239bde
 
ad95d04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: apache-2.0
datasets:
- Subh775/formatted-hindi-hinglish-cot
language:
- en
- hi
base_model:
- unsloth/Mistral-Small-Instruct-2409
pipeline_tag: text-generation
library_name: adapter-transformers
tags:
- LoRA
- text-generation-inference
- unsloth
---

## Inference Instructions:

```python
!pip install unsloth
```

```python
from unsloth import FastLanguageModel
from transformers import TextStreamer
import torch

# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="QuantumInk/Mistral-small-12B-Hinglish-cot",
    max_seq_length=2048,
    load_in_4bit=True
)
FastLanguageModel.for_inference(model)

# Streamer for real-time decoding
text_streamer = TextStreamer(tokenizer)

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{output}"""
```

```python
# Chat loop with memory
def chat():
    print("💬 Chat with Qwen-2.5-Hindi-Hinglish-COT! Type '\\q' or 'quit' to exit.\n")

    chat_history = ""  # Full chat history with prompts and responses

    while True:
        user_input = input("➤ ")

        if user_input.lower() in ["\\q", "quit"]:
            print("\n👋 Exiting chat. Goodbye!")
            break

        # Format the current prompt
        current_prompt = alpaca_prompt.format(
            instruction="Continue the following conversation.",
            input_text=user_input,
            output=""
        )

        # Add to full chat history
        chat_history += current_prompt + "\n"

        # Tokenize the full prompt
        inputs = tokenizer([chat_history], return_tensors="pt").to("cuda")

        print("\n🤖: ", end="")  # Prepare for streaming output

        # Generate response using streamer
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            no_repeat_ngram_size=2,
            streamer=text_streamer
        )

        # Decode and capture response for chat history
        full_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        response = full_output.split("### Response:")[-1].strip()

        # Add response to chat history
        chat_history += f"{response}\n"

# Run the chat
chat()
```