🔥 Transfer model's Chat feature, Context length and Knowledge to another under 1 minute without any train.
Imagine being able to create chat models, expand context, and transfer domain-specific knowledge to models, all within a matter of minutes. Our innovative approach, based on a combination of diff-based techniques and sigmoid ratio calculations, makes this possible.
By considering the diffs between the desired information model (long context or chat) and the base model, as well as the diffs between the base model and the target model, we can efficiently transfer features and expand context without the need for extensive training or resources.
Our method minimizes model degradation and ensures that only the desired information is captured, resulting in high-quality models that can be created with just a single click. Whether you need a chat model, expanded context, or domain-specific knowledge transfer, our approach offers a rapid and effective solution.
In blog post below, we will dive into the details of our method, provide code examples, and showcase the impressive results achieved using our approach. Get ready to revolutionize your model creation process and unlock new possibilities with this powerful technique.
The idea of training a chat model with desired raw data is incredibly appealing.
However, there is a significant problem with this process. Directly training a chat model with raw data can disrupt its output format.
To solve this issue, the common approach is to create Q/A-formatted datasets. However, this method is time-consuming, costly, and may result in information loss or bias during dataset creation.
So, how can we effectively train raw data? We can utilize the sequential structure of transformer models like Llama, which consists of multiple layers.
I intentionally form the layers responsible for handling the output format in the latter part of the model, and designate the middle to late layers as the starting point for raw training.
You may think that the method involves feeding chat data to the later layers and then training the middle to late layers with raw data, but that's not the case. Such an approach cannot properly address the problem and may even lead to increased model complexity.
The idea presented above doesn't seem bad, so how can we make good use of it? Let's try using a base model.