Trouble pre-training with DCLM and synthetic data
Hi,
We are trying to build a more hardware-friendly version of this model and we are following the QAT-LLM (https://arxiv.org/abs/2305.17888) approach, which basically trains the model on synthetically generated data (sampling from softmax starting with BOS token) using pure distillation loss. This works very well on Phi-3 (3.8B), but we are having trouble with this model.
Since per our earlier discussion, generating synthetic data using this method is not really working, we wanted to use open-source datasets. We are looking into FineWeb and DCLM. On DCLM our training always diverges at some point.
Note that the train loss is the distillation loss.
On FineWeb, it looks a bit better:
(the orange line is the continuation of the purple one). But you can also see here that at the end, the orange line increases to a kind of plateau.
We are on 96 V100 GPUs (so fp16 only) and use the Adam optimizer with a linear decay and warmup.
What we have tried so far:
- Different number of warmup steps
- Different learning rates (especially just slows down the effect but it still eventually happens)
When I train on the synthetic data (using the "trick" with <endoftext>
instead of the actual EOS), the loss diverges much more quickly. This basically tells me that the quality of the dataset is super important. What is surprising to me is that I can't make it work on DCLM.
In the pre-processing of DCLM, we just chunk together different pieces into 4096-sized chunks. We don't insert a seperator token or something else. This is an example:
Shark Reef - Mandalay Bay
Shark Reef - Mandalay Bay
Traveler Rating:
The Place
The Shark Reef transports visitors to an undersea ocean of sights, sounds and encounters. The Shark Reef contains more than 2,000 animals including giant rays, endangered green sea turtles, piranha, jellyfish and sharks.How Denver native Matthew Batt turned a Salt Lake city crackhouse into his home
First-time author and Denver native Matthew Batt tells of his entry into the real world when he and his wife bought a house -- a former crack house -- in Sugarhouse: Turning the Neighborhood Crack House Into Our Home. The tale takes readers on his journey into adulthood, which included stops for dying family membrs, bills, marriage, the oppressive shadow of school and always, the new house.
Batt will be at the Tattered Cover Colfax at 7:30 p.m. tomorrow to sign copies of his book and read from it; Westword caught up with him in advance of that appearance to ask a few questions.
Westword: Why did you decide to write this book?
Matthew Batt: I was a graduate student enrolled at the University of Utah. I was primarily writing fiction. I took a creative nonfiction class with creative and nonfiction writer Robin Hemley, and we had to present a piece to the class. I like to get stuff out of the way, so I offered to go first. I was focused on writing important essays that one thinks one should write, but I was struggling with a topic. All I had around me was stray wood and power tools, and I had to write something. When I actually had to move power tools and wood off my desk to write, I thought, "Oh, why don't I write this?" I always try to imitate other writers, but I have never read anything like this, so I wasn't constrained at all.
How does this book break away from your tendency to act and dress in the way society dictates is proper for your given role?
In terms of literary style, the book allowed me to respond to myself and see how I sound, not unlike an interview question. This book is a pure form of an essay. It is a means to try; not something you write for a college essay, but an endeavor, an attempt. I couldn't put it all down neat and clean because there's something inherently messy about a memoir. It's a work in progress. Writing can be play. It's not something with very precise tools where you can do damage, but it's like a toy store or a sandbox.
Where do you find your inspiration?
Marijuana Deals Near You
From my family. Especially from my really dear friends who are great writers, like the novelists Bruce Machart and Peter Geye. Being a good person and hopefully decent listener comes from my mom and my wife. Jenae is especially my external conscience. If I'm not sure I'm doing the right thing, I run it by her. The other thing is I have a four-, almost five-year-old son. There's nothing to make something fresh and new and weird like having a toddler's eye. It's like having another childhood.
[shortened]
And here is an example from FineWeb:
ADC unit associated with data acquisition. These units inject “quantization nose” since there is quantization conversion uncertainty of ± LSB/2. An N-bit ADC with a sinusoidal signal input has a signal- to-noise ratio. If this ADC is operated in the over sampling mode, then the signal-to-noise ratio is given by SNR = 6.02*N+1.76 dB + Log(OSR), where OSR (over sampling ratio) is defined as the ratio of sampling frequency (fs) to twice the bandwidth limited signal frequency (fo), OSR=fs / 2*fo. These simple examples illustrate that as communication systems become more complex, accurately calculating SNR becomes equally complicated.
Although the SNR as a figure of merit does not generally apply to industrial process instrumentation and control modules, one can calculate a basic number from Dataforth specifications. For example, Dataforth’s SCM5B30 Analog Voltage Input Module, Narrow Bandwidth has a maximum output of 1 VDC (same as RMS) and a maximum noise output of 200 micro-volts, RMS. These specifications give a SNR of 20*log (1÷200E-6) = 74dB, which means a 1 volt output is 5000 times larger than the module noise.
Remember, our Application Engineers can assist you with signal conditioner selection over the phone or via fax and email. Call us at our manufacturing facility in Tucson at 520-741-1404 (fax 520-741-0762) or Email us at [email protected] is the most significant neurological disorder experienced by persons aged over 65 years. Although predominantly associated with increased aging, there are also types of dementia which occur in people under 65 years.
It usually presents as a syndrome of chronic or progressive nature, with changes in
memory, thinking, orientation, comprehension, calculation, learning capacity, language and judgment.
Thus it is a condition which has an impact on many aspects of the lives of both the people with dementia and their family members. It is also likely to result in many encounters with health care professionals across all disciplines and care settings.
The aims of this learning module are to:
- Present the incidence and prevalence of data on dementia
- Provide an overview of the different types of dementia and the pathophysiological
features associated with the different types of dementia
- Highlight the distinctive issues of dementia within Aboriginal & Torres Strait
Islanders communities and Culturally and Linguistically Diverse communities
- Provide an overview of the impact of dementia from the perspective of a person with
dementia and a carer
By the end of this learning module you will have an awareness and understanding of the following:
- Overall trend of incidence and prevalence of dementia across different age groups
- Different types of dementia and pathophysiological features associated with the
different types of dementia
- Distinctive issues of dementia within Aboriginal & Torres Strait Islanders
communities and Culturally and Linguistically Diverse communities
- What a person with dementia and a carer consider the impact of dementia is on
Content Focus Area
This learning module is built around four content focus areas:
- Epidemiological trends of dementia
- Types of dementia and the associated pathophysiological features
- Distinctive issues for special groups
- Impact of dementia on person and his/her family
How to Complete this Module
To complete this learning module work through the activities associated with
the four content focus areas and complete the five learning activities.
References have been provided to enable you to complete the module, while the
range of learning activities provided will facilitate your learning about these
|Your Progress|| |ALBUQUERQUE, N.M. >> Widespread flu activity is being reported in Albuquerque and other parts of New Mexico, and the American Red Cross urges people who have not yet gotten a flu vaccine to get vaccinated now.
The Red Cross also has steps people can take to prevent the spread of the flu virus during this time.
Steps to prevent the flu
The CDC recommends a yearly flu vaccine for everyone six months of age and older as the first and most important step in protecting someone against flu viruses. In addition to getting vaccinated, the Red Cross has some simple steps people can take to help prevent the spread of the flu virus. Parents can also practice these things with their kids to help keep them well:
• Cover the nose and mouth with a tissue or sleeve when coughing or sneezing, and throw the tissue away after use. If a tissue isn't available, cough or sneeze into the elbow, not the hands.
• Wash hands often, especially after coughing or sneezing. If soap and water are not available, use an alcohol-based hand-rub.
• Avoid touching the eyes, nose or mouth.
• Avoid close contact with people who are sick.
• Stay home if sick.
Call the doctor
[shortened]
I notice that there are much fewer "\n" in the fineweb texts, but except that, I don't see anything else. Also, we don't have any special tokens in there. When we tokenize and pre-process the pre-training data, is there maybe something we need to do? Right now, we are just passing it through the tokenizer and concatenating it together.
BTW the med-qa accuracy during training for the fineweb run looks like this:
With our shots and eval method, we get ~35% for the FP-16 model.