Pythia-70M Wikipedia Paragraphs Text Generation Model

Model Description

This model is a fine-tuned version of EleutherAI/pythia-70m-deduped trained on paragraphs from Wikipedia. It is designed for open-ended text generation tasks, particularly focused on producing prose content.

The base model is a 70 million parameter language model from the Pythia family, which was then fine-tuned on a dataset of Wikipedia paragraphs for 50 epochs. This fine-tuning process adapted the model to generate text that more closely resembles the style and content found in Wikipedia articles.

Intended Uses & Limitations

This model is intended for:

  • Generating Wikipedia-style paragraphs on various topics
  • Assisting in drafting encyclopedic content
  • Text completion tasks for educational or informational writing
  • Further finetuning on paragraph-length texts

Limitations:

  • The model can generate incomprehensible or factually incorrect information.
  • It may struggle with very specialized or technical topics not well-represented in the training data.
  • The model's knowledge is limited to the information available in its training data, which has a cutoff date.

Usage

To use this model for text generation, you can leverage the Hugging Face transformers library. Below is an example of how to load and use the model in Python:

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='agentlans/pythia-70m-wikipedia-paragraphs', device='cuda')

results = generator("Your text here...", max_length=3000, num_return_sequences=3, do_sample=True)

for x in results:
    print(x['generated_text'] + '\n')

Parameters for Customization

You can customize the generation process using various parameters:

  • max_length: Maximum length of the generated sequence.
  • num_return_sequences: Number of sequences to return.
  • temperature: Controls randomness; lower values make the output more deterministic.
  • top_k and top_p: Controls sampling strategies to influence diversity.

Sample output

use set_seed(1234567) to reproduce the results.

Input text: "Soon we dropped into a living forest, where cold-tolerant evergreens and boreal animals still evoke the Canadian heritage of an ecosystem pushed south by glaciers 20,000 years ago." Output:
  • Soon we dropped into a living forest, where cold-tolerant evergreens and boreal animals still evoke the Canadian heritage of an ecosystem pushed south by glaciers 20,000 years ago. The next day we had to change our climate again and we worked with a new science, "our grasslands", we could have as well the most suitable habitat suitable for those species we have, and we are open to natural vegetation. Since the old ground has also been destroyed. The vegetation that weaver protect us is the natural habitat... We have done another good day of good work in both the northern and northern regions, so our work would provide a safe habitat for species and ecosystems. This ecosystem is where the natural ecosystems we inhabit with great potential is dependent on its ecosystem services and ecosystems.
  • Soon we dropped into a living forest, where cold-tolerant evergreens and boreal animals still evoke the Canadian heritage of an ecosystem pushed south by glaciers 20,000 years ago. The Canadian people who decided to migrate west through British Columbia were very rich in their wild products, with thousands of young people living nearby and uprooted for as long time. The Fraser settlers were mostly immigrants from the time their arrival in Canada and their wives took refuge with their ancestral relatives and settled on the forest of Gewa. The Aborigines and the Aborigines worked to make their way to the land they are, but the Aboriginal people had left their lands and the villages to them. During their time of arrival the Australians made up an inland strip of land known as the Little Burnocks, they were attacked by the Aborigines who came along their turn.
  • Soon we dropped into a living forest, where cold-tolerant evergreens and boreal animals still evoke the Canadian heritage of an ecosystem pushed south by glaciers 20,000 years ago. This is a very rich ecological history of the ecosystems in the eastern forests of the mountainous Sierra Moultrop, which is only part of the extensive forests of the region which in its east, which have been populated by the evergreen forest biomes for centuries. Some parts of the wild world of these forests are located in the forest around the village of Huebelof, a climate well defined and accessible region. The landscapes of this region are of particular interest in biodiversity including the vegetation of many regions and biodiversity of the forests. In addition to the great biodiversity of all the biodiversity out there, there are numerous biodiversity forests, like the extinct forest, the most biodiversity in the region. It does not cover a part of the forests found in the regions and are understudied or otherwise, by the very first decade of the forest degradation it has in fact occurred in natural history.
Input text: "The important thing was that I had an onion on my belt, which was the style at the time. I didn't have any white onions, because of the war. The only thing you could get was those big yellow ones." Output:
  • The important thing was that I had an onion on my belt, which was the style at the time. I didn't have any white onions, because of the war. The only thing you could get was those big yellow ones. I looked at the stove. I said I couldn't make a dough in any kind, but the room was small enough with the stove when I went to the bottom of the stove, and, for a good night like work, I made them at the stove in the oven. The room was so hot that I could not bring the stove to cook, and I had only made a few short changes in the stove but I only did the cooking for a few days. This kind of experiment, however, had to wait and we were going up to the stove. I was amazed by the manner instructions, and I felt quite uncomfortable with the stove and, after some years of cooking, I thought that it would work better than when I wanted to get food and was ready to eat. But I couldn't get as much time. I also saw, I kept burning bread, and when I found the hot stove, I just had a cold.
  • The important thing was that I had an onion on my belt, which was the style at the time. I didn't have any white onions, because of the war. The only thing you could get was those big yellow ones. I went to the house where I had made him. Then, I got to do something different, but now I knew so much about it and decided to take things first. The first time I finished my game, they worked at a restaurant and I never told them they were going to try a cook, and I kept going to the kitchen, then they would do it. Then some of the ingredients would work in the oven where they were cooking. Then we went to cook the cook and he made the dish in the oven, the cook only two nights."
  • The important thing was that I had an onion on my belt, which was the style at the time. I didn't have any white onions, because of the war. The only thing you could get was those big yellow ones. This is just something that I would try to soak in the sunshine. The bread I had for a little time to do, just get stuck in the end, and I had a very long time to get things done.

Training Data

The model was fine-tuned on a dataset of paragraphs extracted from English Wikipedia. See agentlans/wikipedia-paragraphs for details.

Training Procedure

Training Hyperparameters

  • Learning rate: 5e-05
  • Batch size: 8
  • Optimizer: Adam
  • LR scheduler: Linear
  • Number of epochs: 50

Evaluation Results

The model achieved the following results on the evaluation set:

  • Loss: 4.3424
  • Accuracy: 0.2728
  • Perplexity: 31.26 (calculated as exp(4.3424))

Ethical Considerations

When using this model, consider:

  • Potential biases present in Wikipedia content may be reflected in the model's outputs.
  • The model may generate plausible-sounding but incorrect information, so fact-checking is essential.
  • Use of the model to generate misleading or false information should be avoided.

Additional Information

For more details on the base Pythia model, refer to the EleutherAI/pythia-70m-deduped model card.

Downloads last month
13
Safetensors
Model size
70.4M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for agentlans/pythia-70m-wikipedia-paragraphs

Finetuned
(119)
this model

Dataset used to train agentlans/pythia-70m-wikipedia-paragraphs

Evaluation results