Text Summarization

The model used in this summarization task is a T5 summarization transformer-based language model fine-tuned for abstractive summarization. The model generates summaries by treating text summarization as a text-to-text problem, where both the input and the output are sequences of text.

Model Details

The model used in this summarization task is a Transformer-based language model (e.g., T5 or a similar model) fine-tuned for abstractive summarization. The model generates summaries by treating text summarization as a text-to-text problem, where both the input and the output are sequences of text. Architecture:

Model Type: Transformer-based encoder-decoder (e.g., T5 or BART)

Pretrained Model: The model uses a pretrained tokenizer and model from the Hugging Face transformers library (e.g., T5ForConditionalGeneration).

Tokenization: Text is tokenized using a subword tokenizer, where long words are split into smaller, meaningful subwords. This helps the model handle a wide variety of inputs, including rare or out-of-vocabulary words.

Input Processing: The model processes the input sequence by truncating or padding the text to fit within the max_input_length of 512 tokens.

Output Generation: The model generates the summary through a text generation process using beam search with a beam width of 4 to explore multiple possible summary sequences at each step.

Key Parameters:

Max Input Length: 512 tokens — ensures the input text is truncated or padded to fit within the model's processing capacity.

Max Target Length: 128 tokens — restricts the length of the generated summary, balancing between concise output and content preservation.

Beam Search: Uses a beam width of 4 (num_beams=4) to explore multiple candidate sequences during generation, helping the model choose the most probable summary.

Early Stopping: The generation process stops early if the model predicts the end of the sequence before reaching the maximum target length.

Generation Process:

Input Tokenization: The input text is tokenized into subword units and passed into the model.

Beam Search: The model generates the next token by considering the top 10 possible sequences at each step, aiming to find the most probable summary sequence.

Output Decoding: The generated summary is decoded from token IDs back into human-readable text using the tokenizer, skipping special tokens like padding or end-of-sequence markers.

Objective:

The model is designed for abstractive summarization, where the goal is to generate a summary that conveys the most important information from the input text in a fluent, concise manner, rather than simply extracting text. Performance:

The use of beam search improves the coherence and fluency of the generated summary by exploring multiple possibilities rather than relying on a single greedy prediction.

The model's output is evaluated using metrics such as ROUGE, which measures overlap with reference summaries, or other task-specific evaluation metrics.

Repository: https://github.com/tcdickson/Text-Summarization.git

Training Details

The summarization model was trained on a dataset of press releases scraped from various party websites. These press releases were selected to represent diverse political perspectives and topics, ensuring that the model learned to generate summaries across a wide range of political content. Data Collection:

Source: Press releases from official party websites, which often contain detailed statements, policy announcements, and responses to current events. These documents were chosen because of their structured format and consistent language use.

Preprocessing: The scraped text was cleaned and preprocessed, removing extraneous HTML tags, irrelevant information, and ensuring that the text content was well-formatted for model training.

Text Format: The press releases were processed into suitable text pairs: the original full text as the input and a human-crafted summary (if available) or a custom summary generated by the developers as the target output.

Training Objective:

The model was fine-tuned using these press releases to learn the task of abstractive summarization — generating concise, fluent summaries of longer political texts.

The model was trained to capture key information and context, while avoiding irrelevant details, ensuring that it could produce summaries that accurately reflect the essence of each release.

Training Strategy:

Supervised Learning: The model was trained using supervised learning, where each input (press release) was paired with a corresponding summary, enabling the model to learn the mapping from a long document to a short, concise summary.

Optimization: During training, the model's parameters were adjusted using gradient descent and the cross-entropy loss function, which penalizes incorrect predictions and encourages the generation of summaries that match the target.

This training process allowed the model to learn not only the specific language patterns commonly found in political press releases but also the broader context of political discourse.