|
--- |
|
library_name: "tensorflow" |
|
|
|
language: |
|
- ary |
|
|
|
tags: |
|
- text-normalization |
|
- darija |
|
- moroccan-arabic |
|
- character-level |
|
- lstm |
|
inference: false |
|
license: apache-2.0 |
|
--- |
|
# Darija Text Normalization Model |
|
|
|
This repository contains a **Sequence-to-Sequence LSTM** model trained to normalize Darija text. |
|
The model converts noisy or informal Darija into a standardized format using character-level tokenization. |
|
|
|
## Model Details |
|
|
|
- **Architecture:** Encoder-Decoder LSTM (Sequence-to-Sequence) |
|
- **Task:** Text Normalization |
|
- **Language:** Darija (Moroccan Arabic) |
|
- **Input Tokenizer:** Character-level |
|
- **Target Tokenizer:** Character-level |
|
- **Embedding Dimension:** 50 |
|
- **Latent Dimension (LSTM Units):** 128 |
|
- **Training Data:** Darija Open Dataset ([link](https://github.com/darija-open-dataset/dataset)) |
|
- **Saved Model Format:** Keras (.keras) |
|
- **Tokenizers Format:** JSON (.json) |
|
- **Parameters Format:** JSON (.json) |
|
|
|
## Files in this Repository |
|
|
|
- **darija-text-normalizer.keras:** The trained Keras model. |
|
- **tokenizer_input.json:** JSON file for the input tokenizer configuration. |
|
- **tokenizer_target.json:** JSON file for the target tokenizer configuration. |
|
- **model_parameters.json:** Model parameters (such as max sequence lengths and vocabulary sizes). |
|
- **config.yaml:** YAML file with complete model configuration details. |
|
- **README.md:** This file, describing the model and its usage. |
|
|
|
## How It Works |
|
|
|
The model uses an Encoder-Decoder LSTM architecture. The encoder processes the input text into a context vector. The decoder then uses this context to generate normalized text one character at a time. This approach helps the model handle spelling variations and out-of-vocabulary characters. |
|
|
|
## Example Usage |
|
|
|
Below is an example of how to load the model and tokenizers and perform text normalization: |
|
|
|
```python |
|
import tensorflow as tf |
|
from tensorflow.keras.preprocessing.text import tokenizer_from_json |
|
import json |
|
import numpy as np |
|
|
|
# --- Load the Keras model --- |
|
loaded_model = tf.keras.models.load_model("darija-text-normalizer.keras") |
|
encoder_model = tf.keras.models.Model(loaded_model.input[0], loaded_model.layers[2].output) |
|
|
|
# Get latent dimension from the model configuration |
|
latent_dim_loaded = loaded_model.layers[3].get_config()['units'] |
|
decoder_state_input_h = tf.keras.layers.Input(shape=(latent_dim_loaded,)) |
|
decoder_state_input_c = tf.keras.layers.Input(shape=(latent_dim_loaded,)) |
|
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] |
|
decoder_embedding_layer = loaded_model.layers[2] |
|
decoder_lstm_layer = loaded_model.layers[3] |
|
decoder_dense_layer = loaded_model.layers[4] |
|
|
|
decoder_embedding_inf = decoder_embedding_layer(loaded_model.input[1]) |
|
decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm_layer( |
|
decoder_embedding_inf, initial_state=decoder_states_inputs |
|
) |
|
decoder_states_inf = [state_h_inf, state_c_inf] |
|
decoder_outputs_inf = decoder_dense_layer(decoder_outputs_inf) |
|
decoder_model = tf.keras.models.Model( |
|
[loaded_model.input[1]] + decoder_states_inputs, |
|
[decoder_outputs_inf] + decoder_states_inf |
|
) |
|
|
|
# --- Load Tokenizers --- |
|
with open("tokenizer_input.json", 'r', encoding='utf-8') as f: |
|
tokenizer_input_config = json.load(f) |
|
tokenizer_input = tokenizer_from_json(tokenizer_input_config) |
|
|
|
with open("tokenizer_target.json", 'r', encoding='utf-8') as f: |
|
tokenizer_target_config = json.load(f) |
|
tokenizer_target = tokenizer_from_json(tokenizer_target_config) |
|
|
|
# --- Load Model Parameters --- |
|
with open("model_parameters.json", 'r', encoding='utf-8') as f: |
|
model_params = json.load(f) |
|
max_input_len = model_params['max_input_len'] |
|
max_target_len = model_params['max_target_len'] |
|
|
|
def normalize_text(input_text, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_target_len, max_input_len): |
|
"""Normalizes input Darija text using the trained encoder-decoder model.""" |
|
input_seq = input_tokenizer.texts_to_sequences([input_text]) |
|
padded_input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_input_len, padding='post') |
|
states_value = encoder_model.predict(padded_input_seq, verbose=0) |
|
|
|
target_seq = [target_tokenizer.word_index.get(target_tokenizer.oov_token, 0)] |
|
if target_seq[0] is None: |
|
target_seq = [0] |
|
target_seq = np.array(target_seq).reshape(1, 1) |
|
|
|
decoded_sentence = '' |
|
stop_condition = False |
|
while not stop_condition: |
|
output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=0) |
|
sampled_token_index = np.argmax(output_tokens[0, -1, :]) |
|
sampled_char = target_tokenizer.index_word.get(sampled_token_index, '') |
|
|
|
if sampled_char and sampled_char != target_tokenizer.oov_token: |
|
decoded_sentence += sampled_char |
|
|
|
if sampled_char == '' or sampled_char == target_tokenizer.oov_token or len(decoded_sentence) > max_target_len: |
|
stop_condition = True |
|
|
|
target_seq = np.array([sampled_token_index]).reshape(1, 1) |
|
states_value = [h, c] |
|
|
|
return decoded_sentence |
|
|
|
# --- Example --- |
|
input_text = "kn-mchiw l-sou9" # Example input (Darija for "We are going to the market") |
|
print("Input text:", input_text) |
|
print("Normalized text:", normalize_text(input_text, encoder_model, decoder_model, tokenizer_input, tokenizer_target, max_target_len, max_input_len)) |
|
``` |
|
|
|
## Installation Requirements |
|
|
|
Make sure to install the following packages: |
|
|
|
```bash |
|
pip install tensorflow numpy pyyaml huggingface_hub |
|
``` |
|
|
|
**Before running this script, please follow these steps to resolve the Hugging Face Hub push error:** |
|
|
|
1. **Get your Hugging Face API Token:** |
|
* Go to [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and create a new token with "write" role. |
|
* **Replace `HF_API_TOKEN = "YOUR_HF_API_TOKEN"` in the code with your actual token.** |
|
|
|
2. **Enter your Hugging Face Username:** |
|
* **Replace `HF_USERNAME = "YOUR_HF_USERNAME"` in the code with your actual Hugging Face username.** This is the username you use to log in to Hugging Face. |
|
|
|
3. **(Recommended) Login to Hugging Face CLI:** |
|
* Open your terminal or command prompt. |
|
* Run this command: `huggingface-cli login` |
|
* Enter your Hugging Face API token when prompted. This securely configures Git to authenticate with Hugging Face. |
|
|
|
4. **Check your Internet Connection:** |
|
* Ensure you have a stable internet connection. |
|
|
|
**After completing these steps, run the Python script again.** |
|
|
|
If you still encounter issues, double-check: |
|
|
|
* That you have correctly installed all required libraries (`tensorflow`, `numpy`, `pyyaml`, `huggingface_hub`). |
|
* That your API token is valid and has "write" permissions. |
|
* That your Hugging Face username is correctly entered. |
|
|
|
If the error persists, it might be a more temporary network issue or a problem on the Hugging Face Hub side, although the above steps should resolve the most common causes of this error. |