Update README.md

b7937f2 verified about 1 month ago

6.98 kB

	---
	library_name: "tensorflow"

	language:
	- ary

	tags:
	- text-normalization
	- darija
	- moroccan-arabic
	- character-level
	- lstm
	inference: false
	license: apache-2.0
	---
	# Darija Text Normalization Model

	This repository contains a Sequence-to-Sequence LSTM model trained to normalize Darija text.
	The model converts noisy or informal Darija into a standardized format using character-level tokenization.

	## Model Details

	- Architecture: Encoder-Decoder LSTM (Sequence-to-Sequence)
	- Task: Text Normalization
	- Language: Darija (Moroccan Arabic)
	- Input Tokenizer: Character-level
	- Target Tokenizer: Character-level
	- Embedding Dimension: 50
	- Latent Dimension (LSTM Units): 128
	- Training Data: Darija Open Dataset ([link](https://github.com/darija-open-dataset/dataset))
	- Saved Model Format: Keras (.keras)
	- Tokenizers Format: JSON (.json)
	- Parameters Format: JSON (.json)

	## Files in this Repository

	- darija-text-normalizer.keras: The trained Keras model.
	- tokenizer_input.json: JSON file for the input tokenizer configuration.
	- tokenizer_target.json: JSON file for the target tokenizer configuration.
	- model_parameters.json: Model parameters (such as max sequence lengths and vocabulary sizes).
	- config.yaml: YAML file with complete model configuration details.
	- README.md: This file, describing the model and its usage.

	## How It Works

	The model uses an Encoder-Decoder LSTM architecture. The encoder processes the input text into a context vector. The decoder then uses this context to generate normalized text one character at a time. This approach helps the model handle spelling variations and out-of-vocabulary characters.

	## Example Usage

	Below is an example of how to load the model and tokenizers and perform text normalization:

	```python
	import tensorflow as tf
	from tensorflow.keras.preprocessing.text import tokenizer_from_json
	import json
	import numpy as np

	# --- Load the Keras model ---
	loaded_model = tf.keras.models.load_model("darija-text-normalizer.keras")
	encoder_model = tf.keras.models.Model(loaded_model.input[0], loaded_model.layers[2].output)

	# Get latent dimension from the model configuration
	latent_dim_loaded = loaded_model.layers[3].get_config()['units']
	decoder_state_input_h = tf.keras.layers.Input(shape=(latent_dim_loaded,))
	decoder_state_input_c = tf.keras.layers.Input(shape=(latent_dim_loaded,))
	decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
	decoder_embedding_layer = loaded_model.layers[2]
	decoder_lstm_layer = loaded_model.layers[3]
	decoder_dense_layer = loaded_model.layers[4]

	decoder_embedding_inf = decoder_embedding_layer(loaded_model.input[1])
	decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm_layer(
	decoder_embedding_inf, initial_state=decoder_states_inputs
	)
	decoder_states_inf = [state_h_inf, state_c_inf]
	decoder_outputs_inf = decoder_dense_layer(decoder_outputs_inf)
	decoder_model = tf.keras.models.Model(
	[loaded_model.input[1]] + decoder_states_inputs,
	[decoder_outputs_inf] + decoder_states_inf
	)

	# --- Load Tokenizers ---
	with open("tokenizer_input.json", 'r', encoding='utf-8') as f:
	tokenizer_input_config = json.load(f)
	tokenizer_input = tokenizer_from_json(tokenizer_input_config)

	with open("tokenizer_target.json", 'r', encoding='utf-8') as f:
	tokenizer_target_config = json.load(f)
	tokenizer_target = tokenizer_from_json(tokenizer_target_config)

	# --- Load Model Parameters ---
	with open("model_parameters.json", 'r', encoding='utf-8') as f:
	model_params = json.load(f)
	max_input_len = model_params['max_input_len']
	max_target_len = model_params['max_target_len']

	def normalize_text(input_text, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_target_len, max_input_len):
	"""Normalizes input Darija text using the trained encoder-decoder model."""
	input_seq = input_tokenizer.texts_to_sequences([input_text])
	padded_input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_input_len, padding='post')
	states_value = encoder_model.predict(padded_input_seq, verbose=0)

	target_seq = [target_tokenizer.word_index.get(target_tokenizer.oov_token, 0)]
	if target_seq[0] is None:
	target_seq = [0]
	target_seq = np.array(target_seq).reshape(1, 1)

	decoded_sentence = ''
	stop_condition = False
	while not stop_condition:
	output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=0)
	sampled_token_index = np.argmax(output_tokens[0, -1, :])
	sampled_char = target_tokenizer.index_word.get(sampled_token_index, '')

	if sampled_char and sampled_char != target_tokenizer.oov_token:
	decoded_sentence += sampled_char

	if sampled_char == '' or sampled_char == target_tokenizer.oov_token or len(decoded_sentence) > max_target_len:
	stop_condition = True

	target_seq = np.array([sampled_token_index]).reshape(1, 1)
	states_value = [h, c]

	return decoded_sentence

	# --- Example ---
	input_text = "kn-mchiw l-sou9" # Example input (Darija for "We are going to the market")
	print("Input text:", input_text)
	print("Normalized text:", normalize_text(input_text, encoder_model, decoder_model, tokenizer_input, tokenizer_target, max_target_len, max_input_len))
	```

	## Installation Requirements

	Make sure to install the following packages:

	```bash
	pip install tensorflow numpy pyyaml huggingface_hub
	```

	Before running this script, please follow these steps to resolve the Hugging Face Hub push error:

	1. Get your Hugging Face API Token:
	* Go to [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and create a new token with "write" role.
	* Replace `HF_API_TOKEN = "YOUR_HF_API_TOKEN"` in the code with your actual token.

	2. Enter your Hugging Face Username:
	* Replace `HF_USERNAME = "YOUR_HF_USERNAME"` in the code with your actual Hugging Face username. This is the username you use to log in to Hugging Face.

	3. (Recommended) Login to Hugging Face CLI:
	* Open your terminal or command prompt.
	* Run this command: `huggingface-cli login`
	* Enter your Hugging Face API token when prompted. This securely configures Git to authenticate with Hugging Face.

	4. Check your Internet Connection:
	* Ensure you have a stable internet connection.

	After completing these steps, run the Python script again.

	If you still encounter issues, double-check:

	* That you have correctly installed all required libraries (`tensorflow`, `numpy`, `pyyaml`, `huggingface_hub`).
	* That your API token is valid and has "write" permissions.
	* That your Hugging Face username is correctly entered.

	If the error persists, it might be a more temporary network issue or a problem on the Hugging Face Hub side, although the above steps should resolve the most common causes of this error.