#### Keras Embedding Layer
Keras offers an Embedding layer that can be used for neural networks on text data.



It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

Embedding layer can be used:

    * Alone to learn a word embedding that can be saved and used in another model later.
    * As part of a deep learning model where the embedding is learned along with the model itself.
    * To load a pre-trained word embedding model, a type of transfer learning.


Keras __Embedding__ turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]. This layer can only be used as the first layer in a model.


The Embedding layer is defined as the first hidden layer of a network. 

Imp Arguments:

    input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. e.g. if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
    output_dim: int >= 0. Dimension of the dense embedding. It defines the size of the output vectors from this layer for each word.
    input_length: Length of input sequences. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

In [62]:
import tensorflow as tf
import numpy as np

In [63]:
from numpy import zeros
from numpy import asarray

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten,Embedding


In [64]:
# fix random seed for reproducibility
np.random.seed(123)
tf.random.set_seed(123)

###### Data:
Have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.

In [65]:
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Technique 1: The len() method to find the length of a list in Python. Python has got in-built method – len() to find the size of the list i.e. the length of the list. The len() method accepts an iterable as an argument and it counts and returns the number of elements present in the list.']

# define class labels
# positive is 1 and negative is 0
labels = [1,1,1,1,1,0,0,0,0,0]

Integer encode each document. This means that as input the Embedding layer will have sequences of integers. 

__Tokenizer__

    Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).

__fit_on_texts(texts)__

    Arguments:  
        texts: list of texts to train on.
        
* __fit_on_texts( )__ - Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. 
        
__word_index__ attribute: 

    Dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.
    
* https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do

In [66]:
# Prepare tokenizer
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

t = Tokenizer()

t.fit_on_texts(docs)
print(t.word_counts)
print(t.word_index)

vocab_size = len(t.word_index) + 1
print (vocab_size)

OrderedDict([('well', 1), ('done', 1), ('good', 2), ('work', 3), ('great', 1), ('effort', 2), ('nice', 1), ('excellent', 1), ('weak', 1), ('poor', 2), ('not', 1), ('technique', 1), ('1', 1), ('the', 9), ('len', 3), ('method', 3), ('to', 2), ('find', 2), ('length', 2), ('of', 4), ('a', 1), ('list', 4), ('in', 3), ('python', 2), ('has', 1), ('got', 1), ('built', 1), ('–', 1), ('size', 1), ('i', 1), ('e', 1), ('accepts', 1), ('an', 2), ('iterable', 1), ('as', 1), ('argument', 1), ('and', 2), ('it', 1), ('counts', 1), ('returns', 1), ('number', 1), ('elements', 1), ('present', 1)])
{'the': 1, 'of': 2, 'list': 3, 'work': 4, 'len': 5, 'method': 6, 'in': 7, 'good': 8, 'effort': 9, 'poor': 10, 'to': 11, 'find': 12, 'length': 13, 'python': 14, 'an': 15, 'and': 16, 'well': 17, 'done': 18, 'great': 19, 'nice': 20, 'excellent': 21, 'weak': 22, 'not': 23, 'technique': 24, '1': 25, 'a': 26, 'has': 27, 'got': 28, 'built': 29, '–': 30, 'size': 31, 'i': 32, 'e': 33, 'accepts': 34, 'iterable': 35, 'as':

__texts_to_sequences(texts)__

    Arguments:
        texts: list of texts to turn to sequences.
    Return: list of sequences (one per text input).

* __texts_to_sequences( )__ - Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.

In [67]:
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(docs)
print(encoded_docs)

['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Technique 1: The len() method to find the length of a list in Python. Python has got in-built method – len() to find the size of the list i.e. the length of the list. The len() method accepts an iterable as an argument and it counts and returns the number of elements present in the list.']
[[17, 18], [8, 4], [19, 9], [20, 4], [21], [22], [10, 9], [23, 8], [10, 4], [24, 25, 1, 5, 6, 11, 12, 1, 13, 2, 26, 3, 7, 14, 14, 27, 28, 7, 29, 6, 30, 5, 11, 12, 1, 31, 2, 1, 3, 32, 33, 1, 13, 2, 1, 3, 1, 5, 6, 34, 15, 35, 36, 15, 37, 16, 38, 39, 16, 40, 1, 41, 2, 42, 43, 7, 1, 3]]


The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras's pad_sequences() function.

In [68]:
from tensorflow.python.ops.math_ops import Any
# pad documents to a max length of 4 words this is auto i hAVE used
padded_docs = pad_sequences(encoded_docs,padding='post')
print(padded_docs)


[[17 18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0]
 [ 8  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0]
 [19  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0]
 [20  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0]
 [21  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0]
 [22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  

In [86]:
# finding the max input length
print(padded_docs.size)
print(len(padded_docs))
max_input_length =(padded_docs.size)/len(padded_docs)
print(max_input_length)

580
10
58.0


In [78]:
labels=np.array(labels)

In [79]:
print(labels)

[1 1 1 1 1 0 0 0 0 0]


The Embedding has a vocabulary of 15 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. 

Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.

In [81]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length= int(max_input_length) ))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [82]:
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [83]:
# summarize the model
print(model.summary())

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 58, 8)             352       
                                                                 
 flatten_8 (Flatten)         (None, 464)               0         
                                                                 
 dense_8 (Dense)             (None, 1)                 465       
                                                                 
Total params: 817
Trainable params: 817
Non-trainable params: 0
_________________________________________________________________
None


for embedding param is 352 i.e 44*8(total number of unique word * output dimension)

In [87]:
embeddings = model.layers[0].get_weights()[0]

In [88]:
embeddings

array([[-3.73846889e-02,  7.27512687e-03, -2.00686697e-02,
         4.61835787e-03,  2.20515765e-02,  2.88953297e-02,
        -1.92318801e-02, -1.82889774e-03],
       [ 1.53775252e-02, -3.79007459e-02, -1.37082562e-02,
        -2.55100019e-02, -2.73043867e-02,  2.01003626e-03,
         3.68466265e-02, -2.13810559e-02],
       [-8.15294683e-04,  1.74988396e-02, -5.98294660e-03,
        -1.07008219e-03,  3.05635221e-02,  3.97428386e-02,
         1.36615969e-02,  1.20054930e-04],
       [-1.61092356e-03,  2.52256282e-02,  3.06485929e-02,
        -9.68657434e-04, -8.23821872e-03,  4.19750698e-02,
        -1.95727702e-02,  4.12120558e-02],
       [ 1.48617961e-02, -1.51228905e-02,  1.58478655e-02,
         3.37831415e-02,  3.04305442e-02, -1.41870491e-02,
        -1.00730881e-02, -1.44804642e-03],
       [-4.17529456e-02, -7.88740069e-03, -4.96819504e-02,
         4.88086231e-02,  6.93566725e-03, -3.47149149e-02,
         8.68083164e-03,  4.95600142e-02],
       [-4.99248281e-02,  5.387067

In [89]:
embeddings.shape

(44, 8)

In [90]:
# fit the model
model.fit(padded_docs, labels, epochs=50,verbose=0)

<keras.callbacks.History at 0x7fe3d3efca90>

In [91]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels,verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 60.000002


You could save the learned weights from the Embedding layer to file for later use in other models.

You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.

* https://stackoverflow.com/questions/51235118/how-to-get-word-vectors-from-keras-embedding-layer

## Using Pre-Trained GloVe Embedding

The Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license.

The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

You can download this collection of embeddings from https://nlp.stanford.edu/projects/glove/ and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.


If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. 

###### load the GloVe word embedding file into memory as a dictionary of word to embedding array.

__Note__: Filter the embedding for the unique words in the training data.


In [None]:
#### mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load the whole embedding into memory
embeddings_index = dict()

f = open('/content/drive/MyDrive/NLP/Deep_Learning/PGP/WordEmbeddings/glove.6B.50d.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


Next, create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

The result is a matrix of weights only for words we will see during training.

In [None]:
# Example to create a zero matrix
embedding_matrix_1 = zeros((vocab_size, 5))

In [None]:
embedding_matrix_1

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [None]:
t.word_index.items()

dict_items([('work', 1), ('done', 2), ('good', 3), ('effort', 4), ('poor', 5), ('well', 6), ('great', 7), ('nice', 8), ('excellent', 9), ('weak', 10), ('not', 11), ('could', 12), ('have', 13), ('better', 14)])

In [None]:
embeddings_index.get('weak')

array([-0.26241 , -1.1103  ,  0.50271 , -0.43052 ,  0.37468 , -0.3055  ,
        0.36708 ,  0.25938 , -0.16993 ,  0.54245 ,  0.63919 ,  0.11347 ,
       -0.3919  ,  0.31521 , -0.42901 ,  0.49977 , -0.2376  , -0.79307 ,
        0.34494 , -0.47877 , -0.51945 , -0.50665 ,  0.057701, -0.31797 ,
       -0.080134, -1.0289  , -0.1507  ,  0.50944 ,  0.60715 ,  1.3049  ,
        3.2575  ,  0.11849 ,  1.5057  , -0.36649 , -0.17726 , -0.20931 ,
       -0.59527 , -0.025889, -0.2965  , -1.1387  , -0.52999 ,  0.067286,
        0.094954,  0.049722,  0.51323 , -0.11194 , -0.007111,  0.23775 ,
        0.68874 ,  0.13873 ], dtype=float32)

In [None]:
# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 50))

for word, i in t.word_index.items():
    print(word)
    print(i)
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

work
1
done
2
good
3
effort
4
poor
5
well
6
great
7
nice
8
excellent
9
weak
10
not
11
could
12
have
13
better
14


In [None]:
print(embedding_matrix[1])

[ 5.13589978e-01  1.96950004e-01 -5.19439995e-01 -8.62179995e-01
  1.54940002e-02  1.09729998e-01 -8.02929997e-01 -3.33609998e-01
 -1.61189993e-04  1.01889996e-02  4.67340015e-02  4.67510015e-01
 -4.74750012e-01  1.10380001e-01  3.93269986e-01 -4.36520010e-01
  3.99839997e-01  2.71090001e-01  4.26499993e-01 -6.06400013e-01
  8.11450005e-01  4.56299990e-01 -1.27260000e-01 -2.24739999e-01
  6.40709996e-01 -1.27670002e+00 -7.22310007e-01 -6.95900023e-01
  2.80450005e-02 -2.30719998e-01  3.79959989e+00 -1.26249999e-01
 -4.79669988e-01 -9.99719977e-01 -2.19760001e-01  5.05649984e-01
  2.59530004e-02  8.05140018e-01  1.99290007e-01  2.87959993e-01
 -1.59150004e-01 -3.04380000e-01  1.60249993e-01 -1.82899997e-01
 -3.85629982e-02 -1.76190004e-01  2.70409994e-02  4.68420014e-02
 -6.28970027e-01  3.57259989e-01]


In [None]:
embeddings_index.get('work')

array([ 5.1359e-01,  1.9695e-01, -5.1944e-01, -8.6218e-01,  1.5494e-02,
        1.0973e-01, -8.0293e-01, -3.3361e-01, -1.6119e-04,  1.0189e-02,
        4.6734e-02,  4.6751e-01, -4.7475e-01,  1.1038e-01,  3.9327e-01,
       -4.3652e-01,  3.9984e-01,  2.7109e-01,  4.2650e-01, -6.0640e-01,
        8.1145e-01,  4.5630e-01, -1.2726e-01, -2.2474e-01,  6.4071e-01,
       -1.2767e+00, -7.2231e-01, -6.9590e-01,  2.8045e-02, -2.3072e-01,
        3.7996e+00, -1.2625e-01, -4.7967e-01, -9.9972e-01, -2.1976e-01,
        5.0565e-01,  2.5953e-02,  8.0514e-01,  1.9929e-01,  2.8796e-01,
       -1.5915e-01, -3.0438e-01,  1.6025e-01, -1.8290e-01, -3.8563e-02,
       -1.7619e-01,  2.7041e-02,  4.6842e-02, -6.2897e-01,  3.5726e-01],
      dtype=float32)

Define our model, fit, and evaluate it as before.

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. 

    We chose the 50-dimensional version, therefore the Embedding layer must be defined with output_dim set to 50. 
    We do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

In [None]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=4, trainable=False))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [None]:
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [None]:
# summarize the model
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 50)             750       
                                                                 
 flatten_1 (Flatten)         (None, 200)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 201       
                                                                 
Total params: 951
Trainable params: 201
Non-trainable params: 750
_________________________________________________________________
None


In [None]:
# fit the model
model.fit(padded_docs, labels, epochs=500, verbose=0)

<keras.callbacks.History at 0x7fbf39a3b5d0>

In [None]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)

print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


In [None]:
predict_label=model.predict(padded_docs)



In [None]:
predict_label = np.round(predict_label).astype(int)

In [None]:
predict_label


array([[1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0]])

__References:__

    https://keras.io
    https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

    https://machinelearningmastery.com
    
    https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html    