NLP

Table of Contents

1. Data process

1.1. Data load

import nltk
nltk.download('punkt')

with open('/home/si/Desktop/textPreprocessing/frankenstein.txt') as f:
     frankensteintext = f.read()
sentences = nltk.sent_tokenize(frankensteintext)
print(f'the length of frankenstein txt is {len(sentences)}!')  

1.2. Alpha filter

AlphaFilter_words = nltk.word_tokenize(sentences[0])
AlphaFilter_words = [token for token in AlphaFilter_words if token[0].isalpha()]
print(f'the length of the first sentence is {len(AlphaFilter_words)}')
print(f'the word is {AlphaFilter_words}')

1.3. Stop words

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stopwords = stopwords.words("english")
print(f"the length of stop words is {len(stopwords)}")

words = word_tokenize(sentences[0])
words = [token for token in words if not token in stopwords]
print(f'the length of the first sentence is {len(words)}')
print(f'the word is {words}')

1.4. Lemmatization

nltk.download("wordnet")
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy

ps = PorterStemmer()
stemmed_words = []
words = word_tokenize(sentences[0])
for token in words:
    stemmed_words.append(ps.stem(token))
print(f'the length of the first sentence is {len(stemmed_words)}')
print(f'the word is {stemmed_words}')

1.5. Stemming

nlp = spacy.load('en_core_web_sm')
doc = nlp(sentences[0])
lemma_words = [token.lemma_ for token in doc]
print(f'the length of the first sentence is {len(lemma_words)}')
print(f'the word is {lemma_words}')

1.6. Part of speech Tagging

import spacy
from spacy.lang.en.examples import sentences

nlp = spacy.load('en_core_web_sm')
doc = nlp(sentences[0])

print(doc.text)

for token in doc:
    print(token.text, token.pos_, token.dep_)

1.7. Named Entity Recognition (NER)

print(doc.text)

for token in doc.ents:
    print(token.text, token.label_)

1.8. Frequency Analysis

import nltk
from nltk import FreqDist

# Assuming 'tokens' are already generated from the text
print(sentences[0])
freq_dist = FreqDist(sentences[0])
print(freq_dist.most_common(10))

words = word_tokenize(sentences[0])
print(words)
freq_dist = FreqDist(words)
print(freq_dist.most_common(10))

1.9. one-hot coding

all input should numerical, categorized character shoud be one-hot coded, starting with 1

1.10. Tokenizationn

  • Breaking text to words (There are many steps to consider )
  • CountWordFrequencies (counted key-value dictionary)
    • list all sorted dictionary
    • if the list is too big, removing infrequent words(because of incorrection, or neame…) good for one-hot coding
  • encode to sequences
    • with counted dictionary index
    • index length is one-hot coding vector length
  • one-hot coding all sequences
    • if one-hot code vector is not so lang, word embedding is not needed
tests[5] = "this is a cat and a"
tests_dict = {"this": {1:1}, "is": {2:1}, "a":{3:2}, "cat":{4:1}, "and":{5:1}}
tests_sequences = [1,2,3,4,3]

1.10.1. Tokenize with nltk

tokenize_words = nltk.word_tokenize(sentences[0])
print(f'the length of the first sentence is {len(tokenize_words)}')
print(f'the word is {tokenize_words}')

1.11. Word Embedding

compose high dimension one-hot vector to low dimension \[ X_{i} = P^{T} \cdot e^{i}\] \(e^{i}\) is high dimensional vector after one-hot coding(v,1) of collected data \(P^{T}\) is the parameter matrix trained by data(d,v), \(X_{i}\) is low dimensional vector(d,1), for further training \(d\) : The dimension parameter d is important, can be vertified with corss validation. Each row of \(P\) is called (words vector词向量), can be interpreted with classification

Embedding layer need the number of vocabulary(v), embeddingdim(d), and wordnum(cuted words number) v*d : parameters for this layer

1.12. Words Count

1.13. TF-IDF for weight title and body

the idea is to give weights for title and text. Afterwards the title has a huge impact for document = body + title \(TF-IDF(document) = TF-IDF(title) * alpha + TF-IDF(body) * (1-alpha)\)

1.14. Vectorizing tf-idf

2. language model

A model that computes \(P(W)\) or \(P(W_{n}|W_{1},W_{2},W_{3},W_{4}....W_{n-1})\) is called a language model.

2.1. skip-gram

2.2. CBOW

continuous bag of words

3. text generatation

Encoder: A is RNN layer or LMST layer, all input(x1 to xm) share the same A, hm is the last result, only give hm to decoder, we can generate text, but many content of input will be forget

4. seq2seq

After one resulte in Decode is generated, With Corss Enteopy to update the Network, using all the resulte we get, to predict the next resulte until all is finished consuming the previously generated symbols as additional input when generating the next.

5. Transformer

5.1. simple RNN + attention

Encoder Input E\(X = x_{1}, x_{2},,,,x_{m}\) Decoder Input \(X^{'} = x_{1}^{'}, x_{2}^{'},,,,x_m^{'}\) after RNN or LSTM we get \(H = h_{0}, h_{1},,,,,h_{m}\) Now unlike before only pass the last element \(h_{m}\) to Decoder, we use attention skill to mix all input information

  1. Notion:

    • Encoder, lower index \(i\) stands for the index of input order in Encoder
    • Decoder, high index \(j\) stands for the index of generated items in Decoder

    \(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Decoder with respect of the i-th input(\(x_{i}\)) in X.

  2. Variables
    • Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
    • Encoder shared parameter, A: RNN or LSMT shared parameter
    • Encoder output , \(H: = h_{1}, h_{2},,,h_{m}\) output at each step of RNN or LSMT
    • Decoder initial input \(h_{m}\) , denote also as \(s^{0}\)
    • key, \(q_{i}^{j} = W_{q}^{j} s^{i}\)
    • query \(k_{i}^{j} = W_{k}^{j} h_{i}\)
    • Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
    • Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
    • Eecoder Context Vector, \(c^{j} = a_{1}^{j}h_{1} + a_{2}^{j}h_{2}+,,,,+a_{m}^{j}h_{m}\)
    • Decoder initial input \(h_{m}\) , denote also as \(s^{0}\)
    • Decoder output, \(s^{j} = \tanh(A^{'}\cdot [x^{'j}, s_{j-1}, c_{j-1}]^{T})\)
  3. update Network with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))

5.2. simple RNN + self attention

only Encoder, e\(X = x_{1}, x_{2},,,,x_{m}\) Without Decoder and Decoder input, after RNN or LSTM we get \(H = h_{0}, h_{1},,,,,h_{m}\) Now unlike before only pass the last element \(h_{m}\) to Decoder, we use attention skill to mix all input information

  1. Notion:

    • Encoder, lower index \(i\) stands for the index of input order in Encoder
    • Generation, high index \(j\) stands for the index of generated items

    \(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Encoder with respect of the i-th input(\(x_{i}\)) in X.

  2. Variables
    • Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
    • Encoder shared parameter, A: RNN or LSMT shared parameter
    • Encoder output , \(H: = h_{1}, h_{2},,,h_{m}\) output at each step of RNN or LSMT
    • key, \(q_{i}^{j} = W_{q}^{j} h_{i}\)
    • query \(k_{i}^{j} = W_{k}^{j} h_{i}\)
    • Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
    • Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
    • Eecoder Context Vector, \(c^{j} = a_{1}^{j}h_{1} + a_{2}^{j}h_{2}+,,,,+a_{m}^{j}h_{m}\)
  3. update Network
    • with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
  4. Note
    • attention: key, \(q_{i}^{j} = W_{q}^{j} s^{i}\) with \(s^{j} = \tanh(A^{'}\cdot [x^{'j}, s_{j-1}, c_{j-1}]^{T})\)
    • self attention: key, \(q_{i}^{j} = W_{q}^{j} h_{i}\)

5.3. attention layer

An attention function can be described as mapping a query and a set of key-value pairs to an output Encoder Input E\(X = x_{1}, x_{2},,,,x_{m}\) Decoder Input \(X^{'} = x_{1}^{'}, x_{2}^{'},,,,x_m^{'}\) Removing RNN or LSMT, only constructing attention layer

  1. Notion:

    • Encoder, lower index \(i\) stands for the index of input order in Encoder
    • Decoder, high index \(j\) stands for the index of generated items in Decoder

    \(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Decoder with respect of the i-th input(\(x_{i}\)) in X.

  2. Variables
    • value, \(v_{i}^{j} = W_{v}^{j} x_{i}\)
    • query \(k_{i}^{j} = W_{k}^{j} x_{i}\)
    • key, \(q_{i}^{j} = W_{q}^{j} x_{'i}\)
    • Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
    • Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
    • Eecoder Context Vector, \(c^{j} = a_{1}^{j}v_{1}^{j} + a_{2}^{j}v_{2}^{j}+,,,,+a_{m}^{j}v_{m}^{j}\)
  3. update Network
    • with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
  4. Note
    • X replace H, but still seq2seq model(with X')

5.4. self attention layer

only Encoder, e\(X = x_{1}, x_{2},,,,x_{m}\) Without Decoder and Decoder input,

  1. Notion:

    • Encoder, lower index \(i\) stands for the index of input order in Encoder
    • Generation, high index \(j\) stands for the index of generated items

    \(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Encoder with respect of the i-th input(\(x_{i}\)) in X.

  2. Variables
    • Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
    • value, \(v_{i}^{j} = W_{v}^{j} x_{i}\)
    • key, \(q_{i}^{j} = W_{q}^{j} x_{i}\)
    • query \(k_{i}^{j} = W_{k}^{j} x_{i}\)
    • Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
    • Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
    • Eecoder Context Vector, \(c^{j} = a_{1}^{j}v_{1}^{j} + a_{2}^{j}v_{2}^{j}+,,,,+a_{m}^{j}v_{m}^{j}\)
  3. update Network with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
  4. Note
    • in query \(k_{i}^{j} = W_{k}^{j} x_{i}\), it's X , not X'

5.5. transformer model

RNN_attention.png after 6 stacked multi head self attention layers, another 6 stacked multi head attention layers, each time take the input of 6 self attention layer

6. Bert

7. ViT

Author: si

Created: 2025-01-24 Fr 19:30

Validate