NLP
Table of Contents
- 1. Data process
- 1.1. Data load
- 1.2. Alpha filter
- 1.3. Stop words
- 1.4. Lemmatization
- 1.5. Stemming
- 1.6. Part of speech Tagging
- 1.7. Named Entity Recognition (NER)
- 1.8. Frequency Analysis
- 1.9. one-hot coding
- 1.10. Tokenizationn
- 1.11. Word Embedding
- 1.12. Words Count
- 1.13. TF-IDF for weight title and body
- 1.14. Vectorizing tf-idf
- 2. language model
- 3. text generatation
- 4. seq2seq
- 5. Transformer
- 6. Bert
- 7. ViT
1. Data process
1.1. Data load
import nltk nltk.download('punkt') with open('/home/si/Desktop/textPreprocessing/frankenstein.txt') as f: frankensteintext = f.read() sentences = nltk.sent_tokenize(frankensteintext) print(f'the length of frankenstein txt is {len(sentences)}!')
1.2. Alpha filter
AlphaFilter_words = nltk.word_tokenize(sentences[0]) AlphaFilter_words = [token for token in AlphaFilter_words if token[0].isalpha()] print(f'the length of the first sentence is {len(AlphaFilter_words)}') print(f'the word is {AlphaFilter_words}')
1.3. Stop words
import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stopwords = stopwords.words("english") print(f"the length of stop words is {len(stopwords)}") words = word_tokenize(sentences[0]) words = [token for token in words if not token in stopwords] print(f'the length of the first sentence is {len(words)}') print(f'the word is {words}')
1.4. Lemmatization
nltk.download("wordnet") from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize import spacy ps = PorterStemmer() stemmed_words = [] words = word_tokenize(sentences[0]) for token in words: stemmed_words.append(ps.stem(token)) print(f'the length of the first sentence is {len(stemmed_words)}') print(f'the word is {stemmed_words}')
1.5. Stemming
nlp = spacy.load('en_core_web_sm') doc = nlp(sentences[0]) lemma_words = [token.lemma_ for token in doc] print(f'the length of the first sentence is {len(lemma_words)}') print(f'the word is {lemma_words}')
1.6. Part of speech Tagging
import spacy from spacy.lang.en.examples import sentences nlp = spacy.load('en_core_web_sm') doc = nlp(sentences[0]) print(doc.text) for token in doc: print(token.text, token.pos_, token.dep_)
1.7. Named Entity Recognition (NER)
print(doc.text) for token in doc.ents: print(token.text, token.label_)
1.8. Frequency Analysis
import nltk from nltk import FreqDist # Assuming 'tokens' are already generated from the text print(sentences[0]) freq_dist = FreqDist(sentences[0]) print(freq_dist.most_common(10)) words = word_tokenize(sentences[0]) print(words) freq_dist = FreqDist(words) print(freq_dist.most_common(10))
1.9. one-hot coding
all input should numerical, categorized character shoud be one-hot coded, starting with 1
1.10. Tokenizationn
- Breaking text to words (There are many steps to consider )
- CountWordFrequencies (counted key-value dictionary)
- list all sorted dictionary
- if the list is too big, removing infrequent words(because of incorrection, or neame…) good for one-hot coding
- encode to sequences
- with counted dictionary index
- index length is one-hot coding vector length
- one-hot coding all sequences
- if one-hot code vector is not so lang, word embedding is not needed
tests[5] = "this is a cat and a" tests_dict = {"this": {1:1}, "is": {2:1}, "a":{3:2}, "cat":{4:1}, "and":{5:1}} tests_sequences = [1,2,3,4,3]
1.10.1. Tokenize with nltk
tokenize_words = nltk.word_tokenize(sentences[0]) print(f'the length of the first sentence is {len(tokenize_words)}') print(f'the word is {tokenize_words}')
1.11. Word Embedding
compose high dimension one-hot vector to low dimension \[ X_{i} = P^{T} \cdot e^{i}\] \(e^{i}\) is high dimensional vector after one-hot coding(v,1) of collected data \(P^{T}\) is the parameter matrix trained by data(d,v), \(X_{i}\) is low dimensional vector(d,1), for further training \(d\) : The dimension parameter d is important, can be vertified with corss validation. Each row of \(P\) is called (words vector词向量), can be interpreted with classification
Embedding layer need the number of vocabulary(v), embeddingdim(d), and wordnum(cuted words number) v*d : parameters for this layer
1.12. Words Count
1.13. TF-IDF for weight title and body
the idea is to give weights for title and text. Afterwards the title has a huge impact for document = body + title \(TF-IDF(document) = TF-IDF(title) * alpha + TF-IDF(body) * (1-alpha)\)
1.14. Vectorizing tf-idf
2. language model
A model that computes \(P(W)\) or \(P(W_{n}|W_{1},W_{2},W_{3},W_{4}....W_{n-1})\) is called a language model.
2.1. skip-gram
2.2. CBOW
continuous bag of words
3. text generatation
Encoder: A is RNN layer or LMST layer, all input(x1 to xm) share the same A, hm is the last result, only give hm to decoder, we can generate text, but many content of input will be forget
4. seq2seq
After one resulte in Decode is generated, With Corss Enteopy to update the Network, using all the resulte we get, to predict the next resulte until all is finished consuming the previously generated symbols as additional input when generating the next.
5. Transformer
5.1. simple RNN + attention
Encoder Input E\(X = x_{1}, x_{2},,,,x_{m}\) Decoder Input \(X^{'} = x_{1}^{'}, x_{2}^{'},,,,x_m^{'}\) after RNN or LSTM we get \(H = h_{0}, h_{1},,,,,h_{m}\) Now unlike before only pass the last element \(h_{m}\) to Decoder, we use attention skill to mix all input information
Notion:
- Encoder, lower index \(i\) stands for the index of input order in Encoder
- Decoder, high index \(j\) stands for the index of generated items in Decoder
\(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Decoder with respect of the i-th input(\(x_{i}\)) in X.
- Variables
- Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
- Encoder shared parameter, A: RNN or LSMT shared parameter
- Encoder output , \(H: = h_{1}, h_{2},,,h_{m}\) output at each step of RNN or LSMT
- Decoder initial input \(h_{m}\) , denote also as \(s^{0}\)
- key, \(q_{i}^{j} = W_{q}^{j} s^{i}\)
- query \(k_{i}^{j} = W_{k}^{j} h_{i}\)
- Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
- Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
- Eecoder Context Vector, \(c^{j} = a_{1}^{j}h_{1} + a_{2}^{j}h_{2}+,,,,+a_{m}^{j}h_{m}\)
- Decoder initial input \(h_{m}\) , denote also as \(s^{0}\)
- Decoder output, \(s^{j} = \tanh(A^{'}\cdot [x^{'j}, s_{j-1}, c_{j-1}]^{T})\)
- update Network with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
5.2. simple RNN + self attention
only Encoder, e\(X = x_{1}, x_{2},,,,x_{m}\) Without Decoder and Decoder input, after RNN or LSTM we get \(H = h_{0}, h_{1},,,,,h_{m}\) Now unlike before only pass the last element \(h_{m}\) to Decoder, we use attention skill to mix all input information
Notion:
- Encoder, lower index \(i\) stands for the index of input order in Encoder
- Generation, high index \(j\) stands for the index of generated items
\(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Encoder with respect of the i-th input(\(x_{i}\)) in X.
- Variables
- Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
- Encoder shared parameter, A: RNN or LSMT shared parameter
- Encoder output , \(H: = h_{1}, h_{2},,,h_{m}\) output at each step of RNN or LSMT
- key, \(q_{i}^{j} = W_{q}^{j} h_{i}\)
- query \(k_{i}^{j} = W_{k}^{j} h_{i}\)
- Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
- Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
- Eecoder Context Vector, \(c^{j} = a_{1}^{j}h_{1} + a_{2}^{j}h_{2}+,,,,+a_{m}^{j}h_{m}\)
- update Network
- with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
- Note
- attention: key, \(q_{i}^{j} = W_{q}^{j} s^{i}\) with \(s^{j} = \tanh(A^{'}\cdot [x^{'j}, s_{j-1}, c_{j-1}]^{T})\)
- self attention: key, \(q_{i}^{j} = W_{q}^{j} h_{i}\)
5.3. attention layer
An attention function can be described as mapping a query and a set of key-value pairs to an output Encoder Input E\(X = x_{1}, x_{2},,,,x_{m}\) Decoder Input \(X^{'} = x_{1}^{'}, x_{2}^{'},,,,x_m^{'}\) Removing RNN or LSMT, only constructing attention layer
Notion:
- Encoder, lower index \(i\) stands for the index of input order in Encoder
- Decoder, high index \(j\) stands for the index of generated items in Decoder
\(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Decoder with respect of the i-th input(\(x_{i}\)) in X.
- Variables
- value, \(v_{i}^{j} = W_{v}^{j} x_{i}\)
- query \(k_{i}^{j} = W_{k}^{j} x_{i}\)
- key, \(q_{i}^{j} = W_{q}^{j} x_{'i}\)
- Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
- Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
- Eecoder Context Vector, \(c^{j} = a_{1}^{j}v_{1}^{j} + a_{2}^{j}v_{2}^{j}+,,,,+a_{m}^{j}v_{m}^{j}\)
- update Network
- with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
- Note
- X replace H, but still seq2seq model(with X')
5.4. self attention layer
only Encoder, e\(X = x_{1}, x_{2},,,,x_{m}\) Without Decoder and Decoder input,
Notion:
- Encoder, lower index \(i\) stands for the index of input order in Encoder
- Generation, high index \(j\) stands for the index of generated items
\(a^{j}_{i}\) stands for the parameter for generate the j-th item (\(s_j\))in Encoder with respect of the i-th input(\(x_{i}\)) in X.
- Variables
- Encoder input, \(X = x_{1}, x_{2},,,x_{m}\) ,
- value, \(v_{i}^{j} = W_{v}^{j} x_{i}\)
- key, \(q_{i}^{j} = W_{q}^{j} x_{i}\)
- query \(k_{i}^{j} = W_{k}^{j} x_{i}\)
- Query Martix, \(K^{j} = [k_{i}^{j}, k_{2}^{j},,,k_{m}^{j}]\)
- Encoder Weight \(a^{j}_{i}\), \(a^{j}_{i} = Softmax(K^{jT} q_{i})\)
- Eecoder Context Vector, \(c^{j} = a_{1}^{j}v_{1}^{j} + a_{2}^{j}v_{2}^{j}+,,,,+a_{m}^{j}v_{m}^{j}\)
- update Network with softmax(c) get the prediciton, and corss enteopy update network back(\(W^{j} -> W^{j+1}\))
- Note
- in query \(k_{i}^{j} = W_{k}^{j} x_{i}\), it's X , not X'
5.5. transformer model
after 6 stacked multi head self attention layers,
another 6 stacked multi head attention layers, each time take the input of 6 self attention layer