Introduction
Elon Musk has become an internet sensation in recent years, with their views on the future, his fun personality and passion for technology. By now everyone knows him, Either like that type of electric car or like that guy who builds flamethrowers. He is mostly active on his Twitter, where you share everything, Even memes!
He inspires many young people in the IT industry, and i wanted to do a fun little project, where you would create an AI that would generate text based on your previous posts on Twitter. I wanted to sum up her style and see what kind of weird results she would get.
Preparation
The data I'm using was taken directly from Elon Musk's twitter, both from your posts and from your responses. You can download the dataset in this Link.
Importing the libraries:
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as e.g
import pandas as pd
import re
Now I am going to create the function that will remove all the links, the hashtags, the labels and all the things that will confuse the model so that we are left with clean text.
#import the data
data_path="C:/Users/Dejan/Downloads/elonmusk.csv"
data = pd.read_csv(data_path)
#Function to clean the text
def clean_text(text):
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
and remove words containing numbers.'''
text = text.lower()
#text = text.replace('%','')
text = re.sub('[.*?]', '', text)
text = re.sub('https?://S+|www.S+', '', text)
text = re.sub('<.*?>+', '', text)
#text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('n', '', text)
text = re.sub('w * dw *', '', text)
text = " ".join(filter(lambda x:x[0]!="@", text.split()))
return text
#Apply the function
data['text'] = data['text'].apply(lambda x: clean_text(x))
data = data['text']
Let's define a tokenizer and apply it to the text. This is how we are mapping all the words into their numerical representations. We do that because neural networks cannot accept strings. If you are new to it, there is a great series on Youtube by Lawrence Moroney, that I suggest you check below:
tokenizer = Tokenizer() tokenizer.fit_on_texts(data) total_words = len(tokenizer.word_index) + 1 print(total_words) #5952input_sequences = [] for line in data: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence)
Now we will have to define max_length (all data must be padded to a fixed length, as with Convolutions), and we also need to convert input_sequences to numpy array.
max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre'))
We are going to create data sequences, where we will use all the elements except the last one as our X, and the last element as the y, of our data. What's more, ours and is a unique representation of total_words, which can sometimes be a large amount of data (if total_words is 5952, that means that each and has the shape (5952,))
# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
Model
Below is the configuration of our model.
model = Sequential() model.add(Embedding(total_words, 80, input_length=max_sequence_length-1)) model.add(LSTM(100, return_sequences=True)) model.add(LSTM(50)) model.add(tf.keras.layers.Dropout(0.1)) model.add(Dense(total_words/20)) model.add(Dense(total_words, activation='softmax')) model.summary()Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 56, 80) 476160 _________________________________________________________________ lstm_2 (LSTM) (None, 56, 100) 72400 _________________________________________________________________ lstm_3 (LSTM) (None, 50) 30200 _________________________________________________________________ dropout_1 (Dropout) (None, 50) 0 _________________________________________________________________ dense_2 (Dense) (None, 297) 15147 _________________________________________________________________ dense_3 (Dense) (None, 5952) 1773696 ================================================================= Total params: 2,367,603 Trainable params: 2,367,603 Non-trainable params: 0
Tried a couple of optimizers and found Adam works best for this example. Let's build and run the model:
model.compile(loss="categorical_crossentropy",
optimizer="adam",
metrics=['accuracy'])
history = model.fit(xs, ys, epochs=200, verbose=1)
#Output
Epoch 196/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7377 - accuracy: 0.8031
Epoch 197/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7363 - accuracy: 0.8025
Epoch 198/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7236 - accuracy: 0.8073
Epoch 199/200
1026/1026 [==============================] - 19s 18ms/step - loss: 0.7147 - accuracy: 0.8083
Epoch 200/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7177 - accuracy: 0.8070
Let's create a 'for loop', which will generate new text, based on seed_text and the number of words we will define. This part of the code can seem a bit intimidating, but once you read each line carefully, you will see that we have already done something similar before.
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
predicted = np.argmax(model.predict(token_list), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
Now is the time to play with our model. Wow!
seed_text = "Space is big"
next_words = 20
Space is big conflation of cats a lot of civilization by spacex is making a few months of dragon is intense as we
seed_text = "i think about flowers"
next_words = 30
i think about flowers that on the future it are limited as you could brute force it with tankers to low earth orbit that’s probably faster than liquid temp in year we can have
seed_text = "i want to colonize jupiter"
next_words = 40
i want to colonize jupiter be words just be order to zero immediate future nor can we ourselves accurately predict what issues we will encounter on a short term fine grained level with in the house with it with a human part of the us
Summary
Space is a great combination of cats !? Who would have known! As you can see, the results that the model gives are silly and do not make much sense. As with all deep learning models, there are many things that could be modified to generate better results. I leave it to you.