Géneration de texte avec TensorFlow
La génération de texte peut-être faite mot par mot ou caractère par caractère. La seconde option marche beaucoup, les modèles arrivent à reconstituer les mots ainsi que la ponctuation.
Importation des packages
import unicodedata
import re
from pprint import pprint
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
Lecture des données
('My fellow citizens: I stand here today humbled by the task before us, '
'grateful for the trust you have bestowed, mindful of the sacrifices borne by '
'our ancestors. I thank President Bush for his service to our nation, as well '
'as the generosity and cooperation he has shown throughout this transition. '
'Forty-four Americans have now taken the presidential oath. The words have '
'been spoken during rising tides of prosperity and the still waters of peace. '
'Yet, every so often the oath is taken amidst gathering clouds and raging '
'storms. At these moments, America has carried on not simply because of the '
'skill or vision of those in high office, but because We the People have '
'remained faithful to the ideals of our forbearers, and true to our founding '
'documents. So it has been. So it must be with this generation of Americans. '
'That we are in the midst of crisis is now well understood. Our nation is at '
'war, against a far-reaching network of violence and hatred. Our economy is '
'badly weakened, a consequen')
Nettoyage des données
def preprocess_text(text):
# on enlève tous les accents
new_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')
# on passe en miniscule
new_text = new_text.lower()
# on garde que les lettres
new_text = re.sub('[^a-z,.\'\n-]+', ' ', new_text)
# on enlève les retours à la ligne
new_text = new_text.replace('\n\n', '').replace(' ', '')
return new_text
clean_corpus = preprocess_text(corpus)
pprint(clean_corpus[:1000])
('my fellow citizens i stand here today humbled by the task before us, '
'grateful for the trust you have bestowed, mindful of the sacrifices borne by '
'our ancestors. i thank president bush for his service to our nation, as well '
'as the generosity and cooperation he has shown throughout this transition. '
'forty-four americans have now taken the presidential oath. the words have '
'been spoken during rising tides of prosperity and the still waters of peace. '
'yet, every so often the oath is taken amidst gathering clouds and raging '
'storms. at these moments, america has carried on not simply because of the '
'skill or vision of those in high office, but because we the people have '
'remained faithful to the ideals of our forbearers, and true to our founding '
'documents. so it has been. so it must be with this generation of americans. '
'that we are in the midst of crisis is now well understood. our nation is at '
'war, against a far-reaching network of violence and hatred. our economy is '
'badly weakened, a consequenc')
Création du vocabulaire
print('Corpus size:', len(clean_corpus))
chars = sorted(list(set(clean_corpus)))
print('Total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
Création des séquences
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(clean_corpus) - maxlen, step):
sentences.append(clean_corpus[i: i + maxlen])
next_chars.append(clean_corpus[i + maxlen])
print('Total sequences:', len(sentences))
Vectorisation des séquences
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Conception du modèle
# build the model: a single LSTM
print('Build model...')
model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(keras.layers.Dense(len(chars), activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 128) 81920
dense (Dense) (None, 31) 3999
=================================================================
Total params: 85,919
Trainable params: 85,919
Non-trainable params: 0
_________________________________________________________________
Paramétrage de l'entrâinement
optimizer = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy',
optimizer=optimizer,
metrics="accuracy")
Entraînement
Inférence
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
def predict_text(start_index):
for temperature in [0.2, 0.5, 1.0, 1.2]:
print('----- temperature:', temperature)
generated = ''
sentence = clean_corpus[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)
for i in range(400):
x_pred = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x_pred[0, t, char_indices[char]] = 1.
preds = model.predict(x_pred, verbose=0)[0]
next_index = sample(preds, temperature)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
sys.stdout.write(next_char)
sys.stdout.flush()
print()
start_index = np.random.randint(0, len(clean_corpus) - maxlen - 1)
predict_text(start_index)
----- temperature: 0.2
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatting the support the country and the challenges that the change the support the same persons and senator the support that we have to do the same american people and the country that the president that the same contracts that the same country the same country that the country that the same country that the support the same country and the same country that the country where the same country that t
----- temperature: 0.5
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatting and a capable judge the change the support the same allow that we should lear individual states and supplion senator who have a faith has been in the political program of the support of the caused on the support. i get that you have the crisis who want to work the senate to order that we should ever bestit of the succeed to come to get to come to the planet the sost that our political contra
----- temperature: 1.0
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchans clease here when the other childrens, most our engogions and members movement would sezemes of president senator and aftersers to accomps is. kid our womanbers and flexible to come topight alant, choices and help uspeed is not the few threat in end met amendment to take herse health care-know that consequence made to just stood obligation in greatesm clan emphersable very a drardes who are not
----- temperature: 1.2
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatenthisis, it was, race. intide-seried from natia. you truahs, the ral begin -- from incregitious know but collinal committee, neam to be princility. undemo, it yearogphilom, and our fadry lenglies. toepre fack inligistimated to no shore are wreates strugnion introrved docure down there's think mudeman very bara. dick islat learned our future, now there about a tcroying recouse spack there agown