Aller au contenu

Algo Jungle

Génération de texte

Data-Science-with-Python

Géneration de texte avec TensorFlow#

La génération de texte peut-être faite mot par mot ou caractère par caractère. La seconde option marche beaucoup, les modèles arrivent à reconstituer les mots ainsi que la ponctuation.

Importation des packages#

import unicodedata
import re
from pprint import pprint
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras

Lecture des données#

speechs = pd.read_json(INPUT_FILE)
speechs.sample(10)

corpus = " ".join(list(speechs["content"]))
pprint(corpus[:1000])

('My fellow citizens: I stand here today humbled by the task before us, '
 'grateful for the trust you have bestowed, mindful of the sacrifices borne by '
 'our ancestors. I thank President Bush for his service to our nation, as well '
 'as the generosity and cooperation he has shown throughout this transition. '
 'Forty-four Americans have now taken the presidential oath. The words have '
 'been spoken during rising tides of prosperity and the still waters of peace. '
 'Yet, every so often the oath is taken amidst gathering clouds and raging '
 'storms. At these moments, America has carried on not simply because of the '
 'skill or vision of those in high office, but because We the People have '
 'remained faithful to the ideals of our forbearers, and true to our founding '
 'documents. So it has been. So it must be with this generation of Americans. '
 'That we are in the midst of crisis is now well understood. Our nation is at '
 'war, against a far-reaching network of violence and hatred. Our economy is '
 'badly weakened, a consequen')

Nettoyage des données#

def preprocess_text(text):
    # on enlève tous les accents
    new_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')
    # on passe en miniscule
    new_text = new_text.lower()
    # on garde que les lettres
    new_text = re.sub('[^a-z,.\'\n-]+', ' ', new_text)
    # on enlève les retours à la ligne
    new_text = new_text.replace('\n\n', '').replace('  ', '')


    return new_text

clean_corpus = preprocess_text(corpus)
pprint(clean_corpus[:1000])

('my fellow citizens i stand here today humbled by the task before us, '
 'grateful for the trust you have bestowed, mindful of the sacrifices borne by '
 'our ancestors. i thank president bush for his service to our nation, as well '
 'as the generosity and cooperation he has shown throughout this transition. '
 'forty-four americans have now taken the presidential oath. the words have '
 'been spoken during rising tides of prosperity and the still waters of peace. '
 'yet, every so often the oath is taken amidst gathering clouds and raging '
 'storms. at these moments, america has carried on not simply because of the '
 'skill or vision of those in high office, but because we the people have '
 'remained faithful to the ideals of our forbearers, and true to our founding '
 'documents. so it has been. so it must be with this generation of americans. '
 'that we are in the midst of crisis is now well understood. our nation is at '
 'war, against a far-reaching network of violence and hatred. our economy is '
 'badly weakened, a consequenc')

Création du vocabulaire#

print('Corpus size:', len(clean_corpus))

chars = sorted(list(set(clean_corpus)))
print('Total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Création des séquences#

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(clean_corpus) - maxlen, step):
    sentences.append(clean_corpus[i: i + maxlen])
    next_chars.append(clean_corpus[i + maxlen])
print('Total sequences:', len(sentences))

Vectorisation des séquences#

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Conception du modèle#

# build the model: a single LSTM
print('Build model...')
model = keras.Sequential()
model.add(keras.layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(keras.layers.Dense(len(chars), activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 128)               81920     

 dense (Dense)               (None, 31)                3999      

=================================================================
Total params: 85,919
Trainable params: 85,919
Non-trainable params: 0
_________________________________________________________________

Paramétrage de l'entrâinement#

optimizer = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', 
              optimizer=optimizer, 
              metrics="accuracy")

Entraînement#

BATCH_SIZE = 128
EPOCHS = 20
history = model.fit(x, y, 
                    batch_size=BATCH_SIZE, 
                    epochs=EPOCHS)

Inférence#

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def predict_text(start_index):
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('----- temperature:', temperature)

        generated = ''
        sentence = clean_corpus[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

start_index = np.random.randint(0, len(clean_corpus) - maxlen - 1)
predict_text(start_index)

----- temperature: 0.2
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatting the support the country and the challenges that the change the support the same persons and senator the support that we have to do the same american people and the country that the president that the same contracts that the same country the same country that the country that the same country that the support the same country and the same country that the country where the same country that t
----- temperature: 0.5
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatting and a capable judge the change the support the same allow that we should lear individual states and supplion senator who have a faith has been in the political program of the support of the caused on the support. i get that you have the crisis who want to work the senate to order that we should ever bestit of the succeed to come to get to come to the planet the sost that our political contra
----- temperature: 1.0
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchans clease here when the other childrens, most our engogions and members movement would sezemes of president senator and aftersers to accomps is. kid our womanbers and flexible to come topight alant, choices and help uspeed is not the few threat in end met amendment to take herse health care-know that consequence made to just stood obligation in greatesm clan emphersable very a drardes who are not 
----- temperature: 1.2
----- Generating with seed: "of my colleagues or staffers would excha"
of my colleagues or staffers would exchatenthisis, it was, race. intide-seried from natia. you truahs, the ral begin -- from incregitious know but collinal committee, neam to be princility. undemo, it yearogphilom, and our fadry lenglies. toepre fack inligistimated to no shore are wreates strugnion introrved docure down there's think mudeman very bara. dick islat learned our future, now there about a tcroying recouse spack there agown

Références#