1 year ago

#367981

test-img

let me down slowly

How to get the next word from huggingface's gpt-2 model instead of a token?

I am fine-tuning a pre-trained GPT-2 model for my native language. The model uses a byte-level BPE tokenizer. My goal is to predict the next word from a given sequence, but the model predicts next tokens, which are not complete words. This is what I am doing for predicting:

input_ids = tokenizer.encode(text, return_tensors='tf')
outputs = model.predict(input_ids).logits

print("Next most probable tokens:\n" + 100 * '-')
for i in range(outputs.shape[1]):
    pred_id = np.argmax(outputs[:, i, :]).item()
    print(tokenizer.decode(pred_id))

If I input "I am", the above code gives me something similar like " atte", "pli", etc, where I want a complete word. This is because the model is using byte-level BPE tokenizer. I have also tried using beam search with model.generate() but this gives a sentence and the immediate next words after my input words are most of the time identical. This is the code I am using:

input_ids = tokenizer.encode(text, return_tensors='tf')

beam_outputs = model.generate(
   input_ids,
   max_length=100,
   num_beams=5,
   no_repeat_ngram_size=2,
   num_return_sequences=5,
   early_stopping=True,
   do_sample=True,
   top_k=50,
   top_p=0.95,
)

print("Beam Output:\n" + 100 * '-')

for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

This gives something like "I am an engineering student", "I am an architect", etc. for input "I am". Note that almost all the sentences gives " an" as the next word where I want to get next 5 probable words for my input. These examples are of course not applicable for my model since it is not for English, but I hope I have made it clear what I want to do.

python

tensorflow

nlp

huggingface-transformers

huggingface-tokenizers

0 Answers

Your Answer

Accepted video resources