1 year ago
#367981
let me down slowly
How to get the next word from huggingface's gpt-2 model instead of a token?
I am fine-tuning a pre-trained GPT-2 model for my native language. The model uses a byte-level BPE tokenizer. My goal is to predict the next word from a given sequence, but the model predicts next tokens, which are not complete words. This is what I am doing for predicting:
input_ids = tokenizer.encode(text, return_tensors='tf')
outputs = model.predict(input_ids).logits
print("Next most probable tokens:\n" + 100 * '-')
for i in range(outputs.shape[1]):
pred_id = np.argmax(outputs[:, i, :]).item()
print(tokenizer.decode(pred_id))
If I input "I am", the above code gives me something similar like " atte", "pli", etc, where I want a complete word. This is because the model is using byte-level BPE tokenizer. I have also tried using beam search with model.generate()
but this gives a sentence and the immediate next words after my input words are most of the time identical. This is the code I am using:
input_ids = tokenizer.encode(text, return_tensors='tf')
beam_outputs = model.generate(
input_ids,
max_length=100,
num_beams=5,
no_repeat_ngram_size=2,
num_return_sequences=5,
early_stopping=True,
do_sample=True,
top_k=50,
top_p=0.95,
)
print("Beam Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
This gives something like "I am an engineering student", "I am an architect", etc. for input "I am". Note that almost all the sentences gives " an" as the next word where I want to get next 5 probable words for my input. These examples are of course not applicable for my model since it is not for English, but I hope I have made it clear what I want to do.
python
tensorflow
nlp
huggingface-transformers
huggingface-tokenizers
0 Answers
Your Answer