Skip to content

Instantly share code, notes, and snippets.

@pythonlessons
Created August 16, 2023 12:45
Show Gist options
  • Save pythonlessons/be5b886cc1a48c52d219fb143478d8d0 to your computer and use it in GitHub Desktop.
Save pythonlessons/be5b886cc1a48c52d219fb143478d8d0 to your computer and use it in GitHub Desktop.
transformer_attention
decoder_vocab_size = 1100
d_model = 512
decoder_embedding_layer = PositionalEmbedding(vocab_size, d_model)
random_decoder_input = np.random.randint(0, decoder_vocab_size, size=(1, 110))
decoder_embeddings = decoder_embedding_layer(random_decoder_input)
print("decoder_embeddings shape", decoder_embeddings.shape)
causal_self_attention_layer = CausalSelfAttention(num_heads=2, key_dim=512)
causal_self_attention_output = causal_self_attention_layer(decoder_embeddings)
print("causal_self_attention_output shape", causal_self_attention_output.shape)
out1 = causal_self_attention_layer(decoder_embedding_layer(random_decoder_input[:, :50])) # Only the first 50 tokens beffore applying the embedding layer
out2 = causal_self_attention_layer(decoder_embedding_layer(random_decoder_input)[:, :50]) # Only the first 50 tokens after applying the embedding layer
diff = tf.reduce_max(tf.abs(out1 - out2)).numpy()
print("Difference between the two outputs:", diff)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment