Skip to content

Instantly share code, notes, and snippets.

@Athospd
Created January 6, 2020 16:10
Show Gist options
  • Select an option

  • Save Athospd/fd08d7837b34285012ebbd2cf71d856b to your computer and use it in GitHub Desktop.

Select an option

Save Athospd/fd08d7837b34285012ebbd2cf71d856b to your computer and use it in GitHub Desktop.
Transformer translation for R
---
title: "Transformer"
author: "Athos Petri Damiani"
date: "05/01/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This code is the translation from Python to R of the Transformer Tutorial implemented by Google.
Original link: [https://www.tensorflow.org/tutorials/text/transformer](https://www.tensorflow.org/tutorials/text/transformer)
[https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb#scrollTo=15VYkkSfKE3t](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb#scrollTo=15VYkkSfKE3t)
```{r, message=FALSE, warning=FALSE}
library(keras)
library(tidyverse)
library(tensorflow)
library(tfdatasets)
# If not installed yet, installs tfds package and tfds module in Python.
# remotes::install_github("rstudio/tfds")
# tfds::install_tfds()
library(tfds)
```
## Setup input pipeline
```{r}
examples <- tfds::tfds_load(name = 'ted_hrlr_translate/pt_to_en')
train_examples <- examples$train
val_examples <- examples$validation
```
Create a custom subwords tokenizer from the training dataset.
```{r}
# tokenizer_en <- train_examples %>%
# tensorflow::iterate(function(x) x$en$numpy()) %>%
# tfds$features$text$SubwordTextEncoder$build_from_corpus(2^13)
# tokenizer_en$save_to_file("tokenizer_en")
tokenizer_en <- tfds$features$text$SubwordTextEncoder$load_from_file("tokenizer_en")
```
```{r}
# tokenizer_pt <- train_examples %>%
# tensorflow::iterate(function(x) x$pt$numpy()) %>%
# tfds$features$text$SubwordTextEncoder$build_from_corpus(2^13)
# tokenizer_pt$save_to_file("tokenizer_pt")
tokenizer_pt <- tfds$features$text$SubwordTextEncoder$load_from_file("tokenizer_pt")
```
```{r}
sample_string = 'Transformer is awesome.'
tokenized_string <- tokenizer_en$encode(sample_string)
paste('Tokenized string is ', paste(tokenized_string, collapse = " "))
original_string <- tokenizer_en$decode(tokenized_string)
paste('The original string: ', original_string)
```
```{r}
walk(tokenized_string, ~ print(sprintf("%s ------> %s", .x, tokenizer_en$decode(c(.x, 0L)))))
```
```{r}
BUFFER_SIZE = 20000L
BATCH_SIZE = 64L
```
```{r}
encode <- function(lang1, lang2) {
lang1 <- tokenizer_pt$vocab_size + tokenizer_pt$encode(lang1$numpy()) + (tokenizer_pt$vocab_size + 1)
lang2 <- tokenizer_pt$vocab_size + tokenizer_en$encode(lang1$numpy()) + (tokenizer_pt$vocab_size + 1)
return(list(lang1, lang2))
}
```
```{r}
MAX_LENGTH = 1L
```
```{r}
filter_max_length <- function(en, pt, max_length = MAX_LENGTH) {
tf$size(en) <= max_length & tf$size(pt) <= max_length
}
```
```{r}
tf_encode <- function(en_pt) {
tf$py_function(encode, list(en_pt$en, en_pt$pt), list(tf$int64, tf$int64))
}
```
```{r}
train_dataset = train_examples %>%
dataset_map(tf_encode) %>%
dataset_cache() %>%
dataset_shuffle(BUFFER_SIZE) %>%
dataset_padded_batch(BATCH_SIZE, padded_shapes = tuple(list(-1L), list(-1L))) %>%
dataset_prefetch(1)
val_dataset = val_examples %>%
dataset_map(tf_encode) %>%
dataset_filter(filter_max_length) %>%
dataset_padded_batch(BATCH_SIZE, padded_shapes = tuple(list(-1L), list(-1L)))
```
```{r}
val_dataset_iter <- reticulate::as_iterator(val_dataset)
```
```{r}
reticulate::iter_next(val_dataset_iter)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment