peterjliu/README.md

Last active September 12, 2025 16:19

Star (32) You must be signed in to star a gist
Fork (2) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/peterjliu/f0dc9152a630520dc604c783db963aa7.js"></script>
Save peterjliu/f0dc9152a630520dc604c783db963aa7 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

README.md

Shortlink: goo.gl/wSuuS9

Supplementary Materials for Generating Wikipedia by Summarizing Long Sequences

The github repository can be found at https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikisum

cyberandy commented Mar 8, 2018

Great work!

Diego999 commented Mar 23, 2018

Hi peterjliu, any news about the progress ?

sai-prasanna commented Mar 31, 2018

Eagerly awaiting the dataset, want to apply other techinques to compare with the results.

morleytj commented Apr 12, 2018

Hi peterjliu, I was wondering if there were any updates on the dataset availability? I'm hoping to do some reproducibility tests. Thanks!

tfmorris commented Apr 21, 2018

Is this the data set that @lukaszkaiser is referencing in the Google Developers video here? https://youtu.be/O2UvKxaOH7c?t=637
If he's going to boast about it, it would seem to make sense to actually have it be available. It sounds like a great application of @commoncrawl data.

Author

peterjliu commented Apr 26, 2018

Hi folks, check out the updated link now. Thanks for the patience.

Author

peterjliu commented Apr 26, 2018

@rafaelbou @sai-prasanna @vedant @SHohentanner @leisurehippo @coventry @cyberandy @Diego999 @Legalamb77 @tfmorris

Mentioning folks who specifically expressed interest here.

Diego999 commented May 13, 2018

@peterjliu

Thank you for the share. I was wondering if it would be possible to store the preprocessed datasets on a local computer (after the preprocessing on the cloud) of it is too large ? Do you have an estimate of the necessary space ? 10 GB ? 100 GB ? 1 TB ?

Thank you for your help !

nlothian commented Jun 12, 2018

This looks really useful. I noticed that the pre-processed vocabs seem to be available in the gs://tensor2tensor-data/ bucket too (vocab.wikisum_commoncrawl.32768 and vocab.wikisum_web.32768)

The TODO says you release the hparams_set, which would be great, but can I request a pre-trained model release too?

hoang-ho commented Oct 6, 2018

Dear all,

Is there any available pre-trained model released for this wikisum problem? If there is, may I have the link to that pre-trained model?

Thank you so much

coventry commented Feb 26, 2019

Thanks for linking that, @peterjliu. Am I reading the README.md correctly, here, that training uses a full transformer architecture, rather than a decoder-only architecture with memory-compressed attention?

Training

TODO(rsepassi): Put actual results achieved on wikisum_web and/or
wikisum_commoncrawl and with what hparams_set.

PROBLEM=wikisum_web  # or wikisum_commoncrawl
t2t-trainer \
  --problem=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_base \
  --train_steps=250000 \
  --eval_steps=100 \
  --data_dir=$DATA_DIR \
  --output_dir=$TRAIN_DIR

rfdearborn commented Jun 7, 2019

Does anyone have processed training examples (i.e., the output of step 3 here) available to share? I'm having trouble getting GCP to release IP addresses for data generation, so I'm hoping to be able to bypass this for the time being...

Also, as @nlothian and @hoang-ho have asked, are pre-trained model weights available anywhere?

peterjliu/README.md

Supplementary Materials for Generating Wikipedia by Summarizing Long Sequences

cyberandy commented Mar 8, 2018

Uh oh!

Diego999 commented Mar 23, 2018

Uh oh!

sai-prasanna commented Mar 31, 2018

Uh oh!

morleytj commented Apr 12, 2018

Uh oh!

tfmorris commented Apr 21, 2018

Uh oh!

peterjliu commented Apr 26, 2018

Uh oh!

peterjliu commented Apr 26, 2018

Uh oh!

Diego999 commented May 13, 2018

Uh oh!

nlothian commented Jun 12, 2018

Uh oh!

hoang-ho commented Oct 6, 2018

Uh oh!

coventry commented Feb 26, 2019

Uh oh!

rfdearborn commented Jun 7, 2019

Uh oh!