Misc Transformers ideas

Workflow: git: pre-commit hook to check style Cleanup: gpt and gpt2 attention vv similar besides caching

fine to just add caching to GPT

Infra: test coverage for t5 causal mask

Infra: add test to ModelTesterMixin that loss doesn't change if pad tokens are introduced

save_pretrained: should mkdir if save_path doesn't exist

What does

stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)

do? It is all over the place. Can it be deleted? Thom said yes on slack.

TranslationPipeline abstractions that works for marian e.g. you call call pipeline(task='translation', src_lang='en', tgt_lang='fr')(['I went to the bakery'])

checkin FastAI sortish sampler, add --sortish-sampler clarg for examples

Cleanup: There are many nearly identical Attention implems. Can they be consolidated?

Easy win: go back through a few shleifer cleanup PRs and do the same thing in TF.

Easy win: go through examples tests and switch to tiny models instead of distilbert e.g. sshleifer/tiny-bart-random

Harder: using profiling tools to figure out why training summarization is 10x more memory intensive than inference.

test coverage for CTRL,XLM generation: does use_cache=True change results?

Medium cleanup: * refactor determine_archive_file: very similar logic all over

sshleifer/misc_ideas.md