The SalesForce CodeGen models are a family of large language models trained on a large amount of natural language data and then fine-tuned on specialized datasets of code. Models of size 350M, 2B, 6B, and 16B parameters are provided in three flavors:
- nl, the base model trained on The Pile, a large natural language dataset compiled by EleutherAI
- multi, which is fine-tuned from the nl model on a dataset of code in multiple languages, scraped from GitHub, and
- mono, which is fine-tuned from the multi model on Python code only.