DailyGlimpse

Training CodeParrot 🦜 from Scratch

AI
April 26, 2026 · 5:44 PM
Training CodeParrot 🦜 from Scratch

{ "title": "Training CodeParrot: An Inside Look at Building a Code Generation Model from Scratch", "content": "In this blog post we'll take a look at what it takes to build the technology behind GitHub CoPilot, an application that provides suggestions to programmers as they code. In this step by step guide, we'll learn how to train a large GPT-2 model called CodeParrot 🦜, entirely from scratch. CodeParrot can auto-complete your Python code - give it a spin here. Let's get to building it from scratch!\n\n## Creating a Large Dataset of Source Code\n\nThe first thing we need is a large training dataset. With the goal to train a Python code generation model, we accessed the GitHub dump available on Google's BigQuery and filtered for all Python files. The result is a 180 GB dataset with 20 million files (available here). After initial training experiments, we found that the duplicates in the dataset severely impacted the model performance. Further investigating the dataset we found that:\n\n* 0.1% of the unique files make up 15% of all files\n* 1% of the unique files make up 35% of all files\n* 10% of the unique files make up 66% of all files\n\nYou can learn more about our findings in this Twitter thread. We removed the duplicates and applied the same cleaning heuristics found in the Codex paper. Codex is the model behind CoPilot and is a GPT-3 model fine-tuned on GitHub code.\n\nThe cleaned dataset is still 50GB big and available on the Hugging Face Hub: codeparrot-clean. With that we can setup a new tokenizer and train a model.\n\n## Initializing the Tokenizer and Model\n\nFirst we need a tokenizer. Let's train one specifically on code so it splits code tokens well. We can take an existing tokenizer (e.g. GPT-2) and directly train it on our own dataset with the train_new_from_iterator() method. We then push it to the Hub.\n\n

Iterator for Training\ndef batch_iterator(batch_size=10):\n for _ in tqdm(range(0, args.n_examples, batch_size)):\n yield [next(iter_dataset)["content"] for _ in range(batch_size)]\n\n# Base tokenizer\ntokenizer = GPT2Tokenizer.from_pretrained("gpt2")\nbase_vocab = list(bytes_to_unicode().values())\n\n# Load dataset\ndataset = load_dataset("lvwerra/codeparrot-clean", split="train", streaming=True)\niter_dataset = iter(dataset)\n\n# Training and saving\nnew_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(),\n vocab_size=args.vocab_size,\n initial_alphabet=base_vocab)\nnew_tokenizer.save_pretrained(args.tokenizer_name, push_to_hub=args.push_to_hub)\n```\n\nLearn more about tokenizers and how to build them in the Hugging Face course.\n\nSee that inconspicuous streaming=True argument? This small change has a big impact: instead of downloading the full (50GB) dataset this will stream individual samples as needed saving a lot of disk space! Checkout the Hugging Face course for more information on streaming.\n\nNow, we initialize a new model. We’ll use the same hyperparameters as GPT-2 large (1.5B parameters) and adjust the embedding layer to fit our new tokenizer also adding some stability tweaks. The scale_attn_by_layer_idx flag makes sure we scale the attention by the layer id and reorder_and_upcast_attn mainly makes sure that we compute the attention in full precision to avoid numerical issues. We push the freshly initialized model to the same repo as the tokenizer.\n\n

Load codeparrot tokenizer trained for Python code tokenization\ntokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)\n\n# Configuration\nconfig_kwargs = {"vocab_size": len(tokenizer),\n "scale_attn_by_layer_idx": True,\n "reorder_and_upcast_attn": True}\n\n# Load model with config and push to hub\nconfig = AutoConfig.from_pretrained('gpt2-large', **config_kwargs)\nmodel = AutoModelForCausalLM.from_config(config)\nmodel.save_pretrained(args.model_name, push_to_hub=args.push_to_hub)\n\n\nNow that we have an efficient tokenizer and a freshly initialized model we can start with the actual training loop.\n\n## Implementing the Training Loop\n\nWe train with the [🤗 Accelerate](https://github.com/huggingface/accelerate) library which allows us to scale the training from our laptop to a multi-GPU machine without changing a single line of code. We just create an accelerator and do some argument housekeeping:\n\n\naccelerator = Accelerator()\nacc_state = {str(k): str(v) for k, v in accelerator.state.dict.items()}\n\nparser = HfArgumentParser(TrainingArguments)\nargs = parser.parse_args()\nargs = Namespace(**vars(args), **acc_state)\nsamples_per_step = accelerator.state.num_processes * args.train_batch_size\nset_seed(args.seed)\n\n\nWe are now ready to train! Let's use the `huggingface_hub` client library to clone the repository with the new tokenizer and model. We will checkout to a new branch for this experiment. With that setup, we can run many experiments in parallel and in the end we just merge the best one into the main branch.\n\n\n# Clone model repository\nif accelerator.is_main_process:\n hf_repo = Repository(args.save_dir, clone_from=args.model_ckpt)\n\n# Checkout new branch on repo\nif accelerator.is_main_process:\n hf_repo.git_checkout(run_name, create_branch_ok=True)\n\n\nWe can directly load the tokenizer and model from the local repository. Since we are dealing with big models we might want to turn on [gradient checkpointing](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) to decrease the GPU memory footprint during training.\n\n\n# Load model and tokenizer\nmodel = AutoModelForCausalLM.from_pretrained(args.save_dir)\nif args.gradient_checkpointing:\n model.gradient_checkpointing_enable()\ntokenizer = AutoTokenizer.from_pretrained(args.save_dir)\n```\n\nNext up is the dataset. We make training simpler with a dataset that yields examples with a fixed context size. To not waste too much data (some samples are too short or too long) we can concatenate many examples with an EOS token and then chunk them.\n\nThe more sequences we prepare together, the smaller the fraction of tokens we discard. Since we want to...",

"is_ai_topic": true }