8 February 2024
Computation and Language
Proposes separating embeddings for NL and code tokens during pre-training
Introduces modality-relative training objectives tailored to text-to-code data
Evaluates on two models and datasets, showing consistent improvements
Measures gains with pass@k and a new incremental pass@k metric
Modality-aware representation learning for text-to-code generation
This paper investigates separating the embedding spaces of natural language and code tokens during pre-training of text-to-code models. It hypothesizes that due to their precise semantics, code tokens like 'while' may not benefit from transfer learning from natural language usage. The authors experiment with modality-relative training objectives and embedding spaces on two models, consistently observing improvements in text-to-code generation quality on two datasets.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper