In this role, you will:
- Design and run pre-training, continued pre-training, and mid-training experiments for code models.
- Build and improve data pipelines for large-scale model training, including filtering, deduplication, mixture design, and dataset quality checks.
- Work with code corpora, repositories, tests, execution traces, and synthetic data.
- Develop evaluations for complex repository-level code reasoning tasks.
- Collaborate with researchers and engineers working on ML for code and AI developer tools.
We’ll be happy to have you on our team if you:
- Have hands-on experience with model pre-training, continued training, or mid-training.
- Have strong engineering skills in Python and experience with modern ML frameworks.
- Understand large-scale ML training workflows, including data processing, distributed training, checkpointing, evaluation, experiment tracking, and debugging.
- Have experience working with large datasets and care about data quality, contamination, sampling, and reproducibility.
- Have a background in NLP, ML for software engineering, or a similar domain.
- Enjoy working on research problems with high uncertainty and turning ideas into working experiments.
It would be a plus if you:
- Have experience training or adapting models for code generation, code understanding, software agents, program repair, test generation, or repository-level reasoning.
- Have worked with execution-based data, such as unit tests, traces, logs, compiler feedback, runtime states, or sandboxed code execution.
- Have experience with large-scale distributed training of models with 70B+ parameters.
- Understand evaluation challenges for code models, including benchmark contamination, flaky tests, execution-based scoring, and long-horizon task evaluation.
- Have contributed to ML infrastructure, open-source projects, or research systems.