BigCode Project
BigCode Project is an open scientific collaboration coordinated by Hugging Face and ServiceNow Research, founded in September 2022, with a mandate to develop open-weights code language models and the supporting training data infrastructure for the broader research community. The collaboration's principal outputs include the StarCoder family of open-weights code language models (StarCoder, StarCoderBase, StarCoder2, and other variants), The Stack training dataset (a open code training corpus, with The Stack v2 reaching approximately 6.7 TB of permissively licensed source code), and published research output on responsible code-foundation-model development. As of April 2026, BigCode is one of the principal open-research collaborations on code language models, with cross-institution participation including Hugging Face, ServiceNow Research, the Allen Institute for AI, Eleuther AI, and other academic and industry research peers.
At a glance
- Founded: September 2022 as an open scientific collaboration. Coordinated by Hugging Face and ServiceNow Research.
- Status: Open scientific collaboration with cross-institution participation.
- Funding: Hugging Face and ServiceNow Research operating contributions, plus selected industry-cooperative-agreement funding and academic-research-grant funding.
- Coordinators: Leandro von Werra, Co-coordinator (Hugging Face). Harm de Vries, Co-coordinator (ServiceNow Research).
- Other notable participants: Academic and industry research participants from Hugging Face, ServiceNow Research, and other institutions across the broader open-research community.
- Open weights: Yes. The StarCoder family is released open-weights through Hugging Face under the OpenRAIL-M license (a responsible AI license with use-case restrictions). The Stack is released open-source through Hugging Face datasets.
- Flagship outputs: StarCoder family (StarCoder, StarCoderBase, StarCoder2, and other variants), The Stack training dataset, StarChat (instruction-tuned variant), published research output on responsible code-foundation-model development.
Origins
BigCode Project was founded in September 2022 as an open scientific collaboration following the 2021 to 2022 community interest in open-weights code-foundation-models. The 2021 OpenAI Codex (the principal closed-weights code-foundation-model that powered GitHub Copilot's initial deployment) had anchored commercial interest in code-foundation-models, with open-research interest in alternative open-weights code-foundation-models.
The 2021 founding of the BigScience workshop by Hugging Face had anchored an analogous open-research collaboration approach for general-purpose foundation models (resulting in BLOOM, the 176-billion-parameter open-weights multilingual foundation model). BigCode was founded as the analogous code-foundation-model collaboration with Hugging Face and ServiceNow Research coordination.
The 2022 to 2023 founding period built collaboration infrastructure including The Stack training dataset (with filtering for permissively licensed source code) and the StarCoder model architecture. StarCoder (May 2023) was the principal first-generation output: a 15.5-billion-parameter code-foundation-model trained on permissively licensed source code from The Stack, with published research on responsible code-foundation-model development including data-attribution mechanisms.
The 2024 release of StarCoder2 (in 3B, 7B, and 15B variants) extended the family with training-data quality improvements through The Stack v2 and multilingual code support across 600-plus programming languages. The 2024 to 2026 period has continued published research output and continued cross-institution research-cooperation.
Mission and strategy
BigCode Project's stated mission is to develop open-weights code language models and the supporting training data infrastructure responsibly for the broader research community. The collaboration's strategic premise reflects the open-research collaboration positioning, with emphasis on responsible AI license frameworks (OpenRAIL-M) and data-attribution mechanisms for code-foundation-model training data.
The strategy combines three threads. First, code-foundation-model research through the StarCoder family. Second, The Stack training dataset providing open code training-data infrastructure for the broader research community. Third, published research output on responsible code-foundation-model development including data-attribution and licensing-framework research.
The competitive premise reflects BigCode's distinct positioning as an open-research collaboration: cross-institution participation, open-weights and open-data release, and responsible-AI license framework adoption.
Distribution channels include open-weights distribution through Hugging Face, open-source dataset distribution through Hugging Face datasets, published research output through major academic venues, and cross-institution research-cooperation.
Models and products
- StarCoder family. StarCoder (May 2023, 15.5B parameters), StarCoderBase, StarCoder2 (in 3B, 7B, and 15B variants, 2024), StarChat (instruction-tuned variant), and other variants. Open-weights through Hugging Face under OpenRAIL-M license.
- The Stack. Open code training dataset. Filtering for permissively licensed source code. The Stack v2 reaches approximately 6.7 TB of permissively licensed source code.
- Published research output. Across responsible code-foundation-model development, data-attribution mechanisms, multilingual code-foundation-model research, and other areas.
- Cross-institution research-cooperation. Cooperation across Hugging Face, ServiceNow Research, the Allen Institute for AI, Eleuther AI, and other academic and industry research peers.
Distribution channels include open-weights and open-source dataset distribution through Hugging Face, published research output through major academic venues, and cross-institution research-cooperation.
Benchmarks and standing
BigCode's evaluation framework focuses on code-foundation-model benchmarks (HumanEval, HumanEval+, MBPP, MultiPL-E, RepoBench, and other code-foundation-model leaderboards) and published research output. The StarCoder family has been consistently characterized in code-foundation-model industry coverage as one of the principal open-weights code-foundation-model families globally, alongside Meta AI / FAIR Code Llama, DeepSeek Coder, and other open-weights code-foundation-model families.
The Stack training dataset has been characterized in open-research industry coverage as one of the principal open code training datasets globally, with use across the broader code-foundation-model research community.
Industry coverage has consistently characterized BigCode's responsible-AI license framework adoption (OpenRAIL-M) and data-attribution mechanisms as responsible-AI research contributions.
Leadership
As of April 2026, BigCode Project's senior coordinators include:
- Leandro von Werra, Co-coordinator (Hugging Face).
- Harm de Vries, Co-coordinator (ServiceNow Research).
- Academic and industry research participants across Hugging Face, ServiceNow Research, the Allen Institute for AI, Eleuther AI, and other institutions.
Continued cross-institution research participation has supported BigCode's continued research output through 2022 to 2026.
Funding and backers
BigCode Project operates under Hugging Face and ServiceNow Research operating contributions, plus selected industry-cooperative-agreement funding and academic-research-grant funding. Specific BigCode-internal budget allocations are not separately disclosed.
Hugging Face's private capital base (with Series D backing) and ServiceNow's public-company resources (NYSE: NOW) provide BigCode with operating-resource certainty. Open questions on near-term funding are limited.
Industry position
BigCode Project occupies a distinctive position as one of the principal open-research collaborations on code language models, with cross-institution participation, the StarCoder family, the The Stack training dataset, and published research output on responsible code-foundation-model development. Industry coverage has consistently characterized BigCode as one of the principal open-research code-foundation-model collaborations globally.
The 2024 to 2026 period has continued the StarCoder family iteration alongside continued cross-institution research-cooperation across the broader open-research community.
Competitive landscape
- Hugging Face. Coordinator and open-research distribution partner.
- ServiceNow Research. Co-coordinator with industry-research collaboration.
- Meta AI / FAIR Code Llama. Open-weights code-foundation-model peer.
- DeepSeek Coder. Open-weights code-foundation-model peer.
- Codeium / Windsurf, Cursor (Anysphere), Magic, Cognition. Commercial coding-AI peers with different commercial-product positioning.
- OpenAI Codex (legacy), GPT-4 code variants. Closed-weights code-foundation-model alternatives.
- Allen Institute for AI (Ai2), EleutherAI, LAION, BigScience. Open-research peers.
Outlook
- The continued StarCoder family iteration through 2026 to 2027.
- The continued The Stack training dataset expansion and open-data infrastructure trajectory.
- Continued cross-institution research-cooperation across the broader open-research community.
- The continued responsible-AI license framework adoption and data-attribution mechanism research.
- The Hugging Face and ServiceNow Research strategic-partnership trajectory.
Sources
- BigCode Project official site. Project reference.
- StarCoder release blog post. May 2023 StarCoder release announcement.
- StarCoder2 release. 2024 StarCoder2 release.
- The Stack dataset. Open code training dataset.
- Hugging Face. Coordinator and open-research distribution partner.