BigScience

BigScience was an international AI research collaboration coordinated by Hugging Face from 2021 to 2022, producing the BLOOM 176-billion-parameter multilingual language model and demonstrating the viability of open, distributed AI research at frontier scale.
BigScience

BigScience

BigScience was a one-year international artificial intelligence research collaboration coordinated by Hugging Face from May 2021 to May 2022, with over 1,200 volunteer researchers and engineers from 38 countries contributing to the development of BLOOM, an open-access 176-billion-parameter multilingual language model trained to generate text in 46 natural languages and 13 programming languages. BLOOM was released in November 2022 and was at the time the largest publicly available multilingual language model. The BigScience collaboration is widely regarded as the principal demonstration that open, distributed, academic-and-industry-collaborative AI research can produce frontier-scale model artifacts comparable in capability to those from major commercial AI labs.

At a glance

  • Founded: Workshop launched May 2021 in France, coordinated by Hugging Face. Concluded the principal research phase in May 2022 with the BLOOM training run completing in July 2022 and the BLOOM release in November 2022.
  • Status: Time-limited research workshop. The collaboration is no longer an ongoing organization in the conventional sense, but the BLOOM model and other research artifacts continue to be available and cited.
  • Funding: Supported by France's GENCI and IDRIS (the operator of the Jean Zay supercomputer at the CNRS) for the training-compute resources, alongside Hugging Face institutional resources and volunteer contributions from the participating researchers.
  • CEO: No single CEO. The collaboration was coordinated by Hugging Face leadership, particularly Thomas Wolf (Hugging Face Chief Science Officer), and by Working Group leaders across multiple research-organization-level work streams.
  • Other notable leadership: Yacine Jernite (Hugging Face research lead and BigScience contributor), Stéphane Requena (GENCI), Pierre-François Lavallée (IDRIS), and Working Group leaders across approximately 30 BigScience research streams.
  • Open weights: Yes. BLOOM was released as an open-access model under the Responsible AI License (RAIL), an unusually permissive license that included specific restrictions on harmful uses while allowing broad research and commercial use.
  • Flagship outputs: BLOOM (176-billion-parameter multilingual language model), BLOOMZ (instruction-tuned BLOOM variants), the BLOOM training methodology, the BigScience research papers, and other multilingual research artifacts.

Origins

BigScience emerged from discussions in late 2020 and early 2021 between Thomas Wolf at Hugging Face, Stéphane Requena at GENCI (Grand Équipement National de Calcul Intensif, the French national supercomputing infrastructure), and Pierre-François Lavallée at IDRIS (Institut du développement et des ressources en informatique scientifique, the operator of the Jean Zay public supercomputer). The motivating premise was that the academic and open-source AI research community lacked access to the compute resources needed to train large language models at the scale of contemporary commercial frontier-lab releases. The Jean Zay supercomputer in France, available through GENCI's national infrastructure program, could provide the necessary compute for an academic-led training run.

The BigScience workshop launched in May 2021 with an initial cohort of researchers from approximately 30 countries. Over the next year, the collaboration expanded to over 1,200 volunteer participants organized across approximately 30 Working Groups covering training data curation, model architecture, training methodology, evaluation, multilingual capability, ethics and law, and other research areas. The Working Group structure was a structurally distinctive element of the collaboration: it allowed distributed contribution from a global community of researchers without requiring centralized employment relationships.

The BLOOM training run began in March 2022 and completed in July 2022. The 176-billion-parameter BLOOM model was at the time the largest publicly available multilingual language model and was trained on the ROOTS corpus, a 1.6-terabyte multilingual text dataset compiled by the BigScience data Working Group. The dataset covered 46 natural languages and 13 programming languages with explicit care taken for ethical curation, regulatory compliance, and language-coverage balance.

BLOOM was released to the public on November 17, 2022, under the Responsible AI License. The release was accompanied by the BLOOM training methodology paper, a comprehensive description of the training data, and other technical disclosure that exceeded the transparency of any contemporary commercial frontier-lab model release.

The BigScience collaboration formally concluded in May 2022, with the BLOOM training and release activities continuing into late 2022. Subsequent activities (BLOOMZ instruction tuning, ROOTS dataset documentation, follow-on research papers) continued into 2023 and 2024 but the principal collaboration structure had wound down. Many of the participating researchers continued the work through their home institutions or through adjacent open-source AI organizations including the Allen Institute for AI, EleutherAI, LAION, and Hugging Face itself.

Mission and strategy

BigScience's stated mission was to demonstrate the viability of open, distributed, and ethically considered AI research at frontier scale. The collaboration explicitly framed itself as a counter-example to the commercial concentration of frontier AI capability, with the BLOOM model intended as evidence that academic and open-source research communities could produce frontier-tier model artifacts when provided with adequate compute infrastructure.

The strategy combined three threads. First, the BLOOM training run as the principal artifact of the collaboration: a single large-scale model release demonstrating frontier-scale open AI research. Second, the Working Group structure as the organizational innovation, allowing distributed volunteer contribution across multiple research areas. Third, ethics, multilingual coverage, and transparency commitments as principled differentiation from commercial AI lab releases, with explicit attention to language balance, harmful-content filtering, and licensing.

The competitive premise was that open, distributed AI research is structurally important for the AI ecosystem and that the right combination of academic-coordinated research with publicly funded compute infrastructure can produce frontier-tier capability. The premise was validated by BLOOM's release.

The longer-term influence of BigScience is most visible through the precedent the collaboration established: subsequent open-source AI research organizations (Allen Institute for AI, EleutherAI, LAION) cite BigScience as a foundational moment for the open-AI movement, and the Working Group methodology has been adopted in adjacent collaborative AI research initiatives.

Models and products

  • BLOOM. Released November 17, 2022. 176-billion-parameter multilingual language model. Open-access under the Responsible AI License. Trained on 1.6 terabytes of text covering 46 natural languages and 13 programming languages.
  • BLOOMZ. Instruction-tuned variants of BLOOM in multiple parameter scales, released in late 2022 and early 2023.
  • ROOTS corpus. 1.6-terabyte multilingual training corpus compiled by the BigScience data Working Group. Available with extensive metadata on language balance, source diversity, and ethical curation methodology.
  • BLOOM training methodology paper. Comprehensive technical disclosure of the BLOOM training run, including data curation, architecture, optimization, and evaluation methodology.
  • BigScience research papers. Academic-publication output across the approximately 30 Working Groups covering training data, ethics, evaluation, multilingual research, and other areas.

The principal distribution channel was Hugging Face for the BLOOM model and dataset releases, with academic-paper publication for research outputs and GitHub for training code and infrastructure.

Benchmarks and standing

BLOOM was characterized at release in November 2022 as the largest publicly available multilingual language model. Benchmark performance was generally competitive with contemporary commercial frontier-lab models on multilingual tasks; English-only benchmarks placed BLOOM somewhat behind comparable commercial models given the broader multilingual training distribution.

The principal contribution of BLOOM was demonstration rather than benchmark leadership. The release validated the viability of open, distributed AI research at frontier scale and provided the open-source community with a frontier-tier model artifact for academic and other research use.

BigScience's standing in the open-source AI ecosystem is anchored on the BLOOM training methodology contribution, the precedent for distributed academic-led AI research, and the influence on subsequent open-source AI organizations. The collaboration is regularly cited as foundational in industry coverage of open-source AI development.

Leadership

BigScience operated through a distributed Working Group structure rather than a conventional senior-leadership hierarchy. The principal coordinating roles included:

  • Thomas Wolf, Hugging Face Chief Science Officer. Senior coordinator of the BigScience collaboration. Previously co-founded Hugging Face and authored "Natural Language Processing with Transformers."
  • Stéphane Requena, GENCI representative. Coordinated access to the Jean Zay supercomputer and the French national-research-infrastructure resources.
  • Pierre-François Lavallée, IDRIS director. Operator of the Jean Zay supercomputer.
  • Working Group leaders. Senior researchers leading approximately 30 research streams covering training data, architecture, optimization, evaluation, multilingual research, ethics, and other areas. The Working Group structure included contribution from researchers at academic institutions globally.
  • Yacine Jernite, Hugging Face research scientist and BigScience contributor.

The collaboration's distributed structure was a structurally distinctive element. No single CEO or executive director coordinated BigScience; the Working Group structure provided the organizational framework, with Hugging Face leadership providing institutional coordination.

Funding and backers

BigScience's funding model was unusual. The principal compute resource (the Jean Zay supercomputer in France) was provided through GENCI and IDRIS as a public-research infrastructure allocation, valued in compute-dollar-equivalent at approximately $4 million. Hugging Face provided institutional coordination and infrastructure resources. Volunteer researchers contributed through their home institutions (academic universities, industry research labs, and other organizations).

The total cumulative resource investment in BigScience is difficult to estimate precisely given the volunteer-and-distributed model, but the public-research-infrastructure allocation, institutional resources, and volunteer time produced a collaboration that operated at substantially lower nominal cost than peer commercial frontier-lab AI training initiatives.

The funding model is widely cited in subsequent open-source AI research organizations as a precedent for combining public-research-infrastructure allocations with distributed academic-and-industry-volunteer contribution.

Industry position

BigScience occupies a structurally distinctive position in the global AI research history. The combination of the BLOOM model as a frontier-tier open-access release, the Working Group organizational innovation, the publicly funded compute resource, and the distributed-volunteer research model produces a profile that no other AI collaboration has matched at the same combination of attributes.

Industry coverage characterized BigScience in 2022 to 2023 as the most consequential open-source AI research collaboration to date. Subsequent open-source AI organizations including the Allen Institute for AI, EleutherAI, LAION, and Hugging Face itself have explicitly cited BigScience as a foundational precedent.

The collaboration's time-limited structure means BigScience is no longer an ongoing organization in the conventional sense, but the BLOOM model continues to be available and the influence of the collaboration's methodology is visible in subsequent open-source AI research organizations.

Competitive landscape

BigScience operated as a research collaboration rather than a competitive commercial entity. Other organizations include:

  • Hugging Face. The principal coordinating institution for BigScience. Continues to host BLOOM and BLOOMZ on the Hugging Face Hub and to support adjacent research.
  • GENCI / IDRIS. French national-research-infrastructure providers. Continue to support adjacent academic AI research projects through Jean Zay supercomputer allocations.
  • Allen Institute for AI, EleutherAI, LAION, MILA, Nous Research. Subsequent open-source AI research organizations that cite BigScience as a precedent.
  • Meta AI / FAIR (Llama), Mistral AI, DeepSeek, Alibaba Qwen. Commercial open-weights model providers whose subsequent releases extended and surpassed BLOOM-era multilingual capability.

Outlook

BigScience as an active research collaboration concluded in 2022 and has not been reconvened. The collaboration's outlook in 2026 is principally about continued influence and citation rather than future research output. Several open questions remain:

  • The continued availability and use of BLOOM as a research artifact in academic and open-source AI research.
  • Potential successor collaborations following the BigScience Working Group precedent. No major successor collaboration has been announced as of April 2026.
  • The integration of BigScience methodology into future open-source AI research initiatives, particularly around multilingual research and ethical training-data curation.
  • The legacy influence of BigScience on AI policy discussions about open-source AI research and publicly funded AI compute infrastructure.

Sources

About the author
Nextomoro

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.