Common Crawl
Common Crawl is a non-profit foundation that produces and openly distributes the principal large-scale web-crawl-data corpus globally, founded in 2007 by Gil Elbaz, the entrepreneur and co-founder of Applied Semantics (the company whose contextual-advertising technology became Google AdSense after the 2003 Google acquisition). Common Crawl produces approximately monthly web-crawl releases that aggregate billions of web pages into petabyte-scale open-data archives. The Common Crawl corpus underpins substantially all foundation-model pre-training data globally, with research-publication evidence indicating that the GPT, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen, and substantially all other principal foundation-model pre-training corpora include Common Crawl-derived training data. As of April 2026, Common Crawl is one of the structurally consequential open-data infrastructure organizations in the global AI ecosystem, with the multi-decade web-crawl operating history and the open-data foundation positioning as principal validating data points.
At a glance
- Founded: 2007 in San Francisco by Gil Elbaz. Operates as a non-profit foundation under US 501(c)(3) status.
- Status: Non-profit foundation. Operates as a public-good open-data infrastructure organization.
- Funding: Foundational donations from Gil Elbaz and adjacent founders. Subsequent operational funding from grants, donations, and adjacent open-data infrastructure funding.
- CEO / Lead: Gil Elbaz, Founder and Chairman. Subsequent operational leadership across the Common Crawl Foundation.
- Other notable leadership: Senior research-and-operations leadership across the foundation.
- Open weights: N/A. Common Crawl is an open-data foundation rather than a model producer.
- Flagship outputs: Approximately monthly web-crawl releases of petabyte-scale web-archive data; the Common Crawl corpus that underpins substantially all foundation-model pre-training data globally; substantial academic and industry research-cooperation on web-data curation and training-data research.
Origins
Common Crawl was founded in 2007 by Gil Elbaz with the founding mission of providing open-access web-scale crawl data for academic, research, and commercial use. Elbaz, the founder of Applied Semantics (the company whose contextual-advertising technology became Google AdSense after the 2003 Google acquisition that established Elbaz as a senior figure in the broader internet-infrastructure community), established the Common Crawl Foundation with substantive personal-foundation funding to build an open-data web-crawl infrastructure that academic and research communities could access without commercial-search-engine licensing constraints.
The 2007 to 2018 founding period built substantive web-crawl operating infrastructure with approximately monthly crawl releases. The 2018 to 2024 period saw substantial growth in Common Crawl's strategic importance as the foundation-model pre-training era began. Substantially all principal foundation-model pre-training corpora through this period included Common Crawl-derived training data, with research-publication evidence indicating Common Crawl's principal role in the GPT-2 training data (OpenAI), GPT-3 training data, Llama training data (Meta AI / FAIR), Mistral training data (Mistral AI), DeepSeek training data (DeepSeek), Qwen training data (Alibaba Qwen / DAMO), and substantially all other principal foundation-model pre-training corpora.
The 2024 to 2026 period has seen continued web-crawl operations alongside increasing structural attention from the broader research community to Common Crawl's role in the foundation-model pre-training data supply chain. Industry coverage has discussed substantive structural questions about web-data licensing, copyright considerations, and the broader research-community engagement on web-corpus curation.
Mission and strategy
Common Crawl's stated mission is to produce and openly distribute web-crawl data for academic and research use. The strategy combines two threads. First, the approximately monthly web-crawl operations that produce the Common Crawl corpus. Second, the open-data distribution that anchors academic and research-community access to the corpus.
The competitive premise of Common Crawl's open-data foundation positioning is that web-scale crawl data is structurally important infrastructure for academic and research use, and that an open-data foundation produces substantively different research-community access than commercial-search-engine alternatives can match.
Models and products
- Approximately monthly web-crawl releases. Petabyte-scale web-archive data published openly through the Common Crawl distribution infrastructure.
- Common Crawl corpus. The cumulative aggregated web-crawl corpus that underpins substantially all foundation-model pre-training data globally.
- Academic and research-community engagement. Substantive cooperation with academic and research-community researchers on web-data curation, training-data research, and adjacent open-data infrastructure topics.
Distribution channels are predominantly through the Common Crawl distribution infrastructure (AWS S3 buckets and the Common Crawl Foundation website) with open access to academic, research, and commercial users.
Benchmarks and standing
Common Crawl is not evaluated against AI benchmarks. The foundation's standing is measured through the operational scale of the web-crawl infrastructure (approximately monthly crawl-cadence, petabyte-scale corpus size), the research-publication evidence of Common Crawl's role in foundation-model pre-training, and the substantive academic and research-community engagement on web-data infrastructure topics.
Industry coverage has consistently characterized Common Crawl as one of the structurally consequential open-data infrastructure organizations in the global AI ecosystem, with the multi-decade operating history and the foundation-model pre-training corpus role as principal validating data points.
Leadership
As of April 2026, Common Crawl's leadership includes:
- Gil Elbaz, Founder and Chairman.
- Senior research-and-operations leadership across the Common Crawl Foundation.
Funding and backers
Foundational donations from Gil Elbaz and adjacent founders. Subsequent operational funding from grants, donations, and adjacent open-data infrastructure funding.
Industry position
Common Crawl occupies a structurally distinctive position as the principal open-data web-crawl infrastructure organization globally, with the multi-decade operating history, the foundation-model pre-training corpus role, and the open-data foundation positioning. Industry coverage has consistently characterized Common Crawl as one of the structurally consequential open-data infrastructure organizations in the broader AI ecosystem.
Competitive landscape
- LAION, Hugging Face Datasets, BigScience ROOTS, BigCode Project The Stack. Adjacent open-data and open-corpus organizations and projects.
- EleutherAI The Pile, Allen Institute for AI (Ai2) Dolma. Open-data corpus alternatives that build on or supplement Common Crawl.
- Internet Archive. Adjacent web-archive organization with different operational structure (full-archive scope rather than crawl-data corpus).
- Google, Microsoft Bing. Commercial-search-engine alternatives with structurally different licensing access.
- OpenAI, Anthropic, Google DeepMind, Meta AI / FAIR, Mistral AI, DeepSeek, Alibaba Qwen / DAMO. Foundation-model labs that consume Common Crawl in their pre-training corpora.
Outlook
- The continued approximately monthly web-crawl release cadence through 2026 to 2027.
- The continued substantive academic and research-community engagement on web-data curation and training-data research.
- The continued structural attention to web-data licensing, copyright considerations, and the broader open-data infrastructure landscape.
- The continued role as the principal open-data web-corpus underpinning foundation-model pre-training globally.
Sources
- Common Crawl official site. Foundation reference.
- Gil Elbaz Wikipedia. Founder reference.
- Common Crawl AWS Open Data. Distribution infrastructure reference.
- Common Crawl GitHub. Open-source tooling.