FAQ: Archive API - Bright Data Docs

Amazon S3 bucket: Have your Data Snapshot delivered directly to your S3 bucket.
Webhook: Retrieved via webhook for real-time integration into your systems.

What is Archive API?

How much data is available?

How quickly can I access the data?

How can my data be delivered?

Can I filter Archive's data to get only what I need?

How does Bright Data's Archive compare to Common Crawl?

When working with large-scale web data, freshness, relevance, and accessibility are key. While Common Crawl provides a broad historical snapshot of the web, Bright Data’s Archive API offers real-time, continuously updated data with advanced filtering and delivery options. Here’s how they compare:

Feature	Bright Data’s Archive	Common Crawl
Data Collection	Continuously captures public web data in real time, providing results as recent as “now.”	Periodic web crawling (not real-time), updated monthly or bimonthly. Data can be outdated
Data Volume	17.5 PB collected in 8 months, covering 118 billion pages (28 billion unique URLs from 40 million domains). Adds ~1 PB and 2 billion unique URLs/week.	250b pages collected over 18 years.
Website Coverage & Relevance	Focuses on high-value, relevant website data based on real scraping business needs.	Crawls indiscriminately, including outdated or low-quality pages.
Data Types	Full web pages (JS-rendered)	98.6% HTML and text
Filtering & Delivery	Full discovery and delivery platform- filtering by category, domain, language, date etc. Delivered via Amazon S3 or webhook.	No built-in filtering or delivery. Need to manually process huge raw WARC files.