Find answers to common questions about Archive API.
What is Archive API?
How much data is available?
How quickly can I access the data?
How can my data be delivered?
Can I filter Archive's data to get only what I need?
How does Bright Data's Archive compare to Common Crawl?
Feature | Bright Data’s Archive | Common Crawl |
---|---|---|
Data Collection | Continuously captures public web data in real time, providing results as recent as “now.” | Periodic web crawling (not real-time), updated monthly or bimonthly. Data can be outdated |
Data Volume | 17.5 PB collected in 8 months, covering 118 billion pages (28 billion unique URLs from 40 million domains). Adds ~1 PB and 2 billion unique URLs/week. | 250b pages collected over 18 years. |
Website Coverage & Relevance | Focuses on high-value, relevant website data based on real scraping business needs. | Crawls indiscriminately, including outdated or low-quality pages. |
Data Types | Full web pages (JS-rendered) | 98.6% HTML and text |
Filtering & Delivery | Full discovery and delivery platform- filtering by category, domain, language, date etc. Delivered via Amazon S3 or webhook. | No built-in filtering or delivery. Need to manually process huge raw WARC files. |