Current Status
We are currently observing elevated error rates across both CloudFront and S3. We are investigating and will update this page as we learn more.
Performance Charts
Interactive charts showing CloudFront and S3 performance trends over time. Hover for values, click legend items to toggle series, and drag the range bar to zoom.
CloudFront - Requests per Second
Total CloudFront request throughput. The platform typically handles several thousand requests per second across all users. Significant spikes often correspond to aggressive bulk downloaders.
CloudFront - Bandwidth
Data transfer volume through CloudFront. Sustained high bandwidth usually indicates bulk dataset downloads.
CloudFront - Error Rate (%)
Percentage of CloudFront requests resulting in errors. 4xx errors are primarily caused by rate limiting applied to users exceeding request thresholds. 5xx errors typically indicate upstream S3 throttling.
S3 - Requests per Second
Direct S3 request throughput for users accessing Common Crawl data from within AWS. In general, S3 access from inside AWS performs better than external CloudFront access.
S3 - Error Rate (%)
Percentage of S3 requests resulting in errors. 503 "Slow Down" errors indicate S3 throttling due to excessive request rates from one or more users.
Access Details & Workarounds
Starting in October 2023, a small number of users with extremely high request rates have intermittently caused elevated traffic to the AWS-hosted open data bucket. The primary symptom is “503 Slow Down” errors. When the bucket is temporarily overwhelmed, these errors affect all users, including those making only a single request.
We worked with AWS’s S3 and network teams to mitigate these issues, and in November 2023 we deployed rate limiting. Since then, performance has been significantly more stable, with only occasional disruptions.
The following workarounds can help when downloading files from the dataset, particularly during periods of elevated error rates:
Using Common Crawl (outside AWS) using the official client
We have an official downloader client at https://github.com/commoncrawl/cc-downloader which can be installed via Cargo with the following command:
cargo install cc-downloader
Please see the cc-downloader github repo for more documentation.
Using Common Crawl (outside AWS) over HTTPS
For bulk downloads, such as whole files, it’s possible to work around these 503s by retrying with a reasonable backoff. Retrying no more than once per second is sufficient. Once a request succeeds, the entire file will be delivered without further rate limit checks.
Here are some recipes for enabling retries for popular command-line tools:
curl: 1 second retries, and download filename taken from URL
curl -O --retry 1000 --retry-all-errors --retry-delay 1 https://data.commoncrawl.org/...
wget: 1 second retries
wget -c -t 0 --retry-on-http-error=503 --waitretry=1 https://data.commoncrawl.org/...
This retry technique does not work well enough for partial file downloads, such as index lookups and downloading individual webpage captures from within a WARC file.
Using Common Crawl (inside AWS) via an S3 client
In general, direct S3 access from within AWS performs better than access via CloudFront (HTTPS). However, most S3 clients split large downloads into many separate requests, which makes them more susceptible to failures when error rates are elevated.
This configuration turns single file downloads into a single transaction:
$ cat ~/.aws/config
[default]
region = us-east-1
retry_mode = adaptive
max_attempts = 100000
s3 =
multipart_threshold=3GB
multipart_chunksize=3GB
Using Amazon Athena to access the Columnar Index
Amazon Athena issues many small S3 requests internally and has limited retry behavior for 503 errors. Re-running a failed query will incur additional charges and may also fail.
As an alternative, the Parquet columnar index files can be downloaded and queried locally using DuckDB, which provides fast SQL execution without S3 rate limit concerns. Example in Python:
import duckdb, glob
files = glob.glob('/home/cc-pds/bucket-mirror/cc-index/table/cc-main/warc/crawl=*/subset=*/*.parquet')
ccindex = duckdb.read_parquet(files, hive_partitioning=True)
duckdb.sql('SELECT COUNT(*) FROM ccindex;')
Status History
Status page redesigned. Now charts are no longer static images, but interactive charts.
Performance has been consistently good since the July 4 incident was mitigated. Please see below for the details of maximizing download speeds.
Starting at UTC 17:34, a single user began sending approximately 100,000 requests per second to our CloudFront endpoint at https://data.commoncrawl.org/. The platform is designed to handle several thousand requests per second across all users combined. This caused degraded performance at both the CloudFront and S3 layers, affecting all users including those accessing data from within AWS and our CDX index server.
The requests were range requests originating from within the US, consistent with attempting to download a subset of the dataset (e.g., webpages in a particular language). While we support this type of access, users must respect rate limits and back off when receiving 4xx or 5xx error responses.
At UTC 21:32, we deployed a user-agent block for the offending client and included a response directing them to contact us. This blocked most of the remaining traffic, though recovery of the CloudFront endpoint and S3 bucket took additional time.
At UTC 22:34, the user reduced their request rate to approximately 7,000 per second but continued using the same user-agent, so the block remained effective. Monitoring continued through the holiday weekend.
Updated instructions to include the new cc-downloader client.
Updated instructions for AWS S3 configuration to mitigate 503 “Slow Down” problems.
After the US holiday weekend, high-volume download activity resumed. Rate limiting is effectively mitigating the impact.
Rate limiting was deployed for CloudFront (HTTPS) access at 15:17 UTC on Nov 21. This has significantly improved fairness of request handling: high-volume users are now rate-limited while normal usage is rarely affected.
Amazon has increased our resource quota.
Extremely high request rates and aggressive retries from a small number of users are causing widespread 503 “Slow Down” responses across S3 and CloudFront (https://data.commoncrawl.org/). This has affected all users throughout October and November. See the Access Details section above for recommended retry strategies.
New status page.