Common Crawl AWS Infrastructure Status

Overall Status

(May 12, 2024) Since last November, performance problems have been fairly minor and short-lived.

Please see below for the details of maximizing download speeds despite 503 Slow Down errors.



CloudFront (https) Performance Screenshot -- past week

Graph-reading hints: We can handle a few thousand requests per second, not millions.
All times are in UTC. Scroll down for daily and monthly screenshots.
CloudFront: 4xx errors are mostly Greg's explicit rate limit rules. 503s come from S3 backpressure.
S3: 503 errors are backpressure, dunno what 4xx errors are.

CloudFront (https) performance screenshot, past week

s3:// Performance Screenshot -- past week

s3 performance screenshot, past week

Details

Starting in October 2023, extremely aggressive downloaders have been causing high traffic to our AWS-hosted open data bucket. The main symptom users will see are “503 Slow Down” errors when accessing our bucket. Once the bucket is temporarily overwhelmed, it sends these 503 errors to everyone, including users sending even a single request.

We worked with AWS’s S3 and network teams to mitigate these issues, and in November 2023 we deployed rate limiting. Since then, performance problems have only occasionally happened.

The following workarounds might be helpful for downloading whole files from our dataset, even when we’re experiencing a log of 503s:

Using commoncrawl over https

For bulk downloads, such as whole files, it’s possible to work around these 503s by politely retrying many times. Retrying no more than once per second is polite enough. And once you are lucky and get a request through, you’ll get the entire file without any further rate limit checks.

Here are some recipes for enabling retries for popular command-line tools:

curl: 1 second retries, and download filename taken from URL

curl -O --retry 1000 --retry-all-errors --retry-delay 1 https://data.commoncrawl.org/...

wget: 1 second retries

wget -c -t 0 --retry-on-http-error=503 --waitretry=1 https://data.commoncrawl.org/...

This retry technique does not work well enough for partial file downloads, such as index lookups and downloading individual webpage captures from within a WARC file.

Using commoncrawl via an S3 client

As you can see in the various graphs, in general direct S3 usage is working better than usage via CloudFront (https). However, most S3 clients tend to split large downloads into many separate requests, and this makes them more vulnerable to the entire download failing after too many retries.

This configuration turns single file downloads into a single transaction:

$ cat ~/.aws/config
[default]
region = us-east-1
retry_mode = adaptive
max_attempts = 100000
s3 =
  multipart_threshold=10GB
  multipart_chunksize=10GB

Using Amazon Athena

Amazon Athena makes many small requests over S3 as it does its work. Unfortunately, it does not try very hard to retry 503 problems, and we have been unable to figure out a configuration for Athena that improves its retries. If you simply run a failed transaction a second time, you’ll get billed again and it probably also won’t succeed.

We use Amazon Athena queries ourselves as part of our crawling configuration. Because we make a lot of queries, we downloaded the parquet columnar index files for the past few crawls and used DuckDB to run SQL queries against them. In Python, starting DuckDB and running a query looks like this:

import duckdb, glob

files = glob.glob('/home/cc-pds/bucket-mirror/cc-index/table/cc-main/warc/crawl=*/subset=*/*.parquet')
ccindex = duckdb.read_parquet(files, hive_partitioning=True)
duckdb.sql('SELECT COUNT(*) FROM ccindex;')


Status history

(May 12, 2024) Updated instructions for AWS S3 configuration to mitigate 503 “Slow Down” problems.

(Nov 27, 2023) After the US holiday weekend, starting with the European morning, our aggressive downloaders have returned. Rate limiting is mostly taking care of the problem.

(Nov 22, 2023) Yesterday at 15:17 UTC we deployed rate limiting for CloudFront (https) accesses. This appears to have significantly improved the fairness of request handling, with very aggressive downloaders receiving errors and more gentle usage only occasionally seeing errors.

(Nov 17, 2023) Amazon has increased our resource quota.

(Nov 15, 2023) Very high request rates and aggressive retries from a small number of users are causing many 503 “Slow Down” replies to almost all S3 and CloudFront (https://data.commoncrawl.org/) requests. This has been causing problems for all users during the months of October and November.

Please see below for some hints for how to responsibly retry your downloads.

(Nov 14, 2023) New status page



CloudFront (https) Performance Screenshot -- past day

Graph-reading hints: We can handle a few thousand requests per second, not millions.
CloudFront: 4xx errors are mostly Greg's explicit rate limit rules. 503s come from S3 backpressure.
S3: 503 errors are backpressure, dunno what 4xx errors are.

CloudFront (https) performance screenshot, past day

s3:// Performance Screenshot -- past day

s3 performance screenshot, past day

CloudFront (https) Performance Screenshot -- past month

Graph-reading hints: We can handle a few thousand requests per second, not millions.
CloudFront: 4xx errors are mostly Greg's explicit rate limit rules. 503s come from S3 backpressure.
S3: 503 errors are backpressure, dunno what 4xx errors are.

CloudFront (https) performance screenshot, past month

s3:// Performance Screenshot -- past month

s3 performance screenshot, past month