Common Crawl AWS Infrastructure Status

Overall Status

(Dec 11) Starting around 16 UT, a very aggressive downloader has been causing a moderate rate of 503s for all users. This downloader (or a different one) became far more aggressive around 19 UT, using a very large number of IP addresses and not triggering our rate limit. This resulted in a significant drop in download total bandwidth.

Please see below for the details of maximizing download speeds despite 503 Slow Down errors.



CloudFront (https) Performance Screenshot -- past week

Graph-reading hints: We can handle a few thousand requests per second, not millions.
All times are in UTC. Scroll down for daily and monthly screenshots.



s3:// Performance Screenshot -- past week



Details

Starting in October, extremely aggressive downloaders have been causing high traffic to our AWS-hosted open data bucket. The main symptom users will see are “503 Slow Down” errors when accessing our bucket. Once the bucket is temporarily overwhelmed, it sends these 503 errors to everyone, including users sending even a single request.

We have been working with Amazon’s S3 and network teams to resolve this issue. Together, we have made several configuration changes that have improved performance. We have also experimented with some of Amazon’s WAF tools, without much success. Another fix would be to contact the organizations to work with them to send an acceptable number of requests. However, our logs do not always indicate who is attempting the downloads. This makes it challenging to contact users and ask them to be more reasonable with request rates and retries.

The following workarounds might be helpful for downloading whole files from our dataset despite the ongoing problems:

Using commoncrawl over https

For bulk downloads, such as whole files, it’s possible to work around these 503s by politely retrying many times. Retrying no more than once per second is polite enough. And once you are lucky and get a request through, you’ll get the entire file without any further rate limit checks.

Here are some recipes for enabling retries for popular command-line tools:

curl: 1 second retries, and download filename taken from URL

curl -O --retry 1000 --retry-all-errors --retry-delay 1 https://data.commoncrawl.org/...

wget: 1 second retries

wget -c -t 0 --retry-on-http-error=503 --waitretry=1 https://data.commoncrawl.org/...

This retry technique does not work well enough for partial file downloads, such as index lookups and downloading individual webpage captures from within a WARC file.

Using commoncrawl via an S3 client

As you can see in the various graphs, direct S3 usage is working better than usage via CloudFront (https). However, most S3 clients tend to split large downloads into many separate requests, and this makes them more vulnerable to being killed by 503s. We do not yet have advice for configuring an S3 client to avoid this large-file download problem.

Using Amazon Athena

Amazon Athena makes many small requests over S3 as it does its work. Unfortunately, it does not try very hard to retry 503 problems, and we have been unable to figure out a configuration for Athena that improves its retries. If you simply run a failed transaction a second time, you’ll get billed again and it probably also won’t succeed.

We use Amazon Athena queries ourselves as part of our crawling configuration. Because we make a lot of queries, we downloaded the parquet columnar index files for the past few crawls and used DuckDB to run SQL queries against them. In Python, starting DuckDB and running a query looks like this:

import duckdb, glob

files = glob.glob('/home/cc-pds/bucket-mirror/cc-index/table/cc-main/warc/crawl=*/subset=*/*.parquet')
ccindex = duckdb.read_parquet(files, hive_partitioning=True)
duckdb.sql('SELECT COUNT(*) FROM ccindex;')


Status history

(Nov 27) After the US holiday weekend, starting with the European morning, our aggressive downloaders have returned.

(Nov 22) Yesterday at 15:17 UTC we deployed rate limiting for CloudFront (https) accesses. This appears to have significantly improved the fairness of request handling, with very aggressive downloaders receiving errors and more gentle usage only occasionally seeing errors.

(Nov 17) Amazon has increased our resource quota.

(Nov 15) Very high request rates and aggressive retries from a small number of users are causing many 503 “Slow Down” replies to almost all S3 and CloudFront (https://data.commoncrawl.org/) requests. This has been causing problems for all users during the months of October and November.

Please see below for some hints for how to responsibly retry your downloads.

(Nov 14) New status page



CloudFront (https) Performance Screenshot -- past day



s3:// Performance Screenshot -- past day



CloudFront (https) Performance Screenshot -- past month



s3:// Performance Screenshot -- past month