Timeouts Loading Data from S3 to Aurora
Been working on a data migration recently which involves loading a pretty big tab-separated file into an Amazon Aurora (MySQL) database and hit an interesting issue in one of our sub-accounts.
We’d gone through the documented steps of creating an IAM Role, associating its ARN with the cluster using aws rds add-role-to-db-cluster and adding its ARN as the value of the aurora_load_from_s3_role DB cluster parameter.
We’d added the s3:GetObject and s3:ListBucket permissions to the bucket policy of the data source, using the IAM Role’s ARN as the principal (we decided to use a bucket policy for ease of clean-up after the migration had been completed).
But still, no matter what we tried, our LOAD DATA FROM S3 FILE query would always hang for approximately five minutes and then throw us an error:
ERROR 1815 (HY000): Internal error: Unable to initialize S3Stream
What was going on? It felt like a timeout, due to the severe hang before the command bombed out. Checking our tables afterwards verified that no data had been written by the query, either.
An extra-strange thing is that the query worked just fine in our main AWS account (using a different S3 bucket and Role ARN, but with otherwise functionally identical configuration).
We only experienced the issue when trying to run the query from one of our sub-accounts, which we use for performance testing.
Double and triple checking our S3 and IAM setups didn’t flag up anything obvious. So we delved into the documentation, and the most interesting thing appeared to be the fifth point:
Configure your Aurora MySQL DB cluster to allow outbound connections to Amazon S3.
Checking the page page linked to in this section implied that our cluster was likely to be misconfigured if we encountered an error like so:
ERROR 1873 (HY000): Lambda API returned error: Network Connection. Unable to connect to endpoint
However, this didn’t seem to be the case for us. We weren’t seeing this error. Still, my instinct was that we were hitting some sort of network problem so I decided to check our our VPC Endpoint configuration in the console, accessed via VPC -> Endpoints.
Sure enough, I didn’t have a VPC Endpoint configured for the S3 service in this account. Checking our main account showed that we did have an S3 endpoint, which gave another hint as to why it worked there.
A VPC Endpoint is a resource within AWS that allows you to privately connect a VPC to a particular AWS service. This means that services which are normally hosted in private subnets, like database clusters can be permitted to communicate with AWS services that expect to be communicated with via the Internet, like Amazon S3. Configuring a VPC Endpoint inserts a statement into Routing Tables that you specify, meaning you can control exactly what subnets you are granting this access to.
I created a new endpoint, selecting com.amazonaws.us-east-1.s3 as the Service Name. I provided the VPC ID where my Amazon Aurora cluster was located. Finally, I provided the Route Table IDs that contained the subnets used by my Amazon Aurora cluster, and decided to use the default policy that was provided.
After creating the endpoint I reran the LOAD DATA FROM S3 query. Our query hit five minutes and didn’t bomb out like before. The anticipation started to build — our datafile was rather large (approximately 13m rows), so we expected it to take a bit of time. Then, finally:
Query OK, 12537768 rows affected (10 min 50.39 sec)
Records: 12537768 Deleted: 0 Skipped: 0 Warnings: 0
It worked! And it was pretty damn fast, too.
Lesson learnt: even if you really, really hope the problem isn’t your VPC configuration, sometimes it is. That, and sometimes you need to use a bit of imagination with AWS’s documentation.
Hopefully I’ll get to write a bit more about my experiences performing data migrations in AWS soon; in particular we’ve been having fun exporting data from Amazon Redshift to S3, where we then import that data into an Amazon Aurora cluster. I’ve been really impressed with the simplicity and speed that can be achieved and hope never have to load data from anywhere else ever again.