How and Why to Host Your Sitemaps Inside an AWS S3 Bucket?

How and Why to Host Your Sitemaps Inside an AWS S3 Bucket?

Juraj S

Juraj S.

6 minutes

Jul 15, 2022

Link copied!

How and Why to Host Your Sitemaps Inside an AWS S3 Bucket?

You developed an amazing website, filled it with eye-catching content, and believe strong results will follow – until you realize it's nowhere near enough.

Rules of the game

The truth is – reaching your target audience is anything but simple since the World Wide Web has its own "rules of the game." For example, if you've ever tried to position your website on the first page of a Google search, you probably realized there are only two ways to get there – the highest bid on the paid ads and an immaculate SEO.

When it comes to Search Engine Optimization - besides creating unique content, using relevant keywords, implementing quality backlinks and ensuring website speed – there is one thing you should keep a very close eye on - having an up-to-date sitemap.

Why doesn’t anyone see your website?

A sitemap is a blueprint of your website – it contains all of the website's pages within a domain, and its primary purpose is to help search engines find, crawl, and index all of your website's content.

How does it work?

To be precise, a sitemap is an XML file that lists all website URLs and additional metadata about each URL. Doing that helps the search engines crawl the site more intelligently.

A crucial part of a well-run website

An up-to-date sitemap is crucial for a well-run website – because most people don't look past the top search engine results, and SEO is your best route for reaching a wider audience.

When optimizing your website for search engines, we have to mention crawling – a process of automatically accessing websites and obtaining data for web indexing.

Is your page relevant to search engines?

A web crawler is an Internet bot operated by search engines – that systematically browses the Web to discover pages and links that lead to more pages. The search engines send crawlers looking for new content by looking at sitemaps and following URLs to obtain relevant data.

Google's crawler is called Googlebot, one part web crawler and one part mobile crawler. After these bots gather data, they index all of the content and metadata of a particular website – so the search engine "knows" how relevant each page is to a search query.

One of the biggest problems when it comes to SEO is the fact that every change in the website content needs to be updated in the sitemap – if you want people to see that new content.

Without updating it, your website's sitemap would be outdated and irrelevant to search engines, at least the content you added after the initial sitemap was created.

Board

How and why should you automate sitemap refreshing?

Refreshing sitemaps manually can be very time-consuming – and it can also lead to outdated sitemaps. Instead, Amazon Web Services offers a secure way to store your sitemaps – and that is through the service Amazon S3.

At Devōt, we host our sitemaps inside an AWS S3 bucket – so if you'd like to go that route, here is what you need to know.

How to create an S3 bucket?

If you don't already have one, create an AWS account and search for the S3 service – then go to Buckets.

When creating a bucket, you can leave ACLs disabled. We access this bucket using policies – but make sure the 'Block Public Access' is turned off, so the crawlers can find your sitemaps. If that feature is turned on, no one will find it.

The next step is creating a policy to attach to an IAM user, which is used to upload sitemaps. AWS Identity and Access Management (IAM) is a web service that helps you securely control access to AWS resources. Inside the platform, search for the IAM service, go to Policies, and create a new policy.

For security reasons, it's best to be very strict when permitting actions. You want to add only those actions to that IAM user, which allow removing and uploading new files in the bucket – but nothing more than that. Instead of selecting each one in the visual editor, you can write the JSON directly:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObjectAcl",
"s3:GetObject",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions",
"s3:ListBucket",
"s3:DeleteObject",
"s3:PutObjectAcl",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:s3:::web-review-app-sitemaps",
"arn:aws:s3:::web-review-app-sitemaps/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "*"
}
]
}

After creating this policy, you will need an IAM user who will have that policy applied.

How to create an IAM user?

  • Go to the IAM service - under Users, select the option "create a user. " We use the Access key credential type.
  • Under permissions, find and attach the policy that you have made.
  • When the user is created, save the Access key ID and the Secret access key values in ENV variables, you'll need to grab those in the next step.

Let's use this with Rails!

If you want to implement this in the Rails framework, you can use the AWS-SDK-s3 gem.

Assuming you already have a sitemap.rb file in your config folder that generates sitemaps locally, you just need to acquire the gem - use its AwsSdkAdapter module and ENV variables to fetch the required values:

SitemapGenerator::Sitemap.sitemaps_path = "sitemaps/" //find the folder in the S3 bucket
SitemapGenerator::Sitemap.adapter = SitemapGenerator::AwsSdkAdapter.new(ENV.fetch("S3_BUCKET_NAME"),
aws_access_key_id: ENV.fetch("AWS_ACCESS_KEY_ID"),
aws_secret_access_key: ENV.fetch("AWS_SECRET_ACCESS_KEY"),
region: ENV.fetch("AWS_REGION")
)

When it comes to automating this process, you can just run the sitemap:refresh Rake task on every deploy. Using Heroku, you can just add that rake task to the Procfile:

release: bundle exec rake sitemap:refresh

Now – remember those Crawlers? We’ll need to direct them to the S3 bucket. It’s simple - just go to your robots.txt file and tell them where the sitemap is:


Sitemap: https://my-websites-s3-bucket.s3.eu-central-1.amazonaws.com/sitemaps/sitemap.xml.gz

Heroku

Since we’re fetching the S3 bucket values from ENV variables, we have to store them somewhere. On Heroku, go to your production app, and under Settings – add them to your Config vars.

Now your app is synchronized with your Rails code, the sitemaps will be refreshed every time the app is redeployed on Heroku.

Submit your sitemaps to search engines

Let's go back to SEO. Now that your sitemap is automatically refreshing, you need to submit it to search engines. You can do this through the Google Search Console. Simply go to your website, in the sidebar of the Google Search Console, and click on Sitemaps in the Index.

Set the bucket URL as a property (https://my-websites-s3-bucket.s3.amazonaws.com/) and add the rest of the sitemap URL sitemaps/sitemap.xml.gz. This will most likely throw out a couldn't fetch error, but don't worry - it can take Google up to 10 days to confirm your sitemaps.

Search Engine Optimization

The power of mastering SEO

If you own a website, mastering SEO will change the way you see your product. Countless factors help determine the search results, including link structure, keywords, and, most importantly, metadata.

If you want to hide some things that are not relevant to Google's users and would confuse crawlers, you can "tell” Google which parts of your website not to crawl. Make sure your website is well optimized and running smoothly. The last thing you want is to drive away users with a slow website.

In conclusion, making minor changes to your website can have significant cumulative effects. These changes might seem like incremental improvements, but when combined with other optimizations, they could have a noticeable impact on your site's user experience and performance in organic search results.