Skip to Main Content
The Algorithmic Cookbook
Where Farai Concocts Code

How I Added Absolute URLs On agckb.xyz's RSS Feed

Published:
Last Modified:

This post is on YouTube too. CHeck it out!

As I was adding full RSS Feeds to Hugo with my custom template, one huge pain point was the inability to have absolute URLs in each RSS entry’s HTML. This issue usually came up on W3C’s RSS Feed Validator as a recomendation since not having absolute URLs may harm interoperability with RSS readers. In other words, while some feeds are able to resolve relative URLs, others might not. The RSS reader I use on my iThings, Reeder 3, for instance isn’t able to resolve relative URLs, which means that relative links don’t go anywhere and images don’t show up.

Recommendations for the RSS feed. Of particular note is the last one line 46, column 9: description should not contain relative references (137 occurences)

All the approaches I tried to get them in Hugo failed. The most promising approach I had was to use Hugo’s Markdown Render Hooks which would use an absolute URL based on whether the content is being rendered in HTML or a feed.

The approach is similar to this:

<!--
    in /layouts/_default/_markup/render-link.html
    .Page.Scratch.isFeed is set in the page template in /layout/_default/single.html
    See https://gohugo.io/functions/scratch/ for more on .Page.Scratch
-->
<a href="{{- if .Page.Scratch.isFeed -}}{{- absURL .Destination -}}{{- else -}}{{- .Destination -}}{{- end -}}">

Whenever a Markdown link is used on my Hugo site, that code will be used in place of the regular link. Unfortunatly, .Content is only rendered once by Hugo. I’ve done a lot of research on this subject and the thing which made me give up was reading Refining the RSS by Aral Balkan. Some suggest using Regex, but you’ll understand why that’s a bad idea once you read the famous Stack Overflow answer explaining why you can’t parse HTML with Regex. Besides, Go’s Regex engine makes certain guarantees which means that certain features that might make this possible aren’t in Go (like lookaheads and lookbehinds).

To fixed this, I decided to think outside the box, literally…ish. Instead of trying to get Hugo to make an absolute link in the RSS feed, I decided to make a script that would do it for me. Since agckb.xyz is hosted on GitLab Pages, I need to set up a CI/CD pipeline to build the website. With it, I can build the website anyway I want. Leveraging this, I can build the website and run a script that would turn the relative links into absolute ones.

Here’s how it works. The RSS feed is basically an XML file with entries containing HTML, which is also XML-like. Using an XML/HTML parser like cheerio, I can loop through the feed’s entries, pick out the img and a tags and the respective src and href attributes, find the relative URLs and make them absolute.

Here’s the annotated code.

//imports cheerio for xml parsing, fs to read the RSS feed file.
const cheerio = require('cheerio')
const fs = require('fs')
const RSS_FEED_FILE = './public/index.xml'

const data = fs.readFileSync(RSS_FEED_FILE, {encoding: 'utf-8'}) //loads the RSS feed file
let $ = cheerio.load(data, {
    xmlMode: true,
}) //initilized cheerio in XML mode.

// gets the website's baseurl and strips the last /.
// initially used substr, but that might chop the tld
const baseurl = $('link')[0].firstChild.data.replace(/\/$/, '')

//loops though the feed entries which are in item tags
$('item').each((i, item) => {
    let posturl = $('link', item)[0].firstChild.data // gets the post's URL
    let description = $('description', item)[0].firstChild.firstChild //extracts the HTML from the <!CDATA[[]]> in the feed's description
    let html = cheerio.load(description.data, {xmlMode: true}) //loads the HTML into cherrio
    
    //loops through each anchor in the entry's HTML
    // doing `a[href^="http://"] a[href^=https://"]` eliminates need to check prefix.
    html('a').each((i, a) => {
        let absurl
        const href = a.attribs.href //gets the anchor's href
        //checks if the link stats with either http:// or https:// which means it's an absolute URL
        const startsWithHttp = href.startsWith('http://')
        const startsWithHttps = href.startsWith('https://')
        if (!(startsWithHttp || startsWithHttps)) { // if it's not absolute
            if(href.startsWith('/')) { //if it starts with a / (from the website's root)
                absurl = `${baseurl}${href}` // append the href to the website's base URL
            } else {
                absurl = `${posturl}${href}` // append the href to the website's base URL
            }
            a.attribs.href = absurl //set the anchor's href to absurl
        }
    })

    //similar to how I did it for the anchor, just with src instead of href
    html('img').each((i, img) => {
        let absurl
        const href = img.attribs.src
        const startsWithHttp = href.startsWith('http://')
        const startsWithHttps = href.startsWith('https://')
        if (!(startsWithHttp || startsWithHttps)) {
            if(href.startsWith('/')) {
                absurl = `${baseurl}${href}`
            } else {
                absurl = `${posturl}${href}`
            }
            img.attribs.src = absurl
        }
    })
    description.data = html.html() //updates the description element with the updated src and hrefs
})
fs.writeFileSync(RSS_FEED_FILE, $.xml()) //writes the changes back to the RSS feed file

With that done, I had to run this in GitLab CI. To do that, I had to update the .gitlab-ci.yml file to run the script. After failing to do that a lot, this is what the CI script looks like now annotated.

image: node:lts #uses the node.js docker image

pages: #specifies the job for building and deploying the website
  script:
    - wget https://github.com/gohugoio/hugo/releases/download/v0.67.0/hugo_extended_0.67.0_Linux-64bit.deb #downloads Hugo
    - apt install ./hugo_extended_0.67.0_Linux-64bit.deb #installs Hugo
    - rm ./hugo_extended_0.67.0_Linux-64bit.deb #deletes the Hugo download
    - hugo --minify #runs hugo and minifies the HTML
    - npm i cheerio #installs cheerio
    - node ./index.js #runs my script
  only:
    - master #only runs on master branch
  artifacts:
    paths:
      - public # publishes the public/ directory. That's where Hugo built the website.

Main drawback is that the the build time increased from 30 seconds to a minute, which while longer is still shorter than if I had done it with another static site generator (looking at you Jekyll).

In all, it’s a decent solution, I just wish I could do it in Hugo itself rather than an external script. Hugo is really big on speed which means making a lot of tradeoffs, like needing a separate build for Sass support and not allowing plugins. The feed still has other validation recommendations which are very hard to fix since they’re so context dependent1. I hope to do things properly at some point. There’s so much to consider.


  1. There are lots of problems with escaping in various places, syntax highlighting, URLs and the like. Making things worse is RSS reader interoperability that I need to test for. ↩︎