The bane of existence for most of small pages: web crawlers. They create most traffic this site sees and makes my site stats overly optimistic. We can go with robots.txt, but what if it’s not enough? I can tell a valuable bot to not index some part of my site, but: a) some bots ignore it b) what if I don’t want some bots to even have the chance to ask?

Get that SEO scanning and LLM training out of here!

Blocking crawlers

The rest of this guide assumes webstack: Relayd and Httpd. Relayd is great and since it works on higher level than pf, we can read headers. Luckily, those crawlers send usable “User-Agents” which we can block.

First, let’s see who uses my site the most. Assuming you use “forwarded”1 style for logs, we can do:

awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort#

Then we need to manually select agents we want to block. It won’t be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. You need to define which part of the full User-Agent is common and can be used for blocking.

Then we can create block rules in a Relayd protocol. Relayd doesn’t use regexp, and instead allows using case-sensitive Lua globs. Stars will match everything.

block request method "GET" header "User-Agent" value "*<common part>*"

Remember that config assumes last-one-wins, so the block rules should be the last matching. I just put those end the end of my config. You can create a `block quick…` rule if you want - it will short-circuit the entire protocol.

Therefore, my “https” protocol now has a series of blocks:

http protocol "https" {
    # most of the procol omitted
    block request method "GET" header "User-Agent" value "*Bytespider*"
    block request method "GET" header "User-Agent" value "*ahrefs*"
    block request method "GET" header "User-Agent" value "*censys*"
    block request method "GET" header "User-Agent" value "*commoncrawl*"
    block request method "GET" header "User-Agent" value "*dataforseo*"
    block request method "GET" header "User-Agent" value "*mj12*"
    block request method "GET" header "User-Agent" value "*semrush*"
    block request method "GET" header "User-Agent" value "*webmeup*"
    block request method "GET" header "User-Agent" value "*zoominfo*"
}

(usage of globs was proposed to me on OpenBSD mailing list