robots.txt is a suggestion

tekeous@usenet.lol to Lemmy Shitpost@lemmy.world – 616 points –
25

You are viewing a single comment

TikTok spider has been a real offender for me. For one site I host it burred through 3TB of data over 2 months requesting the same 500 images over and over. It was ignoring the robots.txt too, I ended up having to block their user agent.

Are you sure the caching headers your server is sending for those images are correct? If your server is telling the client to not cache the images, it'll hit the URL again every time.

If the image at a particular URL will never change (for example, if your build system inserts a hash into the file name), you can use a far-future expires header to tell clients to cache it indefinitely (e.g. expires max in Nginx).

Thanks for the suggestion, turns out there are no cache headers on these images. They indeed never change, I’ll try that update. Thanks again