paint-brush
How to Block Search Engine Indexing in Kubernetes with HAProxyby@kvendingoldo
296 reads

How to Block Search Engine Indexing in Kubernetes with HAProxy

by Alexander SharovOctober 27th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service. However, there may be scenarios in which you want to prevent search engines from indexing your service's content. This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress.
featured image - How to Block Search Engine Indexing in Kubernetes with HAProxy
Alexander Sharov HackerNoon profile picture


In Kubernetes, Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service; however, there may be scenarios in which you want to prevent search engines from indexing your service's content: it might be a development environment or something else.


This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress using a robots.txt file, preventing search engine bots from crawling and indexing your content.

Prerequisites

To proceed with the tutorial, you should have a basic grasp of Kubernetes basic objects, Ingress resources, and the official HAProxy ingress controller. You will also need access to the Kubernetes cluster and the necessary permissions to make configuration changes.

Keep in mind that for this article, I assume that the HAProxy ingress controller is set as the default controller. Otherwise, if you did not select HAProxy as the default controller, you must add the ingressClassName option to all Ingress code examples.

Step 1: Create an Ingress Kubernetes Resource

In the first part of our journey, we'll set up a small Ingress resource to expose our service outside of the Kubernetes cluster. Pay attention: for the time being, all web crawlers will have access to the service. To apply the code below, use the command kubectl apply -f ingress.yaml.


# file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: hackernoon-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: hackernoon-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - hackernoon-close-site-indexing-haproxy-example.referrs.me
     secretName: hackernoon-close-site-indexing-haproxy-example


Step 2: Modify the Ingress Configuration

The robots.txt file is used to control how search engines index documents. Its file specifies which URLs search engine crawlers can access on your website.


The most basic file that restricts access to the web service looks like this:

User-agent: *
Disallow: /


HAProxy does not require you to add this file to your web server or website.


This can be achieved with the following configuration, which should be added to the backend section for the specific group of servers:

acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path


K8S Annotations regulate all manipulations of the HAProxy frontend/backend configuration for a single ingress resource. The full list of HAProxy annotations can be found in the official documentation on GitHub.


In our case, we need to use haproxy.org/backend-config-snippet with the HAProxy snippet for blocking any indexing. To do this, edit, open your Ingress resource YAML file, and add the following annotation to the metadata section:


#file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: hackernoon-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ...


Step 3: Apply the Configuration Changes

After changing the Ingress YAML file, save it and apply it to the Kubernetes cluster with the kubectl command: kubectl apply -f ingress.yaml. The Ingress controller will detect the changes and update the configuration accordingly.

Step 4: Verify the Configuration

Inspect the generated robots.txt file to confirm that indexing prevention is working properly. The Ingress controller generates this file based on the annotation you supply.


Retrieve the external IP or domain associated with your Ingress resource and add /robots.txt to the URL. Example:

$ curl hackernoon-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: /


As we can see, the answer contains a robots.txt file that prevents any search indexing.

Step 5: Test Indexing Prevention

To verify that search engines are not indexing your site, you can run a search for your website on popular search engines. Keep in mind that search results sometimes take some time to reflect changes, so the indexing status may not be fully updated right away.

Conclusion

Annotations make it easy to avoid search engine indexing while using HAProxy Kubernetes Ingress. By adding the appropriate annotation to your Ingress resource, you can prohibit search engine bots from crawling and indexing your website's content. A similar approach can be used with other ingress controllers, such as Nginx, Traefic, and others. A similar annotation can also be used for K8S Gateway API resources, which are actively replacing Ingresses.


As a final note, robots.txt is a time-honored way for website creators to specify whether or not their sites should be crawled by various bots. However, it turns out that AI crawlers from large language model (LLM) companies frequently ignore the contents of robots.txt and crawl your site regardless. To avoid such situations, utilize password security, noindex, or enterprise load balancer features like HAProxy AI-crawler, which may be also configured as a K8S annotation.