Putting Thanos gRPC Endpoints Behind a K8s Ingress

Wade Rossmann
4 min readMar 17, 2023

--

Photo by Growtika on Unsplash

I’ve recently stepped up my K8s Cluster game and deployed clusters to AWS that properly integrate with AWS [thanks kops] so I can spin up these newfangled “Load Balancers” on-demand. Prior to this I had been hand-allocating NodePorts and suffering.

Anyhow, at $12/mo per pop allocating a couple LBs per cluster for applications to applicate is no big deal. But for low-traffic internal tooling/monitoring I like to keep expenses low to minimize the chance that I have to explain myself to an unhappy bean-counter.

“But wait… Thanos uses gRPC and that’s just hand-waving around HTTP, right? I should be able to do an Ingress at this”

Which, while technically true, is not exactly extensively documented. So I’m going to bundle the information I scraped together here into as cohesive of a doc as my scattered brain can produce.

Note: This doc assumes that you are use the Nginx Ingress Controller, which is fairly commonly deployed.

First: TLS

The main sticking point is that while gRPC operates over HTTP/2, Thanos’ defaults _do not_ enable TLS. “Wait, doesn’t one imply the other?” I thought so too, but evidently they wiggled out of it. [I might be wrong on this, but ¯\_(ツ)_/¯]

However, in order for Nginx to proxy HTTP/2 TLS is required. This means that you’re going to need some certificates, and either you’ll need to take steps to ensure that all containers can validate them, or you’ll need to disable certificate validation.

For the purpose of this doc I will be assuming that all certs are self-signed and disabling TLS validation. This is not something that I would ordinarily do or advise, but at least in my case all of this traffic is internal and governed by strict security groups, nor do we consider monitoring data to be sensitive or an attack vector. Though fixing these certificates is still logged as a technical debt item in our backlog.

If you want to easily bake in certs to K8s deployments, I suggest using cert-manager. Creating a self-signed cert is as easy as:

---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: thanos-storage
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: thanos-storage
spec:
secretName: thanos-storage-cert
dnsNames:
- "*.thanos.svc.cluster.local"
- "*.thanos"
issuerRef:
name: thanos-storage

Second: Sidecar Ingress

My primary focus is having the Thanos Sidecar data for our clusters accessible to the main Thanos Query instance, so each Prometheus/Thanos “collector” instance got the following Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: thanos-grpc
namespace: {{ .Release.Namespace }}
annotations:
nginx.ingress.kubernetes.io/backend-protocol: 'GRPC'
nginx.ingress.kubernetes.io/force-ssl-redirect: 'true'
nginx.ingress.kubernetes.io/grpc-backend: 'true'
nginx.ingress.kubernetes.io/protocol: 'h2c'
nginx.ingress.kubernetes.io/proxy-read-timeout: '160'
spec:
rules:
- host: thanos.clusterA.company.local
http:
paths:
- backend:
service:
name: thanos-sidecar
port:
name: thanos-grpc
path: /
pathType: ImplementationSpecific
tls:
- secretName: {{ .Values.tls.secret }}
hosts:
- thanos.clusterA.company.local

The important bits being the annotations. You’ll also want the hostname, eg: thanos.clusterA.company.local, to be an alias to your Ingress Controller’s LB.

Ref: https://thanos.io/tip/operating/cross-cluster-tls-communication.md/#client-clusters-sidecarcingressyaml

Third: TLS for Thanos Storage

Since Sidecar access requires TLS, so too do any Storage Gateways you are talking to via Thanos Query. I won’t exhaustively document the entire deployment manifest, but you’ll want to add something like the following:

spec.template.spec:
args:
- --grpc-server-tls-cert=/certs/tls.crt
- --grpc-server-tls-key=/certs/tls.key
volumes:
- name: thanos-storage-cert
secret:
secretName: thanos-storage-cert
containers[].volumeMounts:
- mountPath: /certs/
name: thanos-storage-cert

Where thanos-storage-cert is the name of the secret containing the relevant certificate.

Fourth: Thanos-Query Config

Now for each sidecar instance you’ll just need to add the grpc-client-* flags and a --store [deprecated] or --endpoint flag to the query args for each sidecar instance. Eg:

--grpc-client-tls-secure
--grpc-client-tls-skip-verify
--endpoint=thanos.clusterA.company.local:443

and you should be good to go.

Troubleshooting

There are two tools that I found that will help diagnose issues: grpcurl and grpc_cli, but each has its own foibles.

grpcurl is more user friendly, and works well. Too well. It will work whether or not TLS is enabled, and can give you a false sense of “it works here, but not here?!”

$ grpcurl -insecure thanos.clusterA.company.local:443 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info

grpc_cli is part of the gRPC dev kit and is equivalently difficult to look directly at, but seems to be more manual and have more control over how the request gets made. That said, I have not been able to figure out how to both enable TLS and disable TLS verification, but at least the error messages are more verbose:

$ grpc_cli ls thanos.clusterA.company.local:443
Received an error when querying services endpoint.
ServerReflectionInfo rpc failed. Error code: 14, message: failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server, debug info: UNKNOWN:Failed to pick subchannel {created_time:"2023-03-16T17:40:55.46469783-07:00", file_line:3246, file:"/builddir/build/BUILD/grpc-1.48.4/src/core/ext/filters/client_channel/client_channel.cc", children:[UNKNOWN:failed to connect to all addresses; last error: INTERNAL: Trying to connect an http1.x server {file:"/builddir/build/BUILD/grpc-1.48.4/src/core/lib/transport/error_utils.cc", file_line:173, created_time:"2023-03-16T17:40:55.464692713-07:00", grpc_status:14}]}

Additionally, gprc_cli produced the same entries in the ingress controller log as Thanos-Query was, which were:

"PRI * HTTP/2.0" 400

and these seem to be an indication that TLS and/or HTTP/2 were not being talked to correctly.

A “proper” log entry evidently looks more like this:

"POST /thanos.info.Info/Info HTTP/2.0" 200

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Wade Rossmann
Wade Rossmann

No responses yet

Write a response