Fluent-Bit Connection with OpenObserve Time Out Errors

TLDR Hakan had errors with fluent-bit connecting to openobserve, despite logs being visible. Prabhat attempted to address the issue, suggesting increased replicas of ingesters. Issue was not resolved in the thread.

Photo of Hakan
Hakan
Sun, 13 Aug 2023 14:52:39 UTC

Hi team, I am trying to connect fluent-bit to openobserve following the blog entry dated June, 4th. I am able to see logs in openobserve, but at the same time, I constantly get this error message on every fluent-bit pod: ```[2023/08/13 14:47:21] [ warn] [engine] failed to flush chunk '1-1691937817.530907721.flb', retry in 45 seconds: task_id=162, input=tail.0 > output=http.0 (out_id=0) [2023/08/13 14:47:21] [error] [output:http:http.0] , HTTP status=504 <html> <head><title>504 Gateway Time-out</title></head> <body> <center><h1>504 Gateway Time-out</h1></center> </body> </html>``` Using the curl command, I can see that the connection, authentication and ingestion seems to work. ```curl -v -u :someSecretPasswd -k -d [{}] * Trying 10.64.52.3:443... * Connected to (10.64.52.3) port 443 (#0) * ALPN: offers h2,http/1.1 * (304) (OUT), TLS handshake, Client hello (1): * (304) (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS handshake, Server key exchange (12): * TLSv1.2 (IN), TLS handshake, Server finished (14): * TLSv1.2 (OUT), TLS handshake, Client key exchange (16): * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS handshake, Finished (20): * TLSv1.2 (IN), TLS change cipher, Change cipher spec (1): * TLSv1.2 (IN), TLS handshake, Finished (20): * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 * ALPN: server accepted h2 * Server certificate: * subject: CN= * start date: August 10 00:00:00 2023 GMT * expire date: September 9 23:59:59 2024 GMT * issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01 * SSL certificate verify ok. * using HTTP/2 * Server auth using Basic with user '' * h2h3 [:method: POST] * h2h3 [:path: /api/default/default/_json] * h2h3 [:scheme: https] * h2h3 [:authority: ] * h2h3 [authorization: Basic Zmx1ZW50QG15Lm5ldDpzb21lU2VjcmV0UGFzc3dkCg==] * h2h3 [user-agent: curl/7.88.1] * h2h3 [accept: */*] * h2h3 [content-length: 4] * h2h3 [content-type: application/x-www-form-urlencoded] * Using Stream ID: 1 (easy handle 0x14a00bc00) > POST /api/default/default/_json HTTP/2 > Host: > authorization: Basic Zmx1ZW50QG15Lm5ldDpzb21lU2VjcmV0UGFzc3dkCg== > user-agent: curl/7.88.1 > accept: */* > content-length: 4 > content-type: application/x-www-form-urlencoded > * We are completely uploaded and fine``` What should I do about the error message? I am not 100% sure I am not missing any logs pushed to the openobserve API.

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 14:59:01 UTC

How much resources have you given to ingester pod(s). What is the amount of data that you are sending to it? Do you think you are overwhelming ingester pods?

Photo of Hakan
Hakan
Sun, 13 Aug 2023 15:09:19 UTC

I am running the ingester with 2 replicas, each with 0.5 CPU and 500MB memory. Fluent-bit is currently deployed to one cluster consisting of 22 nodes and a total ~300 pods

Photo of Hakan
Hakan
Sun, 13 Aug 2023 15:10:31 UTC

would having more replicas help?

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 15:21:52 UTC

Are you getting any logs on ingester pods that indicate of errors? Ideally speaking 22 nodes should not be too much data, but please check for any logs on ingester/router pods. Also see if increasing the replicas of ingesters help.

Photo of Hakan
Hakan
Sun, 13 Aug 2023 15:27:04 UTC

on the router ```[2023-08-13T15:23:28Z ERROR openobserve::service::router] : Timeout while waiting for response [2023-08-13T15:23:28Z INFO actix_web::middleware::logger] 10.64.49.67 "POST /api/default/default/_json HTTP/1.1" 503 34 "2097" "-" "Fluent-Bit" 600.000993``` nothing specific on the ingester, except for many responses with code 400 (also on the router)

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 15:28:35 UTC

Looks like ingesters are overloaded, how is resource utilization (cpu and memory) on ingester pods?I think increasing ingester replicas should help.

Photo of Hakan
Hakan
Sun, 13 Aug 2023 16:01:44 UTC

resource utilization looks good, nowhere near the limits, both router and ingester running 3 replicas, but still getting mostly 504 errors (timeout)

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 16:38:05 UTC

Just to confirm, can you try increasing replicas of ingester to see if that helps

Photo of Hakan
Hakan
Sun, 13 Aug 2023 17:57:42 UTC

I scaled the ingester up to 10 replicase, but still getting 504 timeout errors on fluent-bit

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 18:03:07 UTC

Cool. It's definitely not a load issue then.

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 18:03:33 UTC

I think we could use some help in debugging. Give me some time

Photo of Hakan
Hakan
Sun, 13 Aug 2023 18:04:36 UTC

sure, no trouble. on another note, it looks like also some prometheus `remote-write` requests are also timing out: ```[2023-08-13T18:03:17Z ERROR openobserve::service::router] : Timeout while waiting for response```

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 18:05:06 UTC

Most likely the root cause should be same.

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 18:05:12 UTC

You are using 0.5.1 right?

Photo of Hakan
Hakan
Sun, 13 Aug 2023 18:05:31 UTC

yes

Photo of Prabhat
Prabhat
Sun, 13 Aug 2023 18:08:32 UTC

Let me DM you with some details for debugging