TLDR Hakan had errors with fluent-bit connecting to openobserve, despite logs being visible. Prabhat attempted to address the issue, suggesting increased replicas of ingesters. Issue was not resolved in the thread.
How much resources have you given to ingester pod(s). What is the amount of data that you are sending to it? Do you think you are overwhelming ingester pods?
I am running the ingester with 2 replicas, each with 0.5 CPU and 500MB memory. Fluent-bit is currently deployed to one cluster consisting of 22 nodes and a total ~300 pods
would having more replicas help?
Are you getting any logs on ingester pods that indicate of errors? Ideally speaking 22 nodes should not be too much data, but please check for any logs on ingester/router pods. Also see if increasing the replicas of ingesters help.
on the router
```[2023-08-13T15:23:28Z ERROR openobserve::service::router]
Looks like ingesters are overloaded, how is resource utilization (cpu and memory) on ingester pods?I think increasing ingester replicas should help.
resource utilization looks good, nowhere near the limits, both router and ingester running 3 replicas, but still getting mostly 504 errors (timeout)
Just to confirm, can you try increasing replicas of ingester to see if that helps
I scaled the ingester up to 10 replicase, but still getting 504 timeout errors on fluent-bit
Cool. It's definitely not a load issue then.
I think we could use some help in debugging. Give me some time
sure, no trouble. on another note, it looks like also some prometheus `remote-write` requests are also timing out:
```[2023-08-13T18:03:17Z ERROR openobserve::service::router]
Most likely the root cause should be same.
You are using 0.5.1 right?
yes
Let me DM you with some details for debugging
Hakan
Sun, 13 Aug 2023 14:52:39 UTCHi team, I am trying to connect fluent-bit to openobserve following the blog entry dated June, 4th. I am able to see logs in openobserve, but at the same time, I constantly get this error message on every fluent-bit pod: ```[2023/08/13 14:47:21] [ warn] [engine] failed to flush chunk '1-1691937817.530907721.flb', retry in 45 seconds: task_id=162, input=tail.0 > output=http.0 (out_id=0) [2023/08/13 14:47:21] [error] [output:http:http.0], HTTP status=504
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>```
Using the curl command, I can see that the connection, authentication and ingestion seems to work.
```curl -v -u :someSecretPasswd -k -d [{}]
* Trying 10.64.52.3:443...
* Connected to (10.64.52.3) port 443 (#0)
* ALPN: offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: CN=
* start date: August 10 00:00:00 2023 GMT
* expire date: September 9 23:59:59 2024 GMT
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
* SSL certificate verify ok.
* using HTTP/2
* Server auth using Basic with user ''
* h2h3 [:method: POST]
* h2h3 [:path: /api/default/default/_json]
* h2h3 [:scheme: https]
* h2h3 [:authority: ]
* h2h3 [authorization: Basic Zmx1ZW50QG15Lm5ldDpzb21lU2VjcmV0UGFzc3dkCg==]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* h2h3 [content-length: 4]
* h2h3 [content-type: application/x-www-form-urlencoded]
* Using Stream ID: 1 (easy handle 0x14a00bc00)
> POST /api/default/default/_json HTTP/2
> Host:
> authorization: Basic Zmx1ZW50QG15Lm5ldDpzb21lU2VjcmV0UGFzc3dkCg==
> user-agent: curl/7.88.1
> accept: */*
> content-length: 4
> content-type: application/x-www-form-urlencoded
>
* We are completely uploaded and fine```
What should I do about the error message? I am not 100% sure I am not missing any logs pushed to the openobserve API.