TLDR Dylan experienced connectivity issues while testing OpenObserve router. Suggested by Prabhat regarding etcd issues and possibility of Istio service mesh causing connectivity problems. Reinstallation and increasing memory were also suggested but issue remained unresolved.
What version of OpenObserve are you using?
`0.7.0` ```- name: openobserve-{{ .Environment.Name }} chart: openobserve/openobserve version: 0.7.0 labels: service-name: openobserve is-elk: true namespace: {{ .Environment.Name }} values: - ./openobserve/values/{{ .Environment.Name }}.yaml```
brb
```[2023-11-16T18:45:07Z ERROR openobserve::service::router]
This basically indicates issue with etcd
Are you already ingesting lot of data ?
> Are you already ingesting lot of data ?
Not yet, just sending test curls atm. Once the logstash output is sent to the router there will be a good amount of data ingested
gotcha, I do have etcd-2 failing, a team member suggested looking into
yeah, etcd is turning out to be a problem
maintaining it is hard
We will look at replacing it in long term.
here's the error from etcd-2
``` k --context=development -n openobserve logs openobserve-development-etcd-2
etcd 19:30:23.36
etcd 19:30:23.36 Welcome to the Bitnami etcd container
etcd 19:30:23.36 Subscribe to project updates by watching
try deleting the etcd pod along with its pvc
that should help
its not able to join the cluster
you have only 1 etcd pod up and running right now. right?
so what seems to have happened is that etcd info of the crashing pod is out of sync with the cluster for whatever be the reason
recreating the pod along with the associated pvc should rectify that
The ns looks like this, only 1/2 etcd-2 seems to be failing. I'll try deleting the etcd pod and pvc like you suggested ```NAME READY STATUS RESTARTS AGE devops-shell-20231116172852 1/1 Running 0 150m openobserve-development-alertmanager-5d6679958d-dq6hk 2/2 Running 3 (46h ago) 46h openobserve-development-compactor-7cb5cf5bb-txs9v 2/2 Running 2 (46h ago) 46h openobserve-development-etcd-0 2/2 Running 0 47h openobserve-development-etcd-1 2/2 Running 0 47h openobserve-development-etcd-2 1/2 CrashLoopBackOff 552 (3m35s ago) 46h openobserve-development-ingester-0 2/2 Running 2 (46h ago) 46h openobserve-development-querier-6b446774bb-vg4cj 2/2 Running 2 (46h ago) 46h openobserve-development-router-8669cfc86d-wr45p 2/2 Running 15 (73m ago) 46h```
so much better. only 1 pod failing
delete the etcd-2 pod along with the pvc
`552 restarts` :eyes:
:laughing:
Let's solve the etcd problem first and then we can look at the connectvity problem
alright pd and pvc were terminated and the new one's are booting up
quite interesting, now etcd-0 has that same error. etcd-1 and etcd-0 terminated and restarted automatically
``` k --context=development -n openobserve logs openobserve-development-etcd-0
etcd 20:14:53.64
etcd 20:14:53.64 Welcome to the Bitnami etcd container
etcd 20:14:53.64 Subscribe to project updates by watching
:sigh: :laughing:
``` k --context=development get pods -n openobserve -w 1 ↵ droberts@RM-NB-DROB2 NAME READY STATUS RESTARTS AGE devops-shell-20231116172852 1/1 Running 0 167m openobserve-development-alertmanager-5d6679958d-dq6hk 2/2 Running 3 (47h ago) 47h openobserve-development-compactor-7cb5cf5bb-txs9v 2/2 Running 2 (47h ago) 47h openobserve-development-etcd-0 1/2 CrashLoopBackOff 4 (58s ago) 2m49s openobserve-development-etcd-1 2/2 Running 1 (3m57s ago) 4m13s openobserve-development-etcd-2 2/2 Running 0 5m43s openobserve-development-ingester-0 2/2 Running 2 (47h ago) 47h openobserve-development-querier-6b446774bb-vg4cj 2/2 Running 2 (47h ago) 47h openobserve-development-router-8669cfc86d-wr45p 2/2 Running 15 (90m ago) 47h```
You did not delete etcd-0 though. It happened on its own after you tried fixing etcd-2. is that right?
yep
Try doing the same with etcd-0
:+1:
delete pvc and pod
alright we're in the rebooting process
looking good atm ```AME READY STATUS RESTARTS AGE devops-shell-20231116172852 1/1 Running 0 172m openobserve-development-alertmanager-5d6679958d-dq6hk 2/2 Running 3 (47h ago) 47h openobserve-development-compactor-7cb5cf5bb-txs9v 2/2 Running 2 (47h ago) 47h openobserve-development-etcd-0 2/2 Running 0 95s openobserve-development-etcd-1 2/2 Running 1 (8m32s ago) 8m48s openobserve-development-etcd-2 2/2 Running 0 10m openobserve-development-ingester-0 2/2 Running 2 (47h ago) 47h openobserve-development-querier-6b446774bb-vg4cj 2/2 Running 2 (47h ago) 47h openobserve-development-router-8669cfc86d-wr45p 2/2 Running 15 (95m ago) 47h```
Now let's run `curl`
yep looking like the same output, OOMKILLED status for the router
lemme check the logs
then give it more memory
that should fix it
what is the current limit ?
for router
pretty low amount req ``` Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi```
that is generous enough limit
Is it still getting OOM killed?
no, happily running, and like previously
```Works
curl -X POST "
`helm -n openobserve ls` what is the output of this ?
and `kubectl -n openobserve get svc`
locally it is ```AME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION openobserve-development openobserve 2 2023-11-14 16:15:51.923371 -0500 EST failed openobserve-0.7.0 v0.7.0 kubectl -n openobserve get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE openobserve-development-alertmanager ClusterIP 172.20.53.221 <none> 5080/TCP 47h openobserve-development-compactor ClusterIP 172.20.16.64 <none> 5080/TCP 47h openobserve-development-etcd ClusterIP 172.20.110.205 <none> 2379/TCP,2380/TCP 47h openobserve-development-etcd-headless ClusterIP None <none> 2379/TCP,2380/TCP 47h openobserve-development-ingester ClusterIP 172.20.196.128 <none> 5080/TCP 47h openobserve-development-querier ClusterIP 172.20.75.176 <none> 5080/TCP 47h openobserve-development-router ClusterIP 172.20.167.149 <none> 5080/TCP 47h```
`status=failed` :hmm:
> `status=failed` :hmm: of what ?
ah got it
I see it
maybe you wnat to reinstall from scratch
something is wrong with your installation
probably will be a lot easier to just reinstall
gotcha, will do
I reinstalled and now seeing
```NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
openobserve-development openobserve 1 2023-11-16 15:43:56.23242 -0500 EST deployed openobserve-0.7.0 v0.7.0```
``` k --context=development get pods -n openobserve -w 1 ↵ droberts@RM-NB-DROB2
NAME READY STATUS RESTARTS AGE
devops-shell-20231116172852 1/1 Running 0 3h21m
openobserve-development-alertmanager-5d6679958d-xp2xs 2/2 Running 2 (6m32s ago) 6m40s
openobserve-development-compactor-7cb5cf5bb-l89mp 2/2 Running 2 (6m32s ago) 6m40s
openobserve-development-etcd-0 2/2 Running 0 6m40s
openobserve-development-etcd-1 2/2 Running 0 6m40s
openobserve-development-etcd-2 2/2 Running 0 6m40s
openobserve-development-ingester-0 2/2 Running 2 (6m28s ago) 6m40s
openobserve-development-querier-6b446774bb-x7g89 2/2 Running 2 (6m32s ago) 6m40s
openobserve-development-router-8669cfc86d-gtddg 2/2 Running 3 (3m25s ago) 6m40s```
but unfortunately I'm seeing the same behavior,
```#works
curl -X POST "
:confused:
2 things
1. Your installation is good now
2. you seem to have 2 containers for each pod.
are you using a service mesh?
yes istio
istio might have something to do with connectivity
hmm gotcha
it intercepts everything
I have not tested with istio or any other service mesh for now
gotcha, but why would the router die? It looks like it's receiving something
router has trouble connecting to etcd
`[2023-11-16T20:47:12Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/nodes/, get message error: grpc request error: status: Unknown, message: "h2 protocol error: error reading a body from connection: stream closed because of a broken pipe", details: [], metadata: MetadataMap { headers: {} }`
Can you disable istio for this namespace to make sure that that is not what is causing the issue
I have a very strong feeling that istio is the trouble here
Gotcha, I'll need to check in w/my team on that will get back to you
I haven't heard back from my team but I'd just like to thank you Prabhat for all the help in the past few days, very cool project and I'm excited to use it as a potential replacement for our ELK stack!
and just as a note I'll be on vacation for the next 2 weeks, if you don't hear from me thats why :slightly_smiling_face:
Thank you Dylan and enjoy your vacation. Happy thanksgiving.
Dylan
Thu, 16 Nov 2023 18:50:14 UTCHey guys! I'm doing some connectivity testing from our development namespace to our openobserve router which is in an openobserve ns, getting some strange behavior. Wonder if you guys have any ideas? I'm able to successfully curl test data from a pod in a seperate development ns to the router with `curl -X POST "" -H "Content-Type: application/json" -H "Authorization: Basic XXX" -d '[{"level":"info","job":"test","log":"test message for openobserve"}]'`
however when I try to use `openobserve-development-router.openobserve.svc.cluster.local`
I get
```curl -X POST "openobserve-development-router.openobserve.svc.cluster.local:5080/api/default/default/_json" -H "Authorization: Basic XXX" -H "Content-Type: application/json" -d '[{"level":"info","job":"test","log":"test message for openobserve"}]'
curl: (52) Empty reply from server```
the router starts logging a ton of grpc errors
```[2023-11-16T18:45:07Z ERROR openobserve::service::router] : Failed to connect to host: Internal error: connector has been disconnected
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.950588
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.943231
[2023-11-16T18:45:07Z ERROR openobserve::service::router] : Failed to connect to host: Internal error: connector has been disconnected
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.928565
[2023-11-16T18:45:07Z ERROR openobserve::service::router] : Failed to connect to host: Internal error: connector has been disconnected
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.915907
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 14.004568
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.919026
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 14.138667
[2023-11-16T18:45:07Z ERROR openobserve::service::router] : Failed to connect to host: Internal error: connector has been disconnected
[2023-11-16T18:45:07Z ERROR openobserve::service::router] : Failed to connect to host: Internal error: connector has been disconnected
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.976337
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.907410
[2023-11-16T18:45:07Z INFO actix_web::middleware::logger] 127.0.0.6 "POST /api/default/default/_json HTTP/1.1" 503 58 "-" "-" "curl/7.68.0" 13.945222
[2023-11-16T18:45:08Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/user/, error: grpc request error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }
[2023-11-16T18:45:08Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/nodes/, error: grpc request error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }
[2023-11-16T18:45:09Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/user/, error: grpc request error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }
[2023-11-16T18:45:09Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/nodes/, error: grpc request error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }
[2023-11-16T18:45:10Z ERROR openobserve::common::infra::db::etcd] watching prefix: /zinc/observe/user/, error: grpc request error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }```
and the pod goes into a OOM status and restarts.
running nslookup openobserve-development-router.openobserve.svc.cluster.local gives
```Server: 172.20.0.10
Address: 172.20.0.10#53
Name: openobserve-development-router.openobserve.svc.cluster.local
Address: 172.20.167.149```
and using that address seems to work fine
``` curl -X POST "" -H "Content-Type: application/json" -H "Authorization: Basic XXX" -d '[{"level":"info","job":"test","log":"test message for openobserve"}]'
{"code":200,"status":[{"name":"default","successful":1,"failed":0}]}```
I'm also seeing a etcd-2 connectivity issue that is likely it's own issue: `openobserve-development-etcd-2 1/2 CrashLoopBackOff 531 (59s ago) 44h`
Essentially I want to point the logstash output to `openobserve-development-router.openobserve.svc.cluster.local` to start getting real data from our microservices, but I'm not confident if the curl is failing
```output: |-
if [fields][type] == "application-logs" {
elasticsearch {
hosts => [""]
user => ""
password => "${}"
index => "application-logs-%{+YYYY.MM.dd}"
ilm_enabled => false
manage_template => false
pool_max => 65536
}
http {
url => [""]
format => "json"
http_method => "post"
content_type => "application/json"
headers => ["Authorization", "Basic XXX"]
mapping => {
"@timestamp" => "%{[@timestamp]}"
"source" => "%{[source]}"
"tags" => "%{[tags]}"
"logdate" => "%{[logdate]}"
"level" => "%{[level]}"
"thread" => "%{[thread]}"
"class" => "%{[class]}"
"line" => "%{[line]}"
"msg" => "%{[msg]}"
"server" => "%{[fields][service]}"
"log_type" => "%{[fields][logType]}"
"host_name" => "%{[host][name]}"
}
}
}```