TLDR Mike experienced storage space issues and sled corruption when using OpenObserve. Prabhat assisted in identifying the cause and suggested using a bigger PVC or an object store like S3 to resolve the problem. Recovery of sled remains unexplored.
Hey Mike, Thanks for trying out OpenObserve.
> enabled metrics collection Are you capturing metrics using prometheus remote write?
Correct
Got it. How fast are you capturing it? How much space was consumed?
I've also added a filter to remove the noisiest metrics to prevent this in the future. Two in particular were huge: • apiserver_request_slo_duration_seconds_bucket • apiserver_request_duration_seconds_bucket Both of these were taking 3.5+ GB for a couple days of retention
The way it works is that, OpenObserve first stores data in WAL (json files) in an uncompressed fashion and then converts it into compressed parquet periodically.
I was using the default 10GB pvc in your reference statefulset and exhausted it within 48 hrs
ah hang on there's a second one - copy/paste error. both related to apiserver
oh
Looks like you are generating a lot of data.
Are you on a cloud or in your own data center?
This is a small homelab, a k3s cluster with 3 nodes, 1 master. The size surprised me
k8s v1.24.3
Is there a way for you to check if the data is moved from WAL or it it still in WAL in pvc?
I suspect that data got. in WAL and for some reason did not get cleared out of WAL and filled up space
checking - spinning up a pod attached
this also caused sled corruption
there's about 785MB of data in `/data/wal` mostly under /data/wal/files
mostly metrics - I suppose this was ingested data not yet persisted to the parquet format?
are the files in .json format ?
yes
perfect. You are right. This was the data that could not get persisted to parquet format.
With your rate of data generation you need a bigger PVC than 10 GB.
WAL actually works a kind of buffer. In case data is arriving very fast and you do not have enough compute power to process all of them then data is stored in WAL and slowly processed to parquet
I've filtered the excessively noisy metrics from remoteWrite to avoid sending them, but will also enlarge the PVC
can WAL be cleared / existing db recovered somehow, now that space is freed?
If sled is not corrupted then next time when OpenObserve starts, it will see that the data is still in WAL and will start processing it.
I am not sure of how to recover sled. Let me check that. Even for home lab if you have the option to push data to s3 (enough bandwidth and internet speed) you should use it together with sled. your performance should be fine since OpenObserve caches recent data in memory on the node.
Understood - will look at changing that to s3 or another compatible store
thanks for taking the time to chat about this, by the way. Appreciate the openness
by the way, I also created my own single-node focused helm chart based on the single-node manifest. I wanted to use helm but add a few bits of flexibility to kick the tires without the full number of components in the official chart
Ah, I see, you are that awesome guy, from reddit.
intention is not to compete with the official chart, just provide another option for simpler deployment. I'll add a bit to the readme there about PVC sizing and using an object store as best practices even for homelab :slightly_smiling_face:
Guilty as charged :slightly_smiling_face: I had set up a few dashboards and an alert rule - would love to be able to recover at least the config if possible, but if not, no big deal - can start fresh. Just let me know if you know of a way to run some kind of sled db recovery
Haven't done a sled recovery yet. Will give it a shot. First, will need to figure out, how to corrupt it though.
Mike
Sun, 11 Jun 2023 17:30:28 UTCHey all - digging openobserve so far. Running as a single instance with sled right now -- however, I enabled metrics collection and filled up my storage space quicker than I expected :boom: Sled complains of empty/corrupt snapshot: ```[2023-06-11T17:00:29Z INFO openobserve] Starting OpenObserve v0.4.7 [2023-06-11T17:00:29Z WARN sled::pagecache::snapshot] empty/corrupt snapshot file found thread 'tokio-runtime-worker' panicked at 'sled db dir create failed: Corruption { at: None, bt: () }', src/infra/db/sled.rs:313:50 stack backtrace: 0: 0x55b9da001b87 - <unknown>``` Is there any way to recover from this? Sled docs are... lacking :slightly_smiling_face:. I've cleared space hoping it'd self recover but no dice.