#openobserve

Erroneous Triggering of Alarm in 0.5.1

TLDR Chris expressed experiencing occasional erroneous alerts trigger on a specific stream query. Ashish suggested the issue might be linked to a known duplicates bug. Uncertain, Chris decided to monitor the issue further and report if it persists.

Powered by Struct AI

1

Aug 16, 2023 (3 months ago)
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:08 AM
0.5.1 on-prem - has anyone noticed that alerts very occasionally trigger erroneously? e.g. i have an alert with sql
select count(message) as occurrences from 'streamname' where message like '%ALERT%'

and a condition of occurrences > 0, duration 5 mins, frequency 5 mins, delay 10 mins.

It was set up a week ago, tested ok, and has been ok until last night it fired.... there certainly isn't anything in the stream containing ALERT..... i did notice some false triggers when i was experimenting but thought that may have just been due to the rapid changes i was doing.... i include the {timestamp} in the template and it agrees with when the alert triggered
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
09:41 AM
It could be due to duplicate query results bug in 0.5.1? Since alerts rely on standard query and if it gets more rows than expected it will fire.
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:43 AM
it would have to be more than just a duplicate though... as there aren't any matches at all for that string in that stream (my testing was done on a temporary stream that has been deleted now)
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
09:44 AM
Ashish any insights?
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:45 AM
ah, actually there are (forgot i had backfilled some events).... they are from April 26.... would that still be susceptible to the duplicates bug?
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
09:47 AM
do you have _timestamp in incoming data
09:48
Ashish
09:48 AM
if no…even if backdated data is being ingested will have _timestamp with current utc value and can result in alerts
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:48 AM
i always set @timestamp
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
09:49 AM
_timestamp is default field to get timestamp value…did you change settings to consider @timestamp as timestamp column
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:50 AM
just checked one of them... oo displays _timestamp of Apr 26, and the GUI displays Apr 26
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
09:51 AM
in that case alert wont be evaluated on 26 apr data
09:51
Ashish
09:51 AM
as query fired is based on _timestamp & not data ingestion time
09:53
Ashish
09:53 AM
can you check querier logs…it should have query fired by alertmanager
09:53
Ashish
09:53 AM
just to be sure…you dont have any real time alerts on same stream
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
09:53 AM
i haven't changed the default timestamp field... sure i remember reading that setting @timestamp was correct???? (but can't remember exactly what i read and where)... the timestamps on all my streams are showing the correct _timestamp though
09:54
Chris
09:54 AM
will check the logs... no real-time queries

1

10:26
Chris
10:26 AM
here are the logs... this oo is a single instance so everything is here.... there are actually 3 scheduled alerts on this stream - the above one and two glass-floor ones which (less frequently) check that there is at least one log entry from each of 2 application hosts... all the alerts seemed to run their checks concurrently (but only the ALERT one fired)
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:in_storage;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:in_wal;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:in_storage;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:enter;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:enter;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 1, scan_size 15170, compressed_size 6814
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:cache_parquet_files;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 4, scan_size 318391, compressed_size 40972
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:cache_parquet_files;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:wal:enter;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:get_file_list;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:wal:enter;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:get_file_list;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::sql] sqlparser: stream_name -> "streamname", fields -> [], partition_key -> [("host", "host-b", Eq)], full_text -> [], time_range -> Some((1692127336196919, 1692131536196919)), order_by -> [], limit -> 0,100
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:in_wal;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:in_storage;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:enter;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 4, scan_size 318391, compressed_size 40972
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:cache_parquet_files;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:wal:enter;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:get_file_list;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::wal] wal->search: load files 1, scan_size 19023
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::wal] wal->search: load files 1, scan_size 19023
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::wal] wal->search: load files 1, scan_size 19023
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:datafusion;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:datafusion;
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:wal:datafusion;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(*) as occurrences FROM tbl  where (_timestamp >= 1692127336196919 AND _timestamp < 1692131536196919) AND host = 'host-b' LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(*) as occurrences FROM tbl  where (_timestamp >= 1692127336196941 AND _timestamp < 1692131536196941) AND host = 'host-c' LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(message) as occurrences FROM tbl  where (_timestamp >= 1692131236196967 AND _timestamp < 1692131536196967) AND message like '%ALARM%' LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.001 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.003 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.003 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 1, into memory cache done
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:datafusion;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(message) as occurrences FROM tbl  where (_timestamp >= 1692131236196967 AND _timestamp < 1692131536196967) AND message like '%ALARM%' LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 4, into memory cache done
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:datafusion;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::grpc::storage] search->storage: org pe, stream streamname, load files 4, into memory cache done
[2023-08-15T20:32:16Z INFO  tracing::span] service:search:grpc:storage:datafusion;
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(*) as occurrences FROM tbl  where (_timestamp >= 1692127336196941 AND _timestamp < 1692131536196941) AND host = 'host-c' LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(*) as occurrences FROM tbl  where (_timestamp >= 1692127336196919 AND _timestamp < 1692131536196919) AND host = 'host-b' LIMIT 100
[2023-08-15T20:32:16Z INFO  tracing::span] datafusion::storage::memory::list;
[2023-08-15T20:32:16Z INFO  tracing::span] datafusion::storage::memory::list;
[2023-08-15T20:32:16Z INFO  tracing::span] datafusion::storage::memory::list;

10:27
Chris
10:27 AM
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query took 0.001 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query all took 0.002 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_write_recordbatch took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_write_recordbatch took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_write_recordbatch took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_rewrite_sql took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_rewrite_sql took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] merge_rewrite_sql took 0.000 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge sql: SELECT sum("occurrences") as "occurrences" FROM  tbl   LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge sql: SELECT sum("occurrences") as "occurrences" FROM  tbl   LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge sql: SELECT sum("occurrences") as "occurrences" FROM  tbl   LIMIT 100
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge took 0.001 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge took 0.001 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Merge took 0.001 seconds.
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->grpc: result node: 1, is_querier: true, total: 0, took: 19, files: 5, scan_size: 0
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->grpc: result node: 1, is_querier: true, total: 0, took: 19, files: 2, scan_size: 0
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->grpc: result node: 1, is_querier: true, total: 0, took: 19, files: 5, scan_size: 0
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search_in_cluster: query num_batches: 1
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search_in_cluster: query num_batches: 1
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search_in_cluster: query num_batches: 1
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->result: total: 1, took: 21, scan_size: 0
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->result: total: 1, took: 21, scan_size: 0
[2023-08-15T20:32:16Z INFO  openobserve::service::search] search->result: total: 1, took: 21, scan_size: 0
[2023-08-15T20:32:16Z INFO  tracing::span] save_trigger;
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
10:32 AM
[2023-08-15T20:32:16Z INFO  openobserve::service::search::datafusion::exec] Query sql: select count(message) as occurrences FROM tbl  where (_timestamp >= 1692131236196967 AND _timestamp < 1692131536196967) AND message like '%ALARM%' LIMIT 100

if we check this log…. having time 2023-08-15T20:32:16Z query is appropriately fired for last 5 mins of duration
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
10:33 AM
yep, i couldn't see anything wrong with the epoch values i decoded
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
10:34 AM
can you fire the query select count(message) as occurrences FROM tbl where (_timestamp &gt;= 1692131236196967 AND _timestamp &lt; 1692131536196967) AND message like ‘%ALARM%’ LIMIT 100
10:35
Ashish
10:35 AM
for now you can increase limit to 1000
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
10:41 AM
occurrences=0 (once i realised slack had put smart apostrophes in!)
10:43
Chris
10:43 AM
probably irrelevant, but the other alert queries executing concurrently also use a variable called "occurrences" and they would both have had non-zero values
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
10:50 AM
aren’t the messages different for them..which would help us identify if its bug in OO
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
10:54 AM
the messages for each of the alerts is different, yes... i haven't had any problems with the other 2 ... just the one occurrence with this one
10:57
Chris
10:57 AM
i use this simple template
{
    "subject": "{alert_name} detected",
    "message": "{alert_name} was detected at {timestamp}"
}

and the text from the one last night was
Alarm was detected at 1692131536219058

Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
10:59 AM
can you please raise an issue…please mention abt three alerts ….will try to reproduce the issue
Chris
Photo of md5-7d2796ababc14477f4c09abc2e8edb05
Chris
11:01 AM
will do.... it's not a big problem at the moment, so if it might be related to the duplicates bug then don't mind if it is shelved until we see if it still occurs when the duplicates one is resolved
11:04
Chris
11:04 AM
fwiw, when i experimenting last week, i did suspect that there may be cross-talk between alerts that are evaluated concurrently, but it wasn't controlled enough to get any meaningful evidence... so just a niggling suspicion

OpenObserve

OpenObserve is an open-source, petabyte-scale observability platform for the cloud native realm, offering a 10x cost reduction and 140x less storage use compared to competitors like Elasticsearch or Splunk. Built in Rust for exceptional performance, it offers comprehensive features like logs, metrics, traces, dashboards, and more | Knowledge Base powered by Struct.AI

Indexed 404 threads (74% resolved)

Join Our Community