Skip to main content
Skip table of contents

Health checks, Metrics and KPIs to be monitored

Disk

  • for Persistence should be used less than 75% at any time.

  • disk for WEBAPP should be used less then 85% at any time.

Memory

  • there should be always 1Gb memory free

CPU

  • the load on the server should be less then 75% at any time.

Services should be up and running

Liveness: it shows if the application has started

Readiness: it shows if the application has started successfully and is ready to be used. It returns 200 if everything is fine; it returns 503 - some dependencies are unavailable. It can run the following script (/etc/veridiumid/scripts/check_readyness.sh) to check the status on each server.

Metrics: it presents the response time of different services. The results can be found in files, if the variable enableFileStorage from websecadmin → settings → Advanced → metrics.json is set to true. See chapter VeridiumID Health Check mechanism, to setup different values for metrics thresholds.

Changing this value will need also a service restart (websecadmin and tomcat) so the values will be propagated to metrics.log files.

Application

Liveness

Readiness

Metrics (REST API)

Metric logs (file)

websec

/websec/rest/health/live

/websec/rest/health/ready

  • default dependency checks: cassandra, opa, adservice

  • calling the API with the above dependencies in query params customizes the check. e.g. ?cassandra=true&adservice=false

/websec/rest/health/metrics/status

/var/log/veridiumid/tomcat/metrics.log

websecadmin

/websecadmin/rest/health/live

/websecadmin/rest/health/ready

  • default dependency checks: cassandra.

  • calling the API with the above dependencies in query params customizes the check. e.g. ?cassandra=false

/websecadmin/rest/health/metrics/status

/var/log/veridiumid/websecadmin/metrics.log

adservice

/ADService/rest/health/live

/ADService/rest/health/ready

/ADService/rest/health/metrics/status

/var/log/veridiumid/tomcat/metrics.log

shibboleth

/idp/profile/health/live

/idp/profile/health/ready

  • dependency websec

/idp/profile/admin/metrics

N/A

ssp

/ssp/rest/health/live

/ssp/rest/health/ready

  • default dependency checks: websec, cassandra,

N/A

N/A

dmz

/dmzwebsec/help/

/dmzwebsec/help/

N/A

N/A

opa

/health

*available only on 127.0.0.1 interface

/health

*available only on 127.0.0.1 interface

/metrics

*available only on 127.0.0.1 interface

N/A

fido

/health/live

*available only on 127.0.0.1 interface

/health/ready

*available only on 127.0.0.1 interface

N/A

N/A

haproxy

N/A

N/A

/haproxy?stats

N/A

Metric details:

Application - Metric API for each service.

The response should have the following parameters, and they should be within limits:

  1. cassandra.response.time value less than 50 (if cassandra is down, this field will be missing)

  2. opa.health.check.response.time → status true

  3. zookeeper.response.time value less then 50

  4. if response code for this API is 503, the service websecadmin is unavailable.

  5. this Api also presents for each application, it’s status true/false.

Metrics examples (from metrics.log file):

Elasticsearch is up - status = true; also there is presented the response time

2023-Jul-28 10:15:47 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T10:15:47.860Z, name=elastic.response.time, componentName=websecadmin-Develop-10.70.62.122, serviceName=websecadmin, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=Elastic response time, metricUnit=MS, metricValue=3.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, target-hosts=10.70.60.248,10.70.61.181,10.70.62.122, product-build=8.1.16, built-on=28/07/2023, enabled=true, status=true}

zookeeper is up

2023-Jul-28 10:15:57 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T10:15:57.917Z, name=zookeeper.response.time, componentName=websecadmin-Develop-10.70.62.122, serviceName=websecadmin, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=Zookeper response time, metricUnit=MS, metricValue=1.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, target-hosts=10.70.60.248:2181,10.70.61.181:2181,10.70.62.122:2181/veridiumid/8.1.16, product-build=8.1.16, built-on=28/07/2023, enabled=true, status=true}

Cassandra is up:

2023-Jul-28 10:15:52 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T10:15:52.888Z, name=cassandra.response.time, componentName=websecadmin-Develop-10.70.62.122, serviceName=websecadmin, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=Cassandra response time in ms, metricUnit=MS, metricValue=2.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, target-hosts=10.70.60.248,10.70.61.181,10.70.62.122, product-build=8.1.16, built-on=28/07/2023, enabled=true, status=true}

AD service - checking each LDAP connection:

2023-Jul-28 03:27:11 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T03:27:11.683Z, name=ldap.query.response.time.qc, componentName=adservice-Develop-10.70.62.122, serviceName=adservice, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=LDAP Query Response time in ms, metricUnit=MS, metricValue=2.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, product-build=8.1.15, built-on=27/07/2023, enabled=true, status=true}

2023-Jul-28 03:27:11 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T03:27:11.683Z, name=ldap.query.response.time.dev1, componentName=adservice-Develop-10.70.62.122, serviceName=adservice, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=LDAP Query Response time in ms, metricUnit=MS, metricValue=18.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, product-build=8.1.15, built-on=27/07/2023, enabled=true, status=true}

OPA status:

2023-Jul-28 03:27:12 AM [ExportWorkerThread] DEBUG com.veridiumid.metrics.filestorage 2023-07-28T03:27:12.178Z, name=opa.health.check.response.time, componentName=websec-Develop-10.70.62.122, serviceName=websec, serverIdentifier=Develop, ip=10.70.62.122, metricDescription=OPA health check response time in ms, metricUnit=MS, metricValue=1.0, tags={hostname=Develop, product-version=3.4.0, ip=10.70.62.122, product-build=8.1.15, built-on=27/07/2023, enabled=true, status=true}

Metrics configuration

Metrics thresholds and limits can be viewed and configured in the Admin Dashboard, see image below:

The metrics.json file contains the following values:

  1. probingInterval:

    1. how often, in miliseconds, the probes are executed.

  2. enableFileStorage

    • setting this parameter to true, it will start writing logs to metric.log file

  3.  cassandraThreshold:

    • This parameter provides the current Threshold (in milliseconds) for response times from the Cassandra cluster to the VeridiumID node

    • Default value: 50

  4. hsmThreshold:

    • Optional parameter, if not present it will not be used for status generation

      This parameter provides the current Threshold (in milliseconds) for response times from the HSM to the VeridiumID node

    • Default value: 200

  5. opaHealthCheckThreshold:

    • This parameter provides the current Threshold (in milliseconds) for response times from the VeridiumID node to OPA service

    • Default value: 50

  6. elasticThreshold:

    • This parameter provides the current Threshold (in milliseconds) for response times from the VeridiumID node to Elasticsearch cluster 

    • Default value: 50

  7. zookeeperThreshold:

    • This parameter provides the current Threshold (in milliseconds) for response times from the Zookeeper cluster to the VeridiumID node

    • Default value: 50

  8. kafkaThreshold:

    • This parameter provides the current Threshold (in milliseconds) for response times from the Kafka cluster to the VeridiumID node

    • Default value: 50

  9. websecThreshold:

    • Optional parameter, if not present it will not be used for status generation

    • This parameter provides the current Threshold (in milliseconds) for response times for the RegisterAccount operations

    • Default value: 2000

  10. websecadminThreshold:

    • Optional parameter, if not present it will not be used for status generation

    • This parameter provides the current Threshold (in milliseconds) for response times for basic Websecadmin calls

    • Default value: 2000

  11. serviceMetricsLimit:

    • This parameter provides the limit of metrics reported by service monitoring should be kept in memory

    • Default value: 100000

  12. actionMetricsLimit

    • This parameter provides the limit of metrics reported by action tracking should be kept in memory

    • Default value: 10000

  13. metricsInterval:

    • This parameter provides the interval (in milliseconds) on which the metrics are calculated

    • Default value: 10000

Grok filters for parsing metrics data

The following filters can be used to parse entries in metrics.log file, depending on whether ‘status’ field is present:

CODE
    grok {
      match => { message => [
        "%{GREEDYDATA:_first_timestamp} \[%{DATA:thread}\] %{LOGLEVEL:loglevel} %{DATA:class} %{TIMESTAMP_ISO8601:timestamp}, name=%{DATA:name}, componentName=%{DATA:componentName}, serviceName=%{DATA:serviceName}, serverIdentifier=%{DATA:serverIdentifier}, ip=%{IP:ip}, metricDescription=%{DATA:metricDescription}, metricUnit=%{DATA:metricUnit}, metricValue=%{DATA:metricValue}, tags=\{hostname=%{DATA:hostname}, product-version=%{DATA:product-version}, %{GREEDYDATA:_tags}status=%{DATA:status}\}",
        "%{GREEDYDATA:_first_timestamp} \[%{DATA:thread}\] %{LOGLEVEL:loglevel} %{DATA:class} %{TIMESTAMP_ISO8601:timestamp}, name=%{DATA:name}, componentName=%{DATA:componentName}, serviceName=%{DATA:serviceName}, serverIdentifier=%{DATA:serverIdentifier}, ip=%{IP:ip}, metricDescription=%{DATA:metricDescription}, metricUnit=%{DATA:metricUnit}, metricValue=%{DATA:metricValue}, tags=\{hostname=%{DATA:hostname}, product-version=%{DATA:product-version}, %{GREEDYDATA:_tags}\}"
      ] }
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.