Kubernetes

The recommended setup for infra monitoring with CubeAPM is to use OpenTelemetry (OTel) Collector for collecting the metrics from various infrastructure components and sending them to CubeAPM. CubeAPM then provides visualization and alerting on the collected metrics.

The official OTel Collector helm chart is available at https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector.

On k8s, the Collector can be in two modes - daemonset (collector runs as a daemonset on each k8s node) and deployment (collector runs as a k8s deployment with specified number of pods). For complete k8s monitoring, the Collector needs to be run both as daemonset and deployment.

Installation

Add the OpenTelemetry Helm chart repository.

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
# Use the following command to update if the repo is already added.
helm repo update open-telemetry

Copy the files below and save as otel-collector-daemonset.yaml and otel-collector-deployment.yaml respectively. Edit them to customize the configuration as per your requirements.

otel-collector-daemonset.yaml

mode: daemonset
image:
  repository: "otel/opentelemetry-collector-contrib"
  # tag: 0.112.0
presets:
  kubernetesAttributes:
    enabled: true
  hostMetrics:
    enabled: true
  kubeletMetrics:
    enabled: true
  logsCollection:
    enabled: true
    # includeCollectorLogs: true
    storeCheckpoints: true
config:
  exporters:
    debug:
      verbosity: detailed
      sampling_initial: 5
      sampling_thereafter: 1
    otlphttp/metrics:
      metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp
      retry_on_failure:
        enabled: false
    otlphttp/logs:
      logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs
      headers:
        Cube-Stream-Fields: k8s.namespace.name,k8s.deployment.name,k8s.statefulset.name
    otlp/traces:
      endpoint: <cubeapm_endpoint>:4317
      tls:
        insecure: true
  processors:
    batch: {}
    resourcedetection:
      detectors: ["system"]
      system:
        hostname_sources: ["os"]
    resource/host.name:
      attributes:
        - key: host.name
          value: "${env:K8S_NODE_NAME}"
          action: upsert
    resource/cube.environment:
      attributes:
        - key: cube.environment
          value: UNSET
          action: upsert
    # filter/metrics:
    #   error_mode: ignore
    #   metrics:
    #     metric:
    #       # only include my-namespace
    #       - resource.attributes["k8s.namespace.name"] != "my-namespace"
    # filter/logs:
    #   error_mode: ignore
    #   logs:
    #     log_record:
    #       # only include my-namespace
    #       - resource.attributes["k8s.namespace.name"] != "my-namespace"
    # transform/logs_redact:
    #   error_mode: ignore
    #   log_statements:
    #     - context: log
    #       statements:
    #         # https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#replace_pattern
    #         # - replace_pattern(attributes["http.url"], "client_id=[^&]+", "client_id=[REDACTED]")
    #         - replace_pattern(body, "\"(token|password)\":\"[^\"]*\"", "\"$$1\":\"****\"")
    # transform/logs_extract_fields:
    #   error_mode: ignore
    #   log_statements:
    #     - context: log
    #       statements:
    #         # https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#extractpatterns
    #         - set(cache, ExtractPatterns(body, "\\[(?P<log_level>debug|info|warn|warning|error)\\]"))
    #         - flatten(cache, "")
    #         - merge_maps(attributes, cache, "upsert")
    transform/logs_parse_json_body:
      error_mode: ignore
      log_statements:
        - context: log
          conditions:
            - body != nil and IsString(body) and Substring(body, 0, 2) == "{\""
          statements:
            - set(cache, ParseJSON(body))
            - flatten(cache, "")
            - merge_maps(attributes, cache, "upsert")
            # - set(time, Time(attributes["Timestamp"], "%Y-%m-%dT%H:%M:%S%j"))
            # - set(severity_text, "DEBUG") where attributes["Level"] == "Debug"
            # - set(severity_number, 5) where attributes["Level"] == "Debug"
  receivers:
    otlp:
      protocols:
        grpc: {}
        http: {}
    kubeletstats:
      collection_interval: 60s
      insecure_skip_verify: true
      metric_groups:
        - container
        - node
        - pod
        - volume
      extra_metadata_labels:
        # - container.id
        - k8s.volume.type
    hostmetrics:
      collection_interval: 60s
      scrapers:
        cpu:
        disk:
        # load:
        filesystem:
        memory:
        network:
        # paging:
        # processes:
        # process:
        #   mute_process_all_errors: true
  service:
    pipelines:
      traces:
        exporters:
          # - debug
          - otlp/traces
        processors:
          - memory_limiter
          - batch
          # traces would normally have host.name attribute set to pod name.
          # resourcedetection and resource/host.name processors will override
          # it with the node name.
          # - resourcedetection
          # - resource/host.name
          - resource/cube.environment
        receivers:
          - otlp
      metrics:
        exporters:
          # - debug
          - otlphttp/metrics
        processors:
          - memory_limiter
          # - filter/metrics
          - batch
          - resourcedetection
          - resource/host.name
          - resource/cube.environment
        receivers:
          - hostmetrics
          - kubeletstats
      logs:
        exporters:
          # - debug
          - otlphttp/logs
        processors:
          - memory_limiter
          # - filter/logs
          # - transform/logs_redact
          # - transform/logs_extract_fields
          - transform/logs_parse_json_body
          - batch
          - resourcedetection
          - resource/host.name
          - resource/cube.environment

clusterRole:
  rules:
    # needed for receivers.kubeletstats.extra_metadata_labels.(*)
    # https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.89.0/receiver/kubeletstatsreceiver#role-based-access-control
    - apiGroups: [""]
      resources: ["nodes/proxy"]
      verbs: ["get"]

tolerations:
  # If some nodes (like control plane nodes) are tainted, pods won’t get
  # scheduled unless they have matching tolerations. This toleration
  # allows the pod to be scheduled on any tainted node.
  - operator: Exists

otel-collector-deployment.yaml

mode: deployment
image:
  repository: "otel/opentelemetry-collector-contrib"
  # tag: 0.112.0
presets:
  kubernetesAttributes:
    enabled: true
    # extractAllPodLabels: false
  kubernetesEvents:
    enabled: true
  clusterMetrics:
    enabled: true
config:
  exporters:
    debug:
      verbosity: detailed
      sampling_initial: 5
      sampling_thereafter: 1
    otlphttp/metrics:
      metrics_endpoint: http://<cubeapm_endpoint>:3130/api/metrics/v1/save/otlp
      retry_on_failure:
        enabled: false
    otlphttp/k8s-events:
      logs_endpoint: http://<cubeapm_endpoint>:3130/api/logs/insert/opentelemetry/v1/logs
      headers:
        Cube-Stream-Fields: event.domain
  processors:
    batch: {}
    resource/cube.environment:
      attributes:
        - key: cube.environment
          value: UNSET
          action: upsert
    transform/logs_flatten_map:
      error_mode: ignore
      log_statements:
        - context: log
          conditions:
            - body != nil and IsMap(body)
          statements:
            - set(cache, body)
            - flatten(cache, "")
            - merge_maps(attributes, cache, "upsert")
  receivers:
    k8s_cluster:
      collection_interval: 60s
      allocatable_types_to_report:
        - cpu
        - memory
      metrics:
        k8s.node.condition:
          enabled: true
  service:
    pipelines:
      metrics:
        exporters:
          # - debug
          - otlphttp/metrics
        processors:
          - memory_limiter
          - batch
          - resource/cube.environment
        receivers:
          - k8s_cluster
      logs:
        exporters:
          # - debug
          - otlphttp/k8s-events
        processors:
          - memory_limiter
          - transform/logs_flatten_map
          - batch
          - resource/cube.environment
        receivers:
          - k8sobjects

A sample project with examples of additional Collector configuration, e.g., to monitor Redis, MySQL, etc. is available at https://github.com/cubeapm/sample_infra_monitoring.

Install the collector using the following commands:

helm install otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml
helm install otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml

# Use the following commands to update if already installed.
helm upgrade otel-collector-daemonset open-telemetry/opentelemetry-collector -f otel-collector-daemonset.yaml
helm upgrade otel-collector-deployment open-telemetry/opentelemetry-collector -f otel-collector-deployment.yaml

Monitoring processes

The configuration above will monitor the k8s cluster at container-level granularity, which is quite sufficient in most of the cases. However, if you are running multiple processes in your containers and need to monitor individual processes as well, it can be enabled as follows:

Enable process level monitoring in hostmetrics receiver in the OTel Collector Daemonset.

otel-collector-daemonset.yaml (hostmetrics)
config:
  receivers:
    hostmetrics:
      scrapers:
        # Enable process level monitoring
        process:
          resource_attributes:
            # Enable cgroup info for each process so that we can
            # link processes to respective containers
            process.cgroup:
              enabled: true
          mute_process_all_errors: true

Enabling process level monitoring generates a lot of metrics. We can reduce the number of metrics by disabling some metrics as follows:

otel-collector-daemonset.yaml (hostmetrics)

config:
  receivers:
    hostmetrics:
      collection_interval: 60s
      scrapers:
        cpu:
        disk:
          exclude:
            devices:
              - ^loop.*$
            match_type: regexp
          metrics:
            system.disk.io:
              enabled: false
            system.disk.merged:
              enabled: false
            system.disk.operation_time:
              enabled: false
            system.disk.operations:
              enabled: false
            system.disk.pending_operations:
              enabled: false
            system.disk.weighted_io_time:
              enabled: false
        # load:
        filesystem:
          exclude_devices:
            devices:
              - ^/dev/loop.*$
            match_type: regexp
          metrics:
            system.filesystem.inodes.usage:
              enabled: false
        memory:
        network:
          metrics:
            system.network.connections:
              enabled: false
            system.network.dropped:
              enabled: false
            system.network.errors:
              enabled: false
            system.network.packets:
              enabled: false
        # paging:
        # processes:
        # Enable process level monitoring
        process:
          resource_attributes:
            # Enable cgroup info for each process so that we can
            # link processes to respective containers
            process.cgroup:
              enabled: true
          metrics:
            process.disk.io:
              enabled: false
            process.memory.virtual:
              enabled: false
            process.uptime:
              enabled: true
          mute_process_all_errors: true

CubeAPM will now show process level stats in Infrastructure > Host page.

Enable container.id attribute in kubeletstats receiver to attach container id to each container. This will enable linking of processes to the respective containers.
otel-collector-daemonset.yaml (kubeletstats)
```
config:
  receivers:
    kubeletstats:
    extra_metadata_labels:
      - container.id
```
CubeAPM will now show process level stats in Infrastructure > K8s Pod page as well.

Installation​

Monitoring processes​

Installation

Monitoring processes