Prometheus中監控AWS EC2的CPU credit

Aug 14, 2018

公司用的kubernetes nodes大部份都是t2 type 好處是便宜、彈性大，但壞處也相當明顯當CPU使用率比CPU credit回復的速度慢時，有機會導致cpu credit變成0(或者相當接近0) 導致該節點的CPU效能降到極低，大部份進程都會卡住甚至當機這個情況已經發生過不只一次例如在某個node上的rabbitmq(對，沒有用到dedicated 的node) 因為有個chromium的pod在同一機器上跑，而且他是以cron job形式去跑，再加上有該死bug，亦沒有設定retry count 結果就是山積了一堆job在同一個node上不停跑過了幾天整個CPU credit變成0，rabbitmq掛掉很多其他pod也因此卡死(或者不斷重啟)

現在加了prometheus的cloudwatch plugin 當cpu credit使用率過高時會直接經telegram通知我，問題就解決了！(希望)

要注意的是cloudwatch API是需要付費的果然天下無免費的午餐…

Prometheus 的alert寫法:

      - name: cpu_credit
        rules:
        - alert: CPUCreditTooHigh
          # count the rate (per second) of last 2 hour. if the rate is less than 0 that means the cpu usage is dropping
          # May need to alter to see if the alerts send too rapidly
          expr: avg by (instance_id) (rate(aws_ec2_cpucredit_balance_average[2h])) < 0
          for: 2h
          labels:
            severity: critical
          annotations:
            summary: "CPU credit is running low on {{$labels.instace_id}}"

Prometheus中監控AWS EC2的CPU credit

參考