申请试用
HOT
登录
注册
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Spark开源社区
/
发布于
/
8703
人观看
Understanding the dynamics of GPU utilization and workloads in containerized systems is critical to creating efficient software systems. We create a set of dashboards to monitor and evaluate GPU performance in the context of TensorFlow. We monitor performance in real time to gain insight into GPU load, GPU memory and temperature metrics in a Kubernetes GPU enabled system. Visualizing TensorFlow training job metrics in real time using Prometheus allows us to tune and optimize GPU usage. Also, because Tensor flow jobs can have both GPU and CPU implementations it is useful to view detailed real time performance data from each implementation and choose the best implementation. To illustrate our system, we will show a live demo gathering and visualizing GPU metrics on a GPU enabled Kubernetes cluster with Prometheus and Grafana.
0点赞
0收藏
2下载
确认
3秒后跳转登录页面
去登陆