咖啡香

Monitoring hanging tasks in large cluster

In a large cluster, some un-expected long running task hold back the job’s runtime sometimes. And those long running tasks are hard to locate in such a large cluster. Some ideas are shown here to help to locate those long running tasks. One of them will be implemented in the coming released of OGL.

Customer complain:

Condition of the long running task:

Options:


comments powered by Disqus