In a large cluster, some un-expected long running task hold back the job’s runtime sometimes. And those long running tasks are hard to locate in such a large cluster. Some ideas are shown here to help to locate those long running tasks. One of them will be implemented in the coming released of OGL.
Customer complain:
  - DEV cluster: Customer want to check the issue SI to debug
- Product cluster: customer complain the long running tasks
    
      - the job can not complete because of long running tasks
- all other tasks are useless because some tasks (the long running tasks) are un-finished (killed)
        
          - when to check the long running task? If check the task when the job is almost closed, no much value because:
- almost all tasks are done; the resource are waste
- customer can check it by oglview
 
 
Condition of the long running task:
  - Task number
    
      - c1.1: one task
- c1.2: multiple tasks
 
- Task’s order of the job
    
      - c2.1: first task of the job
- c2.2: last task of the job
- c2.3: in the middle of the task queue of the job
 
- Average of task run time
    
      - c3.1: Variance(runing time) is large
- c3.2: Variance(runing time) ~ 0
 
- The root cause of hang
    
      - c4.1: environment issue
- c4.2: logic/data issue
 
Options:
  - Configure the timeout of the task; if one task did not finish within the duration, restart is and send SNMP notification
- Configure the timeout of the task; if one task did not finish within the duration, only send SNMP notification; DEV will attach to the process in the compute host, oglctrl task jobId:taskId restart
- Provide a new callback “onCheckStatus”; JobRunner will ping service to get its status, because only service known the status
    
      - StatusContext: isTimeOut
- But how to deal with onCheckStatus timeout, SNMP?
 
- Check the throughoutput of JobRunnerObject per min (or second?), report the JobRunner which has the lowest thoughoutput in a Job.
    
      - set the task count to zero when JobRunnerObject bind to a Job;
- including success & failed tasks
 
  comments powered by