Title: Job Monitoring - Admin Version

Date: July 9th 2020

Description:
A simple tutorial about job monitoring for Admin only.

Topics that are included:

Jobs that I have launched
Filter jobs based on gear name, date range, and state
Cancelling Jobs
Restarting Jobs
Get summary of job status

Capture information about jobs: Execution time, queue time, by job, sorting, plots with information about the job id on hover

Requirements:¶

Access to a Flywheel instance.
A Flywheel API key.
A Flywheel Project with ideally the dataset used in the upload-data notebook.
Site Admin Permission
Have some jobs running in your Flywheel Project

NOTE: This notebook is using a test dataset provided by the upload-data notebook. If you have not uploaded this test dataset yet, we strongly recommend you do so now following steps in here before proceeding because this notebook is based on a specific project structure.

WARNING: The metadata of the acquisitions in your test project will be updated and new files will be created after running the scripts below.

Install and Import Dependencies¶

In [ ]:

# Install specific packages required for this notebook
!pip install flywheel-sdk pandas

In [ ]:

# Import packages
from getpass import getpass
import logging
import os
import datetime
import time
import pprint
from dateutil.tz import tzutc

from IPython.display import display, Image
import flywheel
from permission import check_user_permission
import numpy as np
from tqdm import tqdm
import statistics as stats
from scipy import stats as st
import matplotlib.pyplot as plt
from scipy.stats import normaltest

In [ ]:

# Instantiate a logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
log = logging.getLogger('root')

Flywheel API Key and Client¶

Get a API_KEY. More on this at in the Flywheel SDK doc here.

In [ ]:

API_KEY = getpass('Enter API_KEY here: ')

Instantiate the Flywheel API client

In [ ]:

fw = flywheel.Client(API_KEY or os.environ.get('FW_KEY'))

del API_KEY

Show Flywheel logging information

In [ ]:

log.info('You are now logged in as %s to %s', fw.get_current_user()['email'], fw.get_config()['site']['api_url'])

Check User Minimum Requirements¶

Before we started our section, we would like to verify that you have the right permission to proceed in this notebook.

In [ ]:

min_reqs = {
"site": "site_admin",
"group": "ro",
"project": ['jobs_view',
             'jobs_run_cancel',
             'jobs_cancel_any']
}

Find Jobs¶

Firstly, we will show you how to find the jobs that you have run previously.

In the example below, we will be getting 2 jobs that you have launched within your instance. You can change the number of jobs that will be returned by modified the limit variable.

In [ ]:

user_id = fw.get_current_user()['email']

In [ ]:

user_jobs = fw.jobs.find(f'origin.id={user_id}',limit = '2')

In [ ]:

pprint.pprint(user_jobs)

Info:To learn more about the different attributes, please visit our SDK Docs here. It will come in handy when you try to filter jobs.

Essentially, you can search for the jobs that launched by other users as well.

In [ ]:

sample_id = input('Please enter the user\'s email address that you wished to search for:  ')

In [ ]:

user_jobs = fw.jobs.find(f'origin.id={sample_id}',limit = '2')

In [ ]:

pprint.pprint(user_jobs)

Filter jobs based on gear name, date range, and state¶

Gear Name¶

In [ ]:

gear_name = 'mriqc'

In [ ]:

mriqc_jobs = fw.jobs.find(f'gear_info.name={gear_name}', limit='2')

In [ ]:

pprint.pprint(mriqc_jobs)

Date Range¶

In [ ]:

created_by = '2020-07-01'

In [ ]:

filtered_jobs = fw.jobs.find(f'created>{created_by}', limit='2')

In [ ]:

pprint.pprint(filtered_jobs)

State¶

In [ ]:

state = 'complete'

In [ ]:

filtered_jobs = fw.jobs.find(f'state={state}', limit='2')

In [ ]:

pprint.pprint(filtered_jobs)

Cancel Jobs¶

Simply use the update method to cancel the job that is on pending.

In [ ]:

filtered_jobs = fw.jobs.find('state=pending', limit='2')

for job in filtered_jobs:
    job.update(state='cancelled')

Restart Jobs¶

You can also restart a job that has a state of failed. However, each job can only be retried once.

To demonstrate, we will be restarting mriqc job that has failed by iterating through the user_jobs list that we defined earlier with fw.jobs.find() method . We will be using exception handling to prevent from restarting job for more than one times.

Once the job has been successfully restarted, it will return a new job_id. We will append this new job_id to a list named retried_job.

In [ ]:

retried_job = list()

for job in user_jobs:
    try:
        if job.state == 'failed' and job.gear_info['name'] == 'mriqc' and len(retried_job)< 2:
            new_job_id = fw.retry_job(job.id)
            retried_job.append(new_job_id)
            
    except:
        pass

In [ ]:

# View the job ID that has been retried
retried_job

Job Statistics¶

In this section, we will present an example of calculating, plotting and then using job statistics for the purpose of cancelling jobs that take too long.

To give you an overview, you can use fw.get_jobs_stats() method to view the status of all current jobs within the Flywheel Instance.

In [ ]:

fw.get_jobs_stats()

Before getting started, we will be defining a few values like the gear name, date of the jobs created and sample size etc.

Initialize a few values¶

In [ ]:

def validate(date_text):
    try:
        datetime.datetime.strptime(date_text, '%Y-%m-%d')
        log.info('Please proceed to the next cell')
    except ValueError:
        raise ValueError("Incorrect data format, should be YYYY-MM-DD")

In [ ]:

GEAR_NAME = input('Please enter the gear that you wish to print out the information about: ')
CREATED_BY = input('Please enter the date you wish to filter by in this format (yyyy-mm-dd): ')
MAX_SAMPLE_SIZE = input('Please enter the max number of jobs you want to analyze: ')

In [ ]:

# Verify if you have entered the right date format
validate(CREATED_BY)

Helpful Function¶

In [ ]:

def plot(fw_client, gear_name, created_by, sample_size):
    run_times = list()
    
    for job in tqdm(fw_client.jobs.find(f'gear_info.name={gear_name},state="complete",created>{created_by}', limit=sample_size)):
        job_container = fw_client.get_job(job.id)
        time_delta = job_container.transitions.complete - job_container.transitions.running
        run_times.append(time_delta.total_seconds()/60)
        
    if run_times:
        plt.hist(run_times)
        plt.title(f'{gear_name} run times in minutes')
        plt.show()
        
        max_run_time = max(run_times) 
        min_run_time = min(run_times)
        run_time_range = max_run_time - min_run_time
        mu = stats.mean(run_times)
        sd = stats.stdev(run_times)

        # Determine a run_time_cutoff 
        s, pval = normaltest(run_times)
        if pval < 0.01:
            print(f's = {s:.2f}. Distribution is normal (enough)... Using 2*sd + mu a cutoff')
            run_time_cutoff = 2*sd + mu
        else:
            print(f's = {s:.2f}. Distribution is not normal (enough)... Using max time + 1sd as a cutoff')
            
            run_time_cutoff = max_run_time + 1*sd

        print(f'range={run_time_range:.2f}\nmu = {mu:.2f}\nsd = {sd:.2f}\ncut off = {run_time_cutoff:.2f}')

In [ ]:

plot(fw, GEAR_NAME, CREATED_BY, MAX_SAMPLE_SIZE)

In [ ]:

sleep_time = 1              # Amount of time (in min) to sleep between checks

while True:
    print(f"==============================\n{datetime.datetime.now()}\n==============================\n")
   
    
    num_pending = len(fw.jobs.find(f'state=pending,created>{CREATED_BY},gear_info.name={GEAR_NAME}', limit=MAX_SAMPLE_SIZE))
    print(f'{num_pending} pending {GEAR_NAME} jobs')

    running_jobs = fw.jobs.find(f'state=running,created>{CREATED_BY},gear_info.name={GEAR_NAME}', limit=MAX_SAMPLE_SIZE)
    print(f'{running_jobs} running {GEAR_NAME} jobs\n')

    for j in running_jobs:
        job = fw.get_job(j.id)
        time_delta = datetime.datetime.now(tz=tzutc()) - job.transitions.running
        run_time_min = time_delta.total_seconds()/60
        print('{} running for {:.2f} min'.format(job.id, run_time_min))
        if run_time_min > run_time_cutoff:
            print(f"{job.id} running for {run_time_min} -- cancelled as it is more than the cutoff of {run_time_cutoff}")
            
    print(f'Sleeping {sleep_time} min...')
    time.sleep(60*sleep_time)

In [ ]: