CRS User Guide
The job is described by user using the same format as in the old CRS. I will describe it here, using an example.
Let us suppose that user wants to write a job which takes 2 input files, one stored on NFS disk and one in HPSS tape storage. The job uses an executable, which takes a number of input parameters and then writes 3 output files, two of which are supposed to be written to HPSS and one to NFS area. Moreover, one of the output files is considered mandatory, which means that if it is not present when the user's executable completes, then the job is considered a failure. The other two files are optional, which means that if the job does not produce them, it still can be considered a success.
To be considered a succesfull the job should produce all of the declared mandatory output files and the user's executable should return exit code which corresponds to succesfull exit (defined in the input cards).
The input file would then look more or less like that (lines which start with # are comments):
# first of all, user should specify the executable
# and its arguments
#Then he/she should specify his e-mail, so that the job knows where to send e-mail notifications
# Then he/she should leave the magic line:
# How many input files are there?
#What are the input files?
# The first file is stored in HPSS
# this is its directory
# and this is its file name
# The second file is stored in NFS area on a UNIX disk
# this is its directory
# and this is its file name
# now output files
# How many output files will the job produce?
#What are the output?
# The first one will be an hpss file
# here is its target directory
# and its name
# the file is mandatory, if it is missing at the end of job, the job failed.
# the second hpss fileRelation Between Job Definition and Job Environment Variables.
# the file is optional, if it is missing, no harm is done.
# the third file is UNIX
# where should CRS put standard output and standard error of user's executable at the end of job?
# standard output should go to directory
# to file
# standard error should go to
# and finally, what should be correct exit code of user's executable?
# This line is optional, if it is not present in the job definition file, it means that
# the correct exit code is 0.
# but for the fun of it, let us declare that the succesfull user's job should end with status 7 (why not?)
The user executable should assume that all input and output files are located on the local disk in the directory where the binary is executed. (There is an exeption to this rule for STAR, more about it below). The CRS system will define the following environment variables:
- INPUTn (n=0,..., number of input files-1): base name of the input file n.
- ACTUAL_INPUTn: full (directory included) name of input file n.
- INPUT_TYPEn : UNIX or HPSS depending whether input file n is Unix of Hpss file.
- OUTPUTn : base name of outptu file n
- ACTUAL_OUTPUTn : full name of output file n
- OUTPUT_TYPEn : file type (UNIX or HPSS) of output file n
- CRS_STDOUT , CRS_STDOUT_DIR : name and directory of standard output file
- CRS_STDERR , CRS_STDERR_DIR : name and directory of standard error file
- OUTPUTNUMSTREAMS, INPUTNUMSTREAMS : number of output and input files.
- MANDATORYOUTPUTn : yes or no depending on whether the oputput has
been declared as mandatory by the user.
For Star experiment for UNIX files both variables ACTUAL_INPUTn and INPUTn denote full name of input file (including directory).
There are 5 queues to which jobs can be submitted to. (There are 6 queues for STAR). The queue 5 corresponds to the fastest machines, 1 to the slowest. An additional queue, 0, corresponds to CAS machines.
How to Create a Job
- Each time you logon to submit machine execute command: setup_crs.
- Prepare a job definition file. Store it in some directory somewhere on the submit machine.
- Execute command: crs_job -create job_defintion_file [options]
- The job will be created. It will be given a name which consists of the job definition file name with time stamp appended to it.
- [-qn] : the job should be submitted to queue n
- [-pn] : the job should be given priority n (0<n<20)
- [-drop] : if all machines in the queue n are occupied, then the job can be executed in a slower queue.
crs_job -create ~/newcrs/production/rcrsuser2_success.jdf -q4 -p5 -drop
It means: create job using job definition file ~/newcrs/production/rcrsuser2_success.jdf, submit it to queue 4, if the queue is full, execute it in slower queue, give it priority 5.nce you have created a job it is registered in CRS system and has status CREATED.
Order of options is not important, however options should come after the job definition file.f
That's all. You can now see the job in the system using crs_panel or crs_job -stat commands.
How to submit a job
There are two ways you can submit a job:
- Do nothing. Once your job is created, it will be submitted for execution when its time is due by the loader daemon.
- Submit it manually. To do this open the crs_panel, select the job and click "submit". You should use this option only if you want to push one or two jobs "ahead of the line", you should never use this option to submit large numbers of jobs.
To block a job select it from the main panel and click "block". To unblock it click "unblock".
To block/unblock jobs from line mode use crs_job -block job_name / crs_job -unblock job_name command.
Only jobs in CREATED state can be blocked. Only jobs in BLOCKED state can be unblocked.
The jobs in BLOCKED state are ignored by loader and will not be submitted to condor until they are unblocked by user.
- Once the job is created it is registered as CREATED in the system. It mens that CRS knows about this particular job and has its information in its databases. However the job is not (yet) known to condor and has not (yet) been submitted to condor.
- Once the job is submitted to condor, either by the user using "submit" button on crs_panel or by the loader, it will change status to SUBMITTED. This means that the job is known to condor, but is not (yet) running and it sits in condor in idle state.
- Once condor starts the job it changes status to STARTED. A job in this stage copies input files from NFS disks to local execution directory (if the job requires NFS input) and submits stage requests to ORBS. After the requests are accepted by HPSS the jobs waits until they either fail or time out or complete.
- Once the HPSS requests are completed, job starts to import data file from HPSS cache. First of all it changes status to MAIN-INIT. In this state it does some internal maintenance work (it is very fast, you rarely see jobs in this state)
- Then it changes status to MAIN-IMPORT-WAITING. In this state it waits for an open pftp slot to import the data.
- When there is an open slot it changes status to MAIN-IMPORT and imports data.
- Once all input data is local the job chacks how many other CRS job are running data reconstruction at this particular node. If there are 2 other jobs running reconstruction, the jobs enter MAIN-SLEEP state and waits until at least one of the other jobs completes.
- If there are less thatn 2 reconstruction jobs running, the job starts the main module and changes status to MAIN-EXEC.
- When data reconstruction is done, it checks if the executable exit code was correct and if all mandatory output files are present. If everything is ok it starts data exporting, if not it goes to ERROR state.
- First of all job exports UNIX data files, and while it does it it is in MAIN-EXPORT-UNIX state
- When it is done with UNIX files it tries to export HPSS files. It enters MAIN-EXPORT-WAITING and waits until there are pftp slots available
- when there are, it changes status to MAIN-EXPORT-HPSS and starts to export data to HPSS.
- When everything is OK job ends in status DONE.
- If something failed at any stage the job will end in one of the three possible final states:
- SUBMIT_FAILED - it means that the job failed to be submitted to condor.
- ERROR - it means that the job failed, but the problem seems to be of temporary nature, (network breakdown, hpss failure,...) and it is likely that if the job is reset it will run correctly. Users are encouraged to reset all jobs in ERROR state and give them a second chance.
- FATAL - the job failed and the failure is likely to be serious and irreversible (for example: bad exit code from user's executable). Most likely reseting a job will not help, but user should always investigate the cause since CRS is not foolproof at determining the cause of failures.
Semaphores and HPSS flags
As the job moves through various stages of execution it can be temporarily stopped by the user (or HPSS crew) using set of flags. There are six flags which can be set by users and one which is controlled by HPSS crew.
The Semaphores controlled by users can be changed from "Semaphores" subpanel of the main CRS panel.
When a job reaches a particular stage of execution it will check the status of corresponding flag. If the flag says "go" it will continute execution. If it says "stoip" it will wait untill it is allowed to continue.
- The ORBS (Oak Ridge Batch Software) flag. It is checked before the job is about to contact the HPSS interface machines. If the HPSS interface is down, users should close the corresponding semaphore.
- Unix get semaphore - it indicated to CRS whether the NFS disks are OK or not. It is checked before importing data from UNIX disks.
- Pftp get semaphore - it it checked before importing data from HPSS cache by pftp.
- Job execution semaphore - it is checked before starting execution of user's executable. It should be set to "stop" if there are AFS problems.
- Unix export semaphore - it tells CRS if the NFS disks for exporting data are available.
- PFTP export semaphore - it tels CRS if it is OK to export data to HPSS DST cache.
How to Reset Job
Any job at any stage of execution can be reset and sent back to CREATED state.
To reset job from panel: select the jobs you want to reset, then click reset.
There are several ways you can reset jobs using line mode commands.
- If you want to reset a couple of jobs: crs_job -reset job_name_1 job_name_2 ....
- If you want to reset all jobs in a particular status: crs_job -reset_status status (for example: crs_job -reset_status ERROR - will reset all jobs in ERROR state).
- If you want to reset a particular list of jobs you can create
list of their names in a text file, put them one name per line (# at
beginning of line denotes comments) and then run: crs_job
The last command is useful in conjunction with crs_job -stat. You can redirect crs_job -stat to a temporary file, then edit it leaving the jobs yoo want to reset, save the file and then run crs_job -reset_from_file.
How to Check Job Status
Once the job is created it will change its status as it goes through various stages of execution. In order to learn about the status of a particular job do:
crs_job -stat | grep job_name
or using the crs_panel, click "refresh" and find the name of the job to read its status. If the crs_panel has thousands of jobs finding a particular name might be hard. You can use "sort" buttons to sort the jobs by name, timestamp, status (and inverse the sorting order). If that does not help you can type in job name (or part of thereof) into an entry field in lower left corner of the panel and then click "select jobs". The jobs whose names contain the string you've typed will be highlighted.
To check progress of a particular job you can select it from the panel and thel lokk at its crs logfile useng "crs logfile" button. The log is human readable and allows you to see what happened to the job recently.
You can also select a running job and click "spy execdir" to see contents of its execution directory. You can then select the job files and peek at their content. The "get top" and "get ps" allow you to execute top and ps commands on machines on which the selected job is running.
How to Get Fast Status of the CRS Production
From line mode: do farmstat command.
From main panel: click "production status".
You will get information how many jobs are in a particular state.
CRS Line Mode Commands
CRS line mode commands are invoked using script
crs_job [command] [options]
A listing of available commands can be obtained by
The available options are:
-stat : get information about known jobs
-stat_show_machines : show status of each job, but instead of status time show machine on which this job runs
-stat_show_problem : show status of each job, and, if the job has a problem, short description of the problem.
-submit jobname : submit job.
-submit_all : submit all jobs in CREATED status
-create_and_submit job_description_file : create and start a job. (The job_description_file can include wildcards). This command is obsolete and should not be used.
-block job_name : block a CREATED job, so that submit daemon ignores it
-unblock job_name : unblock a previously BLOCKED job
-block_created : block all jobs in CREATED state
-unblock_blocked : unblock all jobs in BLOCKED state
-crs_logfile job_name : print content of crs log file of job job_name
-spy_execdir job_name : show content of job work directory (job must be in MAIN* state)
-cat_stdio job_name : show content of stdio file
-tail_stdio job_name : tail content of stdio file
-cat_stderr job_name : show content of stderr file
-tail_stderr job_name : tail content of stderr file
-kill jobname : kill job
-kill_status status : kill all jobs with given status
-archive jobname: archive job jobname
-archive_done : archive all jobs in DONE status
-reset jobname : reset job, bring it to CREATED state
-reset_status status : reset jobs in given status
-reset_from_file fn : reset jobs from file fn
-kill_from_file fn : kill jobs from file fn
-create job_description_file [-qn] [-pn] [-drop]: create a job from the job description
-qn=submit job to queue n; -pn=give job priority n; -drop=allow drop queue
-save_for_debug job_name : copy a copy of the job to temporary storage
so that it can be debugged later
-show_machines : show the status of the farm machines, as seen by condor
-show_queues : show the status of the farm queues, as seen by condor
-show_crs_jobs_per_machine : show number of CRS jobs per machine
-show_crs_jobs_per_queue : show number of CRS jobs per queue
-get_pftp_link_limits : print the maximum number of pftp links
-change_number_of_input_links : change the max allowed number of input links. This will adjust the allowed number of output links as well
-recent_errors : show list of errored jobs, status time, problem description; order jobs by status time
-get_jobs_with_missing_jobdir : print jobs which have missing job directory
The job_description_file can include wildcards
CRS Panel - Main Panel
CRS panel servers as a GUI for job control. To start it execute crs_panel command.
will give you list of options.
The main panel consists of a listbox which shows list of jobs known to CRS and their statuses. Buttons in the "Job commands" allow user to execute commands which relate to selected jobs. Buttons in "System commands" column allow user to inspect the status of the system.
Buttons at the bottom of the page allow to control the flow of the production.
CRS Panel - How to Select Jobs.
Jobs can be selected using mouse (left button click).
To select a range of jobs use left mouse button+shift.
To select individual jobs use left mouse button+ctrl button.
If you would like to select jobs which contain a particular string in name, go to entry fiels in lower left corner of the panel. Type in (or paste) the string in that field. Click "select jobs". Jobs with names that contain the selected string will be highlighted.
CRS Panel - Job Commands
Job in this command relate to individual jobs. Going from top down you will see buttons which sort jobs according to their name, time, status (ordered according to the logical job flow), execution host and queue. The "inverse sort" button inverses current sorting order.
"crs logfile" button displays crs logfile of selected job. "list job files" lists content of job directory on submit machine. Once this option is selected user can peek at the content of individual job files.
"submit job" submits a CREATED job to condor and should not be used. "reset job" resets selected jobs. "Archive job" deletes a completed job from CRS however stores some of its log files in a archive directory. (The archive directory should be purged from time to time, or it will overfill the disk).
"kill job" kills a job and deletes it from CRS.
"show job details" shows some information about the job status.
"spy execdir" allows user to look at the content of execution directory on the machine on which the job runs.
"get top" and "get ps" buttons execute top and ps commands on the host on which the selected job runs.
CRS Panel - System Commands
- "show machines" displays a panel with information about CRS machines.
- "show archive" - shows list of jobs which were done using CRS system in the past, and their result.
- "spy hpss server" - allows user to peek into the machine which serves as interface to HPSS. It opens a subpanel which lists hpss requests known to CRS and their statuse (INPUT/WORKING/OUTPUT). Requests for which the parent job has been deleted are listed as ORPHANED. User can delete the orphaned requests by clicking "delete orphaned". The panel also allows user to send "ping" signals to HPSS daemons to check if they are alive.
- Show I/O files. Shows files belonging to selected jobs and their status.
- "Show HPSS requests" - show status of HPSS requests for selected jobs.
- "Show PFTP links" - shows status of PFTP links for selected jobs.
- "Adjust PFTP links" - normally each experiment is assigned a quota of PFTP connections it is allowed to use at any given time. This is usually between 10 and 20. This quota can be shared between the "incoming" and "outgoing" connections in a way that is convenient for any experiment. This button opens a panel which allows user to change the number of "in" and "out" links. Changing the number of "in" and "out" links can be done by line mode command -change_number_of_input_links n as well.
- condor_q - execute the condor_q command
- condor analyze - execute "condor analyse" command for selected jobs.
Those buttons are at the bottom of the main panel.
The first two buttons are meant so simnplify navigating among the jobs:
- "select jobs" - this button is used to select jobs which contain a particular string in the name. Let us assume that you want to select all jobs which have string "abc" in the name. Type "abc" in the input field next to "select jobs" button. Then click "select jobs". All jobs which have "abc" as part of name will become highlighted.
- "print selected" - opens a text window with names of highlighted jobs. The names can be then cut and paste to any text file.
The buttons on the loader panel:
- Loader status/loader enable, loader disable - give status of the loader, start and stop it.
- Load by name/ load by creation time - chooses if loader should load jobs according to their names or creation times.
- Buttons below decide from which queues should the loader pic jobs. To stop loading from a particular queue depress its button.
The "production status" shows the snapshot of the production. It does the same thing as the "farmstat" command in the line mode.
"production history" gives a list of jobs which were executed by CRS and their statuses and times of completion.
"for experts button" opens a panel which gives the user some commands to check status of CRS daemons. From that panel you can start/stop loader daemon (this can be done from the loader panel as well), check the status and start/stop the logbook manager daemon and ping the hpss daemons.
"refresh" button refreshes the status of jobs shown in CRS panel.
"Semaphores" button opens the semaphores panel. From this panel users can open/close the production semaphores.
"Help" button gives listing of available help.
"Exit" button closes CRS panel.