You are here: Home User Information Software Condor Policy Information RHIC Condor Policy

RHIC Condor Policy

by Chris Hollowell last modified Aug 02, 2012 03:43 PM
Detailed information On Condor Configuration For The RHIC Experiments

Overview

Several Condor configurations are in use on the RCAS/RCRS processor farm.  The different configurations separate machines into groups of hosts which advertise similar ClassAds (parameters) and accept the same subset of job types.  By examining the Condor Monitoring Page (RACF authentication required), and selecting "List Node Configuration" in the lower left margin, you can view the detaills (including system memory, and CPU benchmark ratings) of the hardware resources allocated to a particular configuration/host group.

The General Queue

By default, RHIC jobs submitted to Condor will run in the general queue, allowing them to make opportunistic use of other experiments' unused processing resources.  While this queue gives you access to additional batch hosts, jobs submitted to it may be preempted after 2 hours of execution if the occupied slots are required by the experiment owning the resource.  You can avoid the use of the general queue by adding CPU_Experiment == "MYEXPERIMENT" (replace with experiment name: brahms, phenix, phobos, star) to your job description file's Requirements line, i.e. Requirements = (CPU_Experiment == "phenix").  See the General Queue Page for more information.

Utilizing a Specific Host Group

In order to specify the execution host group for your jobs, you'll need to modify one or more ClassAds in the job description files (JDF) you pass to condor_submit.  The parameters affecting host group selection are unique for each experiment, and defined in detail in the next section.  Before submitting a JDF which requests the use of a particular host group, ensure that you are not submitting to the general queue.

Configuration/Policy

PHENIX

Host Group
JDF Requirements
Condor Memory Limits
Wall Clock Time Limit
Host Group Description/Priority
cas2 CPU_Speed==2, CPU_Type=="cas"
2.5 GB
3 days
User Analysis
cas4 CPU_Speed==4, CPU_Type=="anatrain"
2.5 GB
3 days
User Analysis/Analysis Train
crs5
CPU_Speed==5, CPU_Type=="highmem"
2.5 GB
3 days user / 4 day others
Large Memory Validation
crs6
CPU_Speed==6, CPU_Type=="crs"
2.5 GB / None for CRS
36hr user / 3 days crs
User Analysis/Reconstruction
crs6res

CPU_Speed==6, CPU_Type=="crs"

2.5 GB / None for CRS 36hr sim  / 3 day others Reconstruction/Simulation/Analysis Train

Policy

Special users access the Analysis Train, Simulation, and Reconstruction queues, and are ranked higher in various configurations where they can run (the crs6res queue is not accessible to any normal users). User jobs that coexist on these machines (like cas4/crs6) are given only 36 hours of runtime before being preempted if there are CRS/anatrain jobs that want these slots.  Jobs that go over the memory limit are evicted immediately without any waiting period.

Example

# Run on cas1 or cas4
Requirements = ( ( (CPU_Type == "anatrain") || (CPU_Type == "cas") && (CPU_Speed <= 4) && (CPU_Experiment == "phenix") )

STAR

 

General User based policies

Host GroupMachine StatusJDF Job_Type flagTime Limit (if Preempted)User Groups
crs0, crs0interactive CPU_Speed==0, CPU_Type=="cas" "cas", "long" (1 slot)
cas: 40hr (5hr), long: 10day (5day)
User Analysis
crs0h CPU_Speed==0, CPU_Type=="cas" "cas", "long" (1 slot), "high" (1 slot) above + high: None (5day) User Analysis, High Priority
crs3 CPU_Speed==3, CPU_Type=="crs" "cas" (1/4 total slots), "crs" (all slots) cas: None (3hr), crs: None (7day) CRS, User Analysis
crs4, 5 above but CPU_Speed==4 or 5 same as above same as above same as above

Policy

crs0* Machines:

User Analysis jobs run on these nodes (CAS Jobs), and users can access the special "long"/"high" queues to get longer runtimes/higher priorities although they are rather small compared to the "cas" queue. CAS jobs will preempt each other based on user-priority fair-share.

crs[345] Machines:

Reconstruction (CRS) jobs have priority on all slots, will preempt "cas" jobs

"cas" jobs can run on 1/4 of all slots on these nodes when "crs" jobs aren't using them

No "long" or "high" slots on these nodes.

Exceptions:

"crs" jobs on "cas" CPU's:

all "cas" (crs0*) configs are able to run CRS jobs on up to 1/2 of their slots, and CRS jobs will have priority there and aren't subject to user-priority based preemption.

"starembd" user:

Allowed to run jobs with Job_Type == "crs" on crs3 machines, but is subject to 3hr preemption like the "cas" jobs that can run there, also allowed to run the same jobs on crs0* nodes and not be subject to no time-limits (but still possible user-priority based preemption)

Example

# Run on crs0-type nodes
Requirements = ( (CPU_Type == "cas") && (CPU_Experiment == "star") )
+Job_Type = "cas"
Document Actions