You are here: Home User Information Facility Services Linux Farm

Linux Farm

by admin last modified Apr 06, 2009 04:51 PM

General

The Linux Farm at the RACF provides the bulk of the computational power for the RHIC experiments and for the U.S. Tier 1 Center of the ATLAS experiment at the LHC.

The Linux Farm is loosely subdivided into CAS (analysis) and CRS (reconstruction) sub-farms for historical reasons. The CRS farm is mostly used for reconstruction, that is the processing of raw event data, raw bits and bytes from the detectors, to create reconstructed event data, tracks, hits, collisions. The CRS farm is not generally available interactively. The CAS farm is mostly used for analysis of the reconstructed events. The CAS farm is a mix of interactive and batch systems.

The sheer number of identical nodes in the Linux Farm insures the high availability of the cluster. Users should access different nodes if their preferred node is not available.

Further specific information about the Linux Farm hardware is available to authenticated users.

Facility Performance Monitoring

All Linux Farm nodes are available via the Condor batch system. Historical data on the usage and performance of the Condor batch system can be viewed on our public Condor Monitoring page.

Historical data on the usage and performance of the Linux Farm can be viewed on our public Ganglia page. This page requires a valid RACF password for access. For security reasons, this page can only be accessed from a browser launched by a system within the BNL domain.

The Linux Farm consumes a large portion of the available electrical power and requires significant cooling resources.  For this reason, ambient temperature at the facility is an important parameter to monitor. Temperature monitoring data is available for authenticated users.

Authenticated users may also view node_guard data for a listing of jobs killed due to excessive memory usage.

Specific Information for users of the U.S. Tier 1 Center for ATLAS

The U.S. Tier 1 Center ATLAS Linux farm has been configured with CAS systems only due to present operational requirements.

Usage of CAS resources is determined by Tier 1 management in consultation with the US ATLAS Computing Group.

The Condor batch system is used to manage the Tier 1 Center Linux Farm (see current policy).

Interactive (non-batch) processes on the ACAS hosts which consume more than 5000 minutes of CPU time are automatically terminated.  Processes killed in this manner are logged to /var/log/find_proc.log on the local system.

Availability Status of Individual Servers

The Linux Farm Alerts list contains color-coded information on the status of individual nodes and related services.  Authentication is required for access. This list is automatically updated every 5 minutes

Document Actions