You are here: Home Members Xin Zhao Trace Panda Jobs How to trace the end user from a failed condor job

How to trace the end user from a failed condor job

by Xin Zhao last modified Mar 22, 2013 05:02 PM
  • The Question: given a known local condor batch job id and the time it's killed, for whatever reasons (eg. exceeding memory usage limit, etc),how can one trace to find which end user submitted this job?
  • different cases
    • the simplest case
      • if job is still running when it's getting killed, the process tree information will show which Panda Job ID it is, Panda Job ID will lead to the panda web page where it shows end user information for this job;
      • our condor kill script will attach the stdout of ps tree and send email to relevant people with the attached information.
    • within one day after the job is killed
      • within this time period, condor history files are normally still kept on CE and submit hosts. CEs has ~3 days of history, and submit hosts may have < 1 day of history, depending on the intensity of job submission.
      • on each CE, run "condor_history | grep <local condor job id>" to find the CE where the job ran through.
      • On the found CE, run "condor_history -l <local condor job id>", which can provide the following info
        • panda queue name
      • problem is: once the job is gone, there is no globus job log left behind, so no way to trace them back to the submit host.
      • another problem is : even we know the condor-g job id, how to find the real job id? One way is to check the condor-g job log, if it still exists.
    • more than one day has passed after job kill
      • Panda DB is the only source for tracing it, where several ways of doing it:
        • as shown in the following example, panda DB records where a job ran on a local worker node (<slot number>@<hostname>)
        • modificationHost: 4@acas0188.usatlas.bnl.gov

          combined with time stamp records, one can find the corresponding job ID by querying Panda DB.

        • use the panda per node page, which shows jobs in different states from the last 12 hours or longer.
        •  http://panda.cern.ch/server/pandamon/query?jobsummary=site&site=BNL_CVMFS_1

Document Actions