You are here: Home User Information Facility Services Frontier Frontier Meetings for ATLAS Meeting Minutes Archive 2011 Minutes: 8/25/2011

Minutes: 8/25/2011

by John S. De Stefano Jr. last modified Aug 25, 2011 03:35 PM
Notes from the ATLAS Frontier meeting on August 25, 2011.

frontier-minutes-20110825.txt — Plain Text, 6 kB (6255 bytes)

File contents

Participants: Florentin Bujor, John DeStefano, Alastair Dewhurst, Dave 
  Dykstra, David Front, Fred Luehring, Andreas Petzold, Emmanouil 
  Vamvakopoulos, Hans von der Schmitt
Early check-in: Andrew Wong
Apologies: Elizabeth Gallas

*** Site status and issues: ***

BNL:
- Upgraded launchpad Tomcats to v6.0.33 per vulnerability reports and fixes
- Two downtimes/degraded events reported by SLS:
  * One "unavailable" report as SLS failed to receive query reply from one 
    launchpad node under heavy load, timed-out on queries to other node (15 
    Aug)
    - 200+ active threads reached, query times up to 5 secs at peak load
    - SLS requests not logged during peak activity (~20 mins)
    - Service remained available and served requests throughout incident
    - Longest queries were partial reconstruction jobs, completed successfully
    - SLS queries on 2nd node completed but took ~60 secs, timed out on SLS 
      side
    - Investigation to continue
  * One "degraded" report as Squid restart failed on one launchpad node (18 
    Aug)
    - Cyclic pattern: cache validation timed out and restarted at 1-hour 
      intervals, prevented Squid from accepting requests
      * Required manual kill of Squid processes, move of Squid cache, before 
        Squid would restart (and rebuild new cache) 
    - Not the latest Frontier Squid RPM (2.7.STABLE9-5.1), will upgrade when 
      package is tested for relocation, or when new hardware is deployed
    - Due to manual Squid restart to address memory swapping
  * Both incidents noticed and reported by shifters (ELOG #28463, Savannah 
    #85776)
- AWStats not resolving client IPs to host names
  * Custom script not configured to parse multiple entries on hosts behind 
    load balancer
    - Modification needed to parse multiple, quoted values from source IP 
      field
      * Dave's code changes added to launchpads on 24 Aug
      * Additional changes needed to parse IP column into single IP entry

CERN:
- Alastair requested ToA change to have CERN launchpads fail-over to RAL
  * Savannah #85421, complete
- Tomcat upgrade (Savannah #122978)
  * Worked on test instance, but production Tomcat failed to restart
  * Related outage noted by shifters (Savannah #85879)
- AGIS installation (Savannah #122704)
- Changed ATLAS_FRONTIER_READER password

KIT:
- No issues to report presently
- Will continue investigation of package installation problem

Lyon:
- Configuration of new launchpad underway
  * Monitoring, testing requested
    - Working with Dave to add to AWStats
      * Problems with configuration files: $AWSTATS_CONFIG and $DirData
        - User must customize these values manually
    - Working with Florentin to add to SLS, MRTG
      * Test system will not be added to SLS to reduce confusion
    - Would appreciate Nagios plugin information
      * CMS Online plugin may be useful
- Upgraded Tomcat to 6.0.33
  * Installed as root; had to restore user permissions to directory by hand
  * Also changes to local directories, sym-link to new directory
    - Did not copy necessary servlet config files 
    - Known deficiency of current servlet RPM, to be addressed
  * Installs user cron in non-standard location

RAL:
- IT cloud following up on Alastair's observations of Squid problems, direct 
  WN connections to RAL launchpad (GGUS #73397)
  * No direct connections seen in past week
- Tomcat not yet upgraded, will do soon
- Working to get AWStats in place for shifters: added instructions to shifter 
  checklists

TRIUMF:
- No issues to report

*** Deployment: ***

CERN T0:
- Continued discussion of Frontier testing for T0 processing (Hans)
  * From Elizabeth's notes: "Tier-0 will try Frontier during the Technical 
    Stop"
  * T0 prohibited from dynamic growth in response to job requests
  * Frontier proposed to prevent direct T0 and CAF database access
  * Technical stops ends on 2 Sep
  * Production stage may require separate Frontier instance for T0
    - T0 status somewhere in between "online" and "grid" stages
    - Not sure Frontier is appropriate for express processing (small 
      percentage of load)
      * One hour caching timeframe also would not work for express
      * Express possible to be considered later
  * Plan: use existing test instance at CERN for testing, and coordinate test 
    with David's development
    - Florentin may have a dedicated test server from Serguei, will follow-up 

Downtime requests:
- Request from ATLAS: "notify AMOD couple days in advance (if possible), then 
  send reminder 1 day in advance, and then another one 2 hrs before the 
  intervention start"
- Resolution: for known interventions, one notification to AMOD and ADCoS, one  
  day in advance:
  * ADCoS: atlas-project-adc-operations-shifts@cern.ch
  * AMOD: atlas-adc-expert@cern.ch
- Central request to store site downtime in AGIS: 
  https://savannah.cern.ch/support/?113224

Investigating site issues:
- "Bad" queries must be investigated from perspective of both DBA and Frontier 
  admin
  * Access to site logs is required
  * Investigation made higher priority over code development

*** Development: ***

Monitoring:
- SLS updated with meaningful descriptions, logic for multiple site launchpads
- Alastair created shifter instructions, informed ADCoS
- Alastair's instructions for AWStats configuration:
  https://twiki.cern.ch/twiki/bin/view/Sandbox/AlastairDewhurstSandbox
- Problems with MRTG, interpreting and fixing entry generation
  * Will start from scratch, generate data via AGIS
- Sites and monitoring configuration in general is falling into place
  * Nagios monitoring via fnget has been fixed; sites finally noting and 
    addressing problems

Packages:
- Plan to release new frontier-squid rpm version that uses all tar ball files
  * Should fix reported issue of erasing system files at upgrade/install time
    - Would appreciate additional review of existing code [repo link]
  * Only .spec file, post-install script to be added to underlying code
    - If log directory does not exist pre-installation, RPM does not install
- Plan to fix issue with upgrading frontier-tomcat rpm version to 6.0.33-1 via 
  quattor
- Servlet RPM will be changed to support multiple instances

*** A.O.B.: ***
None.
Document Actions
Filed under: , , ,