You are here: Home User Information Facility Services Frontier Frontier Meetings for ATLAS Meeting Minutes Archive 2011 Minutes: 7/14/2011

Minutes: 7/14/2011

by John S. De Stefano Jr. last modified Jul 14, 2011 01:40 PM
Notes from the ATLAS Frontier meeting on July 14, 2011.

frontier-minutes-20110714.txt — Plain Text, 6 kB (6580 bytes)

File contents

Participants:   Florentin Bujor, John DeStefano, Alastair Dewhurst, 
                Dave Dykstra, Elizabeth Gallas, Fred Luehring, 
                Roman Sorokoletov 
Early check-in: Andrew Wong

*** Site Status and Issues: ***

BNL:
- No news

CERN:
- RPM upgrades to production machines
  * RPMs not complete with respect to Quattor/CDB: 
    - Manual AWStats password files
    - Manual servlet database connection details
  * To do: progress on multiple servlet configuration, change default 
    installation prefix to `/`
- Problem with disk on frontier.cern.ch
  * frontier2 switched to temporary primary server while disk on frontier1 was 
    replaced
    - Same disk on backup server has failed: replacing now
- Problem reported with Frontier, COOL, database access at CERN
  * Savannah #68644: https://savannah.cern.ch/bugs/?68644
  * Error ORA-12514 reported: ATLR was overloaded by reconstruction tests
  * New, unusual configuration at T0 affected multiple services
    - Prompted discussions of possibly using Frontier at T0 to buffer/stage 
      incoming client requests
      * May require additional site Squids
      * Dave should be involved in any discussions of implementation
- CDB templates being shared more between CMS and ATLAS
  * Involves launchpad and proxy Squid configurations at CERN
    - Configurations changed such that test machines have their own templates
  * Transition went relatively smoothly
- ATLAS service documentation:
  https://twiki.cern.ch/twiki/bin/view/Atlas/ATLASFrontierSquid
- Savannah #83656
  * Task 433920: rel 16.6.7.1 Athena seg fault in libfrontier_client 2.7.14
  * https://savannah.cern.ch/bugs/?83656
  * Appears to have been a transient network issue

KIT: N/A

LYON: N/A

RAL:
- Running latest Frontier servlet on both production servers
  * Alias change will include both servers in production
  * Both permit remote access for authorized users (via DN)
  * Manual process for adding new server to central monitoring at CERN
  * Need to setup AWStats for new server
    - Florentin will do this eventually; Dave and David know how presently
- Things running smoothly otherwise
  * Transparent problem with memory on one machine

TRIUMF:
- No issues to report
- Discussion (via email) of local Frontier launchpad Squid access policy
  * Project recommendation: permit all incoming, and restrict destination to 
    local Frontier service
    - Site Squids may restrict incoming requests to specific sites/subnets

*** Development, Monitoring, Testing: ***

New Tomcat vulnerability reported
- No patch or upgrade available as of yet
- Should not affect Frontier servlet instances

Monitoring:
- New Savannah tickets by Dario for tracking Florentin's monitoring 
  improvements to AWStats, MRTG, SAM, SLS, SSB [21274-21278]
  * Improvements to SLS scripts and connectivity: ~80% complete
    - Updates to firewall rules at TRIUMF to permit test connectivity
    - Working to integrate AGIS into SLS scripts, merging CMS solution
      * Differences from CMS distribution: one central location versus six 
        launchpad locations (including CERN "T0")
      * Not currently possible to pull all information from AGIS
      * Recommendation: get scripts working first and replace them later to 
        pull information from AGIS
  * Will work on MRTG next: will also pull data from AGIS
  * HammerCloud can test Conditions access: has failed in some UK T2s
    - Tests for at least one successful connection
    - Serve similar purpose as previous SAM tests, which have been disabled, 
      except they don't test all backup combinations
      * General tests migrated to Nagios; Frontier may not have been migrated
        - Included in Florentin's ticket to migrate from SAM to Nagios
    - Can alert sites to otherwise unknown fail-overs and configuration 
      problems

AGIS:
- Florentin working on extracting Frontier and Squid node information from 
  AGIS
  * Elizabeth's discussion notes:
    * Snapshot of Tiers of ATLAS currently used to populate the DB
    * DB schema remains a black box
    * We are supposed to test functionality of API
    * AGIS to be used by Frontier for both monitoring and configuration

New Frontier client (v2.8.3):
- Recommended latest client be tested with latest LCG software release
  * Was not included in 60c release, but now included in nightly builds
- Release notes (v2.8.3):
- Fix bug that caused 'unzip unknown error' under relatively rare 
  conditions.  The error code was actually reporting a normal
  condition, it just wasn't being recognized as such.
- Release notes (v2.8.2):
- Update the retry strategy so that when an error is not clearly a
  server error (server errors imply that the proxy was good), try every
  proxy with every server in turn.  Previously, non-server errors
  would cause the strategy to do direct connections to all servers
  after every proxy failed with just the first server.  The old
  strategy was fine for CMS Offline where the first server is a
  round-robin between all servers, but not good for CMS Online or
  ATLAS where that is not the case.
- Make protocol error-induced reloads try "Cache-control: max-age=0"
  before "Pragma: no-cache", because that's gentler on the servers
  since it only asks for the modification times to be immedately
  checked (that is, it revalidates the cache).  $FRONTIER_FORCERELOAD
  variable still by default uses "Pragma: no-cache", but now if the
  value begins with "soft" (that is, "softlong" or "softshort") it
  uses the more gentle refresh.
- Do md5 calculations and unzipping (when needed) as the data is
  received rather than waiting until all data is received .  This
  gives a slight reduction in elapsed time because those calculations
  can be interleaved with I/O.
- Turn low-level error messages into just debug messages when they are 
  caught at a higher level and retried.  At the retrying level they
  were already being printed as warning messages but the output was
  confusing because it showed both 'error' and 'warn'.
- Include ServerError in the types of C++ exceptions that may be thrown.
  Previously it would have been mapped to an UnknownError exception.

*** A.O.B.: ***
- ATLAS Software & Computing talks, on Monday, 18 July:
  https://indico.cern.ch/conferenceDisplay.py?confId=119170
  * Alastair's talk: Frontier News: 15:20 CEST
    - Please contact Alastair with suggestions for inclusion
  * David's talk (remote): Frontier RPMs: 15:35 CEST
  * Florentin's talk: Frontier Monitoring: 15:50 CEST
  * Also note Vakho's Geometry talk at 14:50 CEST
Document Actions
Filed under: , , ,