You are here: Home User Information Facility Services Frontier Frontier Meetings for ATLAS Meeting Minutes Archive 2011 Minutes: 7/28/2011

Minutes: 7/28/2011

by John S. De Stefano Jr. last modified Jul 28, 2011 04:55 PM
Notes from the ATLAS Frontier meeting on July 28, 2011.

frontier-minutes-20110728.txt — Plain Text, 8 kB (9012 bytes)

File contents

Participants: Florentin Bujor, John DeStefano, David Alastair, Dave Dykstra, 
              David Front, Elizabeth Gallas, Fred Luehring
Apologies:    Dario Barberis, Andreas Petzold

*** Site reports and issues: ***

AGLT2 (US):
- Problem with Squid after upgrade to latest version
  * Squid wouldn't start due to missing squid.conf file after upgrade
    - Problem with post-install awk script: known to David, will be fixed ASAP
    - Bob manually restored from squid.conf.old
    - Same issue seen at BNL some time ago but could not be reproduced

BNL:
- Log rotation failed on one site cache node (frontier04)
  * Disk partition ran out of space, required manual log deletion and restart
  * Cause remains unknown; not related to David's recent bug fixes
  * David to investigate sending error message to stderr so cron
  	will send email, which can be forwarded to service admins

CERN:
- AWStats stopped working on atlasfrontier2
  * Password file was not replaced after re-installation
      * This has to be put into sindes
    - atlasfrontier2 tomcat crashed
      * Servlet restarted after a file from sindes was updated (even though
        it hadn't changed) and servlet didn't restart successfully
        possibly because servlet has been restarting every day due to the
        sindes update
     * Sindes update now stopped
     * David will try to reproduce on test server
     * In the future, this file will be generated from a different file
         in sindes, only if the password changes
- Maximum number of threads reached on CERN launchpads
  * Twice in three days (19, 22 Jul)
  * 22 Jul instance caused by long-running query: returned > 1.5 million rows 
    (Gancho)
    - Inappropriate query sent directly by SCT developer extracting conditions 
      into root file
    - Investigation of proper handling of queries
      * Maximum permissible number of retrievable rows should be analyzed
      * David will do this and try to determine if JDBC can be configured
        to limit the number to prevent the Frontier server from crashing
        - Elizabeth suggested better monitoring of queries
          * Queries are logged, but not to any interface outside of system
            - Thread logging to be added to AWStats (low priority)
              * Possible to add long-term query logging of rows returned
          * After limit is added, queries approaching or exceeding limit 
            should trigger monitor alert
    - Investigate Frontier crash by recreating query on development server
      * Need to involve DBAs to monitor ATLR during testing
      * A high priority
- Streams replication issue
  * Frontier ignored modification of new ISZT tables due to lack of read  
    permissions to Frontier client reader account (Savannah #84887)
    - Fix required dummy entry into (and removal from) database tables to 
      update modification times
      * ADC notified, has requested follow-up
- Bind variable investigation and discussion
  * Frequent queries from Frontier: similar except for variants that 
    could be handled as bind variables
    - Fix would be necessary to CORAL's FrontierAccess module
      * Change already implemented to Frontier server, but unused by 
        client to date
        - To be investigated by Andrea, or David if necessary as a low 
          priority
          * May not result in significant increase in performance
      
KIT (Andreas via email):
- Servlet upgrade problem fixed by upgrading to java 1.6
- No Frontier problems till last week when test ran into an http
  timeout; we didn't find a fix and after rebooting the frontier machine
  it's fine again
- 3D database migration to new hardware today. Not yet finished due to
  problems reconnecting the streams

LYON:
- Oracle down for multiple outages over several days
  * Either Frontier continued working for several hours during first database 
    outage, or SLS did not recognize Frontier outage
    - Possible server misconfiguration; requested logs and config files
      * Issue with SLS not recognizing outage: reading from cache
        - Should be addressed by Florentin's new scripts
      * May pursue via GGUS ticket requesting information on outage

RAL:
- AWStats on one production machine, log rotation working on the other
  * AWStats script for CMS required parameter changes for rotation frequency
  * Not yet configured at CERN to accept requests
    - Dave to configure on central server
- Log hits at RAL did not show significant fail-over requests from CERN
  * 1.5 million hits from SARA, KIT over weekend: few or none expected
    * SARA, KIT issues likely transient; CERN fail-over should be investigated 
      (likely via CERN Nagios, or testing fail-over via ATLAS job run)
      - May result in a Savannah ticket to verify/fix CERN fail-over
  * CERN is using Lyon as a backup, that should be changed to RAL


TRIUMF: N/A

*** Development: ***

New Frontier client (2.8.4):
- Release notes:
  * Fix bug that caused outgoing http connections to remain open after
    the frontier client object was deleted, unless the last query
    happened to have been a large one (>16kbyte).  This bug has existed
    for 4-1/2 years, ever since persistent connections were implemented.
    Most production code only uses one frontier client object so that's
    probably why it wasn't noticed.  The problem was noticed in a CMS
    tool that created and deleted the frontier object many times.

ATLAS software packages:
- New releases of Frontier Squid (2.7.STABLE9-5.5) and AWStats (6.0-3)
  * Include log rotation bug fixes
  * May have re-introduced problem with generating squid.conf
- Timescale for full set of RPMs ready for all sites?
  * Generally ready, stable enough to be tested outside of CERN
  * May be necessary to involve David directly in testing
- Some progress with servlet configuration, not yet ready

Monitoring:
- Reusing and adapting CMS framework for SLS
  * Still receiving 403/forbidden and 500 errors from KIT via frontier1 (not 
    frontier2)
    - KIT admins claim to have added Frontier test nodes to ACLs 
      * Possibly still an IP access error due to alias
    - To update in wiki, and close SLS ticket
- Working with AGIS to test with MRTG
- AWStats needs work on central server to enable remote site reporting
  * John will send around his notes on configuring new sites at CERN
- Frontier server encoding responses
  * First three bytes of encoded responses differ from some servers (TRIUMF)
    - Suggestion: change monitoring tools (SLS) to skip these response byte
      * Fixed by Florentin

*** Operations: ***

Savannah issues:
- 77447: Frontier squid installation problems in the UK
  * Complaints regarding RPM installation (customize.sh, using open-source 
    Squid instead, etc.)
  * Ticket responded to, awaited input, closed, then re-opened
  * Non-default user request now addressed, as of 
    frontier-squid-2.7.STABLE9-5.4
- 78014: COOL data reconstruction issue
  * Related to Frontier misconfiguration at CERN: addresses via config fix
  * Sat idle and unconfirmed for months; closed by Dave
- 84887: Access of condition data
  * Problem with propagated read privileges of ISZT tables for Geometry DB
- Related to exceeded connections issues in CAF trigger jobs:
  * 83573: Too many open connections to the atlr database, never closed.
  * 84714: ATLAS Frontier issues at CERN
    - Multiple areas affected by CERN Frontier launchpads running out of 
      threads
  * task #21645: CAF reprocessing with AtlasP1HLT 16.1.3.6+TMP tag

*** A.O.B.: ***

Discussion of Frontier mailing lists and purposes:
- atlas-adc-frontier list for ATLAS issues and operations
- RACF Frontier list for general discussion

Dave: short vacation 29 Jul - 2 Aug

Related ATLAS S&C Talks:
- Vahko: Geometry DB accessible via Frontier (except for T0)
 https://indico.cern.ch/getFile.py/access?contribId=6&sessionId=2&resId=0&materialId=slides&confId=119170
- Tiago: investigating Trigger DB MC processing via Frontier
https://indico.cern.ch/getFile.py/access?contribId=7&sessionId=2&resId=1&materialId=slides&confId=119170
- Alastair: personnel updates; deployment recommendations (hardware, ACLs, configuration, shared Squids, fail-over); monitoring (updates, lack of downtime declaration, question of how to configure central side of site monitoring); software status (package updates, AGIS)
https://indico.cern.ch/getFile.py/access?contribId=8&sessionId=2&resId=0&materialId=slides&confId=119170
- David: Frontier software improvements; package layers; package configuration; cross-experiment utilization; future plans
https://indico.cern.ch/getFile.py/access?contribId=9&sessionId=2&resId=1&materialId=slides&confId=119170
- Florentin: overview of tools; restoring and fixing pre-existing probes; SLS and MRTG clean-up and AGIS integration; SAM test migration to Nagios; future plans; documentation
https://indico.cern.ch/getFile.py/access?contribId=10&sessionId=2&resId=1&materialId=slides&confId=119170
 
Document Actions
Filed under: , , ,