You are here: Home User Information Facility Services Frontier Frontier Meetings for ATLAS Meeting Minutes Archive 2011 Minutes: 10/6/2011

Minutes: 10/6/2011

by John S. De Stefano Jr. last modified Nov 21, 2011 04:45 PM
Notes from the ATLAS Frontier meeting on October 6, 2011.

frontier-minutes-20111006.txt — Plain Text, 6 kB (6553 bytes)

File contents

Participants: Dario Barberis, Misha Borodin, Florentin Bujor, John DeStefano, 
              Alastair Dewhurst, David Front, Dave Dykstra, Andrea Formica, 
              Elizabeth Gallas, Fred Luehring, Shawn McKee, Alexei Sedov, 
              Roman Sorokoletov, Emmanouil Vamvakopoulos

*** Site news and issues: ***

AGLT2:
- Questions on number of Squids per site, especially when sharing Squids 
  between services 
  * Six Squids at AGLT2 across two sites, some serving Frontier and CVMFS
  * RAL: shared Squids have small load even using both Frontier and CVMFS
    - Some smaller sites sharing a single Squid
    - One might be fine for load, two recommended for robustness
    - CVMFS has its own internal caching that reduces load on Squid
    - Cacti graph links for those interested in load numbers:
      * cache (UM; Frontier + CVMFS):
http://umopt1.grid.umich.edu/cacti/graph_view.php?action=preview&host_id=1161    
      * cache0 (MSU; CVMFS):
http://umopt1.grid.umich.edu/cacti/graph_view.php?action=preview&host_id=1457
      * cache1 (MSU; CVMFS):
http://umopt1.grid.umich.edu/cacti/graph_view.php?action=preview&host_id=1458 
      * cache2 (UM; Frontier + CVMFS):
http://umopt1.grid.umich.edu/cacti/graph_view.php?action=preview&host_id=1162
      * cache3 (UM; Frontier + CVMFS):
http://umopt1.grid.umich.edu/cacti/graph_view.php?action=preview&host_id=1376

BNL:
- Reduced launchpad cache lifetime from 60 to 50 minutes to account for 2 
  inherent 5-minute delays
- Tested latest RPMs; working on automating launchpad and cache deployment
  * Delayed until next round of RPMs

CERN:
- T0 instances added, not yet in production, but being added to monitoring
  * T0 servlet still has same name as production; should be changed before 
    moving to live environment (Florentin, Sergei)
  * Minor issue with CDB template has been fixed
  * Results must be tested against Oracle access before moving instance live
    - ATLAS still needs time to investigate cache results
  * https://sls.cern.ch/sls/service.php?id=ATLAS-T0-Frontier

IN2P3:
- ATLAS database corruption on 4 Oct, possibly related to storage system
  * Databases will move to different storage system
  * French cloud Squid load went to RAL
  * Since T1 Frontier is not available (yet), it can't fail-over to backup 
    site
    - Lyon is moving in a direction to address access issues
- Testing new Frontier server, new RPMs with Fred
  * Some problems with CVMFS, Frontier configuration
    - Conditions POOL files weren't working within CVMFS; now fixed
    - Will try to test tomorrow: first one job, then follow with a moderate 
      number of jobs (100) over weekend
    - Will set up new site Squid next week as a replacement of the old one, 
      which is on the same server with the old Frontier
    - Will create round robin with new and existing Squids, with identical 
      configuration
  * Latest version of Tomcat, Squid
    - Comment on AWStats: latest version didn't work; resorted to previous 
      version; will send detailed comments later
    - No problems with latest Squid version
- ACL configuration: would like to restrict Squid access to local Frontier
  * Had a question about "EvGenJobOptions http download" files and whether 
    Squid needs to provide access via open requests
    - Site Squid needs access to:
http://atlas-computing.web.cern.ch/atlas-computing/links/kitsDirectory/EvgenJobOpts/
      * Ref:
https://indico.cern.ch/materialDisplay.py?subContId=1&contribId=0&materialId=slides&confId=145551

KIT: --

RAL:
- T1 was "at risk" 4 Oct due to investigation of packet loss on routers
  * Fixed without downtime during recurring network "at risk" period
- RAL would like copies of CERN Quattor templates for reuse
- Tomcat upgrade, cache retention time not yet changed

TRUIMF: --

*** Deployment: ***

Documentation:
- Needs to be revisited, consolidated
- Alastair's new ATLAS TWiki topic as starting point:
  https://twiki.cern.ch/twiki/bin/viewauth/Atlas/FroNTier
- David will make effort to consolidate:
  - David's RPM documentation
  - Flavia's RPM documentation on TWiki
  - BNL documentation
  - Florentin's monitoring TWiki

T0 launchpads:
- Continued analysis of muon result differences (Savannah #86738,87266)
  * Could actually affect not only muon processing but jobs ATLAS-wide
    - Discussed in Database Coordination meeting
    - Private threads should be added to Savannah tickets (when issues are 
      resolved)
    - Discussion summarized in Elizabeth's DB Coordination Meeting notes:
      http://www-pnp.physics.ox.ac.uk/~gallas/DBAdmin/111003_week_DB.html#DBC
  * Determining proper place to flush cache at processing run start (NEMO?)
    - Dave proposed alternatives: 
      * 15 minute gap between updates and processing <-- should be OK, so long 
        as internal cache lifetime is set to 5 minutes
      * Client-side URL variable for fresh cache runs
  * Updated times from 600 to 300
  * Will notify Dario when all is ready for next T0 test

*** Development: ***

AGIS integration and testing (Florentin)
- Working with Ale DiG to clean up AGIS site entries for ATLAS (not only 
  Frontier)
  * Ensure ToA data is all present in AGIS
  * Additional data request: site and individual node data for round-robin 
    aliases
    - Beijing site Squid seems to be down, using Squid in Japan for data 
      access (Savannah #86866)

LCG release
- Frontier client v2.8.4 (latest) included in LCG 60d, and ATLAS build 17.0.4

Monitoring:
- Two sites in GRIF cloud with low load, could be removed from MRTG list
  * Can be reported to Florentin
- Three sites had a temporary problem
- Other sites (Harvard) reported via local ticketing system (RT)
- MRTG Squid list clean-up and additions: Romania, Italy, UK, etc.
  * Not yet getting data directly from AGIS, but using AGIS data manually

RPM development and testing (Dave, David, John)
- Extensive post-meeting discussion about issues and questions from testing at 
  BNL
  * Results of discussion to be implemented by David when he has time in a few 
    weeks, tested afterward by John

Query analysis and log mining (Roman, Dave)
- Study of unique versus cached queries

SLS development (Florentin)

*** A.O.B.: ***

- Next meeting, difference between NY/CERN time would be 5 hours instead of 6 
  (DST), but will not take place due to ATLAS S&W week
- Dario, Dave: WLCG database technical evaluation group workshop (tentatively 
  scheduled for Nov 7-9) 
- Meeting technology: EVO or ReadyTalk
  * EVO should be fine for future meetings

Document Actions
Filed under: , , ,