You are here: Home User Information Facility Services Frontier Frontier Meetings for ATLAS Meeting Minutes Archive 2011 Minutes: 9/8/2011

Minutes: 9/8/2011

by John S. De Stefano Jr. last modified Sep 08, 2011 03:20 PM
Notes from the ATLAS Frontier meeting on September 8, 2011.

frontier-minutes-20110908.txt — Plain Text, 7 kB (7258 bytes)

File contents

Participants:   Stephane, Emannouil, Alastair, Fred, Elizabeth, David, Dave
Early check-in: Florentin Bujor, Andrew Wong
Apologies:      Dario Barberis

*** Site reports and issues: ***

BNL:
- Shutdown of T1 site due to hurricane Irene (Aug 27-29)
  * Fail-over of US T2/T3s to TRIUMF did not work as expected
    - T2/3s, WNs failed to connect to Frontier at TRIUMF
      * Site Squids may be white-listed, but not WNs

CERN:
- Attempted Tomcat RPM upgrade reverted due to bug
  * Bug fix ready; production servers to be upgraded soon

IN2P3-CC:
- Both old and new (test) launchpads have working MRTG entries
- Will work with Fred after reprocessing campaign, CVMFS migration to stress 
  test new server
  * Plan to use current Frontier server in conjunction with new server to add   
    resilience
- Comments on crontab directives
  * System file permissions may prohibit editing via crontab command by 
    administrator after rpm is installed
    - Suggestion: place file directly in /var/spool/cron, or add user to 
      cron.deny/allow (documentation)
- Installed Squid awstats: working well but script has changed, using previous 
  version
  * David has fixed bug in awstats package
- Will request Romanian sites that currently use IN2P3's site Squid to install 
  at least one Squid of their own

KIT:
- N/A 

RAL:
- Network outage on 5 Sep, ~7 hours
  * SLS reported Frontier service degraded (50%) after network recovery, 
    though servers came up as expected and served requests
    http://savannah.cern.ch/bugs/?86390
    - Will investigate via Tomcat logs
- Network "at risk" 13 Sep at 8 AM CEST
  * May result in fail-over for a short time
- ECDF have new Squid; Savannah ticket created to update ToA (#86530)
- Nikhef network hits seen via awstats; 6x normal bandwidth to Netherlands
  * Shifter ticket submitted (GGUS #74152)
  * Problem at SARA: Squid stopped running
- Still plan to update Tomcat
    
TRIUMF:
- No issues to report

*** Deployment: ***

LCG release:
- LCG team urged to have Frontier client v2.8.4 included in LCG 60(d)
  https://savannah.cern.ch/bugs/index.php?86408

Savannah:
- 86115: seg fault in liblcg_FrontierAccess following out-of-memory
  * Process reached 32-bit, 4 GB memory limit; duplicate of "memory leak bug"
  
T0/CAF testing:
- Test machine at CERN used by T0/CAF for Frontier job batch test
  * http://frontier.cern.ch/squidstats/mrtgatlas/Lpad-CERN_2/index.html
  * Two rounds of tests, per Armin:
    "Each round consisted of ~400 reco jobs, submitted at the usual
    Tier-0 pace of ~45/min. So all the jobs started running within
    less than 10 min.
    The jobs access conditions data from COOL three times during
    their run time: in the initialization phases of the RAWtoESD
    (first, sharp peak in the launchpad plot), ESDtoAOD and ESDtoDPD
    (second, smeared peak) steps, respectively. (The two latter steps
    are considerably shorter than the RAWtoESD step.)"
  * Jobs completed successfully
  * Results from jobs using direct Oracle access are being compared to jobs 
    using Frontier from 359 jobs
    - ESD container size differences are seen
    - Direct-Oracle has additional MUONALIGN retrievals with 100% correlation 
      with ESD container size
    - These ESD differences are thought to cause resulting differences of AOD 
      container sizes and TAG variables.
    - differences are reproducible
    - root cause of differences is still under investigation at the time of 
      this meeting.
- Need to cement requirements (cache retention)
- Need to establish architecture for deployment:
  * Dedicated Frontier host (atlasfrontier4) with fail-over to new servlets on 
    general CERN 
    Frontier?
  * Share cache with general site Squids proxies

Trigger reprocessing:
- Still need work for jobs to use Frontier access
  * Ongoing in Trigger group
  * Request for ATLAS_TRIGGER_REPROC schema to be added to Streams
  * Hope to be able to test by end of month

Frontier sites and fail-over:
- Discussions here, WLCG, ADC to establish open incoming connection policies 
  on Frontier launchpads
  * Proposals to enforce open policies on launchpads, or restrict ATLAS 
    Frontier fail-over definitions only to open and multi-server launchpads 
    (currently: BNL, CERN, RAL)
  - Presented at WLCG T1 SCM:
    https://indico.cern.ch/conferenceDisplay.py?confId=153062
  - Presented at ADC Weekly, who agreed on changes in ToA for back-up sites:
    https://indico.cern.ch/conferenceDisplay.py?confId=153368
- ToA has been modified to change T1 fail-overs:
  https://savannah.cern.ch/bugs/index.php?86461
- ADC requests that all Frontier sites open incoming ACLs
  * Potential concern of opening site to DoS attack should not be reason in 
    itself to deny open access
  * Will continue operation with existing model and recommendation:
    - ATLAS Frontier sites will fail-over to sites with open access 
      configuration and resilient server deployment

AGIS progress (Florentin):
- I'm still waiting feed back from the persons which should validate what I've 
  done until now (new scripts using AGIS; tests to see if all the data from 
  ToA is in AGIS, etc) and for the next steps to be done. So, to be 
  continued... But don't worry, I'm actively following this matter. 
- For monitoring:
  * It is possible to have multiple endpoints declared for one site/service
  * There will be a "downtime" field, not on the endpoint level as I expected, 
    but on the service level, useful when an entire site/service is on 
    maintenance or else.

BNL/Tomcat threads/SLS failure investigation:
- David identified user whose jobs dominated queue during problematic period, 
  likely caused Tomcat thread build-up and overload
  * User's queries and code being analyzed for possible improvement
  * Based on low increase in all aspects of server load, Dave recommends 
    increasing number of database channels by a factor of 3-4 (from 10 to 
    30-40)
    - Recommendation: increase to from 10 to 20 for now
    - Adjust `maxIdle` and `maxActive` values symmetrically in [servlet].xml

*** Development: ***
  
Monitoring (Florentin):
- SLS: 
  * I've finished the integration of Dave's suggestions in a new version of 
    the SLS script, but is not yet deployed in production. As the new script 
    should be used by CMS too, I have to validate it with Dave. I'll contact 
    him some time next week.
- MRTG: 
  * Fixed the site list using info from AGIS. I'm almost done with the new 
    scripts which generate the MRTG configuration. I will validate them with 
    Dave (along with SLS tools) some time next weeks.

Packages:
- New Frontier servlet package release (3.29-4)
  * Multiple servlet support, configuration for shared and private data
  * http://frontier.cern.ch/dist/rpms/frontier-servletREADME
  * If RPM passes testing, will be installed in production at CERN
- New release of Squid RPM almost ready
  * Fix for possible removal of system files
  * RPM will not create non-existent and non-standard log file and cache 
    directories
- New AWStats RPM release
  * Fixes known issues, passes David's tests, installed in Quattor
- New Tomcat RPM release
  * Also successfully tested and installed in Quattor   

*** A.O.B. ***

N/A
Document Actions
Filed under: , , ,