/home/www/ftp/data/hep-lat/dir_0111014/0111014.dvi


1

Physics development of web-based tools for use in hardware clusters

doing lattice physics

P. DreheraMIT]MIT Laboratory for Nuclear Science, Massachusetts Institute of Technology,
Cambridge, MA 02139∗† ‡, W. AkersbJLAB]Thomas Jefferson National Accelerator Facility,
Newport News, VA 23606§, J. Chenc[JLAB]§, Y. Chend[JLAB]§, C. Watsone[JLAB]†§

a[

b[

Abstract:
Jefferson Lab and MIT are developing a set of web-based tools within the Lattice Hadron Physics Collaboration

to allow lattice QCD theorists to treat the computational facilities located at the two sites as a single meta-facility.
The prototype Lattice Portal provides researchers the ability to submit jobs to the cluster, browse data caches,
and transfer files between cache and off-line storage. The user can view the configuration of the PBS servers and
to monitor both the status of all batch queues as well as the jobs in each queue. Work is starting on expanding
the present system to include job submissions at the meta-facility level (shared queue), as well as multi-site file
transfers and enhanced policy-based data management capabilities.

1. THE META-FACILITY CONCEPT

The next generation of computers for lattice
calculations and other grand challenge type nu-
merical simulation problems will be constructed
with multi-teraflops computing capabilities. Such
machines dedicated to a particular grand chal-
lenge problem may not be geographically located
at a single site. The Thomas Jefferson National
Accelerator Facility and the MIT Laboratory for
Nuclear Science have undertaken a joint project
to develop a meta-facility for lattice physics cal-
culations at a geographically dispersed facility.

A “meta-facility” for lattice physics will consist
of a set of tools and utilities that allow for the effi-
cient distribution of of data and computing tasks
among multiple lattice gauge facilities. The goals

∗Support provided through the U.S. Department of En-
ergy Cooperative Agreement DE-FC02-94ER40818
†Member, Lattice Hadron Physics Collaboration
‡Poster presenter
§This work was supported by the U.S. Department of
Energy contract DE-AC05-84ER40150, under which the
Southeastern Universites Research Association (SURA)
operates the Thomas Jefferson National Accelerator Fa-
cility

of the meta-facility will provide a central location
for users to submit and monitor these jobs, collect
and distribute data generated from these comput-
ing activities, and status of the overall status of
the hardware interconnected to the meta-facility.

The full implementation of the meta-facility
will have the capability of providing a primary
routing mechanism to provide users with the abil-
ity to run jobs at one of several sites according
to the resources currently available at any given
time. At each site, there will be a batch system
that accepts jobs from the central routing queue.
A distributed batch system at each site will queue
the job(s) sent to each site and control the exe-
cution of the jobs on the machines at that site.
The distributed batch system at each site will also
communicate with the central routing queue pro-
viding job status and control information to the
routing queue.

2. THE PRESENT JLAB-MIT FACIL-
ITY

At the present time, JLab and MIT have devel-
oped an initial set of web-based tools and utilities


2

that have been installed at both sites. This ini-
tial software deployment will become part of the
design toward an eventual full implementation of
a meta-facility at both sites.

These web based tools have installed to monitor
the clusters of machines at both JLab and MIT.
These machines currently have a mix of different
capabilities and configurations.

Jefferson Lab presently has two clusters oper-
ational. The first cluster consists of 16 XP1000
nodes from Compaq. The XP1000 is a 64 bit
Alphaa 21264 processor (500 MHz)with a 4 MB
L2 cache and 64 KB on chip cache, a 100 MHz
SDRAM subsystem and integrated 4 GB Wide-
Ultra SCSI (10000rpm) disk subsystem with dual
independent 32/64 bit PCI buses. Eight of the
XP1000 nodes are have Myrinet cards installed,
THe nodes are connected to a Myrinet switch.
The second cluster consists of 12 dual processor
UP2000 from API Labs. The UP-2000 systems
have dual 667 MHz 21264 processors with 4 MB
caches, 512 MB memory, and an 18 GBIDE disk
(7200 rpm)

At MIT, the are twelve Compaq ES40 ma-
chines. Each ES40 machine has four EV-67 667
Mhz alpha processors and 1 Gbyte of memory
in an SMP configuration within each box. The
twelve ES40s are connected by both Myrinet
hardware and fast ethernet. There are also ad-
ditional Intel based PCs that serve as backup,
file and batch servers at this site. This entire set
of machines is positioned behind an Intel based
front-end machine. This front end machine serves
the functionality of a firewall and as a repository
for the web-based tools and utilities

The current capabilities of the present sys-
tem allow users at either the JLab or MIT sites
to launch and monitor batch jobs and examine
batch queue parameters and machine status. The
JLab site also has the capability to allow users to
browse the tape file catalog and initiate file trans-
fers from to and from the JLab tape storage fa-
cilities. At the present time there is a certificate
server operational at JLab with the capability of
issuing a personal web certificate for enhanced se-
curity for users submitting jobs to that site.

3. FUTURE PLANS AND ENHANCE-
MENTS

The present configuration that is operational
at JLab and MIT has demonstrated a proof of
concept. The batch system execution queues at
both the JLab and MIT sites operate properly
and a Web based link has been established be-
tween both sites that provides information on
the status and batch jobs running on all clusters.
The project will be moving forward with two ma-
jor upgrade tasks scheduled over the next several
months.

- At the present time, batch job submission via
the Web is limited to the JLab site. Web services
for batch submission of jobs at the MIT site will
be installed in the Fall of 2001.

- Tape transfers at both JLab and MIT involve
a two-stage process because of the front-end ma-
chines acting as firewalls at both sites. Work is
ongoing to develop full tape file transfer capabil-
ity at both sites

After these immediate tasks have been com-
pleted and are operational, there are several
longer term activities planned for development
and implementation of a fully functional meta-
facility.

At both the MIT and JLab sites, only the exe-
cution queues have been configured in the batch
queuing system at each site. The ultimate goal
is to develop and install a separate node in the
meta-facility that will be dedicated to the func-
tionality of a routing queue. The routing queue
would not be attached to any of the individual
clusters in the meta-facility but would rather view
all of the execution queues on every cluster at all
sites throughout the meta-facility. The routing
node will act as a separate communications and
traffic manager.

Users would only submit jobs to the node that
was running the routing queue system software.
The routing queue would have the configuration
parameters of each execution queue and commu-
nicate with all of the various execution queues
located throughout the entire system. Based on
the job requirements, the routing queue would
examine systems loads and available resources
throughout the entire meta-facility and direct the


3

job to the execution queue on the cluster with the
most available resources. At this point additional
sites beyond the Jefferson Lab and MIT locations
would be added.

Finally, the routing queue will be integrated
with future data grid software currently being
developed using Java and XML-based web ser-
vices [1]. The final system will be a web-based
distributed data grid that is site independent.
It will include a distributed batch system aug-
mented with various monitoring tools and man-
agement options to deliver a full production level
meta-facility for lattice physics.

REFERENCES

1. W. Watson, I. Bird, J. Chen, B. Hess, A.
Kowalski, Y. Chen, A Web Services Data
Analysis Grid, submitted to Concurrency and
Computation: Practice and Experience