[CENIC Today -- First Quarter 2013, Volume 16 Issue 1]
Larry Smarr Keynote: Campus-scale HP Cyberinfrastructure Required for Data-intensive Research: UCSD's Experience

In this issue:

Keynote Addresses:
Larry SmarrDavid McGowanGreg Bell
2013 Innovations in Networking Awards:
All presentations
Research & Technology Highlights:
International Networks Update
NSF Campus Cyberinfrastructure
US-Mexico Networking
Teaching & Learning Highlights:
Mobile Audiovisual Broadcasting
Update on the Mid-Pacific ICT Center
CA State Parks Reaching Out with PORTS

CENIC News:
Calit2, ESnet, and CENIC Convene 100G and Beyond Workshop
CENIC Network Traffic Doubles in 2013
Network Updates for First Quarter 2013
CENIC Star Performer: Christopher Paolini, SDSU

Watch on YouTube ]

Dr. Smarr's Monday, March 11 Keynote Address revolved around the ways in which a research university would need to plan, design, and implement its own networking in order to take advantage of ultra-high-performance connectivity at 100G or more. The Keynote Address was given in Calit2's Atkinson Auditorium, a room capable of delivering a Gigabit to each individual seat, meaning, as Smarr informed the audience, that the entire Auditorium was capable of 200 Gb/s in full use. After describing the next incarnation of the UCSD campus cyberinfrastructure, known as PRISM, Smarr then discussed the categories of research activities that any such cyberinfrastructure must support and gave compelling examples of each.

Smarr also took care to emphasize the human networking required to ensure that a campus cyberinfrastructure was not only planned properly for researcher use but would also actually be used. Multiple face-to-face consultations are an extremely important part of network planning, as very often the researchers will not automatically volunteer what they would like to have or even what they need. And with potentially revolutionary technology like 100G networking, some researchers may have difficulty even predicting its uses. Smarr was also quick to point out that supporting data-intensive science on a campus meant more than traditionally tech-heavy disciplines but also included the humanities as well, which while they do not involve data at the level of high-energy physics or genomics yet, are nonetheless beginning to move into the realm of data- and computation-dependent disciplines.

Happily, the rapidly declining costs per port of 10G networking have made campus-wide 10G cyberinfrastructure affordable, as cost estimates of $80,000 per port in 2005 have come down to only a few hundred dollars in 2013 to approach HPC cluster costs -- with projected costs going down even further in the near future. A localized example shown by Smarr of what this can make possible was the massively parallel 10G switched data analysis resources of the San Diego Supercomputer Center (see Fig.1), which among other things, enables a Terabit connection to the Gordon Supercomputer, designed to maximize access to extremely large datasets. This, Smarr pointed out, "makes the LAN connection faster than the backplane, in essence" and makes possible for the first time a true instantiation of distributed computation.


Figure 1: SDSC's Massively Parallel 10G Switched Data Analysis Resource

As CENIC begins to introduce 100G connections, this facility and others like it that are coming into existence will have a strong influence on conceptualizing how a campus "terminates" a 100G connection. Smarr used the analogy of impedance matching to describe this process, and presented the design for the next generation of UCSD's campus cyberinfrastructure, called PRISM (see Fig. 2).

PRISM is UCSD's answer to how a campus terminates a 100G connection, with awareness that it must support the following types of research activities:

  1. The Remote Analysis of Large Data Sets
  2. Connection to Remote Campus Compute and Storage Clusters
  3. Remote Access to Campus Data Repositories
  4. Enabling Remote Collaborations

Smarr then listed examples of each of these functions either on the UCSD campus or of great interest to researchers located there.


Figure 2: PRISM@UCSD -- UCSD's Big Data Freeway system connecting instruments, computers, and storage


The Remote Analysis of Large Data Sets

The discovery of a particle consistent with current understanding of the Higgs boson in 2012 was headline-making news the world over, and the devices used to make this discovery are among the most complex scientific instruments ever built. The Compact Muon Solenoid or CMS detector at CERN generates unimaginably vast amounts of data, and in the hierarchy of data centers responsible for making this data globally available, known as the Worldwide LHC Computing Grid, UCSD functions as a Tier-2 center. As Smarr described, data flows from the CERN Tier 0 center to its many child data centers peaked at 32 Gb/s in 2012, while flows into the US Tier 1 center located at Fermilab peaked at 10Gb/s, and flows into the Tier 2 center at UCSD peaked at 2.4 Gb/s.

A second example given by Smarr to illustrate the ways in which UCSD's cyberinfrastructure must be designed to enable researchers at UCSD to reach data around the world is the Open Science Grid Consortium, a an organization that administers a worldwide grid of technological resources called the Open Science Grid, which facilitates distributed computing for scientific research. The consortium is composed of service and resource providers, researchers from universities and national laboratories, as well as computing centers across the United States.

Another example of great significance to UCSD is that of climate research, specifically that taking place at the Scripps Institution of Oceanography. The calculations underlying climate simulations must often be performed many times over and on increasingly smaller scales to account for very fine regional variations; these calculations further require vast amounts of data and are often built on the data-heavy results of remote supercomputer simulations.

Connection to Remote Campus Compute and Storage Clusters

Examples shown by Smarr of connecting to remote compute and storage clusters of relevance to UCSD included ocean observing and high-resolution microscopy, specifically UCSD's role in the Ocean Observing Initiative (OOI) and the National Center for Microscopy and Imaging Research (NCMIR). Both deep ocean observatories and high-resolution microscopy involve large data sets, which must be made available not only to researchers around the world, but on the USCD campus as well. Any research university's cyberinfrastructure must function not only as a gateway to and from the world but as a backplane for its own research activities as well, as with the OOI team at the Scripps Institution of Oceanography and the team's server complex located at Calit2 in Atkinson Hall.

Crucially, Smarr reminded the attendees, the OOI and NCMIR projects also make substantial use of "blended" resources, combining federal networks such as ESnet, regional optical networks like CalREN, and commercial cloud, increasing the variety of data flows that a campus cyberinfrastructure must support.

A third example of this kind of use of a campus cyberinfrastructure as an institutional backplane is one particularly championed by Smarr, the analysis of the human gut microbiome. The ecosystem supported by the average human body is a vast one, with the microbes inhabiting the body outnumbering the body's own human cells by ten to one. Displaying microbial populations by phylum is a perfect application for the Calit2 Virtual Room (VRoom) although the analyzed data itself resides at the San Diego Supercomputer Center. Smarr also pointed out that the million-fold reduction in cost of gene sequencing promises to make the study of human as ecosystem yet one more distributed big-data science that campus networking must support.

Remote Access to Campus Data Repositories

Just as UCSD researchers must be able to access remote data located around the world, Smarr reminded his audience that, as a world-class research university, UCSD generates and houses a great deal of data that must also be made globally available to researchers elsewhere; he specifically named two protein structure analysis projects in this vein, the Protein Data Bank (PDB) and the UCSD Center for Computational Mass Spectrometry's proteomics database.

The Protein Data Bank is an archive of experimentally determined 3-d structures of proteins, nucleic acids, and other complex assemblies, and is one of the largest scientific resources in the life sciences. With a third of a million unique visitors each month, the PDB is eager to establish global load balancing between its UCSD and Rutgers University sites to better serve its users. This requires high bandwidth to move data to users off-campus, as well as between UCSD and Rutgers so that the database is presented to users as one seamless structure and not two joined ones.

The UCSD Center for Computational Mass Spectrometry's proteomics project is another example of a massive life-sciences database focused on protein structure -- in this case as determined by mass spectrometry -- that is rapidly assuming the status of one of the major global repositories of protein-related data, and which only promises to grow in the future.

Remote Collaborations

The fourth function that Smarr specified as vital for a campus cyberinfrastructure to support was that of enabling remote collaborations, and the examples cited by him included:

  • The CineGrid international digital media production collaboration,
  • Collaboration between the Calit2 VRoom and the University of Illinois at Chicago's Electronic Visualization Lab,
  • Collaboration between Calit2 and Mexico's CICESE, supported via a 10G optical link between both facilities-- announced at CENIC's 2012 annual conference -- supporting a coupled pair of OptIPortals, and
  • The Pacific-Rim PRAGMA collaboration.

All of these examples involve multiple large datasets which need to be linked and manipulated smoothly among globally separated collaborators in real time. One particularly pertinent example mentioned by Smarr revolved around the 2002-03 SARS epidemic, during which local resources to analyze patient radiographs were swamped by the sheer number of radiographs taken. High-bandwidth networks enabled this data to be shared globally and read by remote researchers, an excellent example of the crucial role that advanced networks can play during global crises.

back to top ]

About CENIC and How to Change Your Subscription:

California's education and research communities leverage their networking resources under CENIC, the Corporation for Education Network Initiatives in California, in order to obtain cost-effective, high-bandwidth networking to support their missions and answer the needs of their faculty, staff, and students. CENIC designs, implements, and operates CalREN, the California Research and Education Network, a high-bandwidth, high-capacity Internet network specially designed to meet the unique requirements of these communities, and to which the vast majority of the state's K-20 educational institutions are connected. In order to facilitate collaboration in education and research, CENIC also provides connectivity to non-California institutions and industry research organizations with which CENIC's Associate researchers and educators are engaged.

CENIC is governed by its member institutions. Representatives from these institutions also donate expertise through their participation in various committees designed to ensure that CENIC is managed effectively and efficiently, and to support the continued evolution of the network as technology advances.

For more information, visit www.cenic.org.

Subscription Information: You can subscribe and unsubscribe to CENIC Updates at http://lists.cenic.org/mailman/listinfo/cenic-announce.

[(c) Copyright 2013 CENIC.  All Rights Reserved.]