PNCC Offers 3 Methods to Get Data:
Globus (required on all applications)
EMSL User Portal
PNCC data sets are very large, with some of them comprising multiple terabytes. Distance matters a great deal when transferring such large quantities of data across geographical distances on a wide area network. The speed of network transfers has to be taken into account, as they can cause transfer times to vary by factors of 1000! Transfer rates of 100-400 Mbps are typical when transferring data from PNCC if your network connection is well-tuned.
To work around some of the inherent problems of transferring large quantities of data across long distances, PNCC recommends that you:
1. Ensure that machines at your institution are well-configured for long-distance data transfer.
2. Choose a file transfer tool that can perform well.
3. Test your transfers end-to-end from EMSL to troubleshoot any problems prior to taking data at PNCC.
Frequently Asked Questions and Tips for Tuning Data Transfer
- How do I get my data?
PNCC processes data and distributes it to users via the EMSL Computing facility (https://www.emsl.pnl.gov) at Pacific Northwest National Laboratory in Richland, Washington (PNNL). The EMSL computing facility features parallel HPC systems, petabytes of data storage, and dedicated high-speed data transfer capability. We offer three routes for you to access your PNCC data.
- The fastest transfer option is using Globus. This is the preferred method because you can get near real-time access to the data as it streams off the microscope and detector. Globus also provides a single point of access for you to see old and new data using a single endpoint rather than needing separate download links for each dataset. However, not all institutions permit Globus transfers inbound. PNCC now requires each proposal to list the Globus ID for at least one team member at time of proposal submission. Once your project is approved the PNCC data transfer team will reach out to the principle investigator and team to do a test transfer via a shared endpoint to confirm if GLOBUS works for you and to establish expected transfer speeds to your local storage. Setting up Globus access is pretty straightforward and is completely free for academics and non-profits. To get the ID go to https://www.globusid.org/create where you can register for a Globus basic account if your institution doesn’t already have an institutional license.
- The second fastest transfer option to download PNCC data is to use the ARIA2 parallel download manager. For every dataset collected an autogenerated email will be sent to members of the team to notify that data is available for download. However, this option is only available after a microscope session is complete and the full dataset has been uploaded to the archive. Installation of ARIA2 is straightforward with options to compile from source or direct download of binaries for OS X and Windows. Once installed, the use of ARIA2 is very easy. At the bottom of each PNCC autogenerated dataset email is a link for specific use with ARIA2 which points to the particular dataset. Simply copy this link and launch it with the ARIA2 command line to initiate the transfer. Similar to Globus, if your local connection is intermittent or the download gets halted, ARIA2 will automatically resume after disruption. ARIA2 parallelized download options are particularly helpful for big files or when data transfer rates are slow between your institution and the PNCC archive.
- The slowest but most accessible transfer option is via the Environmental Molecular Sciences Laboratory's User Portal “Get Data” tab. To access the datasets, login to the EMSL User Portal then either migrate to the “Get Data” tab or click the MyEMSL data download link provided in the autogenerated email mentioned above. Note: The portal login gives you a token needed to see the data and the token can time out after awhile. If you have problems seeing the data, try logging in to the portal on another browser page then refreshing the data portal page.
If you have any issues or difficulties, please contact our Data Team directly by email. None of these tools will perform well if you are distant from the Pacific Northwest United States and have not addressed the tuning issues described in "Configure and check your machines" below.
- How will I know when I have data ready?
If using Globus, you will usually see your data on the our Globus Endpoint within 30 minutes of your session starting on the microscope. You will not be notified automatically about the start of the session but your SPOC may contact you if you have a pre-scheduled reservation. Whether or not you are using Globus to download your data, an auto-generated email will be sent to the project team members when the complete dataset is archived and available for download. This occurs after the end of the session and can take extra time depending on the archival queue. The auto-generated email will have a subject line of “MyEMSL Notification - Data Uploaded for Project XXXXX” and the message will contain direct links to your data in the EMSL data portal and a metalink that may be used to download your data with ARIA2. Note: You may receive multiple copies of this is your dataset was incomplete upon archiving.
The automated pre-processed data will be separately uploaded automatically to the archive and an auto-generated email will again be sent to the project team when that data is available for download. Depending on the queue size, the pre-processed data may appear in the archive later than the raw data. However, the pre-processed data will be visible via Globus as soon as each file is created.
Please contact our Data Team directly by email if you still don't see your dataset more than 24 hours after the end of your session.
- What is Globus? Why are we using it?
Globus is widely used in the scientific research community for sharing and transferring large quantities of data. It is designed to provide secure, reliable, high performance data access across multiple sites in a single interface. GLOBUS is used day in and day out to transfer PNCC data from OHSU to EMSL and moves multiple terabytes of data per day.
Introductory information about GLOBUS is available here.
- How do I to obtain a Globus ID?
In order to use Globus, you will need a Globus ID. The Globus ID identifies you to Globus, and allows you to access Globus endpoints (see below), including the EMSL Data Transfer Node (emsl#dtn) that hosts PNCC user data. Additional information about getting started with Globus is available here.
- What are the minimum technical requirements for running Globus?
The normal Globus mode of operations is to use endpoints at either end of a data transfer. An endpoint is usually a dedicated data storage system with high performance network interfaces and disk storage, tuned for long distance data transfers.
It is also possible to set up a personal computer as a Globus endpoint by installing Globus Connect Personal software on it. Note: that this will probably not perform as well as a dedicated, tuned endpoint. If you have a local server-class Globus endpoint at your institution, it will likely be better to transfer data to that endpoint, then make a local copy down to your own machine.
- What should I do if Globus doesn't work for me?
As mentioned above in the "How do I get my data" section, PNCC has 2 other options for data transfer. Please contact our Data Team directly if you have any questions about the other access mechanisms. Note, the auto-generated emails that you will receive when every dataset is available in the archive will also have specific instructions and links to the particular dataset.
- What are typical data transfer speeds?
The average data collection results in 2TB+ of data. At 1Gbps transfer speeds (ie: a common internal wired networking) it would take 6+ hours to transfer that amount of data locally from start to finish. The further your location is from our site, the slower data transfer rates are likely to be. The emsl#msc Globus endpoint available to PNCC users has been observed sending data via Globus to Arizona as fast as 275 megabytes/second. At that rate, one terabyte can be transferred in about an hour. In comparison, a poor cross-country connection has been seen to perform as badly as 250 kilobytes/second, in which case that same terabyte would take a month and a half to transfer demonstrating the need to tune network connections at the user institution.
- Configure and check your machines.
In order to get good data transfer rates from PNCC, your receiving machine(s) must have:
- A reliable network connection that does not experience errors or dropped packets
- A fast-wired network interface (10 gigabit/second is recommended). Do not use wireless for these data transfers!
- Tuning of the operating system (Linux, Windows, or MacOS) that dedicates extra system memory to data transfer
- Fast disk or SSD storage (capable of at least 100 megabyte/second) that is large enough to hold your data set(s)
The farther you are from the Pacific Northwest in the United States, the more important #1, #2, and #3 are. See here for instructions on #3.
Background info: At long distances, many megabytes of data can be “in the wire” between computers that are far apart even when taking the speed of light in a cable into account. The computers at either end must be tuned to dedicate at least a “cable full” of their memory to each active data transfer, or they will drop data that will then have to be re-sent. This problem is noticeable at distances of about 500 miles, and can slow down transfers by a factor of 1,000 on cross-country connections. Almost no computer is tuned for this “out of the box,” but fortunately tuning guidelines are available here.
- How do I test my data transfer rate?
To perform a data transfer rate test, please contact our Data Team directly through email with the subject line: Test Dataset.
When a new Globus Shared Endpoint is created for your project, a large file named 'testdata' containing random data will be placed in it. You should download that file to test transfer rates to your institution.
If you want to test with known high-performance servers, or compare the speed of distant vs. local transfers, The Energy Sciences Network (ESNet) maintains a set of machines for such testing. They use the GridFTP protocol (which Globus uses) for maximum performance. See https://fasterdata.es.net/performance-testing/DTNs.
- What is ARIA2?
ARIA2 is a multi-protocol data download tool that is available at https://aria2.github.io. ARIA2 has several useful features, including the ability to transfer multiple data streams simultaneously to improve performance. It can use the metalink to your data that is in the “MyEMSL Notification - Data Uploaded for Project XXXXX” email message. It can also resume an interrupted data transfer without having to restart.
- What is the MyEMSL data portal?
The simplest but least performant way to download data is from the EMSL User Portal at https://eus.emsl.pnnl.gov/Portal. Log in to the portal with your EMSL-assigned email address and password. Click on the “Get Data” tab to see a list of instruments that have data available to you. You can navigate to the data set of interest, select the files and folders that you want to download, and then click on the “Queue Selected Files for Download.” The portal will then prepare a downloadable set of files for you anywhere from several minutes to hours. You may leave the page and return to it later; the portal will continue preparing your data.
- A “Project Team Access” banner across the corner of the Upload pane indicates that the data set has not yet been released to the public. This means it is available only to approved members of the project and authorized EMSL staff. As an authorized member of the project, you should be able to download this data.
- If you do not see any instruments or data sets on the portal pages, make sure you go to https://eus.emsl.pnnl.gov/Portal and log in, then reload the pages. Also, select the appropriate date range for the data you are looking for.
- Once a downloadable file set (a “cart”) has been prepared you may download it with your web browser, or use a tool like wget, curl, or ARIA2.
- Can we use rsync?
Unfortunately, rsync is prohibited by an institutional firewall on our end, so is not an option. ARIA2 is available as a command line based parallel download manager. See the "How do I get my data" section above for details about ARIA2 and contact the Data Team directly if you have questions with its implementation.
- Can we send hard drives?
Yes, but this must be approved by scheduling before shipping hard drives to our facility. We prefer data to be transferred via one of the options listed above, but do make exceptions for projects where the institution is far from the center, has very poor transfer rates and has large datasets.
- What are my imaging parameters?
For all DATA COLLECTION sessions, a "METADATA_INFO.txt" file is automatically created within 30 minutes of the session data appearing on PNCC compute resources. This "METADATA_INFO.txt" contains all relevant parameters including Cs value, accelerating voltage, pixel size, exposure dose, etc. Keep in mind, that the pixel size value is coming from the cross-grating grid and can be a few % off, contact your SPOC before processing the data to get an accurate value (calibrated on the standard protein dataset). The metadata file can be found at the same directory level where you will find the "frames/" folder and the "relion3_preprocess/" folders containing your raw images and preprocessed data. Open the "METADATA_INFO.txt" file with any text editing program. Please contact your SPOC with any questions about the specific parameters or the Data Team directly if the file can not be found for some reason in your data directory. Note, the metadata file is not created for screening sessions.