Download Data

PNCC Offers four Data Download Methods:

  • Globus (required on all applications)

  • MinIO (in testing)

  • ARIA2

  • PNCC User Portal

PNCC data sets are very large, with some of them comprising multiple terabytes. Distance matters a great deal when transferring such large quantities of data across geographical distances on a wide area network. The speed of network transfers has to be taken into account, as they can cause transfer times to vary by factors of 1000! Transfer rates of 100-400 Mbps are typical when transferring data from PNCC if your network connection is well-tuned.

To work around some of the inherent problems of transferring large quantities of data across long distances, PNCC recommends that you:

1. Ensure that machines at your institution are well-configured for long-distance data transfer.

2. Choose a file transfer tool that can perform well.

3. Test your transfers end-to-end from EMSL to troubleshoot any problems prior to taking data at PNCC.

Frequently Asked Questions and Tips for Tuning Data Transfer

How do I get my data?

PNCC processes data and distributes it to users via the EMSL Computing facility (https://www.emsl.pnl.gov) at Pacific Northwest National Laboratory in Richland, Washington (PNNL). The EMSL computing facility features parallel HPC systems, petabytes of data storage, and dedicated high-speed data transfer capability. We offer three routes for you to access your PNCC data.

  1. The fastest transfer option is using Globus. This is the preferred method because you can get near real-time access to the data as it streams off the microscope and detector. Globus also provides a single point of access for you to see old and new data using a single endpoint rather than needing separate download links for each dataset. However, not all institutions permit Globus transfers inbound. PNCC now requires each proposal to list the Globus ID for at least one team member at time of proposal submission. Once your project is approved please reach out to our Data Team to do a test transfer via a shared endpoint to confirm if Globus works for you and to establish expected transfer speeds to your local storage. Setting up Globus access is pretty straightforward and is completely free for academics and non-profits. To get the ID go to https://www.globusid.org/create where you can register for a Globus basic account if your institution doesn’t already have an institutional license. 

  2. The second fastest transfer option to download PNCC data is to use the MinIO. We are currently testing MinIO (S3 API) to provide users an alternate route of accessing their data. There are many S3 compatible clients, but since this option is still in testing, we have not yet integrated this with the PNCC user account system. Users wanting to test this option should contact our Data Team so that we can issue an access key and provide any additional help needed alongside our MinIO [documentation].

  3. The third fastest transfer option to download PNCC data is to use the ARIA2 parallel download manager. For every dataset collected an autogenerated email will be sent to members of the team to notify that data is available for download. However, this option is only available after a microscope session is complete and the full dataset has been uploaded to the archive. Installation of ARIA2 is straightforward with options to compile from source or direct download of binaries for OS X and Windows. Once installed, the use of ARIA2 is very easy. At the bottom of each PNCC autogenerated dataset email is a link for specific use with ARIA2 which points to the particular dataset. Simply copy this link and launch it with the ARIA2 command line to initiate the transfer. Similar to Globus, if your local connection is intermittent or the download gets halted, ARIA2 will automatically resume after disruption. ARIA2 parallelized download options are particularly helpful for big files or when data transfer rates are slow between your institution and the PNCC archive.

  4. The slowest but most accessible transfer option is via the PNCC User Portal “Get Data.” To access the datasets, login to the PNCC User Portal then click the “Get Data” tile or click the MyEMSL data download link provided in the autogenerated email mentioned above. Note: The portal login gives you a token needed to see the data and the token can time out after awhile. If you have problems seeing the data, try logging in to the portal on another browser page then refreshing the data portal page. 

If you have any issues or difficulties, please contact our Data Team directly by email. None of these tools will perform well if you are distant from the Pacific Northwest United States and have not addressed the tuning issues described in "Configure and check your machines" below.

How will I know when I have data ready?

If using Globus, you will usually see your data on the our Globus Endpoint within 30 minutes of your session starting on the microscope. You will not be notified automatically about the start of the session but your SPOC may contact you if you have a pre-scheduled reservation. Whether or not you are using Globus to download your data, an auto-generated email will be sent to the project team members when the complete dataset is archived and available for download.  This occurs after the end of the session and can take extra time depending on the archival queue. The auto-generated email will have a subject line of “MyEMSL Notification - Data Uploaded for XXXXX” and the message will contain direct links to your data in the EMSL data portal and a metalink that may be used to download your data with ARIA2. Note: You may receive multiple copies of this is your dataset was incomplete upon archiving.

The automated pre-processed data will be separately uploaded automatically to the archive and an auto-generated email will again be sent to the project team when that data is available for download. Depending on the queue size, the pre-processed data may appear in the archive later than the raw data. However, the pre-processed data will be visible via Globus as soon as each file is created. 

Please contact our Data Team directly by email if you still don't see your dataset more than 24 hours after the end of your session. 

What is Globus? Why are we using it?

Globus is widely used in the scientific research community for sharing and transferring large quantities of data. It is designed to provide secure, reliable, high performance data access across multiple sites in a single interface. GLOBUS is used day in and day out to transfer PNCC data from OHSU to EMSL and moves multiple terabytes of data per day. 

Introductory information about GLOBUS is available here.

How do I to obtain a Globus ID?

In order to use Globus, you will need a Globus ID. The Globus ID identifies you to Globus, and allows you to access Globus endpoints (see below), including the EMSL Data Transfer Node (emsl#dtn) that hosts PNCC user data. Additional information about getting started with Globus is available here.

What are the minimum technical requirements for running Globus?

The normal Globus mode of operations is to use endpoints at either end of a data transfer. An endpoint is usually a dedicated data storage system with high performance network interfaces and disk storage, tuned for long distance data transfers.

It is also possible to set up a personal computer as a Globus endpoint by installing Globus Connect Personal software on it. Note: that this will probably not perform as well as a dedicated, tuned endpoint. If you have a local server-class Globus endpoint at your institution, it will likely be better to transfer data to that endpoint, then make a local copy down to your own machine.

What should I do if Globus doesn't work for me?

As mentioned above in the "How do I get my data" section, PNCC has 2 other options for data transfer. Please contact our Data Team directly if you have any questions about the other access mechanisms. Note, the auto-generated emails that you will receive when every dataset is available in the archive will also have specific instructions and links to the particular dataset. 

What are typical data transfer speeds?

The average data collection results in 2TB+ of data. At 1Gbps transfer speeds (ie: a common internal wired networking) it would take 6+ hours to transfer that amount of data locally from start to finish. The further your location is from our site, the slower data transfer rates are likely to be. The emsl#msc Globus endpoint available to PNCC users has been observed sending data via Globus to Arizona as fast as 275 megabytes/second. At that rate, one terabyte can be transferred in about an hour. In comparison, a poor cross-country connection has been seen to perform as badly as 250 kilobytes/second, in which case that same terabyte would take a month and a half to transfer demonstrating the need to tune network connections at the user institution.

Configure and check your machines.

In order to get good data transfer rates from PNCC, your receiving machine(s) must have:

  1. A reliable network connection that does not experience errors or dropped packets

  2. A fast-wired network interface (10 gigabit/second is recommended). Do not use wireless for these data transfers!

  3. Tuning of the operating system (Linux, Windows, or MacOS) that dedicates extra system memory to data transfer

  4. Fast disk or SSD storage (capable of at least 100 megabyte/second) that is large enough to hold your data set(s)

The farther you are from the Pacific Northwest in the United States, the more important #1, #2, and #3 are. See here for instructions on #3.

Background info: At long distances, many megabytes of data can be “in the wire” between computers that are far apart even when taking the speed of light in a cable into account. The computers at either end must be tuned to dedicate at least a “cable full” of their memory to each active data transfer, or they will drop data that will then have to be re-sent. This problem is noticeable at distances of about 500 miles, and can slow down transfers by a factor of 1,000 on cross-country connections. Almost no computer is tuned for this “out of the box,” but fortunately tuning guidelines are available here.

How do I test my data transfer rate?

To perform a data transfer rate test, please contact our Data Team directly through email with the subject line: Test Dataset.

When a new Globus Shared Endpoint is created for your project, a large file named 'testdata' containing random data will be placed in it.  You should download that file to test transfer rates to your institution.

If you want to test with known high-performance servers, or compare the speed of distant vs. local transfers, The Energy Sciences Network (ESNet) maintains a set of machines for such testing. They use the GridFTP protocol (which Globus uses) for maximum performance. See https://fasterdata.es.net/performance-testing/DTNs.

What is MinIO?

MinIO is an object storage system that uses an Amazon S3 web interface for letting clients manipulate their data. It can manage data replication and scaling over many systems and disks, or it can simply provide an S3 interface to an existing directory on a filesystem (our use-case).

Why MinIO?

The Amazon S3 API is a de-facto standard supported by many server and clients. Since it uses HTTP as a transport, it is reasonably easy to proxy the traffic and transfers are generally fast. This make it easier to access across firewalls, keep traffic secure, promote high speeds, and provide users a diverse number of options for clients.

How do users use MinIO?

Currently the setup process is very manual on the PNCC end and user end. Users interested in S3 access would email our Data Team for access. After verifying that the originating email is associated with the project, we would then create and generate login credentials (see below for further details) for the project in the form of a text file. The resulting text file can then be used directly with rclone, or the relevant parameters extracted for use with another S3 client.

There are many available S3 clients and the most pertinent information to users are the server name, a username and a password.  Because many users are already familiar with rsync, a good client to recommend is Rclone (https://rclone.org). Rclone supports many different file protocols (not just S3) so it is convenient to provide it connection information in the form of a single config file.  For example, the following text file 51799.conf:

             [pncc-51799]
             type = s3
             provider = Other
             env_auth = false
             access_key_id = 51799
             secret_access_key = 1rRmndsfgsdfOAxYSolup+/2K+w=
             endpoint = https://pncc-data.ohsu.edu 

The above essentially tells rclone that the shortcut name `pncc-51799` refers to an `s3` server at `https://pncc-data.ohsu.edu` that can be accessed with the username `51799` and password `1rRmndsfgsdfOAxYSolup+/2K+w=`

Rclone itself has many different subcommands and flags that can change with major version updates, so it is useful to peruse the docs and familiarize oneself with the basic functionality offered. The `copy` command in Rclone provides similar functionality as rsync, and might be used with the above config file like:

             rclone copy --conf 51799.conf -P pncc-51799:51799 path/to/destination/.

The above command will copy the entire `51799` directory, skipping files that are already present at the destination.  It can be run repeatedly, like rsync, to keep the user’s copy up-to-date with reasonable efficiency. The -P flag provides useful progress readout.

As a final reminder, rclone itself has many other useful commands it supports, and rclone is not the only S3 compatible client.  On the Mac, Transmit (https://www.panic.com/transmit/) is a nice GUI file transfer client that speaks S3. There are also ways to mount an S3 share like a normal drive: https://mac.eltima.com/mount-cloud-drive.html.  All the information needed to use any of these clients can be gleaned from the Rclone config file.

What is ARIA2?

ARIA2 is a multi-protocol data download tool that is available at https://aria2.github.io. ARIA2 has several useful features, including the ability to transfer multiple data streams simultaneously to improve performance. It can use the metalink to your data that is in the “MyEMSL Notification - Data Uploaded for XXXXX” email message. It can also resume an interrupted data transfer without having to restart.

How do users download data using MyEMSL's data portal?

The simplest but least performant way to download data is from the PNCC User Portal at https://pnccportal.labworks.org/.

  1. Log in to the portal with your EMSL-assigned email address and password. On the home page, click on “Get Data” tile. On the resulting page, scroll to your project number and relevant microscope. Click on the microscope name and you will be taken to a search page that will show the selected project and instrument, and let you refine your search criteria if desired.

  2. Click on the box next to dataset you want to download and a cart will be made. The cart may take a few minutes to a couple hours to make depending on file size and how many others tried to start a cart. You may leave the page and return to it later; the portal will continue preparing your data. Once the cart is finished a link will appear on the page that you can click on to start the download directly over your web connection.

  3. If your local connection is intermittent and the download gets halted, you can copy the download link and use WGET or ARIA2 download managers which can resume after disruption and also offer parallelized download options for big files.

  4. Once you have the files, depending on your computing platform, you can extract the files by the following means...
                - Macintosh
                                Double click on the downloaded file in the Finder.
                - Windows
                                Download 7-Zip and install it on your machine. Right-click on the downloaded file and choose: 7-Zip > Extract Here
                - Linux
                                From a terminal window, type tar -xf your_tar_file.tar

  5. To keep the file sizes small and facilitate the transfer, the movie files and other data may be compressed separately using Zstandard (https://github.com/facebook/zstd). To uncompress, download Zstandard from GitHUB and then use the command “zstd -d file.zst” to recover the original raw file. You can write a quick C-shell script to loop across all files. An example of the script could be
                foreach stack_file ( *.zst )
                set stack_filename = `ls ${stack_file} | cut -d . -f1`
                echo ${stack_filename}
                zstd -d ${stack_file}
                end

Notes:

  • A “Project Team Access” banner across the corner of the Upload pane indicates that the data set has not yet been released to the public. This means it is available only to approved members of the project and authorized EMSL staff. As an authorized member of the project, you should be able to download this data.

  • If you do not see any instruments or data sets on the portal pages, make sure you go to https://pnccportal.labworks.org/ and log in, then reload the pages. Also, select the appropriate date range for the data you are looking for.