Mo Data, Mo Problems: How to share big data with ease

Sharing Image Data

At the Imaging Platform we frequently need to send and receive images between collaborators or forum users. Working with images can become challenging simply due to their file size. Memory and disk limitations are often irritants when analyzing images, but in some cases, such as multi-dimensional images and whole-slide scans, images’ file sizes can be large enough that even accessing images becomes a major roadblock, especially when the size starts to rival the size of desktop or laptop hard drives.

1. Packaging Image Data

A common first step, prior to transferring images, is to package multiple images together into a single file. There are several popular packaging or archive formats, e.g. tar or zip. To create the archives we recommend the open-source archive utility called 7zip. Alternately, on the Mac, files can be selected, and after right-clicking, choose “Compress N items” from the resulting menu to create a .zip archive.

2. Transferring Image Data

Small Image Sets (a single image or a package less than 25MB)
Small image sets could be a single image to share or a small collection, maybe a dozen. This kind of data isn’t meant to be transferred back and forth with any frequency and tends to be more of a one-directional transfer.

  • Email: The easiest and most straightforward approach to sharing images is sending them through email. In the past there have been file size limits, but these are becoming less and less noticeable. For example, Gmail has a file size limit of 25MB per email and a total storage limit of 15GB. However, the per email limit is automatically circumvented by attaching the images using a Google Drive link, which is discussed below.
  • Team Chat: If you’re at a lab or company using team chat, such as Slack, then images can easily be swapped back and forth through messaging. Slack has a file limit size of 1GB. However, it is not recommended to upload such large files since the max storage of a Slack channel is 10GB per team member on a standard plan.

Medium Image Sets (less than 2GB)
Medium size image sets might represent an entire experiment or a more complex image type, such as stitched image of a tissue or a lengthy time-lapse image.

  • Cloud Storage: [Dropbox](https://www.dropbox.com/help/billing/cost) and Google Drive are prominent cloud storage services that also make it easy to share images. Dropbox has 2GB of storage in the free plan and Google Drive has 15 GB. Image sets can be uploaded to either service and then a link can be created to grant access to the files for collaborators. Google Drive is integrated into Gmail such that if you attempt to attach a too-large file to an email, Gmail will offer to store the file on Google Drive.
  • File Transfer Service: One notable file transfer service is wetransfer.com, which will transfer up to 2GB of data for free. The data is uploaded to wetransfer.com for a limited amount of time (2 weeks). Then, a link is created that can be shared through email. Anyone with the link can then download the data. This is a convenient way to move medium amounts of data without having to store it long-term in a cloud storage service.

Large Image Sets (more than 2GB)
Large image sets are collections of images that might represent archives of images from a project that spanned months or years, or was part of a high-throughput screen. These image sets may not even fit on a single hard drive and might be processed by CellProfiler in headless mode on a computer cluster or cloud service such as Amazon Web Services (AWS).

  • Cloud Services: Cloud services like AWS provide cloud storage similar to Google Drive or Dropbox. In AWS, the storage service S3 can be used to house data that can then be processed by the EC2 or ECS compute services (check out [Distributed CellProfiler](https://github.com/CellProfiler/Distributed-CellProfiler) for documentation on implementing CellProfiler in the cloud). An S3 bucket can be created for the imaging project and a file transfer client such as [CyberDuck](https://cyberduck.io/) can be used to manage the transfer of files to and from the bucket. Note that whomever maintains the AWS infrastructure must provide you and your collaborator access credentials.
  • FTP: If your group has a file server, then FTP, or SFTP, is a good option for moving files. This is like having cloud storage without having to open an account with Dropbox. The administrator of the file server will be able to provide you and your collaborator with a username and password. A free FTP client such as [FileZilla](https://filezilla-project.org/) or Cyberduck provides a user interface for managing the transfer of files.

When all else fails… 

  • Physical Media: Sometimes the simplest solution is to copy the images to an external hard drive, DVD, or a flash drive. Then these images can be physically taken to a collaborator or sent through a postal service.

Final Thoughts

The above approaches represent only a handful of tools, services, and methods to share images and data. New options and updates are bound to appear as long as the demand for more images and data continues to rise. If you’ve got a favorite tool, or you think we’ve left out an important option please mention it in the comments.

Start the discussion at forum.cellprofiler.org