Keep HPLC data in S3 (AWS, Google Cloud, Azure, etc)

Chromatography/Mass Spec data can be very heavy, especially spectra. Instead of keeping it all in Postgres, Peaksel allows storing signals to S3-compatible blob storage (AWS, Google Cloud, Azure Object Storage, Hetzner).

To activate this, you have to add/uncoment these to docker-compose.yml:

s3.blobs.access_key_id: [key id]
s3.blobs.access_key: [key content]
s3.blobs.bucket: [s3 bucket]
s3.blobs.endpoint: [s3 root URL]
s3.blobs.region: [region]
s3.provider: [cloud provider]

By default, Peaksel will first store binaries in Postgres first, and after some time (e.g. a day) it’ll be migrating the data to S3. It’s possible to control:

  • How old the blobs should before they are migrated to S3 with job.s3.blobs.older_than_seconds, default is 1 day

  • The min size of the blob to be considered for migration`job.s3.blobs.min_size_bytes`, defaults to 64kB

  • The min size after which the object is going to be uploaded directly to S3 bypassing Postgres: s3.blob.immediate_upload_threshold_bytes, default is 1GB

Cloud providers

Cloud provider property (s3.cloud.provider) has the following possible values:

  • aws - use AWS cloud. With this option, the s3.blobs.endpoint property will be ignored.

  • gcp - use Google cloud. For using Google Cloud Storage buckets, a default project must be set in interoperability settings. With this option, the s3.blobs.endpoint property will be ignored.

  • hetzner - use Hetzner cloud. With this option, the s3.blobs.endpoint property will be ignored.

  • azures3 - support for Azure Blob Storage

  • other - use other S3-compatible cloud providers. s3.blobs.endpoint is mandatory with this option. It must include the complete schema, including regions, buckets and anything else that is present. Those properties must match the values in s3.blobs.bucket and s3.blobs.region. Support is not guaranteed. If you have any issues with your provider, please contact us for supporting them.

If no value is provided or invalid value is provided, defaults to other.

Azure Blob Storage

Peaksel can interact with Azure via S3Proxy. We don’t support Azure natively as it’s not S3 compatible. If you have any issues with Azure blob storage, please contact us for native support. S3Proxy can be run as a docker container:

services:
  s3proxy:
    restart: always
    container_name: s3proxy
    image: andrewgaul/s3proxy:sha-b6ce601
    ports:
      - "9000:80"
    expose:
      - "80"
    environment:
      - LOG_LEVEL=info # can be set to debug or trace for troubleshooting
      - S3PROXY_ENDPOINT=http://0.0.0.0:80
      - S3PROXY_IDENTITY=local-identity #  value that will be used as access_key_id in peaksel
      - S3PROXY_CREDENTIAL=local-credential #  value that will be used as access_key in peaksel
      - S3PROXY_AUTHORIZATION=none # signature authentication. For peaksel, set 'none'
      - JCLOUDS_PROVIDER=azureblob-sdk # name of azure service account
      - JCLOUDS_IDENTITY=serviceaccount # secret key
      - JCLOUDS_CREDENTIAL=accesskey # https://<service-account-name>.blob.core.windows.net
      - JCLOUDS_ENDPOINT=https://serviceaccount.blob.core.windows.net

In Peaksel, corresponding properties are:

s3.blobs.access_key_id: local-identity
s3.blobs.access_key: local-credential
s3.blobs.bucket: [azure blob container name]
s3.blobs.endpoint: http://localhost:9000
s3.blobs.region: [azure blob container region]
s3.provider: azures3

For more information, refer to a blog post by Microsoft.