Skip to content

Using Google Colab

The easiest way to prepare your data for ProtSpace - no local installation required!

Open In Colab

Overview

The Colab notebook converts protein embeddings into a visualization-ready .parquetbundle file:

  1. Reads your embeddings from an HDF5 file (.h5)
  2. Applies dimensionality reduction (PCA, UMAP, t-SNE, PaCMAP, MDS)
  3. Retrieves annotations from UniProt, InterPro, and NCBI Taxonomy
  4. Creates the .parquetbundle file ready for ProtSpace

Step 1: Get Protein Embeddings

You need an HDF5 file (.h5) containing protein embeddings. There are three ways to obtain this:

  1. Go to uniprot.org
  2. Search for proteins using UniProt query syntax (e.g., (ft_domain:phosphatase) AND (reviewed:true))
  3. Click Download → Select Format Embeddings → Submit job
  4. Download the results - check UniProt's Tools Dashboard for the prepared embedding file

Option B: Generate from FASTA

Use the dedicated embedding generation notebook:

Open Embedding Generator

This notebook:

  • Takes a FASTA file as input
  • Generates embeddings using various protein language models (ProtT5, ESM2, etc.)
  • Outputs an HDF5 file ready for ProtSpace

Option C: Use Your Own Embeddings

For advanced users with custom embeddings, save them as an HDF5 file where each protein is stored as a dataset named by its identifier.

Step 2: Run the Notebook

  1. Click the Colab badge above to open the notebook
  2. Run the first cell to install dependencies (~1 minute)
  3. Upload your .h5 embeddings file

Step 3: Configure Options

Annotations

Choose which annotations to include:

SourceAvailable Annotations
UniProtannotation_score, subcellular_location, protein_families, reviewed, etc.
InterProCATH, Pfam, signal_peptide, superfamily
Taxonomykingdom, phylum, class, order, family, genus, species

TIP

First-time taxonomy selection downloads a database (~1 minute).

Dimensionality Reduction

Choose which 2D projections to generate:

  • PCA - Fast, initial overview
  • UMAP - Best balance of speed and quality (recommended)
  • t-SNE - Great for clusters, slower on large datasets
  • PaCMAP - Alternative to t-SNE/UMAP
  • MDS - Preserves pairwise distances

Parameters (Optional)

Fine-tune settings for each method:

MethodParameters
UMAPN Neighbors, Min Dist
t-SNEPerplexity, Learning Rate
PaCMAPN Neighbors, MN Ratio, FP Ratio
MDSN Init, Max Iter

Step 4: Generate and Download

  1. Click "Generate Bundle"
  2. Wait for processing (time depends on dataset size)
  3. Download your .parquetbundle file

Step 5: Visualize in ProtSpace

  1. Go to protspace.app/explore
  2. Drag & drop your .parquetbundle file onto the scatterplot
  3. Start exploring!

Tips

  • Start small: Test with a subset of proteins first.
  • PCA is fastest: All methods except PCA become significantly slower with larger datasets (quadratic or worse complexity).
  • Try multiple methods: For best results, include both PCA and UMAP.

Alternative: Python CLI

For local processing, automation, or larger datasets, see the Python CLI guide.

Released under the Apache 2.0 License.