Using Google Colab

The easiest way to prepare your data for ProtSpace - no local installation required!

Overview

The Colab notebook converts protein embeddings into a visualization-ready .parquetbundle file:

Reads your embeddings from an HDF5 file (.h5)
Applies dimensionality reduction (PCA, UMAP, t-SNE, PaCMAP, MDS)
Retrieves annotations from UniProt, InterPro, and NCBI Taxonomy
Creates the .parquetbundle file ready for ProtSpace

Step 1: Get Protein Embeddings

You need an HDF5 file (.h5) containing protein embeddings. There are three ways to obtain this:

Option A: Download from UniProt (Recommended)

Go to uniprot.org
Search for proteins using UniProt query syntax (e.g., (ft_domain:phosphatase) AND (reviewed:true))
Click Download → Select Format Embeddings → Submit job
Download the results - check UniProt's Tools Dashboard for the prepared embedding file

Option B: Generate from FASTA

Use the dedicated embedding generation notebook:

This notebook:

Takes a FASTA file as input
Generates embeddings using various protein language models (ProtT5, ESM2, etc.)
Outputs an HDF5 file ready for ProtSpace

Option C: Use Your Own Embeddings

For advanced users with custom embeddings, save them as an HDF5 file where each protein is stored as a dataset named by its identifier.

Step 2: Run the Notebook

Click the Colab badge above to open the notebook
Run the first cell to install dependencies (~1 minute)
Upload your .h5 embeddings file

Step 3: Configure Options

Annotations

Choose which annotations to include from three sources:

Source	Examples
UniProt	protein_families, ec, go_bp, cc_subcellular_location, ...
InterPro	pfam, cath, panther, smart, superfamily, ...
Taxonomy	kingdom, phylum, class, order, family, genus, species

See the ProtSpace Python package for the complete list of available annotations per source.

TIP

First-time taxonomy selection downloads a database (~1 minute).

Dimensionality Reduction

Choose which 2D projections to generate:

PCA - Fast, initial overview
UMAP - Best balance of speed and quality (recommended)
t-SNE - Great for clusters, slower on large datasets
PaCMAP - Alternative to t-SNE/UMAP
MDS - Preserves pairwise distances

Parameters (Optional)

Fine-tune settings for each method:

Method	Parameters
UMAP	N Neighbors, Min Dist
t-SNE	Perplexity, Learning Rate
PaCMAP	N Neighbors, MN Ratio, FP Ratio
MDS	N Init, Max Iter

Step 4: Generate and Download

Click "Generate Bundle"
Wait for processing (time depends on dataset size)
Download your .parquetbundle file

Step 5: Visualize in ProtSpace

Go to protspace.app/explore
Drag & drop your .parquetbundle file onto the scatterplot
Start exploring!

Tips

Start small: Test with a subset of proteins first.
PCA is fastest: All methods except PCA become significantly slower with larger datasets (quadratic or worse complexity).
Try multiple methods: For best results, include both PCA and UMAP.

Alternative: Python CLI

For local processing, automation, or larger datasets, see the Python CLI guide.

Using Google Colab ​

Overview ​

Step 1: Get Protein Embeddings ​

Option A: Download from UniProt (Recommended) ​

Option B: Generate from FASTA ​

Option C: Use Your Own Embeddings ​

Step 2: Run the Notebook ​

Step 3: Configure Options ​

Annotations ​

Dimensionality Reduction ​

Parameters (Optional) ​

Step 4: Generate and Download ​

Step 5: Visualize in ProtSpace ​

Tips ​

Alternative: Python CLI ​