Using Google Colab
The easiest way to prepare your data for ProtSpace - no local installation required!
Overview
The Colab notebook converts protein embeddings into a visualization-ready .parquetbundle file:
- Reads your embeddings from an HDF5 file (.h5)
- Applies dimensionality reduction (PCA, UMAP, t-SNE, PaCMAP, MDS)
- Retrieves annotations from UniProt, InterPro, and NCBI Taxonomy
- Creates the
.parquetbundlefile ready for ProtSpace
Step 1: Get Protein Embeddings
You need an HDF5 file (.h5) containing protein embeddings. There are three ways to obtain this:
Option A: Download from UniProt (Recommended)
- Go to uniprot.org
- Search for proteins using UniProt query syntax (e.g.,
(ft_domain:phosphatase) AND (reviewed:true)) - Click Download → Select Format Embeddings → Submit job
- Download the results - check UniProt's Tools Dashboard for the prepared embedding file
Option B: Generate from FASTA
Use the dedicated embedding generation notebook:
This notebook:
- Takes a FASTA file as input
- Generates embeddings using various protein language models (ProtT5, ESM2, etc.)
- Outputs an HDF5 file ready for ProtSpace
Option C: Use Your Own Embeddings
For advanced users with custom embeddings, save them as an HDF5 file where each protein is stored as a dataset named by its identifier.
Step 2: Run the Notebook
- Click the Colab badge above to open the notebook
- Run the first cell to install dependencies (~1 minute)
- Upload your
.h5embeddings file
Step 3: Configure Options
Annotations
Choose which annotations to include:
| Source | Available Annotations |
|---|---|
| UniProt | annotation_score, subcellular_location, protein_families, reviewed, etc. |
| InterPro | CATH, Pfam, signal_peptide, superfamily |
| Taxonomy | kingdom, phylum, class, order, family, genus, species |
TIP
First-time taxonomy selection downloads a database (~1 minute).
Dimensionality Reduction
Choose which 2D projections to generate:
- PCA - Fast, initial overview
- UMAP - Best balance of speed and quality (recommended)
- t-SNE - Great for clusters, slower on large datasets
- PaCMAP - Alternative to t-SNE/UMAP
- MDS - Preserves pairwise distances
Parameters (Optional)
Fine-tune settings for each method:
| Method | Parameters |
|---|---|
| UMAP | N Neighbors, Min Dist |
| t-SNE | Perplexity, Learning Rate |
| PaCMAP | N Neighbors, MN Ratio, FP Ratio |
| MDS | N Init, Max Iter |
Step 4: Generate and Download
- Click "Generate Bundle"
- Wait for processing (time depends on dataset size)
- Download your
.parquetbundlefile
Step 5: Visualize in ProtSpace
- Go to protspace.app/explore
- Drag & drop your
.parquetbundlefile onto the scatterplot - Start exploring!
Tips
- Start small: Test with a subset of proteins first.
- PCA is fastest: All methods except PCA become significantly slower with larger datasets (quadratic or worse complexity).
- Try multiple methods: For best results, include both PCA and UMAP.
Alternative: Python CLI
For local processing, automation, or larger datasets, see the Python CLI guide.