Skip to content

Using Python CLI

For local processing or automation, use the ProtSpace Python package.

Installation

bash
pip install protspace

Quick Start

From a UniProt Query

bash
protspace prepare -q "(ft_domain:kinase) AND (reviewed:true)" -m pca2,umap2

From Local Embeddings

bash
protspace prepare -i embeddings.h5 -m pca2,umap2

From a FASTA File

bash
protspace prepare -i sequences.fasta -e prot_t5 -m pca2,umap2

Parameters

ParameterDescription
-qUniProt query string
-iInput file(s): HDF5 or FASTA (use -i f.h5:name to override model name)
-oOutput directory (default: .)
-mProjection methods (comma-separated or repeatable)
-aAnnotations: group names, individual names, or CSV path
-eEmbedder model shortcut (for FASTA input)
-sCompute sequence similarity via MMseqs2
-vVerbosity (-v = INFO, -vv = DEBUG)

Annotations

Specify annotations with -a:

bash
# Use a predefined group
-a default        # EC, keyword, length, protein_families, reviewed
-a all            # Everything from all sources

# Pick individual sources
-a uniprot -a interpro -a taxonomy -a ted -a biocentral

# Or pick individual annotation names
-a protein_families,reviewed,pfam,genus,species

# Or provide a CSV/TSV file
-a annotations.csv

Available annotation groups:

GroupSource annotations
defaultEC, keyword, length, protein_families, reviewed
uniprotGene name, EC, GO terms, subcellular location, length, and more
interproPfam, CATH, SMART, CDD, Panther, Superfamily, and more
taxonomyKingdom, phylum, class, order, family, genus, species
tedAlphaFold TED domain annotations
biocentralPredicted membrane, signal peptide, transmembrane, subcellular location
allAll of the above

CSV format:

csv
identifier,taxonomy,family,function
P12345,Bacteria,Kinase,ATP binding
P67890,Archaea,Phosphatase,Hydrolase
Q54321,Eukaryota,Kinase,Transferase

The identifier column must match protein IDs in your embeddings file.

Projection Methods

Methods require a dimension suffix: 2 for 2D, 3 for 3D.

Dimension Suffix Required

Specify pca2 or pca3, not pca alone — the dimension suffix is mandatory.

Method2D3DDescription
PCApca2pca3Principal Component Analysis
UMAPumap2umap3Uniform Manifold Approximation
t-SNEtsne2tsne3t-distributed Stochastic Neighbor Emb.
PaCMAPpacmap2pacmap3Pairwise Controlled Manifold Approx.
MDSmds2mds3Multidimensional Scaling
LocalMAPlocalmap2localmap3Local-first alternative to PaCMAP

You can customize parameters inline:

bash
-m "umap2:n_neighbors=50;min_dist=0.1" -m "tsne2:perplexity=50"

TIP

ProtSpace is optimized for 2D visualization — prefer *2 methods over *3.

Embedder Models

When using a FASTA file as input, specify an embedder with -e to generate embeddings via the Biocentral API:

bash
protspace prepare -i sequences.fasta -e prot_t5 -m pca2,umap2

Available models: prot_t5, prost_t5, esm2_8m, esm2_35m, esm2_150m, esm2_650m, esm2_3b, ankh_base, ankh_large, ankh3_large, esmc_300m, esmc_600m

More Info

The protspace prepare command is the recommended all-in-one pipeline. For advanced workflows, the CLI also provides individual subcommands: embed, project, annotate, bundle, serve, and style.

Full CLI reference and advanced usage: ProtSpace Python GitHub

Released under the Apache 2.0 License.