Data Format Reference

ProtSpace uses .parquetbundle files - a single file containing all visualization data. This page explains the structure for users who want to understand the file format.

What is a .parquetbundle?

A .parquetbundle is a single file containing three Parquet tables bundled together:

.parquetbundle file
├── selected_annotations.parquet # Protein metadata and annotations
├── ---PARQUET_DELIMITER---      # Separator
├── projections_metadata.parquet # Projection method information
├── ---PARQUET_DELIMITER---      # Separator
└── projections_data.parquet     # 2D/3D coordinates

This bundled format allows efficient loading in the browser while keeping everything in one convenient file.

Tables

1. Annotations Table

Contains metadata and biological annotations for each protein.

Column	Type	Description
`identifier`	string	Protein ID (e.g., P12345)
others	string/number	Any biological annotations

The columns gene_name, protein_name, and uniprot_kb_id are tooltip-only — shown on hover but excluded from the annotation dropdown.

2. Projections Metadata

Column	Type	Description
`projection_name`	string	Method name (e.g., `PCA_2`)
`dimensions`	integer	2 or 3
`info_json`	json	Method parameters and settings

3. Projections Data

Column	Type	Description
`projection_name`	string	Method name (e.g., `PCA_2`)
`identifier`	string	Protein ID
`x`	float	X coordinate
`y`	float	Y coordinate
`z`	float	Z coordinate (null for 2D)

Annotation Types

Categorical: Text values (taxonomy, family). Distinct colors per category.
Multi-Label: Semicolon-separated values (e.g., EC:1.1.1;EC:2.1.1). Displayed as pie charts.

Missing Values

Proteins with missing, empty, or whitespace-only annotation values are displayed as N/A in the legend and tooltip. N/A items receive a dedicated color (#DDDDDD) and can be toggled, isolated, or reordered in the legend like any other category.

Scored Annotations

Annotation values can include a numeric score after a pipe character:

Single score: PF00001|1.5e-10
Multiple scores: PF00001|1.5e-10,2.3e-5

Scores are displayed in the protein tooltip when hovering over a point. This is commonly used for InterPro domain E-values.

Evidence-Coded Annotations

Annotation values can include an ECO evidence code after a pipe character:

Cytoplasm|EXP (experimental evidence)
apoptotic process|IDA (inferred from direct assay)

Recognized evidence codes: EXP, HDA, IDA, TAS, NAS, IC, ISS, SAM, COMB, IMP, IEA

Evidence codes are displayed in the protein tooltip alongside the annotation value.

Creating Files

Use the Google Colab notebook or Python CLI to generate .parquetbundle files.

Data Format Reference ​

What is a .parquetbundle? ​

Tables ​

1. Annotations Table ​

2. Projections Metadata ​

3. Projections Data ​

Annotation Types ​

Missing Values ​

Scored Annotations ​

Evidence-Coded Annotations ​

Creating Files ​