Skip to content

Data Format Reference

ProtSpace uses .parquetbundle files - a single file containing all visualization data. This page explains the structure for users who want to understand the file format.

What is a .parquetbundle?

A .parquetbundle is a single file containing three Parquet tables bundled together:

.parquetbundle file
├── selected_annotations.parquet # Protein metadata and annotations
├── ---PARQUET_DELIMITER---      # Separator
├── projections_metadata.parquet # Projection method information
├── ---PARQUET_DELIMITER---      # Separator
└── projections_data.parquet     # 2D/3D coordinates

This bundled format allows efficient loading in the browser while keeping everything in one convenient file.

Tables

1. Annotations Table

Contains metadata and biological annotations for each protein.

ColumnTypeDescription
identifierstringProtein ID (e.g., P12345)
othersstring/numberAny biological annotations

The columns gene_name, protein_name, and uniprot_kb_id are tooltip-only — shown on hover but excluded from the annotation dropdown.

2. Projections Metadata

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
dimensionsinteger2 or 3
info_jsonjsonMethod parameters and settings

3. Projections Data

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
identifierstringProtein ID
xfloatX coordinate
yfloatY coordinate
zfloatZ coordinate (null for 2D)

Annotation Types

  • Categorical: Text values (taxonomy, family). Distinct colors per category.
  • Multi-Label: Semicolon-separated values (e.g., EC:1.1.1;EC:2.1.1). Displayed as pie charts.

Missing Values

Proteins with missing, empty, or whitespace-only annotation values are displayed as N/A in the legend and tooltip. N/A items receive a dedicated color (#DDDDDD) and can be toggled, isolated, or reordered in the legend like any other category.

Scored Annotations

Annotation values can include a numeric score after a pipe character:

  • Single score: PF00001|1.5e-10
  • Multiple scores: PF00001|1.5e-10,2.3e-5

Scores are displayed in the protein tooltip when hovering over a point. This is commonly used for InterPro domain E-values.

Evidence-Coded Annotations

Annotation values can include an ECO evidence code after a pipe character:

  • Cytoplasm|EXP (experimental evidence)
  • apoptotic process|IDA (inferred from direct assay)

Recognized evidence codes: EXP, HDA, IDA, TAS, NAS, IC, ISS, SAM, COMB, IMP, IEA

Evidence codes are displayed in the protein tooltip alongside the annotation value.

Creating Files

Use the Google Colab notebook or Python CLI to generate .parquetbundle files.

Released under the Apache 2.0 License.