Skip to content

Data Format Reference

ProtSpace uses .parquetbundle files - a single file containing all visualization data. This page explains the structure for users who want to understand the file format.

What is a .parquetbundle?

A .parquetbundle is a single file containing three Parquet tables bundled together, with an optional settings section:

.parquetbundle file
├── selected_annotations.parquet  # Protein metadata and annotations
├── ---PARQUET_DELIMITER---       # Separator
├── projections_metadata.parquet  # Projection method information
├── ---PARQUET_DELIMITER---       # Separator
├── projections_data.parquet      # 2D/3D coordinates
├── ---PARQUET_DELIMITER---       # Optional separator
└── settings.parquet              # Optional: one-row Parquet table with settings_json

This bundled format allows efficient loading in the browser while keeping everything in one convenient file.

The optional settings section is stored as settings.parquet, a one-row Parquet table with a settings_json column. It stores legend customizations (colors, shapes, ordering, visibility, palette, numeric binning settings) and export options (image dimensions, legend sizing) per annotation. When present, these settings are applied automatically on load so the visualization renders exactly as it was exported.

Tables

1. Annotations Table

Contains metadata and biological annotations for each protein.

ColumnTypeDescription
identifierstringProtein ID (e.g., P12345)
othersstring/numberAny biological annotations

The columns gene_name, protein_name, and uniprot_kb_id are tooltip-only — shown on hover but excluded from the annotation dropdown.

2. Projections Metadata

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
dimensionsinteger2 or 3
info_jsonjsonMethod parameters and settings

3. Projections Data

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
identifierstringProtein ID
xfloatX coordinate
yfloatY coordinate
zfloatZ coordinate (null for 2D)

Annotation Types

ProtSpace distinguishes three practical annotation shapes:

  • Categorical: plain text values such as taxonomy or family. These get discrete legend entries.
  • Numeric: scalar numeric values such as length. These stay numeric in the file and are binned in the browser at runtime.
  • Multi-Label: semicolon-separated values such as EC:1.1.1;EC:2.1.1. These are displayed as pie charts.

Numeric Annotations

A column is treated as numeric when every non-empty value is a single finite scalar number. This includes true numeric source values and dense or continuous numeric-looking strings. Sparse or small integer-like string columns stay categorical by default so identifier-style code fields are not silently reclassified.

Numeric detection does not apply to:

  • sparse integer-like string labels such as cluster or code identifiers
  • semicolon-separated multi-value fields
  • pipe-coded score/evidence fields such as PF00001|1.5e-10
  • mixed-format columns

For numeric annotations:

  • raw numeric values are stored and exported as numbers
  • legend bins are generated client-side from the raw values plus the saved numeric settings
  • the selected distribution can be linear, quantile, or logarithmic
  • numeric palettes are sequential gradients, not categorical swatches
  • the gradient direction can also be reversed and is persisted as part of the numeric settings
  • unsupported numeric palette IDs are normalized to cividis on import/load

Numeric Edge Cases

Numeric binning is data-driven, so the realized number of bins can be lower than Max legend items.

Examples:

  • Linear or logarithmic intervals can be empty and therefore disappear from the legend.
  • Quantile cut points can collapse when many proteins share the same value.
  • Constant numeric columns produce a single bin.
  • All-null numeric columns produce zero bins.
  • Very narrow decimal ranges can require extra precision in the displayed labels.

Numeric legend labels are summaries of the observed values in each realized bin. They are meant for readability, not as the exact bin-membership rule.

Missing Values

Proteins with missing, empty, or whitespace-only annotation values are displayed as N/A in the legend and tooltip. N/A items receive a dedicated color (#DDDDDD) and can be toggled, isolated, or reordered in the legend like any other category.

Scored Annotations

Annotation values can include a numeric score after a pipe character:

  • Single score: PF00001|1.5e-10
  • Multiple scores: PF00001|1.5e-10,2.3e-5

Scores are displayed in the protein tooltip when hovering over a point. This is commonly used for InterPro domain E-values.

Evidence-Coded Annotations

Annotation values can include an ECO evidence code after a pipe character:

  • Cytoplasm|EXP (experimental evidence)
  • apoptotic process|IDA (inferred from direct assay)

Evidence codes are recognized by pattern: any 2–5 uppercase letter code (e.g., EXP, IDA, IPI, IGI, IEP, COMB) or raw ECO identifiers (e.g., ECO:0000269). This covers all standard GO evidence codes and ECO ontology IDs.

Evidence codes are displayed in the protein tooltip alongside the annotation value.

Creating Files

Use the Google Colab notebook or Python CLI to generate .parquetbundle files.

Export And Import Notes

Numeric annotations round-trip differently from categorical annotations:

  • the bundle stores the raw numeric column, not precomputed bin labels
  • the exported settings remember the numeric palette, gradient direction, target bin count, distribution, hidden bins, and compatible manual order
  • when a bundle is imported again, ProtSpace rebuilds the numeric bins from the raw values and the saved numeric settings

If the saved numeric topology no longer matches the realized one, incompatible numeric hidden/manual state is dropped instead of being applied to the wrong bins.

Released under the Apache 2.0 License.