Skip to content

Data Format Reference

ProtSpace uses .parquetbundle files - a single file containing all visualization data. This page explains the structure for users who want to understand the file format.

What is a .parquetbundle?

A .parquetbundle is a single file containing three Parquet tables bundled together:

.parquetbundle file
├── selected_features.parquet    # Protein metadata and annotations
├── ---PARQUET_DELIMITER---      # Separator
├── projections_metadata.parquet # Projection method information
├── ---PARQUET_DELIMITER---      # Separator
└── projections_data.parquet     # 2D/3D coordinates

This bundled format allows efficient loading in the browser while keeping everything in one convenient file.

Tables

1. Annotations Table

Contains metadata and biological annotations for each protein.

ColumnTypeDescription
identifierstringProtein ID (e.g., P12345)
othersstring/numberAny biological annotations

2. Projections Metadata

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
dimensionsinteger2 or 3
info_jsonjsonMethod parameters and settings

3. Projections Data

ColumnTypeDescription
projection_namestringMethod name (e.g., PCA_2)
identifierstringProtein ID
xfloatX coordinate
yfloatY coordinate
zfloatZ coordinate (null for 2D)

Annotation Types

  • Categorical: Text values (taxonomy, family). Distinct colors per category.
  • Multi-Label: Semicolon-separated values (e.g., EC:1.1.1;EC:2.1.1). Displayed as pie charts.

Creating Files

Use the Google Colab notebook or Python CLI to generate .parquetbundle files.

Released under the Apache 2.0 License.