Example Datasets & Formats

Access test data and learn about supported data formats

🌐

GitHub Repository

Access Example Datasets & Tutorials

Our GitHub repository contains ready-to-use example datasets for all supported omics types, along with comprehensive tutorials and documentation to help you get started with Profiler.

🌐 Visit GitHub Repository
📊

Data Loading Options

Profiler supports multiple ways to load your omics data, from raw mass spectrometry files to structured tabular data. Below are the three main loading methods available.

🔧 RAW Data Conversion
Mass Spectrometry RAW Files
Convert vendor-specific RAW files directly within Profiler.
Supported Vendors:
  • Waters (.raw)
  • Thermo Fisher (.raw)
  • Bruker (.d folders)
Output Formats:
  • mzML, mzXML, mz5, mzDB
Options:
• Mass range selection
• Peak picking
• Lock mass correction (Waters)
📂 MS Standard Formats
Converted MS Files
Load previously converted mass spectrometry files in standard formats.
Accepted Formats:
  • .mzML
  • .mzXML
Features:
• Class-based organization
• Peak height threshold filtering
• Automatic binning & extraction
🗂️ Tabular Data
Structured Omics Data
Load pre-processed tabular data from various sources and software.
Accepted Formats:
  • CSV, XLSX, TXT, TSV
Native Support:
  • DIA-NN Protein Groups NATIVE
  • MaxQuant Output NATIVE
  • Perseus Files NATIVE
📋

Expected Data Formats by Omics Type

1. Proteomics Data

Standard Tabular Format:

| Class     | Protein1 | Protein2 | Protein3 | ... |
|-----------|----------|----------|----------|-----|
| Control   | 1257.5   | 843.2    | 2341.8   | ... |
| Control   | 1189.3   | 891.5    | 2298.4   | ... |
| Tumor     | 2456.7   | 421.9    | 3892.1   | ... |
| Tumor     | 2389.1   | 398.6    | 3756.8   | ... |

Requirements:

  • First column must be named Class
  • Feature names (proteins, genes) in column headers
  • Numeric intensity/abundance values

🎯 DIA-NN Protein Groups NATIVE SUPPORT

Profiler natively supports DIA-NN protein group files. Simply upload the file and select:

  • Gene names or Protein names as feature identifiers
  • Profiler will automatically structure the data with the Class column

🎯 MaxQuant Output NATIVE SUPPORT

MaxQuant proteinGroups.txt files are directly supported. Choose between:

  • Gene names column
  • Protein names column
  • Automatic formatting and Class assignment

🎯 Perseus Files NATIVE SUPPORT

Perseus matrix files are supported with feature selection:

  • T: Gene names row
  • T: Protein names row
  • Automatic matrix conversion to Profiler format

2. Metabolomics & Lipidomics Data

Expected Format:

| Class        | Metabolite1 | Lipid_PC_34:1 | Ion_m/z_542.3 | ... |
|--------------|-------------|---------------|---------------|-----|
| Healthy      | 5423.1      | 8932.4        | 1234.5        | ... |
| Healthy      | 5189.7      | 8745.2        | 1198.3        | ... |
| Disease      | 7891.2      | 4532.1        | 2341.7        | ... |
| Disease      | 7654.3      | 4389.6        | 2298.9        | ... |

Supported identifiers:

  • Metabolite names (e.g., Glucose, Lactate)
  • Lipid nomenclature (e.g., PC_34:1, TAG_52:3)
  • m/z values (e.g., mz_542.3201)
  • Retention time + m/z (e.g., RT_12.5_mz_542.3)

3. Transcriptomics (RNA-seq, Gene Expression)

Expected Format:

| Class     | GENE1  | GENE2  | GENE3  | ... |
|-----------|--------|--------|--------|-----|
| WT        | 145.2  | 89.7   | 523.4  | ... |
| WT        | 132.8  | 94.3   | 498.1  | ... |
| Mutant    | 78.4   | 156.9  | 234.7  | ... |
| Mutant    | 82.1   | 149.2  | 221.5  | ... |

Accepted values:

  • Raw read counts
  • TPM (Transcripts Per Million)
  • FPKM/RPKM values
  • Normalized expression values

4. Survival Analysis Data

Kaplan-Meier Format:

| Overall survival | State | Class      |
|------------------|-------|------------|
| 12               | 1     | Treatment  |
| 24               | 0     | Treatment  |
| 8                | 1     | Control    |
| 36               | 0     | Control    |

Required columns:

  • Overall survival: Time in months/days/years
  • State: Event indicator (0 = censored, 1 = event occurred)
  • Class: Group/condition for comparison

Cox Regression Format:

| Overall survival | State | Age | BMI  | Protein_X | Lipid_Y | ... |
|------------------|-------|-----|------|-----------|---------|-----|
| 18               | 1     | 67  | 28.5 | 1234.5    | 892.3   | ... |
| 32               | 0     | 54  | 24.1 | 2341.8    | 1023.7  | ... |
| 9                | 1     | 72  | 31.2 | 987.3     | 654.2   | ... |

Required + optional columns:

  • Overall survival and State (required)
  • Any additional covariates: clinical variables, omics features, etc.
  • Can include both numeric and categorical variables

5. Multi-Omics Integration

Integrated Data Format:

| Class   | Protein1 | Protein2 | Metabolite1 | Lipid1  | Gene1 | ... |
|---------|----------|----------|-------------|---------|-------|-----|
| Sample1 | 1257.5   | 843.2    | 5423.1      | 8932.4  | 145.2 | ... |
| Sample2 | 2456.7   | 421.9    | 7891.2      | 4532.1  | 78.4  | ... |

Integration approach:

  • Combine features from multiple omics datasets
  • Ensure sample alignment across datasets
  • Maintain unique feature names (e.g., prefix with data type)
  • All samples must have the same Class label
💡

Important Notes

✅ General Requirements:

  • The Class column is mandatory for all tabular data
  • Feature names must be in column headers (not rows)
  • All values should be numeric (except Class column)
  • Missing values are supported (will be handled in preprocessing)
  • Sample names can be in row index or a separate column

🔧 Automatic Handling:

  • DIA-NN, MaxQuant, Perseus: Automatically formatted by Profiler
  • Missing values: Multiple imputation methods available
  • Normalization: 7 different methods to choose from
  • Batch effects: NeuroCombat correction available

📥 Download Examples:

Visit our GitHub repository to download example datasets for each omics type, including properly formatted files ready to use with Profiler.

📚 Download Example Datasets