# CSV format requirements

## General CSV format requirements

The following are the general format requirements for a CSV file used to create multiple cases:

1. The file must have a .csv extension.
2. The file must contain a \[Data] header.
3. The row after \[Data] header must include the field names identifying the data in each column. The column names are case-sensitive.
4. The row after the column name header and each subsequent row represents a sample.
5. Each column represents a data field.
6. It is essential that there are no empty rows between the \[Data] header and the last sample row.
7. Number of cases per file can’t be greater than 50.

***

## CSV schema

### **1. Mandatory fields**

Must be present in the sample table at all times.

1. Case Type;
2. Family Id;
3. Phenotypes OR Phenotypes Id.

### **2. Conditionally mandatory fields**

If these fields are left empty, it will result in the creation of an empty sample.

1. BioSample Name;
2. Files Names;
3. Storage Provider Id;

This field is mandatory if Files Names is empty:

1. Sample Type.

This field is required if the "auto" option is used for Files Names (only relevant for BSSH):

1. Default Project.

### **3. Optional fields**

The sample table may include these supported optional columns.

1. Boost Genes
2. Clinical Notes
3. Date Of Birth
4. Due Date
5. Execute now
6. Gender. See an [important note](#handling-cases-with-unknown-sex)
7. Gene List Id
8. Kit Id
9. Intersect Bed Id (38.0+)
10. Label Id
11. Opt In
12. Relation
13. Selected Preset
14. Visualization Files

### **4. Custom fields**

The sample table may contain custom columns to suit your specific needs and include any relevant information that is important for your workflow.

Each custom field must be assigned a unique name without spaces. Data from custom columns is saved per case under the Additional information section of [Case Info](https://help.emg.illumina.com/emedgene-analyze-manual/getting_around_the_platform/cases_tab/case_details).

{% hint style="info" %}
**Note:** In cases with more than one sample, custom fields are only recognized and added to case information if their values appear within the same table row where the Relation field is equal to "proband".
{% endhint %}

#### **Custom field examples:**

| Field (column) name    | Expected input | Field details | Example           |
| ---------------------- | -------------- | ------------- | ----------------- |
| Institution            | Free text      | Custom        | GenoMed Solutions |
| Sample\_Received\_Date | Free text      | Custom        | 24-02-2022        |
| Sample\_Type           | Free text      | Custom        | Amniotic Fluid    |

***

## **Batch case .csv file validation rules**

[Mandatory](#mandatory-fields) (highlighted in <mark style="background-color:red;">red</mark>), [Conditionally mandatory](#conditionally-mandatory-fields) (highlighted in <mark style="background-color:orange;">orange</mark>), and [Optional ](#optional-fields)fields should be filled in according to the following rules.

<table data-full-width="false"><thead><tr><th>Field (column) name</th><th>Expected input</th><th width="178">Field details</th><th>Example</th></tr></thead><tbody><tr><td><mark style="background-color:orange;">BioSample Name</mark></td><td>Free text</td><td>Conditionally mandatory.<br><br>An empty sample will be created if the field is left blank.</td><td>NA24385</td></tr><tr><td>Boost Genes</td><td>1. "TRUE"<br>2. "FALSE"</td><td>Optional.<br><br>Indicates whether the <a href="../creating_a_single_case/gene_list">Boost genes mode</a> will be used. "TRUE" means that variants in the targeted genes will receive upgraded scores during prioritization by the AI Shortlist algorithm.<br><br>Default value is "FALSE".<br><br>Only considered for proband.</td><td>TRUE</td></tr><tr><td><mark style="background-color:red;">Case Type</mark></td><td><p>1. "Whole Genome"<br>2. "Exome"<br>3. "Custom Panel"<br>4. Array</p><p>5. Custom case type</p></td><td>Mandatory.<br><br>Only considered for proband.</td><td>Whole Genome</td></tr><tr><td>Clinical Notes</td><td>Free text</td><td>Optional</td><td>A 14-year-old boy with a visual acuity of 20/200 in both eyes in whom hearing loss was first noted at 5 years of age on routine screening; audiometry revealed sensorineural hearing loss.</td></tr><tr><td>Date Of Birth</td><td>Date "YYYY-MM-DD"</td><td>Optional</td><td>2013-01-22</td></tr><tr><td><mark style="background-color:orange;">Default Project</mark></td><td>Free text</td><td>Conditionally mandatory.<br><br>Must be filled in if the "auto" option is used for Files Names (only relevant for BSSH).</td><td>GIAB</td></tr><tr><td>Due Date</td><td>Date "YYYY-MM-DD"</td><td>Optional</td><td>2023-05-03</td></tr><tr><td>Execute now</td><td>1. "TRUE"<br>2. "FALSE"</td><td>Optional.<br><br>Default value is "TRUE". Use "FALSE" if you don’t want to run the case upon uploading the file.<br><br>Only considered for proband.</td><td>FALSE</td></tr><tr><td><mark style="background-color:red;">Family Id</mark></td><td>Free text</td><td>Mandatory</td><td>RM8392</td></tr><tr><td><mark style="background-color:orange;">Files Names</mark></td><td>1. Semicolon-separated list of paths to <code>.fastq</code>, <code>.fastq.gz</code>, <code>.vcf</code>, <code>.vcf.gz</code>, <code>.bam</code>, <code>.cram</code>, <code>.gt_sample_summary.json</code>, <code>.annotated_cyto.json</code> files without spaces<br>2. "existing"<br>3. "auto" (BSSH)</td><td><p>Conditionally mandatory.<br><br>An empty sample will be created if the field is left blank.<br><br>The "existing" option automatically locates FASTQ files based on the BioSample Name.<br><strong>Note:</strong> If data files for an existing case were sourced from the customer’s external bucket and later removed, attempting to create a case from those files will result in an error.</p><p><br>Learn about the <a href="../../creating_a_single_case/select_sample_type#current-limitation-cram-input-and-reference-compatibility">current limitation for CRAM file input</a>.<br><br>With the "auto" option, BSSH users can automatically locate FASTQ files based on the BioSample Name and Default Project provided.<br><br>When using BSSH without the "auto" option, ensure that your file path is <a href="#required-bssh-file-path-format">formatted correctly</a>.</p></td><td>/GIAB_cases/1/NA24385.dragen.hard-filtered.gvcf.gz;/QA_cases/Other/NA24385.dragen.cnv.vcf.gz;/QA_cases/Other/NA24385.dragen.repeats.vcf;</td></tr><tr><td>Gender</td><td>1. "F"<br>2. "M"<br>3. "U"</td><td>Optional.<br><br>Default value is "U". See an <a href="#handling-cases-with-unknown-sex">important note</a>.</td><td>M</td></tr><tr><td>Gene List Id</td><td>integer</td><td>Optional.<br><br>Must be the id of a previously defined Gene List.<br><br>Only considered for proband.</td><td>12345</td></tr><tr><td>Kit Id</td><td>integer</td><td><p>Optional.<br></p><p>&#x3C;38.0: ID of a Region of interest BED.</p><p>38.0+: ID of a Coverage BED.<br>Must be the id of a previously defined kit.<br><br>Only considered for proband.</p></td><td>23456</td></tr><tr><td>Intersect Bed Id (38.0+)</td><td>integer</td><td>Optional.<br><br>ID of a Region of interest BED.<br>Must be the id of a previously defined kit.<br><br>Only considered for proband.</td><td>78957</td></tr><tr><td>Label Id</td><td>integer</td><td>Optional.<br><br>Must be the id of a previously defined Case Label.<br><br>Only considered for proband.</td><td>34567</td></tr><tr><td>Opt In</td><td>1. "TRUE"<br>2. "FALSE"</td><td>Optional.<br><br>Indicates whether the case subject consented to the <a href="../../analyze_network/analyze_network_setup">extended sharing of data</a> with your network(s).<br><br>Default value is "TRUE".</td><td>FALSE</td></tr><tr><td><mark style="background-color:red;">Phenotypes</mark></td><td><ol><li>Semicolon-separated list of HPO phenotype terms</li><li>"Unaffected" is used for non-affected family members.</li></ol></td><td>Mandatory for proband sample if Phenotypes Id is empty.<br><br>List must be under 100.<br><br>It is possible to include non-HPO terms if Phenotypes Id is empty.</td><td>Abnormal pupillary function;Orthotopic os odontoideum;</td></tr><tr><td><mark style="background-color:red;">Phenotypes Id</mark></td><td>Semicolon-separated list of HPO phenotype IDs</td><td><p>Mandatory for proband sample if Phenotypes is empty.<br></p><p>List must be under 100.</p></td><td>HP:0007686;HP:0025375;</td></tr><tr><td>Relation</td><td>1. "proband"<br>2. "mother"<br>3. "father"<br>4. "sibling"</td><td>Optional.<br><br>Default value is "proband".<br><br>Values "proband", "father", "mother" can be only used once per Family ID.<br>One sample with Relation "proband" is required per Family ID.</td><td>Mother</td></tr><tr><td><mark style="background-color:orange;">Sample Type</mark></td><td>1. "FASTQ"<br>2. "VCF"</td><td>Conditionally mandatory.<br><br>Required if Files Names is empty.<br><br>Only considered for proband.</td><td>FASTQ</td></tr><tr><td>Selected Preset</td><td>1. Free text<br>2. "Default"</td><td>Optional.<br><br>Must be the name of a previously defined Preset. If set to default, the default Preset will be applied. If left empty, no Preset will be applied.</td><td>High quality candidates</td></tr><tr><td><mark style="background-color:orange;">Storage Provider Id</mark></td><td>Integer</td><td>Conditionally mandatory.<br><br>Required if Files Names is not empty.<br><br>Must be from the configured storage provider ID list.</td><td>208</td></tr><tr><td>Visualization Files</td><td>Semicolon-separated list of paths to sequence alignment data files of extension <code>.bam</code>, <code>.cram</code>; <code>.tn.bw</code>, <code>.baf.bw</code>, <code>.roh.bed</code>, <code>.lrr.bedgraph</code>, <code>.baf.bedgraph</code></td><td>Optional</td><td>/giab_project/NA24385.bam</td></tr></tbody></table>

#### **Handling a proband sample with unknown sex**

{% hint style="warning" %}
When a sample is user-assigned "Unknown" sex, the system assumes "Female". This affects CNV interpretation on sex chromosomes in case *the genetic sex is actually male*:

* **Chromosome X:**\
  CN = 2 is considered reference (REF) for a female genome, so CNVs with two copies are hidden by default. This may cause chromosome X duplications to be missed.
* **Chromosome Y:**\
  CN = 0 is considered reference (REF) for a female genome, so CNVs with zero copies are hidden by default. This may cause chromosome Y deletions to be missed.

To include these variants in the analysis, enable the [Include Reference Homozygosity and No Coverage Calls toggle](https://help.emg.illumina.com/settings/organization_settings_-330+/workbench-and-pipeline/pipeline-versions#include-reference-homozygosity-v36.0) in Workbench & Pipeline Settings.
{% endhint %}

### **Required BSSH file path format:**

For BSSH, it is necessary to use the actual names (numbers):

```
/projects/3824821/appresults/2319318/files/119675608
```

instead of aliases

```
/projects/ABC_DEF_2022-12-22_DEv395/appresults/ABC-GM58342-def/files/ABC-GM58342-def.hard-filtered.vcf.gz
```

### Human-readable path for BSSH files in batch CSV

In version 37, we introduced an enhancement to the batch upload process that allows you to provide a human-readable path in their batch CSV for BSSH files.

#### Validations

When a batch CSV includes a human-readable path, the system performs the following validations for paths in BSSH storage:

1. **Single File in the Path**:
   * If the provided path contains exactly one file or dataset, the batch upload proceeds successfully.
2. **Two Files in the Path**:
   * If the path contains two files with the same name (for example, two pairs of fastqs in a dataset) , the system will:
     * Select the dataset marked as QCPassed.
     * Fail the batch upload if both datasets are marked as QCPassed, as this indicates conflicting data.
3. **More Than Two Files in the Path**:
   * If the path contains more than two files or datasets, the system fails the batch upload, as the path is considered ambiguous or invalid.

#### Error Scenarios

* **Multiple QCPassed Datasets**:\
  If two datasets in the same path are marked as QCPassed, the batch upload will fail with a descriptive error indicating the conflict.
* **Excessive Files in the Path**:\
  If more than two files are found for the provided path, the batch upload will fail, instructing the user to provide a more specific or valid path.

#### Benefits

* Enables customers to use intuitive, human-readable paths in their workflows.
* Automatically handles dataset selection based on quality control status.
