2023 ISB Multisite
This folder contains data about clinically and pathologically diagnosed lymphatic patterns of progression from 332 patients treated at the Inselspital Bern between 2001 and 2018.
Table of Contents
- Cohort Characteristics
- Online Interface
- Curation
- Data Description
- Documentation of Columns
- Mapping Documentation
Cohort Characteristics
Below we show some figures that aim to coarsely characterize the patient cohort in this directory.
Figure 1: Distribution over age, stratified by sex and smoking status. |
Figure 2: Distribution over age, stratified by sex and smoking status. | Figure 3: Distribution over primary tumor subsite. |
Online Interface
We provide a user-friendly and intuitive graphical user interface to view the dataset, which is available at https://lyprox.org/. The GUI has two main functionalities: the patient list and the dashboard. The patient list allows for viewing the characteristics of a patient, corresponding to one row of the csv file, in a visually appealing and intuitive way. The dashboard allows for filtering of the dataset. For example, the user may select all patients with primary tumors extending over the mid-sagittal plane with involvement of ipsilateral level III. The dashboard will then display the number or percentage of patients with metastases in each of the other LNLs.
Curation
Curation and inclusion criteria will be published in a separate Data in Brief article that is currently under review.
Data Description
The data is provided as a CSV-table containing one row for each of the 332 patients. The table has a header with three levels that describe the columns. Below we explain each column in the form of a list with three levels. So, for example, list entry 1.i.g refers to a column with the three-level header patient | # | nicotine_abuse
and this column reports about the patient's smoking status.
Documentation of Columns
patient:
This top-level header contains general patient information.#:
The second level header for thepatient
columns is only a placeholder.id:
The local study ID.institution:
The institution where the patient was treated.sex:
The biological sex of the patient.age:
The age of the patient at the time of diagnosis.diagnose_date:
The date of diagnosis.alcohol_abuse:
Whether the patient was abusingly drinking alcohol at the time of diagnosis.nicotine_abuse:
Whether the patient was considered a smoker. This is set toFalse
, when the patient had zero pack-yearshpv_status:
The HPV p16 status of the patient.neck_dissection:
Whether the patient underwent a neck dissection. In this dataset, all patients underwent a neck dissection.tnm_edition:
The edition of the TNM classification used.n_stage:
The pN category of the patient (pathologically assessed).m_stage:
The M category of the patient.extracapsular:
Whether the patient had extracapsular spread in any LNL.
tumor:
This top-level header contains general tumor information.1:
This second-level header enumerates synchronous tumors.location:
The location of the tumor.subsite:
The subsite of the tumor, specified by ICD-O-3 code.side:
Whether the tumor occurred on the right or left side of the mid-sagittal plane.central:
Whether the tumor was located centrally or not.extension:
Whether the tumor extended over the mid-sagittal line.volume:
The volume of the tumor in cm^3.stage_prefix:
The prefix of the T category.t_stage:
The T category of the tumor.
CT:
This top-level header contains involvement information from the CT scan.info:
This second-level header contains general information about the CT scan.date:
The date of the CT scan. This was missing for some patients where the date of diagnosis was used as a fallback.
left:
This describes the observed involvement of the left LNLs.Va:
As an example, this describes the clinical involvement of the left LNL Va, as observed in a CT scan.
right:
This describes the observed involvement of the right LNLs.IIa:
While this describes the clinical involvement of the right LNL IIa, as observed in a CT scan.
MRI:
This top-level header contains involvement information from the MRI scan.info:
This second-level header contains general information about the MRI scan.date:
The date of the MRI scan.
left:
This describes the observed involvement of the left LNLs.Ia:
E.g., this describes the clinical involvement of the left LNL Ia, as observed in an MRI scan.
right:
This describes the observed involvement of the right LNLs.III:
This describes the clinical involvement of the right LNL III, as observed in an MRI scan.
PET:
This top-level header contains involvement information from the PET scan.info:
This second-level header contains general information about the PET scan.date:
The date of the PET scan.
left:
This describes the observed involvement of the left LNLs.IV:
For instance, this describes the clinical involvement of the left LNL IV, as observed in a PET scan.
right:
This describes the observed involvement of the right LNLs.III:
On the other side, this describes the clinical involvement of the right LNL III, as observed in a PET scan.
pathology:
This top-level header contains involvement information from the pathology report.info:
This second-level header contains general information about the pathology report.date:
Date of the neck dissection.
left:
Microscopic involvement of the left LNLs.I:
This describes whether the left LNL I was pathologically involved or not.
right:
Microscopic involvement of the right LNLs.IIb:
This describes whether the right sub-LNL IIb was pathologically involved or not.
total_dissected:
This top-level header contains information about the number of lymph nodes dissected in each LNL.info:
This second-level header contains general information about the pathology report.date:
Date of the neck dissection.all_lnls:
The total number of investigated lymph nodes.
left:
Number of dissected lymph nodes per LNL on the left side.Va:
Number of dissected lymph nodes in the left sub-LNL Va.Ib_to_III:
Total number of dissected lymph nodes in the left LNLs Ib-III. This information is gathered for a particular figure in our publication. Note that this is not just the sum of the dissected nodes in the LNLs Ib to III because some levels were resected en-bloc. Those are included in this column but could not be resolved for the individual LNLs.
right:
Number of dissected lymph nodes per LNL on the right side.II:
Total number of dissected lymph nodes in the right LNL II.Ib_to_III:
Total number of dissected lymph nodes in the right LNLs Ib-III. This information is gathered for a particular figure in our publication. Note that this is not just the sum of the dissected nodes in the LNLs Ib to III because some levels were resected en-bloc. Those are included in this column but could not be resolved for the individual LNLs.
positive_dissected:
This top-level header contains information about the number of pathologically positive lymph nodes in each LNL.info:
This second-level header contains general information about the findings of metastasis by the pathologist.date:
Date of the neck dissection.all_lnls:
The total number of involved lymph nodes.largest_node_mm:
Size of the largest lymph node in the neck dissection in mm.largest_node_lnl:
LNL where the largest pathological lymph node metastasis was found.
left:
Number of pathologically positive lymph nodes per LNL on the left side.V:
Total number of pathologically positive lymph nodes in the left LNL V.Ib_to_III:
Total number of dissected lymph nodes found to harbor metastases in the left LNLs Ib-III. This information is gathered for a particular figure in our publication. Note that this is not just the sum of the dissected nodes in the LNLs Ib to III because some levels were resected en-bloc. Those are included in this column but could not be resolved for the individual LNLs.
right:
Number of pathologically positive lymph nodes per LNL on the right side.IIa:
Total number of pathologically positive lymph nodes in the right sub-LNL IIa.Ib_to_III:
Total number of dissected lymph nodes found to harbor metastases in the right LNLs Ib-III. This information is gathered for a particular figure in our publication. Note that this is not just the sum of the dissected nodes in the LNLs Ib to III because some levels were resected en-bloc. Those are included in this column but could not be resolved for the individual LNLs.
enbloc_dissected:
These columns only report the number of lymph nodes that where resected en-bloc. If, e.g., the LNLs II, III, and IV were resected together, then in each of the respective columns, we report the total number of jointly resected lymph nodes and add a symbol - e.g. 'a' - to identify the en-bloc resection.left:
This reports the en-bloc resection of the left LNLs.right:
This reports the en-bloc resection of the right LNLs.
enbloc_positive:
These columns only report the number of positive lymph nodes that where resected en-bloc. If, e.g., the LNLs II, III, and IV were resected together, then in each of the respective columns, we report the number of jointly resected lymph nodes that were found to harbor metastases and add a symbol - e.g. 'a' - to identify the en-bloc resection.left:
For each LNL, this reports the number of en-bloc resected and positive lymph nodes on the left side.right:
For each LNL, this reports the number of en-bloc resected and positive lymph nodes on the right side.
module mapping
Map the raw.csv
data from the 2023-isb-multisite cohort to the data.csv
file.
This module defines how the command lyscripts data lyproxify
(see here for the documentation of the lyscripts
module) should handle the raw.csv
data that was extracted at the Inselspital Bern in order to transform it into a LyProX-compatible data.csv
file.
The most important definitions in here are the list EXCLUDE
and the dictionary COLUMN_MAP
that defines how to construct the new columns based on the raw.csv
data. They are described in more detail below:
global EXCLUDE
List of tuples specifying which function to run for which columns to find out if patients/rows should be excluded in the lyproxified data.csv
.
The first element of each tuple is the flattened multi-index column name, the second element is the function to run on the column to determine if a patient/row should be excluded:
python
EXCLUDE = [
(column_name, check_function),
]
Essentially, a row is excluded, if for that row check_function(raw_data[column_name])
evaluates to True
.
More information can be found in the documentation of the lyproxify
function.
global COLUMN_MAP
This is the actual mapping dictionary that describes how to transform the raw.csv
table into the data.csv
table that can be fed into and understood by LyProX.
See here for details on how this dictionary is used by the lyproxify
script.
It contains a tree-like structure that is human-readable and mimics the tree of multi-level headers in the final data.csv
file. For every column in the final data.csv
file, the dictionary describes from which columns in the raw.csv
file the data should be extracted and what function should be applied to it.
It also contains a __doc__
key for every sub-dictionary that describes what the respective column is about. This is used to generate the documentation for the README.md
file of this data.
Global Variables
- MRI_OR_CT_COL
- PATHOLOGY_COLS_POSITIVE
- PATHOLOGY_COLS_INVESTIGATED
- ALL_FALSE
- SUBLVL_PATTERN
- IB_TO_III_PATTERN
- EXCLUDE
- COLUMN_MAP
function smpl_date
python
smpl_date(entry)
function smpl_diagnose
python
smpl_diagnose(entry, *_args, **_kwargs)
function robust
python
robust(func: collections.abc.Callable) → Optional[Any]
Make casting function 'robust' by returning None
when an error is thrown.
function get_subsite
python
get_subsite(entry, *_args, **_kwargs) → str | None
Get human-readable subsite from ICD-10 code.
function map_to_lnl
python
map_to_lnl(entry, tumor_side, *_args, **_kwargs) → list[str] | None
Map integers representing the location of the largest LN to the correct LNL.
function has_pathological_t
python
has_pathological_t(entry, *_args, **_kwargs) → bool
Check whether the pathological T-stage is available.
function map_t_stage
python
map_t_stage(clinical, pathological, *_args, **_kwargs) → int | None
Map their T-stage encoding to actual T-stages.
The clinical stage is only used if the pathological stage is not available.
function map_t_stage_prefix
python
map_t_stage_prefix(pathological, *_args, **_kwargs) → str | None
Determine whether T category was assessed clinically or pathologically.
function map_n_stage
python
map_n_stage(entry, *_args, **_kwargs) → int | None
Map their N-stage encoding to actual N-stage.
function map_location
python
map_location(entry, *_args, **_kwargs) → str | None
Map their location encoding to the semantic locations.
function map_side
python
map_side(entry, *_args, **_kwargs) → str | None
Map their side encoding to the semantic side.
function map_ct
python
map_ct(entry, mri_or_ct, *_args, **_kwargs) → bool | None
Call robust(smpl_diagnose)
if the patient has a CT diagnose.
function map_mri
python
map_mri(entry, mri_or_ct, *_args, **_kwargs) → bool | None
Call robust(smpl_diagnose)
if the patient has an MRI diagnose.
function from_pathology
python
from_pathology(entry) → tuple[dict[str, int], bool]
Infer how many nodes in an LNL where investigated/positive per resection.
If the LNL showed signs of extracapsular extension (ECE).
The way the data was collected is a bit tricky: Generally, they report the number of nodes in an LNL that were investigated or positive (depending on the column one looks at). But if multiple levels were resected and investigated en bloc, they wrote the finding in each LNL and appended a letter to the number. So, if LNL I was resected together with LNL II and they found in total 10 nodes, they would write LNL I: 10a
and LNL II: 10a
.
Additionally, if extracapsular extension was found, they would add 100 to the number. And if parts of an LNL were resected with another LNL but another part of the LNL was investigated on its own, they would write something like 12 + 4b
.
function num_from_pathology
python
num_from_pathology(entry, *_args, **_kwargs) → int | None
Infer number of involved nodes in LNL from pathology report.
function binary_from_pathology
python
binary_from_pathology(entry, *_args, **_kwargs) → bool | None
Infer binary involvement from pathology report.
function num_super_from_pathology
python
num_super_from_pathology(*lnl_entries, lnl='I', side='left') → int | None
Infer number of involved nodes in super LNL (e.g. I, II and V) from pathology.
This involves checking if other LNLs have been resected with the LNL in question. In that case, we do not know if the LNL in question was involved or if it was only one of the co-resected LNLs.
function get_index
python
get_index(side: str, lnl: str) → int
Return the index of the LNL in the PATHOLOGY_COLS_INVESTIGATED
array.
function num_Ib_to_III_from_pathology
python
num_Ib_to_III_from_pathology(*lnl_entries, side='left') → int | None
Infer number of involved lymph nodes in LNL Ib to III from pathology.
function binary_super_from_pathology
python
binary_super_from_pathology(*lnl_entries, lnl='I', side='left') → bool | None
Infer if super LNL is involved from pathology.
function enbloc_resected_from_pathology
python
enbloc_resected_from_pathology(*lnl_entries) → str | None
Return number and symbol of co-resected LNLs.
function map_ece
python
map_ece(*lnl_entries, **_kwargs)
Infer if the patient had LNL involvement with extra-capsular extension.
In the data, this is encoded by the value 100 being added to the number of positive LNLs.
function get_ct_date
python
get_ct_date(entry, mri_or_ct, diagnose_date_col, *_args, **_kwargs)
Determine the date of the CT diagnose.
If the date is missing, the date of diagnosis is used as a fallback.
function get_mri_date
python
get_mri_date(entry, mri_or_ct, *_args, **_kwargs)
Determine the date of the MRI diagnose.