2023 CLB Multisite
This folder contains the pathology report data on lymphatic involvement in 373 patients with head and neck cancer, diagnosed and treated with neck dissection at the Centre Léon Bérard between 2001 and 2018.
The raw.csv
data does contain patients which are duplicates of the patient records in the 2021-clb-oropharynx
folder, but they have been filtered out in the data.csv
file.
Table of Contents
- Cohort Characteristics
- Curation
- Online Interface
- Data Description
- Documentation of Columns
- Mapping Documentation
Cohort Characteristics
Below we show some figures that aim to coarsely characterize the patient cohort in this directory.
Figure 1: Distribution over age, stratified by sex and smoking status. |
Figure 2: Distribution over age, stratified by sex and smoking status. | Figure 3: Distribution over primary tumor subsite. |
Curation
Curation and inclusion criteria will be published in a separate Data in Brief article that is currently under review.
Online Interface
We provide a user-friendly and intuitive graphical user interface to view the dataset, which is available at https://lyprox.org/. The GUI has two main functionalities: the patient list and the dashboard. The patient list allows for viewing the characteristics of a patient, corresponding to one row of the csv file, in a visually appealing and intuitive way. The dashboard allows for filtering of the dataset. For example, the user may select all patients with primary tumors extending over the mid-sagittal plane with involvement of ipsilateral level III. The dashboard will then display the number or percentage of patients with metastases in each of the other LNLs.
Data Description
The data is provided as a CSV-table containing one row for each of the 373 patients. The table has a header with three levels that describe the columns. Below we explain each column in the form of a list with three levels. So, for example, list entry 1.i.g refers to a column with the three-level header patient | # | alcohol_abuse
and underneath it we report each patient's history of alcohol abuse.
Documentation of Columns
patient:
This top-level header contains general patient information.#:
The second level header for thepatient
columns is only a placeholder.id:
The patient ID.institution:
The institution where the patient was treated.sex:
The biological sex of the patient.age:
The age of the patient at the time of diagnosis.weight:
The weight of the patient at the time of diagnosis.diagnose_date:
The date of surgery because the raw file does not specify a date of diagnosis.alcohol_abuse:
Whether the patient was abusingly drinking alcohol at the time of diagnosis.nicotine_abuse:
Whether the patient was smoking nicotine at the time of diagnosis.hpv_status:
The HPV p16 status of the patient.neck_dissection:
Whether the patient underwent a neck dissection. In this dataset, all patients underwent a neck dissection.tnm_edition:
The edition of the TNM classification used.n_stage:
The pN category of the patient (pathologically assessed).m_stage:
The M category of the patient.-1
refers to'X'
.extracapsular:
Whether the patient had extracapsular spread. In this dataset, this information is only globally available, not for each individual lymph node level.
tumor:
This top-level header contains general tumor information.1:
The second level header enumerates synchronous tumors.location:
The location of the tumor. This is empty for all patients because we can later infer it from the subsite's ICD-O-3 code.subsite:
The subsite of the tumor, specified by ICD-O-3 code.central:
Whether the tumor is located centrally w.r.t. the mid-sagittal plane.extension:
Whether the tumor extended over the mid-sagittal line.volume:
The volume of the tumor in cm^3.stage_prefix:
The prefix of the T category.t_stage:
The T category of the tumor.
pathology:
This top-level header contains information from the pathology that received the LNLs resected during the neck dissection.info:
This second-level header contains general information.date:
The date of the pathology report (same as surgery).
ipsi:
This reports the involvement of the ipsilateral LNLs.III:
For example, this column reports the involvement of the ipsilateral LNL III.
contra:
This reports the involvement of the contralateral LNLs.V:
This column reports the pathologic involvement of the contralateral LNL V.
diagnostic_consensus:
This top-level header contains information about the diagnostic consensus, which we assumed to be negative for each LNL that was not resected during the neck dissection. However, we do not know if it was positive for resected patients. This means, all columns under this top-level header are essentially inferred from looking at missing entries under the pathology columns.info:
This second-level header contains general information.date:
The date of the diagnostic consensus (same as surgery).
ipsi:
This reports the diagnostic consensus of the ipsilateral LNLs.Ib:
E.g., this column reports the diagnostic consensus of the ipsilateral LNL Ib.
contra:
This reports the diagnostic consensus of the contralateral LNLs.III:
Under this column, we report the diagnostic consensus of the contralateral LNL III.
total_dissected:
This top-level header contains information about the total number of dissected and pathologically investigated lymph nodes per LNL.info:
This second-level header contains general information.date:
The date of the neck dissection.
ipsi:
This reports the total number of dissected lymph nodes per ipsilateral LNL.II:
For instance, this column reports the total number of dissected lymph nodes in ipsilateral LNL II.Ib_to_III:
This column reports the total number of dissected lymph nodes in ipsilateral LNL Ib to III. This column exists for convenience because we created a figure based on this.
contra:
This reports the total number of dissected lymph nodes per contralateral LNL.VII:
While this column reports the total number of dissected lymph nodes in contralateral LNL VII.Ib_to_III:
This column reports the total number of dissected lymph nodes in contralateral LNL Ib to III. This column exists for convenience because we created a figure based on this.
positive_dissected:
This top-level header contains information about the number of dissected lymph nodes per LNL that were pathologically found to be positive.info:
This second-level header contains general information.date:
The date of the neck dissection.
ipsi:
This reports the number of dissected lymph nodes per ipsilateral LNL that were pathologically found to be positive.IV:
Here, we report the number of metastatic lymph nodes in ipsilateral LNL IV.Ib_to_III:
This column reports the number of metastatic dissected lymph nodes in ipsilateral LNL Ib to III. This column exists for convenience because we created a figure based on this.
contra:
This reports the number of dissected lymph nodes per contralateral LNL that were pathologically found to be positive.Ia:
And this column reports the number of metastatic lymph nodes in contralateral LNL Ia.Ib_to_III:
This column reports the number of metastatic dissected lymph nodes in contralateral LNL Ib to III. This column exists for convenience because we created a figure based on this.
module mapping
Map the raw.csv
data from the 2023-clb-multisite cohort to the data.csv
file.
This module defines how the command lyscripts data lyproxify
(see here for the documentation of the lyscripts
module) should handle the raw.csv
data that was extracted at the Centre Léon Bérard in order to transform it into a LyProX-compatible data.csv
file.
The most important definitions in here are the list EXCLUDE
and the dictionary COLUMN_MAP
that defines how to construct the new columns based on the raw.csv
data. They are described in more detail below:
global EXCLUDE
List of tuples specifying which function to run for which columns to find out if patients/rows should be excluded in the lyproxified data.csv
.
The first element of each tuple is the flattened multi-index column name, the second element is the function to run on the column to determine if a patient/row should be excluded:
python
EXCLUDE = [
(column_name, check_function),
]
Essentially, a row is excluded, if for that row check_function(raw_data[column_name])
evaluates to True
.
More information can be found in the documentation of the lyproxify
function.
global COLUMN_MAP
This is the actual mapping dictionary that describes how to transform the raw.csv
table into the data.csv
table that can be fed into and understood by LyProX.
See here for details on how this dictionary is used by the lyproxify
script.
It contains a tree-like structure that is human-readable and mimics the tree of multi-level headers in the final data.csv
file. For every column in the final data.csv
file, the dictionary describes from which columns in the raw.csv
file the data should be extracted and what function should be applied to it.
It also contains a __doc__
key for every sub-dictionary that describes what the respective column is about. This is used to generate the documentation for the README.md
file of this data.
Global Variables
- TNM_COLS
- IB_TO_III_DISSECTED
- EXCLUDE
- COLUMN_MAP
function smpl_date
python
smpl_date(entry: str) → str
Parse date from string.
function smpl_diagnose
python
smpl_diagnose(entry: str | int, *_args, **_kwargs) → bool
Parse the diagnosis.
function robust
python
robust(func: collections.abc.Callable) → Optional[Any]
Wrapper that makes any type-conversion function 'robust' by simply returning None
whenever any exception is thrown.
function get_subsite
python
get_subsite(entry: str, *_args, **_kwargs) → str | None
Get human-readable subsite from ICD-10 code.
function parse_pathology
python
parse_pathology(entry, *_args, **_kwargs) → bool | None
Transform number of positive nodes to True
, False
or None
.
function set_diagnostic_consensus
python
set_diagnostic_consensus(entry, *_args, **_kwargs)
Return False
, meaning 'healthy', when no entry about a resected LNL is available. This is a hack to tackle theissue described here:
https://github.com/rmnldwg/lyprox/issues/92
function extract_hpv
python
extract_hpv(value: int | None, *_args, **_kwargs) → bool | None
Translate the HPV value to a boolean.
function strip_letters
python
strip_letters(entry: str, *_args, **_kwargs) → int
Remove letters following a number.
function clean_cat
python
clean_cat(cat: str) → int
Extract T or N category as integer from the respective string. I.e., turn 'pN2+' into 2.
function get_tnm_info
python
get_tnm_info(ct7, cn7, pt7, pn7, ct8, cn8, pt8, pn8) → tuple[int, int, int, str]
Determine the TNM edition used based on which versions are available for T and/or N category.
function get_t_category
python
get_t_category(*args, **_kwargs) → int
Extract the T-category.
function get_n_category
python
get_n_category(*args, **_kwargs) → int
Extract the N-category.
function get_tnm_version
python
get_tnm_version(*args, **_kwargs) → int
Extract the TNM version.
function get_tnm_prefix
python
get_tnm_prefix(*args, **_kwargs) → str
Extract the TNM prefix.
function check_excluded
python
check_excluded(column: pandas.core.series.Series) → Index
Check if a patient/row is excluded based on the content of a column
.
For the 2022 CLB multisite dataset this is the case when the first column with the three-level header ("Bauwens", "Database", "0_lvl_2")
is not empty or does not contain the character 'n'
.
function sum_columns
python
sum_columns(*columns, **_kwargs) → int
Sum the values of multiple columns.