Instruction to prepare data files for DeepLR

Before using DeepLR software, you need to create two csv files: one for demographics data at S2 screening visit and one for CT screening data from both S1 and S2 screening visits. Please read instructions below carefully when preparing these two files. If your data files are not prepared according to these instructions, the resulting DeepLR prediction will not be accurate. Since DeepLR calculates risks based on the natural history of lung cancer development, it does not apply for individuals who have received any treatment for lung nodules (including cancer treatment or surgical removal of any lung nodule) because such a treatment could change the risk of cancer development.

In CT screening data file, one row per nodule per visit. The file must be saved in CSV format with file name extension “.csv”. Each patient must have two different visit numbers: one for S1 visit, and one for current S2 visit. Although you can label these two visits by any numerical numbers you prefer, the S2 visit number must be greater than the S1 visit number. 

Demographic data dictionary

Column order Variable name Label Format
1 pid Unique participant ID
2 age_at_S2 Age in years
3 age_quit_smoking Age at most recent smoking cessation For current smokers, leave it blank or impute it by 100
4 smokeyr Total number of years the participant smoked cigarettes.  
5 pkyr Pack years Pack years, calculated as: (Total Years Smoked x Cigarettes Per Day / 20).
Allows for 2 significant digits.
6 female Participant gender 1=female, 0=male
7 family_hx_LC Family history of lung cancer? 1=yes, 0=no
8 days_emphysema Duration in days with emphysema Computed since the S1 visit. If emphysema is first seen in S2 visit, assign its duration to be 1 day.
9 N_Scr_Pri2Yr Total number of screenings received over the past 2 years Computed from S2 visit date back for 2 years. This is because S2 visit is the current screening visit.

Remarks for the demographics data file creation:

  1. The demographics data must be saved in CSV file format with file name extension “.csv”
  2. Variables must be in exactly the same order as indicated by the “Column order” of the “Demographic data dictionary”. That is, pid data are saved in column 1, age_at_S2 data are saved in column 2, …, N_Scr_Pri2Yr data are saved in column 9. Only the first 9 columns of the demographics data file are read by DeepLR.
  3. One patient per row. All entries must be numeric.
  4. All demographic data must be measured from the time of S2 screening.
  5. For individuals who stopped smoking and then smoked again, the total years of smoking should not include those years that the individual did not smoke. This is similar for the pack years calculation.
  6. The total number of screenings received over the past 2 years is counted from the S2 visit time, and dated back for 2 years.
  7. No missing demographic variable value is allowed

Low-dose CT screening data dictionary 

Column order Variable name Label Format Text
1 pid Unique participant ID
2 visit Integer to indicate S1 or S2 visit Numeric. Visit for S1 must be smaller than visit for S2.
3 days Number of days from the first screening visit It can also be the number of days from any prefixed date since DeepLR only uses the number of days between S1 and S2 visits.
4 nodule_id Unique nodule identifier Numeric. A prior resected or resolved nodule should not be entered into the data file.
5 lobe_location Location of epicenter for non-calcified nodules or masses with >= 4 mm diameter. 1=”Right Upper Lobe”
2=”Right Middle Lobe”
3=”Right Lower Lobe”
4=”Left Upper Lobe”
5=”Lingula”
6=”Left Lower Lobe”
6 new_or_first_seen whether the nodule is first time visible with 4mm or above in S2 visit CT scan 1=yes, 0=no.

Leave it blank or impute 0 for S1 screening visit

7 diameter Average diameter (in mm) Calculated as the average of the longest diameter and longest perpendicular diameter in the same CT slice (in mm)
8 attenuation Predominant attenuation 1=”Soft Tissue”
2=”Ground glass”
3=”Mixed”
9=”Other”
9 spiculated Spiculated (Stellate) margin 1=yes, 0=no
10 growing Whether the nodule increases in size as compared to the prior screening 1=yes, 0=no (leave it blank or impute 0 for S1 visit)
11 incr.density Whether the nodule increases in density as compared to prior screening 1=yes, 0=no (leave it blank or impute 0 for S1 visit)

Remarks for the CT screening data file creation.

  1. The CT screening data must be saved in CSV file format with file name extension “.csv”.
  2. Variables must be in exactly the same order as indicated by the “Column order”. That is, pid data are saved in column 1, visit data are saved in column 2, …, incr.density data are saved in column 11. Only the first 11 columns of the CT screening data file are read by DeepLR.
  3. Each patient must have two different visit numbers: one for prior S1 visit, and one for current S2 visit. Although you can label the visit by any numerical value you prefer, the S2 visit number must be greater than the S1 visit number. 
  4. One row per nodule per visit. If no nodule was found from a screening visit (either in S1 visit or S2 visit), you still need to include a row in the CT screening data file to indicate that a screening was done on that date. However, in this row, you only need to enter data for the first 3 variables of “pid”, “visit”, and “day”, and leave all other variables blank.
  5. All data entries must be numeric.
  6. In S2 visit, only non-calcified nodules with longest diameter 4mm or above are included. Since diameter is calculated as the average of the longest diameter and longest perpendicular diameter in the same CT slice, you still include nodules with diameter<4mm as far as their longest diameters are 4mm or above. Nodules apparently caused by infection should not be included.
  7. S1 visit image scan data should include all non-calcified nodules satisfying any one of the following conditions: (1) longest diameter 4mm or above; (2) diameter >1mm in S1 scan, and its longest diameter was increased to 4mm or above in S2 scan.
  8. A new nodule (new_or_first_seen=1) is defined as a non-calcified nodule whose longest diameter is 4mm or above in S2 visit image scan, and it was not visible (diameter<1 mm) in S1 visit image scan.

Examples of demographic data file and CT screening data file, as well as their template files  can be downloaded from here.