De-Identification File Profiles

Most de-identification settings are defined on a per-file type basis.

A de-id profile (YAML or JSON) can be composed of multiple individual file profile. Each file profile is defined under a certain file profile “key”. Here is very simple example of a de-id profile which defines two file profiles, dicom and jpg:

# The name of the de-id profile
name: An example
# A description of the de-id profile
description: A very simplistic de-id profile for Dicom and JSON files
# The Dicom definition
dicom:
  fields:
    - name: PatientID
      replace-with: REDACTED

# The JPG definition
jpg:
  date-increment: 10
  fields:
    - name: DateTime
      increment-datetime: true

There are a few global settings to discuss before looking at each file profile specifically.

Global file settings

The following global settings are available:

salt (string)

This optional salt string is used for all hash-based field transformations. Using a different salt value will result in different (but consistent) values for hashed fields. This value can be any string.

file-filter (string or list)

When set, this control the filename(s) pattern that a profile will process. Patterns are Unix shell style:

* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq

file-filter can be defined as single string or list of string.

For instance a file-filter defined as ['*.tiff', '*.tif'] will match both, TIFF files with extension .tiff and .tif.

Default value varies depending on file profile specific settings.

date-increment (numeric)

When set, this controls how by how much time in days to offset each date or datetime field where the increment-date or increment-datetime transformation is chosen. Positive values will result in later dates, negative values will result in earlier dates. Incrementing by a multiple of 7 will keep the week-day consistent for shifted dates. Incrementing by a non integer value will also modify the time of datetime element (e.g. 0.5 will increment by 12h datetime).

date-format (string)

The optional string representation of the date found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format here.

Default is to use the Dicom date format “%Y%m%d”.

datetime-format (string)

The optional string representation of the datetime found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format here.

Default is to use the Dicom datetime format “%Y%m%d%H%M%S.%f”.

uid-prefix-fields (integer)

The optional number of prefix blocks to be kept from the original UID when generated the new hash UID.

Default is 4.

uid-suffix-fields (integer)

The optional number of suffix blocks to be kept from the original UID when generated the new hash UID.

Default is 1.

uid-numeric-name (string)

The optional UID prefix to be used when generating new hash UID. Usually it will correspond to an OID registered numeric name. The number of fields in uid-numeric-name must match the uid-prefix-fields.

Default is to used the original UID prefix as defined by uid-prefix-fields.

jitter-range (numeric)

The optional range to be used when offsetting the value by a random number. New value is in [-jitter-range, +jitter-range]

Default is 2.

jitter-type (string)

Either “int” or “float”. “float” will draw the random number from a uniform distribution between [-jitter-range, +jitter-range]. “int” from a random integer between [-jitter-range, +jitter-range].

Default is “float”.

replace-with-insert (bool)

If True, replace-with action will insert the field inside record if it does not exist already and replace its value. If False, replace-with will replace the field value only if the field exists already in the record.

Default is “True”.

filenames (list)

The optional list defining how files get renamed when processed. Each element of filenames must be a dictionary defining at least input-regex and output. input-regex defines the regular expression to be used to match the input filename and extract the relevant group(s) out of it. output defines the filename under which the de-id file will be saved in a python f-string notation. Optionally, a groups key can be defined to list the transformations to be taken on the input-regex named captured group(s) or on the record field value.

Important

As opposed to fields, the transformations defined under groups do NOT impact the de-id record. The transformations are only made available to output.

An example of filenames definition looks like this for a Dicom profile:

dicom:
    date-increment: -17
    filenames:
    - output: '{SOPInstanceUID}_{regdate}.dcm'
      input-regex: '^(?P<notused>\w+)-(?P<regdate>\d{4}-\d{2}-\d{2}).dcm$'
      groups:
        - name: regdate
          increment-date: true
        - name: SOPInstanceUID
          hashuid: true
    - output: '{filenameuid}_{regdatetime}.dcm'
      input-regex: '^(?P<filenameuid>[\w.]+)-(?P<regdatetime>[\d\s:-]+).dcm$'
      groups:
        - name: filenameuid
          hashuid: true
        - name: regdatetime
          increment-datetime: true

In this example, a file matching the first input-regex (e.g. “acquisition-2020-02-20.dcm”) will be saved as “1.3.12.2.651092.137711.166132.421848.119968.345027.314331_2020-02-03.dcm”, matching the output specification:

SOPInstanceUID is replaced by the corresponding Dicom keyword and transformed by hashuid
regdate is replaced by the regdate group extracted from regex match defined by input-regex and processed by the transformation listed under groups (e.g. incremented by date-increment).

If multiple input-regex match the filename, the first match in the filenames list gets precedence.

fields (list)

This list of field transformations that are applied to the file. Each item in that list must define the name of the field to be transformed and the transformation to be taken on that field as a dictionary. For a Dicom file profile, an example of an item in fields is:

- name: PatientName
  replace-with: REDACTED

which replaces the PatientName Dicom data element value with “REDACTED”.

The different field transformation is described in this section.

name

All file profiles support referencing fields by the key name. How to reference a field varies depending on the file type and is described below for each profile.

regex

In addition, certain file profile supports referencing the field using regular expression which makes it convenient when the same transformation must be performed on a set of fields that share some name characteristics. For example, for a Dicom file, an example of an item using regex is:

- regex: .*DateTime.*
  increment-datetime: true

which increments all Dicom date elements with keyword matching .*DateTime.*.

File profiles supporting the regex field type are described below.

Danger

Special care is required when using regex to avoid applying multiple actions to the same element. For instance, defining a Dicom profile with the following fields:

- name: AcquisitionDateTime
  increment-datetime: true
- regex: .*DateTime.*
  increment-datetime: true

will cause the AcquistionDateTime element to be incremented twice!

File profile supporting regex field are described in the specific of each field profile below.

Dicom specific file settings

File profile key dicom.

patient-age-from-birthdate (boolean)

When set to true, this will set the PatientAge Dicom header as a 3-digit value with a suffix indicating units. For example an age in days would be 091D, and that same age in months would be 003M. By default, the age will be set using a best-fit approach. (i.e. if the age fits in days, then days will be used, otherwise if it fits in months, then months will be used, otherwise years will be used)

Default is false.

patient-age-units (string)

When set in conjunction with patient-age-from-birthdate, this will act as a preference for which units to use. If the value does not fit into the desired unit, the next level of units will be used. The most common use for this field would be to always use years as the patient age. Valid values are ‘D’, ‘M’, ‘Y’ for Days, Months and Years respectively.

remove-private-tags (boolean)

When set to true, the private tags will be removed

Default is false.

Important

Private tags that are specifically mentioned in the profile will not be removed.

Private creators will be retained for any private tags specified

For example, for a dicom with the following private tag section:

(0029, 0010) Private Creator                     LO: 'SIEMENS CSA HEADER'
(0029, 1008) [CSA Image Header Type]             CS: 'IMAGE NUM 4'
(0029, 1009) [CSA Image Header Version]          LO: '20200122'

And the following profile

dicom:
  remove-private-tags: true
  fields:
    - name: (0029, SIEMENS CSA HEADER, 08)
      replace-with: 'REDACTED'

The resulting dicom will look like this:

(0029, 0010) Private Creator                     LO: 'SIEMENS CSA HEADER'  # Private creator retained
(0029, 1008) [CSA Image Header Type]             CS: 'REDACTED'            # Private tag replaced
 # CSA Image Header Version removed

decode (boolean)

When set to True, the Dicom record will be decoded when loaded and the data element VR potentially manipulated according to pydicom default configuration and Flywheel custom pydicom configuration (e.g. unknown VR (UN) will be inferred when possible).

Default is True.

recurse-sequence (boolean)

When set to True, each element of a sequence (VR=SQ) will be processed according to the profile, recursively for all nested sequence elements.

Important

When using this option, the profile fields section must not define fields acting on element of sequences or using regex.

remove-undefined (boolean)

When set to true, all data elements not defined in fields section of the profile will be removed. If any field references a nested element in a sequence the whole sequence element will be kept.

Important

When using this option, particular attention should be paid to the de-id profile to guarantee that the output Dicom still contains the mandatory data elements according to its Information Object Definitions (IOD).

Default is False.

fields

This file profile supports 4 ways to reference Dicom data element: keyword, tag or dotty-notation.

keyword: Keyword string as defined in the public Dicom dictionaries (as defined by pydicom), e.g. PatientName.
tag:
- Hexadecimal notation, e.g. 00100010 or 0x00100010.
- Tuple notation, e.g. (0010, 0010).
- Private tag notation in the form (GGGG, PrivateCreatorName, EE), e.g. (0009, "GEMS_IMAG_01", 01). As replace-with action will upsert tag if not present, we rely on a predefined private dictionaries to infer tag VR which is build from pydicom _private_dict.py and flywheel-metadata
- Repeater group notation for groups in range (50XX, EEEE) and (60XX, EEEE) only. It supports tuple or hexadecimal notation.
dotty-notation: Dot separated notation for referencing element within Dicom sequence. A mixed of keywords and tags can be used in that case, e.g AnatomicRegionSequence.0.CodeValue, 00082218.0.00080102, AnatomicRegionSequence.0.00080104 In addition, the dotty-notation supports the use * to reference all indices of the sequence element at once, e.g. AnatomicRegionSequence.*.CodeValue. The notation also supports referring data element at any depth recursively.

Note

The data elements in the Dicom File Meta information located in the optional 128 bytes of the Dicom File Preamble can be accessed in the same way as other tags.

Example:

  # using keyword
- name: PatientName
  replace-with: REDACTED

  # using tag (tuple also supported)
- name: 00080104
  replace-with: REDACTED

  # using private tag notation
- name: (0009, "GEMS_IMAG_01", 01)
  replace-with: REDACTED

  # using dotty-notation to access sequence element
- name: 00082218.0.00080102
  replace-with: REDACTED

  # using * to access all element in the sequence
- name: AnatomicRegionSequence.*.CodeValue
  replace-with: REDACTED

  # using repeater group notation
- name: (60xx, 0022)
  replace-with: REDACTED

This file profile also supports regex in field item.

- regex: .*DateTime.*
  increment-datetime: true

JPG specific file settings

File profile key jpg.

A good introduction to JPG file format and EXIF metadata can be found here.

This profile treats Image File Directories (IFD) metadata under the same umbrella which means that if you defined the following field in your profile configuration:

- name: ProcessingSoftware
  remove: REDACTED

that field will be redacted from both, IDF0 and IDF1, metadata blocks.

remove-exif (boolean)

When set to true, remove the EXIF ImageFileDirectory block from the JPG file.

Default is false.

remove-gps (boolean)

When set to true, remove all the GPS related metadata from the JPG file.

Default is false.

file-filter (list)

Default value is: ['*.jpg', '*.jpeg', '*.JPG', '*.JPEG']

fields

Keywords to be used as name are defined in piexif. The full list of available keywords can be found here.

Example:

- name: DateTime
  increment-datetime: true
- name: Artist
  remove: true
- name: DateTimeOriginal
  increment-datetime: true
- name: PreviewDateTime
  remove: true
- name: DateTimeDigitized
  increment-datetime: true
- name: CameraOwnerName
  replace-with: 'REDACTED'
- name: ImageUniqueID
  hash: true

PNG specific file settings

File profile key png.

A good introduction to PNG file format can be found here.

This file profile only supports the remove action. PNG metadata are referred as chunks. Reference to critical and ancillary chunks is supported.

remove-private-chunks (boolean)

When set to true, remove all private chunks from the PNG file.

Default is false.

file-filter (list)

Default value is: ['*.png', '*.PNG']

fields

Example:

- name: tEXt
  remove: true
- name: eXIf
  remove: true

TIFF specific file settings

TIFF and JPG profiles share a lot of similarity given their underlying file format.

remove-private-tags (boolean)

When set to True, remove all private tags. Private tags are tags with index >= 32768.

file-filter (list)

Default value is: ['*.tif', '*.tiff', '*.TIF', '*.TIFF']

fields

The supported keywords are the ones defined in the Pillow package<https://github.com/python-pillow/Pillow>. A list of keywords can be found here

Example:

- name: DateTime
  increment-datetime: true
- name: Software
  remove: true
- name: Model
  replace-with: 'REDACTED'

XML specific file settings

File profile key xml.

file-filter (list)

Default value is: ['*.xml', '*.XML']

fields

Field name uses XPath to reference the DOM element in the tree. If XPath return multiple elements, each element will be processed with the specified transformation.

Example:

- name: /Patient/Patient_Date_Of_Birth
  replace-with: '1900-01-01'
- name: /Patient/Patient_Name
  remove: true
- name: /Patient/SUBJECT_ID
  hash: true
- name: /Patient/Visit/Scan/ScanTime
  increment-datetime: true

JSON specific file settings

File profile key json.

separator (string)

The optional separator string defines what character should be when referencing nested element in the JSON file. By default, . is used so that a nested element in the JSON file can be referenced by a combination of its key and/or list index value.

For example, with the following JSON,

{
   "this":[
      {"package": "is"},
      "neat"
   ]
}

The value store in "package" can be referenced with the following key "this.0.package"

file-filter (list)

Default value is: ["*.json", "*.JSON"]

fields

Field name uses a “dotty-notation” to reference the element in the JSON file hierarchy. This file profile is regex compatible.

Example

- name: timestamp
  increment-datetime: True
- name: info.SiteID
  remove: True
  # regex-sub will be applied first in fields list when processing file
- name: label
  regex-sub:
    - input-regex: '(?P<current_label>.*)'
      output: '{current_label}_{subject.lastname}_{timestamp}'
      groups:
        - name: current_label
          replace-with: 'one_cool_cat'
        - name: subject.lastname
          keep: true
        - name: timestamp
          increment-datetime: True
- regex: info\.subject_raw\..*
  replace-with: "REDACTED"
- name: info.test
  replace-with:
    new: value
    type: dict

Key/value text file settings

File profile key key-value-text-file.

An example for such a text file is:

ObjectType = Image
NDims = 3
BinaryData = True
BinaryDataByteOrderMSB = False
CompressedData = False
TransformMatrix = 1 0 0 0 1 0 0 0 1
Offset = 0 0 0

delimiter (string)

The regular expression to be used when splitting the line in its key/value pair. For example, with the above described file, delimiter should be defined as “\s+=\s+”.

encoding (string)

The optional string defining the encoding to be used to parse the input file. The same encoding will be used for saving the de-id file.

Default is to used the OS default encoding.

ignore-bad-lines (boolean)

When set to true, this optional boolean will ignore the lines that do not match the delimiter provided and log a warning.

file-filter (list)

Default value is: ["*.mhd", "*.MHD"]

Table specific file settings

File profile key table.

This profiles handles tabular file such as CSV and TSV and can be extended to other tabular formats such as XLS, etc.

reader (string)

The string representing the suffix of the pandas reader to be used. Available suffix documented in pandas IO tools.

Default is None.

delimiter (string)

This string representing the delimiter to be used when parsing the input record. It is to be used in combination with the reader settings. For example, with reader=csv, setting delimiter=, parses CSV files, whereas setting delimiter=\t parses TSV file.

Default is None.

CSV specific file settings (csv)

File profile key csv.

Same as Table file settings but with different defaults.

reader (string)

Default value is “csv”.

delimiter (string)

Default value is “,”.

file-filter (list)

Default value is [“.csv”, “.CSV”]

TSV specific file settings (tsv)

File profile key tsv.

Same as Table file settings but with different defaults.

reader (string)

Default value is “csv”.

delimiter (string)

Default value is “t”.

file-filter (list)

Default value is [“.tsv”, “.TSV”]

Filename specific file settings (filename)

File profile key filename.

This profile only implements the filenames functionality (described here) that allows for renaming files and does not have any specific attributes.

For example, the following de-id profile would not transform anything:

filename:
  filenames:
    - input-regex: (?P<filename>.*)
      output: '{filename}

ZIP specific file settings (zip)

This profile handles .zip archive. It behaves differently than the other file profile because it:

Unzips the archive.
Processes the file members of the archive using the de-id profile.
Re-zips the archive according to its configuration.

hash-subdirectories (boolean)

If set to true, the optional boolean will apply an sha256 hash to any subdirectories within the archive.

Default is false.

validate-zip-members (boolean)

If set to tru, this optional boolean allows for partially processing the zip archive. By default, the de-id of an zip archive will fail if a file member cannot be de-identified (e.g. because not profile is associated with its type). Setting this boolean to true will skip the faulty file and not export it in the output archive.

Default is false.

fields

comment is currently the only specific field to ZIP archive.

Example:

- name: comment
  replace-with: 'FLYWHEEL'