.. _file-profiles-page: De-Identification File Profiles =============================== Most de-identification settings are defined on a per-file type basis. A de-id profile (YAML or JSON) can be composed of multiple individual file profile. Each file profile is defined under a certain file profile "key". Here is very simple example of a de-id profile which defines two file profiles, ``dicom`` and ``jpg``: .. code-block:: YAML # The name of the de-id profile name: An example # A description of the de-id profile description: A very simplistic de-id profile for Dicom and JSON files # The Dicom definition dicom: fields: - name: PatientID replace-with: REDACTED # The JPG definition jpg: date-increment: 10 fields: - name: DateTime increment-datetime: true There are a few global settings to discuss before looking at each file profile specifically. Global file settings -------------------- The following global settings are available: salt (string) ............. This optional salt string is used for all hash-based field transformations. Using a different salt value will result in different (but consistent) values for hashed fields. This value can be any string. file-filter (string or list) ............................ When set, this control the filename(s) pattern that a profile will process. Patterns are Unix shell style: * \* matches everything * ? matches any single character * [seq] matches any character in seq * [!seq] matches any char not in seq ``file-filter`` can be defined as single string or list of string. For instance a file-filter defined as ``['*.tiff', '*.tif']`` will match both, TIFF files with extension .tiff and .tif. Default value varies depending on file profile specific settings. date-increment (numeric) ........................ When set, this controls how by how much time in days to offset each date or datetime field where the increment-date or increment-datetime transformation is chosen. Positive values will result in later dates, negative values will result in earlier dates. Incrementing by a multiple of 7 will keep the week-day consistent for shifted dates. Incrementing by a non integer value will also modify the time of datetime element (e.g. 0.5 will increment by 12h datetime). .. _date_format: date-format (string) .................... The optional string representation of the date found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format `here `__. Default is to use the Dicom date format "%Y%m%d". .. _datetime_format: datetime-format (string) ........................ The optional string representation of the datetime found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format `here `__. Default is to use the Dicom datetime format "%Y%m%d%H%M%S.%f". .. _hashuid_config: uid-prefix-fields (integer) ........................... The optional number of prefix blocks to be kept from the original UID when generated the new hash UID. Default is 4. uid-suffix-fields (integer) ........................... The optional number of suffix blocks to be kept from the original UID when generated the new hash UID. Default is 1. uid-numeric-name (string) ......................... The optional UID prefix to be used when generating new hash UID. Usually it will correspond to an OID registered numeric name. The number of fields in ``uid-numeric-name`` must match the ``uid-prefix-fields``. Default is to used the original UID prefix as defined by ``uid-prefix-fields``. .. _jitter_config: jitter-range (numeric) ...................... The optional range to be used when offsetting the value by a random number. New value is in [-jitter-range, +jitter-range] Default is 2. jitter-type (string) ...................... Either "int" or "float". "float" will draw the random number from a uniform distribution between [-jitter-range, +jitter-range]. "int" from a random integer between [-jitter-range, +jitter-range]. Default is "float". .. _replace_with_insert: replace-with-insert (bool) ........................... If True, ``replace-with`` action will insert the field inside record if it does not exist already and replace its value. If False, ``replace-with`` will replace the field value *only if* the field exists already in the record. Default is "True". .. _filenames: filenames (list) ................ The optional list defining how files get renamed when processed. Each element of ``filenames`` must be a dictionary defining at least ``input-regex`` and ``output``. ``input-regex`` defines the regular expression to be used to match the input filename and extract the relevant group(s) out of it. ``output`` defines the filename under which the de-id file will be saved in a python f-string notation. Optionally, a ``groups`` key can be defined to list the transformations to be taken on the input-regex named captured group(s) or on the record field value. .. IMPORTANT:: As opposed to ``fields``, the transformations defined under ``groups`` do NOT impact the de-id record. The transformations are only made available to ``output``. An example of ``filenames`` definition looks like this for a Dicom profile: .. code-block:: yaml dicom: date-increment: -17 filenames: - output: '{SOPInstanceUID}_{regdate}.dcm' input-regex: '^(?P\w+)-(?P\d{4}-\d{2}-\d{2}).dcm$' groups: - name: regdate increment-date: true - name: SOPInstanceUID hashuid: true - output: '{filenameuid}_{regdatetime}.dcm' input-regex: '^(?P[\w.]+)-(?P[\d\s:-]+).dcm$' groups: - name: filenameuid hashuid: true - name: regdatetime increment-datetime: true In this example, a file matching the first ``input-regex`` (e.g. "acquisition-2020-02-20.dcm") will be saved as "1.3.12.2.651092.137711.166132.421848.119968.345027.314331_2020-02-03.dcm", matching the ``output`` specification: * ``SOPInstanceUID`` is replaced by the corresponding Dicom keyword and transformed by ``hashuid`` * ``regdate`` is replaced by the `regdate` group extracted from regex match defined by ``input-regex`` and processed by the transformation listed under ``groups`` (e.g. incremented by ``date-increment``). If multiple ``input-regex`` match the filename, the first match in the ``filenames`` list gets precedence. fields (list) ............. This list of field transformations that are applied to the file. Each item in that list must define the name of the field to be transformed and the transformation to be taken on that field as a dictionary. For a Dicom file profile, an example of an item in ``fields`` is: .. code-block:: yaml - name: PatientName replace-with: REDACTED which replaces the PatientName Dicom data element value with "REDACTED". The different field transformation is described in :ref:`this section `. name ^^^^ All file profiles support referencing fields by the key ``name``. How to reference a field varies depending on the file type and is described below for each profile. .. _regex: regex ^^^^^ In addition, certain file profile supports referencing the field using regular expression which makes it convenient when the same transformation must be performed on a set of fields that share some name characteristics. For example, for a Dicom file, an example of an item using regex is: .. code-block:: yaml - regex: .*DateTime.* increment-datetime: true which increments all Dicom date elements with keyword matching ``.*DateTime.*``. File profiles supporting the ``regex`` field type are described below. .. DANGER:: Special care is required when using ``regex`` to avoid applying multiple actions to the same element. For instance, defining a Dicom profile with the following fields: .. code-block:: yaml - name: AcquisitionDateTime increment-datetime: true - regex: .*DateTime.* increment-datetime: true will cause the AcquistionDateTime element to be incremented twice! File profile supporting ``regex`` field are described in the specific of each field profile below. Dicom specific file settings ---------------------------- File profile key ``dicom``. patient-age-from-birthdate (boolean) .................................... When set to true, this will set the PatientAge Dicom header as a 3-digit value with a suffix indicating units. For example an age in days would be 091D, and that same age in months would be 003M. By default, the age will be set using a best-fit approach. (i.e. if the age fits in days, then days will be used, otherwise if it fits in months, then months will be used, otherwise years will be used) Default is false. patient-age-units (string) .......................... When set in conjunction with patient-age-from-birthdate, this will act as a preference for which units to use. If the value does not fit into the desired unit, the next level of units will be used. The most common use for this field would be to always use years as the patient age. Valid values are ‘D’, ‘M’, ‘Y’ for Days, Months and Years respectively. remove-private-tags (boolean) ............................. When set to true, the private tags will be removed Default is false. .. IMPORTANT:: Private tags that are specifically mentioned in the profile will not be removed. Private creators will be retained for any private tags specified For example, for a dicom with the following private tag section: .. code-block:: bash (0029, 0010) Private Creator LO: 'SIEMENS CSA HEADER' (0029, 1008) [CSA Image Header Type] CS: 'IMAGE NUM 4' (0029, 1009) [CSA Image Header Version] LO: '20200122' And the following profile .. code-block:: yaml dicom: remove-private-tags: true fields: - name: (0029, SIEMENS CSA HEADER, 08) replace-with: 'REDACTED' The resulting dicom will look like this: .. code-block:: bash (0029, 0010) Private Creator LO: 'SIEMENS CSA HEADER' # Private creator retained (0029, 1008) [CSA Image Header Type] CS: 'REDACTED' # Private tag replaced # CSA Image Header Version removed decode (boolean) ................ When set to True, the Dicom record will be decoded when loaded and the data element VR potentially manipulated according to pydicom `default `_ configuration and Flywheel `custom `_ pydicom configuration (e.g. unknown VR (UN) will be inferred when possible). Default is True. recurse-sequence (boolean) .......................... When set to True, each element of a sequence (VR=SQ) will be processed according to the profile, recursively for all nested sequence elements. .. IMPORTANT:: When using this option, the profile ``fields`` section must not define fields acting on element of sequences or using ``regex``. remove-undefined (boolean) .......................... When set to true, all data elements not defined in ``fields`` section of the profile will be removed. If any field references a nested element in a sequence the whole sequence element will be kept. .. IMPORTANT:: When using this option, particular attention should be paid to the de-id profile to guarantee that the output Dicom still contains the mandatory data elements according to its `Information Object Definitions (IOD) `_. Default is False. file-filter (list) .................. Default value is: ``['*.dcm', '*.DCM', '*.ima', '*.IMA']`` fields ...... This file profile supports 4 ways to reference Dicom data element: keyword, tag or dotty-notation. * *keyword*: Keyword string as defined in the public Dicom dictionaries (as defined by `pydicom `_), e.g. ``PatientName``. * *tag*: * Hexadecimal notation, e.g. ``00100010`` or ``0x00100010``. * Tuple notation, e.g. ``(0010, 0010)``. * Private tag notation in the form ``(GGGG, PrivateCreatorName, EE)``, e.g. ``(0009, "GEMS_IMAG_01", 01)``. As ``replace-with`` action will upsert tag if not present, we rely on a predefined private dictionaries to infer tag VR which is build from pydicom `_private_dict.py `_ and `flywheel-metadata `_ * Repeater group notation for groups in range ``(50XX, EEEE)`` and ``(60XX, EEEE)`` only. It supports tuple or hexadecimal notation. * *dotty-notation*: Dot separated notation for referencing element within Dicom sequence. A mixed of keywords and tags can be used in that case, e.g AnatomicRegionSequence.0.CodeValue, 00082218.0.00080102, AnatomicRegionSequence.0.00080104 In addition, the dotty-notation supports the use ``*`` to reference all indices of the sequence element at once, e.g. AnatomicRegionSequence.*.CodeValue. The notation also supports referring data element at any depth recursively. .. NOTE:: The data elements in the Dicom File Meta information located in the optional 128 bytes of the Dicom File Preamble can be accessed in the same way as other tags. Example: .. code-block:: yaml # using keyword - name: PatientName replace-with: REDACTED # using tag (tuple also supported) - name: 00080104 replace-with: REDACTED # using private tag notation - name: (0009, "GEMS_IMAG_01", 01) replace-with: REDACTED # using dotty-notation to access sequence element - name: 00082218.0.00080102 replace-with: REDACTED # using * to access all element in the sequence - name: AnatomicRegionSequence.*.CodeValue replace-with: REDACTED # using repeater group notation - name: (60xx, 0022) replace-with: REDACTED This file profile also supports ``regex`` in field item. .. code-block:: yaml - regex: .*DateTime.* increment-datetime: true JPG specific file settings ---------------------------- File profile key ``jpg``. A good introduction to JPG file format and EXIF metadata can be found `here `__. This profile treats Image File Directories (IFD) metadata under the same umbrella which means that if you defined the following field in your profile configuration: .. code-block:: yaml - name: ProcessingSoftware remove: REDACTED that field will be redacted from both, IDF0 and IDF1, metadata blocks. remove-exif (boolean) ..................... When set to true, remove the EXIF ImageFileDirectory block from the JPG file. Default is false. remove-gps (boolean) ..................... When set to true, remove all the GPS related metadata from the JPG file. Default is false. file-filter (list) .................. Default value is: ``['*.jpg', '*.jpeg', '*.JPG', '*.JPEG']`` fields ...... Keywords to be used as name are defined in `piexif `_. The full list of available keywords can be found `here `__. Example: .. code-block:: yaml - name: DateTime increment-datetime: true - name: Artist remove: true - name: DateTimeOriginal increment-datetime: true - name: PreviewDateTime remove: true - name: DateTimeDigitized increment-datetime: true - name: CameraOwnerName replace-with: 'REDACTED' - name: ImageUniqueID hash: true PNG specific file settings -------------------------- File profile key ``png``. A good introduction to PNG file format can be found `here `__. This file profile only supports the ``remove`` action. PNG metadata are referred as chunks. Reference to critical and ancillary chunks is supported. remove-private-chunks (boolean) ............................... When set to true, remove all private chunks from the PNG file. Default is false. file-filter (list) .................. Default value is: ``['*.png', '*.PNG']`` fields ...... Example: .. code-block:: - name: tEXt remove: true - name: eXIf remove: true TIFF specific file settings --------------------------- TIFF and JPG profiles share a lot of similarity given their underlying file format. remove-private-tags (boolean) ............................. When set to True, remove all private tags. Private tags are tags with index >= 32768. file-filter (list) .................. Default value is: ``['*.tif', '*.tiff', '*.TIF', '*.TIFF']`` fields ...... The supported keywords are the ones defined in the `Pillow package`. A list of keywords can be found `here `__ Example: .. code-block:: yaml - name: DateTime increment-datetime: true - name: Software remove: true - name: Model replace-with: 'REDACTED' XML specific file settings -------------------------- File profile key ``xml``. file-filter (list) .................. Default value is: ``['*.xml', '*.XML']`` fields ...... Field name uses `XPath `_ to reference the DOM element in the tree. If XPath return multiple elements, each element will be processed with the specified transformation. Example: .. code-block:: yaml - name: /Patient/Patient_Date_Of_Birth replace-with: '1900-01-01' - name: /Patient/Patient_Name remove: true - name: /Patient/SUBJECT_ID hash: true - name: /Patient/Visit/Scan/ScanTime increment-datetime: true JSON specific file settings --------------------------- File profile key ``json``. separator (string) .................. The optional separator string defines what character should be when referencing nested element in the JSON file. By default, ``.`` is used so that a nested element in the JSON file can be referenced by a combination of its key and/or list index value. For example, with the following JSON, .. code-block:: json { "this":[ {"package": "is"}, "neat" ] } The value store in ``"package"`` can be referenced with the following key ``"this.0.package"`` file-filter (list) .................. Default value is: ``["*.json", "*.JSON"]`` fields ...... Field name uses a "dotty-notation" to reference the element in the JSON file hierarchy. This file profile is ``regex`` compatible. Example .. code-block:: yaml - name: timestamp increment-datetime: True - name: info.SiteID remove: True # regex-sub will be applied first in fields list when processing file - name: label regex-sub: - input-regex: '(?P.*)' output: '{current_label}_{subject.lastname}_{timestamp}' groups: - name: current_label replace-with: 'one_cool_cat' - name: subject.lastname keep: true - name: timestamp increment-datetime: True - regex: info\.subject_raw\..* replace-with: "REDACTED" - name: info.test replace-with: new: value type: dict Key/value text file settings ---------------------------- File profile key ``key-value-text-file``. An example for such a text file is: .. code-block:: text ObjectType = Image NDims = 3 BinaryData = True BinaryDataByteOrderMSB = False CompressedData = False TransformMatrix = 1 0 0 0 1 0 0 0 1 Offset = 0 0 0 delimiter (string) .................. The regular expression to be used when splitting the line in its key/value pair. For example, with the above described file, ``delimiter`` should be defined as "\\s+=\\s+". encoding (string) .................. The optional string defining the encoding to be used to parse the input file. The same encoding will be used for saving the de-id file. Default is to used the OS default encoding. ignore-bad-lines (boolean) .......................... When set to true, this optional boolean will ignore the lines that do not match the ``delimiter`` provided and log a warning. file-filter (list) .................. Default value is: ``["*.mhd", "*.MHD"]`` .. _table-file-profile: Table specific file settings ---------------------------- File profile key ``table``. This profiles handles tabular file such as CSV and TSV and can be extended to other tabular formats such as XLS, etc. reader (string) ............... The string representing the suffix of the `pandas `_ reader to be used. Available suffix documented in `pandas IO tools `_. Default is None. delimiter (string) .................. This string representing the delimiter to be used when parsing the input record. It is to be used in combination with the ``reader`` settings. For example, with ``reader=csv``, setting ``delimiter=,`` parses CSV files, whereas setting ``delimiter=\t`` parses TSV file. Default is None. CSV specific file settings (csv) -------------------------------- File profile key ``csv``. Same as :ref:`Table ` file settings but with different defaults. reader (string) ............... Default value is "csv". delimiter (string) .................. Default value is ",". file-filter (list) .................. Default value is `[".csv", ".CSV"]` TSV specific file settings (tsv) -------------------------------- File profile key ``tsv``. Same as :ref:`Table ` file settings but with different defaults. reader (string) ............... Default value is "csv". delimiter (string) .................. Default value is "\t". file-filter (list) .................. Default value is `[".tsv", ".TSV"]` Filename specific file settings (filename) ------------------------------------------- File profile key ``filename``. This profile only implements the filenames functionality (described :ref:`here `) that allows for renaming files and does not have any specific attributes. For example, the following de-id profile would not transform anything: .. code-block:: filename: filenames: - input-regex: (?P.*) output: '{filename} ZIP specific file settings (zip) -------------------------------- This profile handles .zip archive. It behaves differently than the other file profile because it: 1) Unzips the archive. 2) Processes the file members of the archive using the de-id profile. 3) Re-zips the archive according to its configuration. hash-subdirectories (boolean) ............................. If set to true, the optional boolean will apply an sha256 hash to any subdirectories within the archive. Default is false. validate-zip-members (boolean) .............................. If set to tru, this optional boolean allows for partially processing the zip archive. By default, the de-id of an zip archive will fail if a file member cannot be de-identified (e.g. because not profile is associated with its type). Setting this boolean to true will skip the faulty file and not export it in the output archive. Default is false. fields ...... comment is currently the only specific field to ZIP archive. Example: .. code-block:: yaml - name: comment replace-with: 'FLYWHEEL'