De-Identification File Profiles
Most de-identification settings are defined on a per-file type basis.
A de-id profile (YAML or JSON) can be composed of multiple individual file profile. Each
file profile is defined under a certain file profile “key”.
Here is very simple example of a de-id profile which defines two file profiles,
dicom
and jpg
:
# The name of the de-id profile
name: An example
# A description of the de-id profile
description: A very simplistic de-id profile for Dicom and JSON files
# The Dicom definition
dicom:
fields:
- name: PatientID
replace-with: REDACTED
# The JPG definition
jpg:
date-increment: 10
fields:
- name: DateTime
increment-datetime: true
There are a few global settings to discuss before looking at each file profile specifically.
Global file settings
The following global settings are available:
salt (string)
This optional salt string is used for all hash-based field transformations. Using a different salt value will result in different (but consistent) values for hashed fields. This value can be any string.
file-filter (string or list)
When set, this control the filename(s) pattern that a profile will process. Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
file-filter
can be defined as single string or list of string.
For instance a file-filter defined as ['*.tiff', '*.tif']
will match both,
TIFF files with extension .tiff and .tif.
Default value varies depending on file profile specific settings.
date-increment (numeric)
When set, this controls how by how much time in days to offset each date or datetime field where the increment-date or increment-datetime transformation is chosen. Positive values will result in later dates, negative values will result in earlier dates. Incrementing by a multiple of 7 will keep the week-day consistent for shifted dates. Incrementing by a non integer value will also modify the time of datetime element (e.g. 0.5 will increment by 12h datetime).
date-format (string)
The optional string representation of the date found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format here.
Default is to use the Dicom date format “%Y%m%d”.
datetime-format (string)
The optional string representation of the datetime found in the metadata of the file. The format interpretation follows the format codes that the 1989 C standard requires. More on how to format here.
Default is to use the Dicom datetime format “%Y%m%d%H%M%S.%f”.
uid-prefix-fields (integer)
The optional number of prefix blocks to be kept from the original UID when generated the new hash UID.
Default is 4.
uid-suffix-fields (integer)
The optional number of suffix blocks to be kept from the original UID when generated the new hash UID.
Default is 1.
uid-numeric-name (string)
The optional UID prefix to be used when generating new hash UID. Usually it will
correspond to an OID registered numeric name. The number of fields in uid-numeric-name
must match the uid-prefix-fields
.
Default is to used the original UID prefix as defined by uid-prefix-fields
.
jitter-range (numeric)
The optional range to be used when offsetting the value by a random number. New value is in [-jitter-range, +jitter-range]
Default is 2.
jitter-type (string)
Either “int” or “float”. “float” will draw the random number from a uniform distribution between [-jitter-range, +jitter-range]. “int” from a random integer between [-jitter-range, +jitter-range].
Default is “float”.
replace-with-insert (bool)
If True, replace-with
action will insert the field inside record if it does not
exist already and replace its value. If False, replace-with
will replace the field
value only if the field exists already in the record.
Default is “True”.
secret-key (string)
Secret key to use for symmetric encryption and decryption, passed in as a base16 (hex) encoded string. The string must be 16 (AES-128), 24 (AES-192), or 32 (AES-256) bytes long.
An example of how to create a secret key via python is as follows:
>>> import secrets
>>> secrets.token_hex(16)
>>> 'f566ee15612f09ecf8dce973e79831fb'
Then, input the created secret key into the profile:
dicom:
secret-key: f566ee15612f09ecf8dce973e79831fb
force-nonce (str)
Base64 encoded nonce to use for symmetric encryption and decryption. The nonce must be 12 base64 characters long, note that corresponds to 16 bytes
You can create a nonce via python as follows:
>>> import base64
>>> import secrets
>>> base64.b64encode(secrets.token_bytes(16)).decode()
>>> 'Gfd5PrWzD38='
Then, input the created nonce into the profile:
csv:
force-nonce: Gfd5PrWzD38=
filenames (list)
The optional list defining how files get renamed when processed. Each
element of filenames
must be a dictionary defining at least input-regex
and
output
. input-regex
defines the regular expression to be used to match
the input filename and extract the relevant group(s) out of it. output
defines
the filename under which the de-id file will be saved in a python f-string notation.
Optionally, a groups
key can be defined to list the transformations to be taken on the
input-regex named captured group(s) or on the record field value.
Important
As opposed to fields
, the transformations defined under groups
do NOT impact
the de-id record. The transformations are only made available to output
.
An example of filenames
definition looks like this for a Dicom profile:
dicom:
date-increment: -17
filenames:
- output: '{SOPInstanceUID}_{regdate}.dcm'
input-regex: '^(?P<notused>\w+)-(?P<regdate>\d{4}-\d{2}-\d{2}).dcm$'
groups:
- name: regdate
increment-date: true
- name: SOPInstanceUID
hashuid: true
- output: '{filenameuid}_{regdatetime}.dcm'
input-regex: '^(?P<filenameuid>[\w.]+)-(?P<regdatetime>[\d\s:-]+).dcm$'
groups:
- name: filenameuid
hashuid: true
- name: regdatetime
increment-datetime: true
In this example, a file matching the first input-regex
(e.g. “acquisition-2020-02-20.dcm”)
will be saved as “1.3.12.2.651092.137711.166132.421848.119968.345027.314331_2020-02-03.dcm”, matching the
output
specification:
SOPInstanceUID
is replaced by the corresponding Dicom keyword and transformed byhashuid
regdate
is replaced by the regdate group extracted from regex match defined byinput-regex
and processed by the transformation listed undergroups
(e.g. incremented bydate-increment
).
If multiple input-regex
match the filename, the first match in the filenames
list
gets precedence.
fields (list)
This list of field transformations that are applied to the file. Each item
in that list must define the name of the field to be transformed and the transformation
to be taken on that field as a dictionary. For a Dicom file profile, an example of an
item in fields
is:
- name: PatientName
replace-with: REDACTED
which replaces the PatientName Dicom data element value with “REDACTED”.
The different field transformation is described in this section.
name
All file profiles support referencing fields by the key name
. How
to reference a field varies depending on the file type and is described
below for each profile.
regex
In addition, certain file profile supports referencing the field using regular expression which makes it convenient when the same transformation must be performed on a set of fields that share some name characteristics. For example, for a Dicom file, an example of an item using regex is:
- regex: .*DateTime.*
increment-datetime: true
which increments all Dicom date elements with keyword matching .*DateTime.*
.
File profiles supporting the regex
field type are described below.
Danger
Special care is required when using regex
to avoid applying multiple actions
to the same element. For instance, defining a Dicom profile with the following
fields:
- name: AcquisitionDateTime
increment-datetime: true
- regex: .*DateTime.*
increment-datetime: true
will cause the AcquistionDateTime element to be incremented twice!
File profile supporting regex
field are described in the specific of
each field profile below.
Dicom specific file settings
File profile key dicom
.
patient-age-from-birthdate (boolean)
When set to true, this will set the PatientAge Dicom header as a 3-digit value with a suffix indicating units. For example an age in days would be 091D, and that same age in months would be 003M. By default, the age will be set using a best-fit approach. (i.e. if the age fits in days, then days will be used, otherwise if it fits in months, then months will be used, otherwise years will be used)
Default is false.
patient-age-units (string)
When set in conjunction with patient-age-from-birthdate, this will act as a preference for which units to use. If the value does not fit into the desired unit, the next level of units will be used. The most common use for this field would be to always use years as the patient age. Valid values are ‘D’, ‘M’, ‘Y’ for Days, Months and Years respectively.
decode (boolean)
When set to True, the Dicom record will be decoded when loaded and the data element VR potentially manipulated according to pydicom default configuration and Flywheel custom pydicom configuration (e.g. unknown VR (UN) will be inferred when possible).
Default is True.
recurse-sequence (boolean)
When set to True, each element of a sequence (VR=SQ) will be processed according to the profile, recursively for all nested sequence elements.
Important
When using this option, the profile fields
section must not define fields
acting on element of sequences or using regex
.
remove-undefined (boolean)
When set to true, all data elements not defined in fields
section of the profile will
be removed. If any field references a nested element in a sequence the whole sequence
element will be kept.
Important
When using this option, particular attention should be paid to the de-id profile to guarantee that the output Dicom still contains the mandatory data elements according to its Information Object Definitions (IOD).
Default is False.
file-filter (list)
Default value is: ['*.dcm', '*.DCM', '*.ima', '*.IMA']
asymmetric-encryption (boolean)
If true
, asymmetric encryption will be utilized for encryption/decryption.
Asymmetric encryption requires public-key
for encryption and private-key
for decryption.
retain (boolean)
If true
, every field modified by any deid field will have its original
value encrypted in the EncryptedAttributesSequence
([See PS3.15 E.1.1](https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#sect_E.1.1)).
For example, the following profile would output a deidentified DICOM with PatientID
replaced with ANONYMIZED
and an EncryptedAttributesSequence that stores
the original PatientID
which can later be restored via the decrypt
action.
dicom:
retain: true
asymmetric-encryption: true
public-key:
- /path/to/public_key.pem
fields:
- name: PatientID
replace-with: ANONYMIZED
public-key (list)
Public key .pem file(s) to be used for asymmetric encryption, entered as a list of filepath(s).
dicom:
public-key:
- public_key1.pem
- public_key2.pem
The following creates a x509 keypair with the private key being a 4096 bit RSA key:
openssl req -x509
-newkey rsa:4096
-keyout private_key.pem
-out public_key.pem
-sha256
-days 3650
-nodes
-subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"
private-key (string)
Private key .pem file to be used for asymmetric decryption, entered as a filepath. The private key must be associated with a public key utilized for encryption. See above for an example on how to create a x509 keypair for asymmetric encryption.
dicom:
private-key: private_key.pem
fields
This file profile supports 4 ways to reference Dicom data element: keyword, tag or dotty-notation.
keyword: Keyword string as defined in the public Dicom dictionaries (as defined by pydicom), e.g.
PatientName
.tag:
Hexadecimal notation, e.g.
00100010
or0x00100010
.Tuple notation, e.g.
(0010, 0010)
.Private tag notation in the form
(GGGG, PrivateCreatorName, EE)
, e.g.(0009, "GEMS_IMAG_01", 01)
. Asreplace-with
action will upsert tag if not present, we rely on a predefined private dictionaries to infer tag VR which is build from pydicom _private_dict.py and flywheel-metadataRepeater group notation for groups in range
(50XX, EEEE)
and(60XX, EEEE)
only. It supports tuple or hexadecimal notation.
dotty-notation: Dot separated notation for referencing element within Dicom sequence. A mixed of keywords and tags can be used in that case, e.g AnatomicRegionSequence.0.CodeValue, 00082218.0.00080102, AnatomicRegionSequence.0.00080104 In addition, the dotty-notation supports the use
*
to reference all indices of the sequence element at once, e.g. AnatomicRegionSequence.*.CodeValue. The notation also supports referring data element at any depth recursively.
Note
The data elements in the Dicom File Meta information located in the optional 128 bytes of the Dicom File Preamble can be accessed in the same way as other tags.
Example:
# using keyword
- name: PatientName
replace-with: REDACTED
# using tag (tuple also supported)
- name: 00080104
replace-with: REDACTED
# using private tag notation
- name: (0009, "GEMS_IMAG_01", 01)
replace-with: REDACTED
# using dotty-notation to access sequence element
- name: 00082218.0.00080102
replace-with: REDACTED
# using * to access all element in the sequence
- name: AnatomicRegionSequence.*.CodeValue
replace-with: REDACTED
# using repeater group notation
- name: (60xx, 0022)
replace-with: REDACTED
This file profile also supports regex
in field item.
- regex: .*DateTime.*
increment-datetime: true
JPG specific file settings
File profile key jpg
.
A good introduction to JPG file format and EXIF metadata can be found here.
This profile treats Image File Directories (IFD) metadata under the same umbrella which means that if you defined the following field in your profile configuration:
- name: ProcessingSoftware
remove: REDACTED
that field will be redacted from both, IDF0 and IDF1, metadata blocks.
remove-exif (boolean)
When set to true, remove the EXIF ImageFileDirectory block from the JPG file.
Default is false.
remove-gps (boolean)
When set to true, remove all the GPS related metadata from the JPG file.
Default is false.
file-filter (list)
Default value is: ['*.jpg', '*.jpeg', '*.JPG', '*.JPEG']
fields
Keywords to be used as name are defined in piexif. The full list of available keywords can be found here.
Example:
- name: DateTime
increment-datetime: true
- name: Artist
remove: true
- name: DateTimeOriginal
increment-datetime: true
- name: PreviewDateTime
remove: true
- name: DateTimeDigitized
increment-datetime: true
- name: CameraOwnerName
replace-with: 'REDACTED'
- name: ImageUniqueID
hash: true
PNG specific file settings
File profile key png
.
A good introduction to PNG file format can be found here.
This file profile only supports the remove
action. PNG metadata are
referred as chunks. Reference to critical and ancillary chunks is supported.
remove-private-chunks (boolean)
When set to true, remove all private chunks from the PNG file.
Default is false.
file-filter (list)
Default value is: ['*.png', '*.PNG']
fields
Example:
- name: tEXt
remove: true
- name: eXIf
remove: true
TIFF specific file settings
TIFF and JPG profiles share a lot of similarity given their underlying file format.
remove-private-tags (boolean)
When set to True, remove all private tags. Private tags are tags with index >= 32768.
file-filter (list)
Default value is: ['*.tif', '*.tiff', '*.TIF', '*.TIFF']
fields
The supported keywords are the ones defined in the Pillow package<https://github.com/python-pillow/Pillow>. A list of keywords can be found here
Example:
- name: DateTime
increment-datetime: true
- name: Software
remove: true
- name: Model
replace-with: 'REDACTED'
NIfTI specific file settings
File profile key nifti
.
Both NIfTI1 and NIfTI2 headers are supported by this profile. For a good introduction to either file format see NIfTI1 here and NIfTI2 here.
file-filter (list)
Default value is: ['*.nii', '*.nii.gz']
fields
Supported header fields are those documented in the nibabel package. While all header fields can be removed and replaced, this should be done with caution as most header fields contain important metadata. Typically, descrip and aux-file are the two header fields most likely to contain sensitive patient information.
Example:
- name: aux-file
replace-with: 'file-1.json'
- name: descrip
remove: true
XML specific file settings
File profile key xml
.
file-filter (list)
Default value is: ['*.xml', '*.XML']
fields
Field name uses XPath to reference the DOM element in the tree. If XPath return multiple elements, each element will be processed with the specified transformation.
Example:
- name: /Patient/Patient_Date_Of_Birth
replace-with: '1900-01-01'
- name: /Patient/Patient_Name
remove: true
- name: /Patient/SUBJECT_ID
hash: true
- name: /Patient/Visit/Scan/ScanTime
increment-datetime: true
JSON specific file settings
File profile key json
.
separator (string)
The optional separator string defines what character should be when referencing
nested element in the JSON file. By default, .
is used so that a nested element
in the JSON file can be referenced by a combination of its key and/or list index value.
For example, with the following JSON,
{
"this":[
{"package": "is"},
"neat"
]
}
The value store in "package"
can be referenced with the following
key "this.0.package"
file-filter (list)
Default value is: ["*.json", "*.JSON"]
fields
Field name uses a “dotty-notation” to reference the element in the JSON file hierarchy.
This file profile is regex
compatible.
Example
- name: timestamp
increment-datetime: True
- name: info.SiteID
remove: True
# regex-sub will be applied first in fields list when processing file
- name: label
regex-sub:
- input-regex: '(?P<current_label>.*)'
output: '{current_label}_{subject.lastname}_{timestamp}'
groups:
- name: current_label
replace-with: 'one_cool_cat'
- name: subject.lastname
keep: true
- name: timestamp
increment-datetime: True
- regex: info\.subject_raw\..*
replace-with: "REDACTED"
- name: info.test
replace-with:
new: value
type: dict
Key/value text file settings
File profile key key-value-text-file
.
An example for such a text file is:
ObjectType = Image
NDims = 3
BinaryData = True
BinaryDataByteOrderMSB = False
CompressedData = False
TransformMatrix = 1 0 0 0 1 0 0 0 1
Offset = 0 0 0
delimiter (string)
The regular expression to be used when splitting the line
in its key/value pair. For example, with the above described file, delimiter
should
be defined as “\s+=\s+”.
encoding (string)
The optional string defining the encoding to be used to parse the input file. The same encoding will be used for saving the de-id file.
Default is to used the OS default encoding.
ignore-bad-lines (boolean)
When set to true, this optional boolean will ignore the lines that do not match
the delimiter
provided and log a warning.
file-filter (list)
Default value is: ["*.mhd", "*.MHD"]
Table specific file settings
File profile key table
.
This profiles handles tabular file such as CSV and TSV and can be extended to other tabular formats such as XLS, etc.
reader (string)
The string representing the suffix of the pandas reader to be used. Available suffix documented in pandas IO tools.
Default is None.
delimiter (string)
This string representing the delimiter to be used when parsing the input record. It is
to be used in combination with the reader
settings. For example, with reader=csv
,
setting delimiter=,
parses CSV files, whereas setting delimiter=\t
parses
TSV file.
Default is None.
CSV specific file settings (csv)
File profile key csv
.
Same as Table file settings but with different defaults.
reader (string)
Default value is “csv”.
delimiter (string)
Default value is “,”.
file-filter (list)
Default value is [“.csv”, “.CSV”]
TSV specific file settings (tsv)
File profile key tsv
.
Same as Table file settings but with different defaults.
reader (string)
Default value is “csv”.
delimiter (string)
Default value is “t”.
file-filter (list)
Default value is [“.tsv”, “.TSV”]
Filename specific file settings (filename)
File profile key filename
.
This profile only implements the filenames functionality (described here) that allows for renaming files and does not have any specific attributes.
For example, the following de-id profile would not transform anything:
filename:
filenames:
- input-regex: (?P<filename>.*)
output: '{filename}
ZIP specific file settings (zip)
This profile handles .zip archive. It behaves differently than the other file profile because it:
Unzips the archive.
Processes the file members of the archive using the de-id profile.
Re-zips the archive according to its configuration.
hash-subdirectories (boolean)
If set to true, the optional boolean will apply an sha256 hash to any subdirectories within the archive.
Default is false.
validate-zip-members (boolean)
If set to tru, this optional boolean allows for partially processing the zip archive. By default, the de-id of an zip archive will fail if a file member cannot be de-identified (e.g. because not profile is associated with its type). Setting this boolean to true will skip the faulty file and not export it in the output archive.
Default is false.
fields
comment is currently the only specific field to ZIP archive.
Example:
- name: comment
replace-with: 'FLYWHEEL'