import run
Import data from an external storage to a Flywheel project through a connector that's hosted and scaled within a cluster. Storages need to be registered by site-admins on the UI / Interfaces menu / External Storage tab or using fw-beta admin storage create
in order to make them available for imports.
Usage
Rules
Selecting the files to be imported and configuring how they are stored in Flywheel can be defined with import rules. At least one rule is required for matching any file in the source storage. Additional rules may be specified to achieve complex import behaviors.
Each rule is tied to a Flywheel hierarchy level where the matching files will be imported and can optionally have a list of include and/or exclude filters. Currently only acquisition level file imports are supported.
Rules are evaluated in order and for every file, the first rule is going to be used where:
- any on the include filters matches (if given) and
- none of the exclude filters match (if given)
Files not matching any of the rules are going to be skipped.
Filters
Include and exclude filters are strings in the form <field><operator><value>
.
Supported filter fields:
Field | Type | Description |
---|---|---|
path | str | File path (relative) |
size | int | File size |
ctime | datetime | Created timestamp |
mtime | datetime | Modified timestamp |
Supported filter operators depending on the value type:
Operator | Description | Types |
---|---|---|
=~ | regex match | str |
!~ | regex not match | str |
= | equal | str,int,float,datetime |
!= | not equal | str,int,float,datetime |
< | less | int,float,datetime |
> | greater | int,float,datetime |
<= | less or equal | int,float,datetime |
>= | greater or equal | int,float,datetime |
Mappings
Imports require metadata in order to place files correctly within a Flywheel project's subject/session/acquisition hierarchy. Mappings are strings in the form <template>=<pattern>
that allow extracting information from the source file's fields (eg.: path
) into one or more Flywheel metadata fields (eg.: session.label
and acquisition.label
).
The default mapping allows extracting all required metadata fields from each file's path, assuming that files are stored in a compatible folder hierarchy:
{path}={subject.label}/{session.label}/{acquisition.label}/*
If any of the required fields are missing after extracting with one or more mappings, the file will be marked as failed
, but the import will continue to process the remaining data on storage.
Use --missing-meta skip
to skip these files instead and in turn allow the overall import operation to complete without any failures for this reason.
Alternatively, use --fail-fast
to halt the entire import when encountering an error.
Templates
Templates are similar to python f-strings for formatting metadata associated with a file as a single string. Currently only the path
source field is available but it will be extended in the future.
Syntax | Description |
---|---|
{field} | Curly braces for referencing metadata fields |
{field/pat/sub} | re.sub pattern for substituting parts of the value |
{field:format} | f-string format spec (strftime for timestamps) |
{field\|default} | Default to use instead of "UNKNOWN" (for "" /None ) |
Combining modifiers is allowed in the order /pat/sub >> :format >> |default
.
Patterns
Patterns are simplified python regexes tailored for scraping Flywheel metadata fields like acquisition.label
from a string with capture groups.
Syntax | Description |
---|---|
{field} | Curly braces for capturing (dot-notated/nested) fields |
[opt] | Brackets for making parts of the match optional |
* | Star to match any string of characters (like glob) |
. | Dot to match a literal dot (like glob) |
File Types
Running the import with the --type=<type>
option allows setting the file.type
metadata field in Flywheel to the specified value. Populating the type is useful for searching and for automatically running gears that are tied to data-types.
Import has additional features when importing DICOM data with --type=dicom
:
- files are parsed using
pydicom
(invalid DICOMs are treated as errors) - series are grouped by directory and
SeriesInstanceUID
(multiple series per directory are treated as errors) - series are uploaded az a single zipped file
(except on single files, eg.: enhanced) - metadata fields have tag-based default mappings
- custom
--mappings
can reference DICOM tags
Advanced
For more complex import workflows where files from multiple multiple levels are needed or the pattern mappings vary based on the data type for example, additional rules can be passed as inline YAML using the --rule
option:
fw-beta import run ... --rule "include: [path=~csv], mapping: ['path={sub}/{ses}/{acq}/{file}']"
Defaults
Simple imports can usually be expressed with a single rule. The first import rule is defined by default and can be adjusted with command-line options directly:
Option | Default |
---|---|
--include | [] (include all files) |
--exclude | [] (don't exclude any file) |
--mapping | path={subject}/{session}/{acquisition}/{file} |
DICOM
When using --type=dicom
, import defaults the metadata fields based on DICOM tags. Default mappings (eg.: setting subject.label
to the value of PatientID
) are only applied if:
- the field (
subject.label
) is not yet populated via a custom--mapping
- the value (
PatientID
) is not empty
The default mappings for DICOM:
subject.label
-PatientID
subject.firstname
- split fromPatientName
subject.lastname
- split fromPatientName
subject.sex
-PatientSex
session.uid
-StudyInstanceUID
session.label
-StudyDescription
fallback tosession.timestamp
fallback toStudyInstanceUID
session.age
- fromPatientAge
(converted to seconds)
fallback to delta betweenacquisition.timestamp
andPatientBirthDate
session.weight
-PatientWeight
session.operator
-OperatorsName
session.timestamp
- fromStudyDate
&StudyTime
fallback toSeriesDate
&SeriesTime
fallback toAcquisitionDateTime
fallback toAcquisitionDate
&AcquisitionTime
with respect toTimezoneOffsetFromUTC
acquisition.uid
-SeriesInstanceUID
acquisition.label
-SeriesNumber - SeriesDescription
fallback toSeriesNumber - ProtocolName
fallback toacquisition.timestamp
(formatted as%Y-%m-%dT%H:%M:%S
)
fallback toSeriesInstanceUID
only prefixed ifSeriesNumber
is setacquisition.timestamp
-AcquisitionDateTime
fallback toAcquisitionDate
&AcquisitionTime
fallback toSeriesDate
&SeriesTime
fallback toStudyDate
&StudyTime
with respect toTimezoneOffsetFromUTC
Settings
Option | Value | Description |
---|---|---|
--overwrite | auto | Overwrite existing files if changed |
never | Do not overwrite existing files even if changed | |
always | Overwrite existing files even if unchanged | |
--dry-run | (flag) | Run without actually uploading data (for testing) |
--limit | N | Stop after processing N files (for testing) |
--fail-fast | N[%] | Stop processing after reaching a failure threshold |
--missing-meta | fail | Fail items with missing metadata |
skip | Skip items with missing metadata | |
--storage-config | (YAML) | Override default storage config |
Some storage settings may be overridden when running an import, which is useful for getting data from the same bucket, but from a different prefix, for example:
fw-beta import run ... --storage-config "prefix: other/prefix"
Output options
The same output options are available as in fw-beta import get
, the only difference being that run
follows the progress of the import until it completes. Use the --no-wait
option to exit immediately instead.
Pressing CTRL+C
stops the progress monitoring in the CLI, but the import operation will continue running on the cluster. Use fw-beta import get
to check the import's status or resume monitoring it's progress later.
Referencing files in place
Importing large datasets into Flywheel can incur substantial network costs while transferring and storage costs later for keeping a copy of the data.
Referencing files in place allows importing data into Flywheel without transferring any bytes from one cloud storage bucket to another, saving time, network and storage costs.
The ref-in-place workflow requires that the bucket is registered in Flywheel Core-API as a storage provider and that this provider is used for creating a storage via fw-beta admin storage create
.