Defining new pipelines

A pipeline is a sequence of xia2/DIALS commands used to process a Singla dataset. The result of that processing is saved in a separate directory named after the pipeline (e.g., the processed/my_pipeline directory). All existing pipelines are defined in the AutoED default configuration file. Look into the defined_pipelines field in the configuration file to understand how to define a new pipeline. When AutoED runs a pipeline, it creates a pipeline directory and a script file (a bash script for local processing or a JSON file for SLURM processing). The pipeline script is generated by processing the script field in the pipeline definition. For conventions of how to write this field, see below. You can directly view the JSON or bash script to check that your pipeline is generated correctly. There is an assumption that things like finding beam position, plotting spot figures, and converting to NeXus are done automatically, so they are not considered part of a pipeline. A pipeline is executed only when these steps are completed.

To understand how a pipeline is defined, let us look at the definition of the default pipeline in the configuration file.

{
 "pipeline_name": "default",
 "type": "xia2",
 "run_condition": true,
 "script": [
     "xia2 image={nexus_file}",
     "goniometer.axis=0,-1,0  dials.fix_distance=True",
     "dials.masking.d_max=9",
     "xia2.settings.remove_blanks=True",
     "input.gain={g.gain};"
 ]
 },

Here, we have a field pipeline_name that defines the name of the pipeline and, at the same time, the name of the pipeline output directory. Do not use space or tab characters when naming a pipeline. Use the underscore character _ instead. The field type specifies what kind of pipeline we are defining. Currently, the pipeline can be either a dials or xia2 pipeline. This field is mainly used when generating reports. We need to tell AutoED what kind of output to expect. The field run_condition allows the pipeline to run only when certain conditions are met (for example, some parameter is set in the global configuration file or in the local JSON metadata file). If set to true, the pipeline will always run (assuming it is set to run in the run_pipelines field in the global configuration file). For more details on setting conditional pipelines, see below. Finally, the script field defines a bash script template that is executed when AutoED runs the pipeline.

Writing the script field in a pipeline definition

The script field is a bash command template you want to run during the pipeline execution. The field is just a list of strings. We used a list instead of a single string to allow the user to split a long sequence of commands into multiple lines (for better readability). There are a few conventions you should be aware when writing the script field.

  • All strings in a list are concatenated into a single string with spaces between them. If you define a script as

    "script": ['command1',
               'option_1=abc',
               'option_2=123']
    

    it will get concatenated into command1 option1=abc option_2=123. If you have a long list of options (e.g., for a DIALS command), splitting those into separate lines is a good idea.

  • If you do not want to insert a space when concatenating two strings, you can end the first string with %%. For example

    "script": ['command1',
               'option_1=%%',
               'abc']
    

    will concatenate to command1 option_1=abc.

  • Since all lines in the script list are concatenated into one, you should use the semicolon ; to explicitly terminate all bash commands.

  • There is a list of variables you can use in curly brackets (just like in Python f-strings). After the script concatenation, AutoED will treat the generated string as an f-string and replace the variables in curly brackets {} with their actual values (which depend on the dataset). The list of available arguments is the following:

    • {nexus_file} - The full path to the generated nexus file. You would use this to import into DIALS or as an xia2 image parameter.

    • {processed_dir} - The full output path for the given pipeline (e.g., /path/to/watched/dir/processed/pipeline_name). You can use {processed_dir}/imported.expt to get the imported.expt file, etc.

    • {imported_file} - Equivalent to {processed_dir}/imported.expt mentioned above.

    • {refl_file} - Equivalent to {processed_dir}/strong.refl.

    • {m.some_field} - Access some_field in the dataset metadata JSON file. For example, if the metadata file has a field space_group, you can access it with {m.space_group}.

    • {g.some_field} - Access values in the AutoED global configuration file. For example, if the field gain is defined in the global configuration file, use {g.gain}.

    • {unit_cell} - A shortcut for {m.unit_cell[0],m.unit_cell[1],..m.unit_cell[5]}. In other words, if the field unit_cell is defined in the dataset JSON metadata file, then {unit_cell} will make a string of this field with comma-separated values (the way this parameter is provided to xia2/DIALS).

Conditional pipelines

In case you want to define a conditional pipeline, you can use the previous m and g variables (without curly brackets) in Python conditional statements. You write these conditional statements as strings. They should return a boolean. For example, the conditional statement for the user pipeline (defined in the default configuration file) checks if there is a unit_cell and space_group field defined in the metadata JSON file.

"run_condition": "(m.unit_cell is not None) or (m.space_group is not None)"