augur filter


Filter and subsample a sequence set.

usage: augur filter [-h] --metadata FILE [--sequences FILE]
                    [--sequence-index FILE] [--metadata-chunk-size N]
                    [--metadata-id-columns COLUMN [COLUMN ...]]
                    [--metadata-delimiters DELIMITER [DELIMITER ...]]
                    [--query QUERY] [--query-columns COLUMN [COLUMN ...]]
                    [--min-date DATE] [--max-date DATE]
                    [--exclude-ambiguous-dates-by LEVEL]
                    [--exclude FILE [FILE ...]]
                    [--exclude-where CONDITION [CONDITION ...]]
                    [--exclude-all] [--include FILE [FILE ...]]
                    [--include-where CONDITION [CONDITION ...]]
                    [--min-length N] [--max-length N] [--non-nucleotide]
                    [--group-by COLUMN [COLUMN ...]]
                    [--sequences-per-group N | --subsample-max-sequences N]
                    [--probabilistic-sampling | --no-probabilistic-sampling]
                    [--priority FILE] [--subsample-seed N] [--output FILE]
                    [--output-metadata FILE] [--output-strains FILE]
                    [--output-log FILE]
                    [--empty-output-reporting {error,warn,silent}]

Inputs

Metadata and sequences to be filtered.

--metadata

Sequence metadata.

--sequences, -s

Sequences in FASTA or VCF format.

--sequence-index
Sequence composition report generated by augur index. If not

provided, an index will be created on the fly.

--metadata-chunk-size
Maximum number of metadata records to read into memory at a

time. Increasing this number can speed up filtering at the cost of more memory used.

Default: 100000

--metadata-id-columns
Names of possible metadata columns containing strain identifier

information, ordered by priority. Only one ID column will be inferred.

Default: (‘strain’, ‘name’)

--metadata-delimiters
Delimiters to accept when reading a metadata file. Only one

delimiter will be inferred.

Default: (‘,’, ‘t’)

Metadata filters

Filters to apply to metadata.

--query
Filter strains by attribute. Uses Pandas DataFrame querying, see

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query for syntax. (e.g., –query “country == ‘Colombia’” or –query “(country == ‘USA’ & (division == ‘Washington’))”)

--query-columns
Use alongside –query to specify columns and data types in the

format ‘column:type’, where type is one of [‘bool’, ‘float’, ‘int’, ‘str’]. Automatic type inference will be attempted on all unspecified columns used in the query. Example: region:str coverage:float.

--min-date
Minimal cutoff for date, the cutoff date is inclusive; may be

specified as: 1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  1. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  2. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

--max-date
Maximal cutoff for date, the cutoff date is inclusive; may be

specified as: 1. an Augur-style numeric date with the year as the integer part (e.g. 2020.42) or

  1. a date in ISO 8601 date format (i.e. YYYY-MM-DD) (e.g. ‘2020-06-04’) or

  2. a backwards-looking relative date in ISO 8601 duration format with optional P prefix (e.g. ‘1W’, ‘P1W’)

--exclude-ambiguous-dates-by

Possible choices: any, day, month, year

Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g.,

2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).

--exclude
File(s) with list of strain IDs to exclude. The ID column is

determined by –metadata-id-columns.

--exclude-where
Exclude strains matching these conditions. Ex: “host=rat” or

“host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND.

--exclude-all
Exclude all strains by default. Use this with the include

arguments to select a specific subset of strains.

Default: False

--include
File(s) with list of strain IDs to include regardless of

priorities, subsampling, or absence of an entry in –sequences. The ID column is determined by –metadata-id-columns.

--include-where
Include strains with these values. ex: host=rat. Multiple

values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any strains matching these rules will be included regardless of priorities, subsampling, or absence of an entry in –sequences.

Sequence filters

Filters to apply to sequence data.

--min-length
Minimal length of the sequences, only counting standard

nucleotide characters A, C, G, or T (case-insensitive).

--max-length
Maximum length of the sequences, only counting standard

nucleotide characters A, C, G, or T (case-insensitive).

--non-nucleotide

Exclude sequences that contain illegal characters.

Default: False

Subsampling

Options to subsample filtered data.

--group-by
Categories with respect to subsample. Notes:
  1. Grouping by [‘month’, ‘week’, ‘year’] is only supported when there is a ‘date’ column in the metadata.

  2. ‘week’ uses the ISO week numbering system, where a week starts on a Monday and ends on a Sunday.

  3. ‘month’ and ‘week’ grouping cannot be used together.

  4. Custom columns [‘month’, ‘week’, ‘year’] in the metadata are ignored for grouping. Please rename them if you want to use their values for grouping.

--sequences-per-group
Subsample to no more than this number of strains per

category.

--subsample-max-sequences
Subsample to no more than this number of strains; can be used

without –group-by.

--probabilistic-sampling
Allow probabilistic sampling during subsampling. This is useful

when there are more groups than requested strains. This option only applies when –subsample-max-sequences is provided.

Default: True

--no-probabilistic-sampling

Default: True

--priority
Tab-delimited file with list of priority scores for strains

(e.g., “<strain ID>t<priority>”) and no header. When scores are provided, Augur converts scores to floating point values, sorts strains within each subsampling group from highest to lowest priority, and selects the top N strains per group where N is the calculated or requested number of strains per group. Higher numbers indicate higher priority. Since priorities represent relative values between strains, these values can be arbitrary. The ID column is determined by –metadata-id-columns.

--subsample-seed
Random number generator seed to allow reproducible subsampling

(with same input data).

Outputs

Options related to outputs. At least one of the possible

representations of filtered data (–output, –output-metadata, –output-strains) is required.

--output, --output-sequences, -o

Filtered sequences in FASTA format.

--output-metadata

Metadata for strains that passed filters.

--output-strains
List of strain IDs that passed filters (no header). The ID

column is determined by –metadata-id-columns.

--output-log
Tab-delimited file with one row for each filtered strain and

the reason it was filtered. Keyword arguments used for a given filter are reported in JSON format in a kwargs column.

--empty-output-reporting

Possible choices: error, warn, silent

How should empty outputs be reported when no strains pass filtering and/or subsampling.

Default: error

Guides

Below are some examples of using augur filter to sample data.

Filtering

The filter command allows you to select various subsets of your input data for different types of analysis. A simple example use of this command would be

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --min-date 2012 \
  --output-sequences filtered_sequences.fasta \
  --output-metadata filtered_metadata.tsv

This command will select all sequences with collection date in 2012 or later. The filter command has a large number of options that allow flexible filtering for many common situations. One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, …). These can be specified in a separate text file (e.g. exclude.txt):

BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...

To drop such strains, you can pass the filename to --exclude:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --min-date 2012 \
  --exclude exclude.txt \
  --output-sequences filtered_sequences.fasta \
  --output-metadata filtered_metadata.tsv

Subsampling within augur filter

Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --min-date 2012 \
  --exclude exclude.txt \
  --group-by country year month \
  --sequences-per-group 1 \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

Subsampling using multiple augur filter commands

There are some subsampling strategies in which a single call to augur filter does not suffice. One such strategy is “tiered subsampling”. In this strategy, mutually exclusive sets of filters, each representing a “tier”, are sampled with different subsampling rules. This is commonly used to create geographic tiers. Consider this subsampling scheme:

Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.

This cannot be done in a single call to augur filter. Instead, it can be decomposed into multiple schemes, each handled by a single call to augur filter. Additionally, there is an extra step to combine the intermediate samples.

  1. Sample 100 sequences from Washington state.

  2. Sample 50 sequences from the rest of the United States.

  3. Combine the samples.

Calling augur filter multiple times

A basic approach is to run the augur filter commands directly. This works well for ad-hoc analyses.

# 1. Sample 100 sequences from Washington state
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --query "state == 'WA'" \
  --subsample-max-sequences 100 \
  --output-strains sample_strains_state.txt

# 2. Sample 50 sequences from the rest of the United States
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --query "state != 'WA' & country == 'USA'" \
  --subsample-max-sequences 50 \
  --output-strains sample_strains_country.txt

# 3. Combine using augur filter
augur filter \
  --sequences sequences.fasta \
  --metadata metadata.tsv \
  --exclude-all \
  --include sample_strains_state.txt \
            sample_strains_country.txt \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

Each intermediate sample is represented by a strain list file obtained from --output-strains. The final step uses augur filter with --exclude-all and --include to sample the data based on the intermediate strain list files. If the same strain appears in both files, augur filter will only write it once in each of the final outputs.

Generalizing subsampling in a workflow

The approach above can be cumbersome with more intermediate samples. To generalize this process and allow for more flexibility, a workflow management system can be used. The following examples use Snakemake.

  1. Add a section in the config file.

subsampling:
  state: --query "state == 'WA'" --subsample-max-sequences 100
  country: --query "state != 'WA' & country == 'USA'" --subsample-max-sequences 50
  1. Add two rules in a Snakefile. If you are building a standard Nextstrain workflow, the output files should be used as input to sequence alignment. See Parts of a whole to learn more about the placement of this step within a workflow.

# 1. Sample 100 sequences from Washington state
# 2. Sample 50 sequences from the rest of the United States
rule intermediate_sample:
    input:
        metadata = "data/metadata.tsv",
    output:
        strains = "results/sample_strains_{sample_name}.txt",
    params:
        augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
    shell:
        """
        augur filter \
            --metadata {input.metadata} \
            {params.augur_filter_args} \
            --output-strains {output.strains}
        """

# 3. Combine using augur filter
rule combine_intermediate_samples:
    input:
        sequences = "data/sequences.fasta",
        metadata = "data/metadata.tsv",
        intermediate_sample_strains = expand("results/sample_strains_{sample_name}.txt", sample_name=list(config.get("subsampling", {}).keys()))
    output:
        sequences = "results/subsampled_sequences.fasta",
        metadata = "results/subsampled_metadata.tsv",
    shell:
        """
        augur filter \
            --sequences {input.sequences} \
            --metadata {input.metadata} \
            --exclude-all \
            --include {input.intermediate_sample_strains} \
            --output-sequences {output.sequences} \
            --output-metadata {output.metadata}
        """
  1. Run Snakemake targeting the second rule.

snakemake combine_intermediate_samples

Explanation:

  • The configuration section consists of one entry per intermediate sample in the format sample_name: <augur filter arguments>.

  • The first rule is run once per intermediate sample using wildcards and an input function. The output of each run is the sampled strain list.

  • The second rule uses expand() to define input as all the intermediate sampled strain lists, which are passed directly to --include as done in the previous example.

It is easy to add or remove intermediate samples. The configuration above can be updated to add another tier in between state and country:

subsampling:
  state: --query "state == 'WA'" --subsample-max-sequences 100
  neighboring_states: --query "state in {'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 75
  country: --query "country == 'USA' & state not in {'WA', 'CA', 'ID', 'OR', 'NV'}" --subsample-max-sequences 50