Scroll to navigation

annot-tsv(1) Bioinformatics tools annot-tsv(1)

NAME

annot-tsv - transfer annotations from one TSV (tab-separated values) file into another

SYNOPSIS

annot-tsv [OPTIONS]

DESCRIPTION

The program finds overlaps in two sets of genomic regions (for example two CNV call sets) and annotates regions of the target file (-t, --target-file) with information from overlapping regions of the source file (-s, --source-file).

It can transfer one or multiple columns (-f, --transfer) and the transfer can be conditioned on requiring matching values in one or more columns (-m, --match). In addition to column transfer (-f) and special annotations (-a, --annotate), the program can operate in a simple grep-like mode and print matching lines (when neither -f nor -a are given) or drop matching lines (-x, --drop-overlaps).

All indexes and coordinates are 1-based and inclusive.

OPTIONS

Common Options

-c, --core SRC:TGT

List of names of the core columns, in the order of chromosome, start and end positions, irrespective of the header name and order in which they appear in source or target files (for example "chr,beg,end:CHROM,START,END"). If both files use the same header names, the TGT names can be omitted (for example "chr,beg,end"). If SRC or TGT file has no header, 1-based indexes can be given instead (for example "chr,beg,end:3,1,2"). Note that regions are not required, the program can work with a list of positions (for example "chr,beg,end:CHROM,POS,POS").

-f, --transfer SRC:TGT

Comma-separated list of columns to transfer. If the SRC column does not exist, interpret it as the default value to fill in when a match is found or a dot (".") when a match is not found. If the TGT column does not exist, a new column is created. If the TGT column already exists, its values will be overwritten when overlap is found and left as is otherwise.

-m, --match SRC:TGT

The columns required to be identical

-o, --output FILE

Output file name, by default the result is printed on standard output

-s, --source-file FILE

Source file with annotations to transfer

-t, --target-file FILE

Target file to be extend with annotations from -s, --source-file

Other options

--allow-dups

Add the same annotations multiple times if multiple overlaps are found

--max-annots INT

Add at most INT annotations per column to save time when many overlaps are found with a single region

--version

Print version string and exit

-a, --annotate LIST

Add one or more special annotation and its target name separated by ':'. If no target name is given, the special annotation's name will be used in output header.

cnt

number of overlapping regions

frac

fraction of the target region with an overlap

nbp

number of source base pairs in the overlap

-H, --ignore-headers

Ignore the headers completely and use numeric indexes even when a header exists

-O, --overlap FLOAT

Minimum overlap as a fraction of region length in at least one of the overlapping regions. If also -r, --reciprocal is given, require at least FLOAT overlap with respect to both regions

-r, --reciprocal

Require the -O, --overlap with respect to both overlapping regions

-x, --drop-overlaps

Drop overlapping regions (cannot be combined with -f, --transfer)

EXAMPLE

Both SRC and TGT input files must be tab-delimited files with or without a header, their columns can be named differently, can appear in arbitrary order. For example consider the source file


#chr   beg   end   sample   type   qual
chr1   100   200   smpl1    DEL    10
chr1   300   400   smpl2    DUP    30

and the target file


150   200   chr1   smpl1
150   200   chr1   smpl2
350   400   chr1   smpl1
350   400   chr1   smpl2

In the first example we transfer type and quality but only for regions with matching sample. Notice that the header is present in SRC but not in TGT, therefore we use column indexes for the latter


annot-tsv -s src.txt.gz -t tgt.txt.gz -c chr,beg,end:3,1,2 -m sample:4 -f type,qual
150   200   chr1   smpl1   DEL   10
150   200   chr1   smpl2   .     .
350   400   chr1   smpl1   .     .
350   400   chr1   smpl2   DUP   30

The next example demonstrates the special annotations nbp and cnt, with target name as pair,count. In this case we use a target file with headers so that column names will be copied to the output:


#from	to	chrom	sample
150	200	chr1	smpl1
150	200	chr1	smpl2
350	400	chr1	smpl1
350	400	chr1	smpl2


annot-tsv -s src.txt.gz -t tgt_hdr.txt.gz -c chr,beg,end:chrom,from,to -m sample -f type,qual -a nbp,cnt:pair,count
#[1]from	[2]to	[3]chrom	[4]sample	[5]type	[6]qual	[7]pair	[8]count
150	200	chr1	smpl1	DEL	10	51	1
150	200	chr1	smpl2	.	.	0	0
350	400	chr1	smpl1	.	.	0	0
350	400	chr1	smpl2	DUP	30	51	1

One of the SRC or TGT file can be streamed from stdin


cat src.txt | annot-tsv -t tgt.txt -c chr,beg,end:3,2,1 -m sample:4 -f type,qual -o output.txt
cat tgt.txt | annot-tsv -s src.txt -c chr,beg,end:3,2,1 -m sample:4 -f type,qual -o output.txt

The program can be used in a grep-like mode to print only matching regions of the target file without modifying the records


annot-tsv -s src.txt -t tgt.txt -c chr,beg,end:3,2,1 -m sample:4
150   200   chr1   smpl1
350   400   chr1   smpl2

AUTHORS

The program was written by Petr Danecek and was originally published on github as annot-regs

COPYING

The MIT/Expat License, see the LICENSE document for details.
Copyright (c) Genome Research Ltd.

12 December 2023 htslib-1.19