Update: To see other one-liners, please check out my gist collection. It’s easier to make a small gist rather than write a “long” post here.
Finally, I have decided to put together a collection of useful one-liners before I forget them all. I will start slowly and add more on-the-fly. It will be a mixture of Bash, R, Perl, Python, and who knows what else.
Command line to remap a position sorted bam (from Heng Li tweet)
samtools collate -uOn128 old-pos-srt.bam tmpxyz | samtools fastq - | bwa mem -pt16 ref.fa - | samtools sort --threads=4 -m4G -o new-pos-srt.bam
The -uOn128
stands for -u uncompressed BAM output
, -O output to stdout and
-n INT number of temporary files [128]
. For the bwa mem
command the -pt16
stands for -p first query file consists of interleaved paired-end sequences
and -t INT number of threads [16]
. So be careful as this assumes you have paired-end reads.
You might have to specify TMPDIR if your BAM is big or if you have access to a quick SSD disk. samtools fastq
can also copy RG tags if you add samtools fastq -t
copy RG, BC and QT tags to the FASTQ header line and/or samtools fastq -T TAGLIST
copy arbitrary tags to the FASTQ header line.
Note for the bwa mem
– there is a new version coming up called bwa-mem2, which should produce the same alignments as bwa mem
but should be ~80% fasters. Right now (May 30, 2019), it doesn’t have a stable release yet, but if you like bwa mem
it should be something to keep your eye on.
Get the number of reads (mapped) from BAM file
I always forget this and I always have to remind myself how to do this simple command so I rather put it down. If you want to get the number of reads from the BAM file, you cannot simply use samtools view -F 4 in.bam | wc -l
or samtools flagstat in.bam
because that would give you the number of mappings and not the number of mapped reads. To get the real number of mapped reads, you would have to do (taken from here):
samtools view -F 4 in.bam | cut -f 1 | sort | uniq | wc -l
This will work for both paired-end and single-end reads. If you have a paired-end and you would like to get only reads where both ends are mapped, you would have to replace -F 4
for -F 12
.
Update: I have recently found that there is an even easier and, most importantly, faster solution (source here). You can use:
samtools view -F 4 in.bam | cut -f 1 | awk '!x[$0]++' | wc -l
Which makes it much faster. Compared to the sort | uniq
this is almost 50% faster.
On BAM file with 149,909,118 input reads sort | uniq
takes:
real 1m8.240s
user 2m57.097s
sys 0m18.001s
while awk '!x[$0]++'
takes:
real 0m42.857s
user 2m30.028s
sys 0m18.179s
Replace multiple spaces (>=1) using sed
Countless times I needed to use my favorite sed to replace multiple spaces with tab. And countless times, I forgot how I did it the last time.
To keep it in my external memory, here we go:
text="hello world"
echo "$text" | sed 's/ */X/g'
results in:
helloXworld
Please note there are two spaces if you use it with *
. With one space it will much every symbol.
I often use it in conjunction uniq -c | sed 's/ *\t/g' | sed 's/^\t//g'
to get a nicely formatted (=tab separated in my case) list of counts:
text=" 2 a\n1 c\n3 b"
echo -e "$text" | sed 's/ */\t/g' | sed 's/^\t//g' | sort -k1n
#1 c
#2 a
#3 b
The original post is here.
Note: If you don’t quote $text
, echo
will strip multiple spaces to just one.
text="hello world"
echo $text
#hello world
echo "$text"
#hello world
Insert line to the beginning of the file (first line)
Another handy sed
command is to insert a line to the beginning of the file. I use it quite often to add a header to a table.
To add the line, just do:
echo -e "a\nb" | sed '1i transcript_id\tlength'
You can also combine it with replacing the first line:
echo -e "a\nb" | tail -n+2 | sed '1i transcript_id\tlength'
Inspiration was found here.