Bash – Jan's Blog of Bioinformatics Bits

I am a simple man and I like simple solutions. Whenever I need to subset a file which doesn’t have nice column structure I use grep. Often, I need to search for a list of strings. To be more specific – subset gtf file by a list of transcript ids.

A regular grep -f transcripts.txt annot.gtf is very slow (transcripts.txt contains the list of strings to search for in annot.gtf). There are several approaches on how to speed up the search. One approach is to split transcripts.txt or annot.gtf and then run for loop. Another one is to run xargs (same link). However, this doesn’t work well with gtf. For some reason if was splitting individual lines creating non-sense results. Next option is to use alternative grep implementations, such as ripgrep. But I wanted to stay with the regular grep. The option I ended up with is a combination of grep and parallel. Since I search only for ASCII string and not a pattern I can add export LC_ALL=C and -w -F.

threads=6

export LC_ALL=C
cat transcripts.txt | parallel -j $threads grep -w -F {} annot.gtf > annot.sub.gtf

Of course, you can combine splitting files (split -l) with parallel or similar. I haven’t done such extensive testing but I would assume the I/O would be the limit.

$ var="hello.world.txt" $ echo ${var%.*} # Remove all after last "." hello.world $ echo ${var%%.*} # Remove all after first "." hello $ echo ${var#*.} # Remove all before first "." world.txt $ echo ${var##*.} # Remove all before last "." txt

> vari <- "hello.world.txt" > sub(".[^.]+$", "", vari) # Remove all after last "." hello.world > gsub("\\..*", "", vari) # Remove all after first "." hello > sub(".*?\\.", "", vari) # Remove all before first "." world.txt > gsub(".*\\.","",vari) # Remove all before last "." txt

Category: Bash

Speed up grep search

Remove everything after last or before first character in Bash and R