Speed up grep search

I am a simple man and I like simple solutions. Whenever I need to subset a file which doesn’t have nice column structure I use grep. Often, I need to search for a list of strings. To be more specific – subset gtf file by a list of transcript ids.

A regular grep -f transcripts.txt annot.gtf is very slow (transcripts.txt contains the list of strings to search for in annot.gtf). There are several approaches on how to speed up the search. One approach is to split transcripts.txt or annot.gtf and then run for loop. Another one is to run xargs (same link). However, this doesn’t work well with gtf. For some reason if was splitting individual lines creating non-sense results. Next option is to use alternative grep implementations, such as ripgrep. But I wanted to stay with the regular grep. The option I ended up with is a combination of grep and parallel. Since I search only for ASCII string and not a pattern I can add export LC_ALL=C and -w -F.

threads=6

export LC_ALL=C
cat transcripts.txt | parallel -j $threads grep -w -F {} annot.gtf > annot.sub.gtf

Of course, you can combine splitting files (split -l) with parallel or similar. I haven’t done such extensive testing but I would assume the I/O would be the limit.

Remove everything after last or before first character in Bash and R

As simple as it is in Bash is not as easy in R to remove part of the string after character or even after last character.

In Bash, use can simply strip the last part of a string using:

$ var="hello.world.txt"
$ echo ${var%.*} # Remove all after last "."
hello.world
$ echo ${var%%.*} # Remove all after first "."
hello
$ echo ${var#*.} # Remove all before first "."
world.txt 
$ echo ${var##*.} # Remove all before last "."
txt

In R is a bit more complicated (unless you want to use special packages such as stringr). To do the same in R these are the commands you would have to use:

> vari <- "hello.world.txt"
> sub(".[^.]+$", "", vari) # Remove all after last "."
hello.world
> gsub("\\..*", "", vari) # Remove all after first "."
hello
> sub(".*?\\.", "", vari) # Remove all before first "."
world.txt 
> gsub(".*\\.","",vari) # Remove all before last "."
txt