I am a simple man and I like simple solutions. Whenever I need to subset a file which doesn’t have nice column structure I use grep. Often, I need to search for a list of strings. To be more specific – subset gtf file by a list of transcript ids.
A regular grep -f is very slow (transcripts.txt annot.gtftranscripts.txt contains the list of strings to search for in annot.gtf). There are several approaches on how to speed up the search. One approach is to split or transcripts.txtannot.gtf and then run for loop. Another one is to run xargs (same link). However, this doesn’t work well with gtf. For some reason if was splitting individual lines creating non-sense results. Next option is to use alternative grep implementations, such as ripgrep. But I wanted to stay with the regular grep. The option I ended up with is a combination of grep and parallel. Since I search only for ASCII string and not a pattern I can add export LC_ALL=C and -w -F.
threads=6
export LC_ALL=C
cat transcripts.txt | parallel -j $threads grep -w -F {} annot.gtf > annot.sub.gtf
Of course, you can combine splitting files (split -l) with parallel or similar. I haven’t done such extensive testing but I would assume the I/O would be the limit.