htmlwidgets::saveWidget Error: pandoc document conversion failed with error 97

I always use htmlwidgets::saveWidget to save my Plotly html plots in R. For some unknown reason, I started to get:

Could not find data file templates/file6efb40d38e3c.html                                        
Error: pandoc document conversion failed with error 97                                       
Execution halted

I first thought I had some problems with the temp dir so I tried to add dir.create(tempdir()) but it wasn’t it. After some more digging I found this GitHub issue and answer from cpsievert, which solved the issue.

Instead of using file=outputdir/filename.html, what I usually use to export/save everything in R, I had to split it into file=filename.html and libdir=outputdir. The error dissapeared and I could happily save my beautiful Plotly html plots.

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: openSUSE Leap 15.0

Matrix products: default
BLAS/LAPACK: /home/joppelt/Miniconda3/envs/riboseq/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] htmlwidgets_1.5.3

loaded via a namespace (and not attached):
[1] compiler_3.6.3    htmltools_0.5.1.1 digest_0.6.27     rlang_0.4.10

Changing font in R plots (Arial Narrow)

To my surprise, changing fonts in R plots is not a trivial task. It’s even worse if you are asked to change to font to one of the Windows licensed fonts and you are on Linux. I personally think it’s easier and faster to change the font during the editing but some people don’t know how (or don’t want to know how).

First, I tried to use the extrafont package. Even with fixes for “No FontName. Skipping” , “Error in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y”, and trying to install Windows fonts in Ubuntu, I still didn’t get the results.

The best and only solution was to use the showtext package (as found here). You will need to download the font you want. You’ll need the .tff format (.ttc, or.otf). For me, it was Arial Narrow from here.

> # Install showtext package
> install.packages("showtext")
> # Load the library and install the font
> library("showtext") 
> font_add(family = "arialn", regular = "~/Downloads/arialn.ttf")
> # Before plotting, "activate" showtext
> showtext_auto()
> # Do some fancy plots, add theme(text = element_text(family = "arialn")) to all the features where you want to use this font if you are using ggplot2, and save to PDF
> # Deactivate showtext if you no longer need it
> showtext_auto(FALSE)

And that’s it! You’ll have nice plots with the font you want.

Or, you can choose any of the Google free fonts from here. If I didn’t find Arial Narrow I would go with Archivo Narrow. In the case of Google fonts, you don’t need to download anything.

# Get the font from Google fonts
> font_add_google("Archivo Narrow", "archivon")

See more examples in the showtext readme.

But one important thing to remember – this will not store the text as text in the pdf but as curves. If you then import the pdf into a graphics editor, it won’t be easy to edit the text. You would have to convert the curves to text (somehow) and then edit it as text.

Speed up grep search

I am a simple man and I like simple solutions. Whenever I need to subset a file which doesn’t have nice column structure I use grep. Often, I need to search for a list of strings. To be more specific – subset gtf file by a list of transcript ids.

A regular grep -f transcripts.txt annot.gtf is very slow (transcripts.txt contains the list of strings to search for in annot.gtf). There are several approaches on how to speed up the search. One approach is to split transcripts.txt or annot.gtf and then run for loop. Another one is to run xargs (same link). However, this doesn’t work well with gtf. For some reason if was splitting individual lines creating non-sense results. Next option is to use alternative grep implementations, such as ripgrep. But I wanted to stay with the regular grep. The option I ended up with is a combination of grep and parallel. Since I search only for ASCII string and not a pattern I can add export LC_ALL=C and -w -F.

threads=6

export LC_ALL=C
cat transcripts.txt | parallel -j $threads grep -w -F {} annot.gtf > annot.sub.gtf

Of course, you can combine splitting files (split -l) with parallel or similar. I haven’t done such extensive testing but I would assume the I/O would be the limit.

Strikethrough text in R plots

I wanted to add a strikethrough text to an R ggplot2 plot and I thought I will have to play with font settings. But, I have found this nice hint that you don’t actually have to do this. All you need is to change the encoding of the text (labels, in my case) and ggplot2 will happily do that for you. The trick is the following:

> strk <- stringr::str_replace_all("strikethrough", "(?<=.)", "\u0336")
> strk
[1] "s̶t̶r̶i̶k̶e̶t̶h̶r̶o̶u̶g̶h̶"

In reality it looks much nicer than this stupid copy-paste you see above:

Remove common part of multiple strings in R

In my work, I often get a long list of samples names which are way too long. And we all know plots don’t like long sample names. The easiest is to remove the common part of the sample names and create a shorter version. By removing the common part you only keep the unique part of the sample name which can still be used to identify the samples.

# Make a function which splits the input vector of strings (name_vect) by a separator (sepa) and returns only unique strings separated by the same separator
rename_samples <- function(name_vect, sepa) {
  samp_names.tmp <- t(as.data.frame(strsplit(name_vect, sepa, fixed = T)))
  ind <- lapply(apply(samp_names.tmp, 2, unique), length) != 1
  samp_names.tmp <- as.data.frame(samp_names.tmp[, ind])
  samp_names <- as.character(interaction(samp_names.tmp, sep = sepa))
  return(samp_names)
}

# Prepare a vector of long strings which have common parts which can be stripped
samples <- c("ath.RNASeq.MiwiKO.P30.1",
"ath.RNASeq.MiwiKO.P30.2",
"ath.RNASeq.MiwiKO.P20.1",
"ath.RNASeq.MiwiWT.P30.1",
"ath.RNASeq.MiwiWT.P30.2",
"ath.RNASeq.MiwiWT.P20.1")

# Apply the function
rename_samples(samples, ".")
[1] "MiwiKO.P30.1" "MiwiKO.P30.2" "MiwiKO.P20.1" "MiwiWT.P30.1" "MiwiWT.P30.2" "MiwiWT.P20.1"

And voila – the initial long sample names were shortened and made much nicer and cleaner.

Note: this will work only if the samples names have the same naming style. This means they have to have the same number of blocks separate by sepa character.

stdin and stdout decision in Python

The easiest automatic decision between defined input file or stdin and output file or stdout I have come up with in Python is:

import sys

# If intab (input) is defined, read it; if not assume stdin
if intab:
    f = open(intab, "r")
else:
    f = sys.stdin

# If otab (output) is defined, read it; if not assume stdout
if otab:
    fout = open(otab, "w")
else:
    fout = sys.stdout

# Do whatever you want 
for line in f:
    print(line)
    write.fout(line)

# And close files, just to be safe
if f is not sys.stdin:
    f.close()

if fout is not sys.stdout:
    fout.close()

There might be something easier but this is simple and obvious.

Count number of mappings (get unique and multi mappings)

I have always strugled with unique mapping filtering. This is mainly because every aligner/mapper marks unique mappings differently. Some aligners assign a specific MAPQ (STAR – 255), some assign a SAM tag (bwa aln – XT:A:U), some just say if the alignment is likely to be unique or not (bwa mem – MAPQ 0 = definitely not unique, MAPQ > 0 more likely unique as the MAPQ increases).

Although many people do not support the concept of unique alignment I am a simple man and I like to know the number of mappings per reads. Especially with the long-read sequencing where we might want to exclude reads which have a strong support to be mapped to a single position from those that could be mapped to multiple regions.

To make this simple and aligner/mapped independent the solution is quite simple – count how many time we see a single read (by name) mapped in a SAM file. To solve this I made a very simple Python script which can do this counting and adds the number of alignments in a new SAM tag.

sam-counts-maps.py (very beta version) counts number of alignments per read name. So far it cannot handle paired-end reads and chimeric reads but I would love to extend it soon.

The use is rather simple:

$ samtools view -h test.bam | python sam-counts-maps.py --tag X0 | samtools view -bh - | samtools sort - > out.bam

The script cannot handle BAM files because I didn’t want to use pysam to make it more compatiable (and I don’t like pysam to read/write SAM/BAM because it likes to modify the files randomly).

The full options are:

usage: sam-count-maps.py [-h] [-i INPUT] [-o OUTPUT] [-t TAG] [-w]

Count number of mappings per read.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input SAM file. Default: stdin
  -o OUTPUT, --output OUTPUT
                        Output SAM file. Default: stdout
  -t TAG, --tag TAG     New tag with number of mappings to be added. Default: X0.
  -w, --overwrite       Do you want to overwrite if --tag already exists?
                        Default: Don't overwrite.

Then, you can just grep for number of alignments with the added tag. For example, to get unique alignments:

samtools view -h out.bam | grep -w -E "^@[A-Z][A-Z]|X0:i:1" | samtools view -bh - > unique.bam

Remove everything after last or before first character in Bash and R

As simple as it is in Bash is not as easy in R to remove part of the string after character or even after last character.

In Bash, use can simply strip the last part of a string using:

$ var="hello.world.txt"
$ echo ${var%.*} # Remove all after last "."
hello.world
$ echo ${var%%.*} # Remove all after first "."
hello
$ echo ${var#*.} # Remove all before first "."
world.txt 
$ echo ${var##*.} # Remove all before last "."
txt

In R is a bit more complicated (unless you want to use special packages such as stringr). To do the same in R these are the commands you would have to use:

> vari <- "hello.world.txt"
> sub(".[^.]+$", "", vari) # Remove all after last "."
hello.world
> gsub("\\..*", "", vari) # Remove all after first "."
hello
> sub(".*?\\.", "", vari) # Remove all before first "."
world.txt 
> gsub(".*\\.","",vari) # Remove all before last "."
txt