Unix, R and python tools for genomics and data science
If you are from fields outside of biology, places to get you started:
tmux is a terminal multiplexer similar to
screenbut have more features. tmux cheatsheet
Theory and quick reference
There are 3 file descriptors, stdin, stdout and stderr (std=standard).
Basically you can:
redirect stdout to a file redirect stderr to a file redirect stdout to a stderr redirect stderr to a stdout redirect stderr and stdout to a file redirect stderr and stdout to stdout redirect stderr and stdout to stderr 1 'represents' stdout and 2 stderr. A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.
This will cause the ouput of a program to be written to a file.
ls -l > ls-l.txt
Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.
This will cause the stderr ouput of a program to be written to a file.
grep da * 2> grep-errors.txt
Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep da * 1>&2
Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep * 2>&1
Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).
This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.
rm -f $(find / -name core) &> /dev/null
This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.
chmod 754 myfile: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.
4 stands for "read",
2 stands for "write",
1 stands for "execute", and
0 stands for "no permission."
So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).
It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.
chmod u+x myfile
chmod g+r myfile
Samtools,
BWAand many others.
It is really important to name your files correctly! see a ppt by Jenny Bryan.
Three principles for (file) names:
* Machine readable (do not put special characters and space in the name)
* Human readable (Easy to figure out what the heck something is, based on its name, add slug)
* Plays well with default ordering:
Put something numeric first
Use the ISO 8601 standard for dates (YYYY-MM-DD)
Left pad other numbers with zeros
If you have to rename the files...
Good naming of your files can help you to extract meta data from the file name
* dirdf Create tidy data frames of file metadata from directory and file names.
> dir("examples/dataset_1/") [1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv" [2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv" [3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv" [4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv" [5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"> library("dirdf") > dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext") date assay experiment well ext pathname 1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv 2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv 3 2014-02-26 BRAFWTNEG FFPEDNA-CRC-1-41 D08 csv 2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv 4 2014-03-05 BRAFWTNEG FFPEDNA-CRC-REPEAT H03 csv 2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv
Using these tool will greatly improve your working efficiency and get rid of most of your
for loops.
brenameand
csvtk. 5. future: Unified Parallel and Distributed Processing in R for Everyone 6. furrr Apply Mapping Functions in Parallel using Futures
a blog post by Mark Ziemann http://genomespot.blogspot.com/2018/03/share-and-backup-data-sets-with-dat.html
# Install new version of R (lets say 3.5.0 in this example)Create a new directory for the version of R
fs::dir_create("~/Library/R/3.5/library")
Re-start R so the .libPaths are updated
Lookup what packages were in your old package library
pkgs
Better R code
Make R a little bit stricter: strict also readoffensive programming Book
[Omicsplayground)[https://github.com/bigomics/omicsplayground]
A Framework for Building Robust Shiny Apps golem
[bootstrapllib}(https://rstudio.github.io/bootstraplib/) Tools for styling shiny and rmarkdown from R via Bootstrap (3 or 4) Sass
What They Forgot to Teach You About R by Jennifer Bryan, Jim Hester. you know it is good. Rstudio2020 https://rstudio-conf-2020.github.io/what-they-forgot/
Fundamentals of Data Visualization by Claus O. Wilke.
from data to vis From Data to Viz leads you to the most appropriate graph for your data. It links to the code to build it and lists common caveats you should avoid.
Data Visualization: A practical introduction A book by Kieran Healy from Duke University. Nice one to have!
Functional programming and unit testing for data munging with R
R workshops some resources for R related materials.
RStartHere A guide to some of the most useful R Packages that we know about, organized by their role in data science.
biobroom:Turn Bioconductor objects into tidy data frames
visdat visualizing your missing data and more.
glue Glue strings to data in R. Small, fast, dependency free interpreted string literals
purrr tutorial by jenny bryan. functional programming in R.
Row-oriented workflows in R with the tidyverse
pmapis your friend :)
janitor simple tools for data cleaning in R.
Rstudio tidyeval video
Tidy Eval Meets ggplot2 a blog post.
Tidy evaluation in ggplot2 from tidyverse.
programming with dplyr A great read on non-standard evaluation, quoating and qusiquotation. then the following two packages help you to deal with that.
replyr An R package for fluid use of dplyr.
Introduction of Parameterized dplyr expression using replyr
wrapr wraps R functions debugging and better standard evaluation.
Letfunction. blog post wrapr: for sweet R code
Easy machine learning pipelines with pipelearner: intro and call for contributors github page
Demystifying ggplot2 Learn how to write ggplot2 extensions.
If you already know the mapping in advance (like the above example) you should use the .data pronoun from rlang to make it explicit that you are referring to the drv in the layer data and not some other variable named drv (which may or may not exist elsewhere). To avoid a similar note from the CMD check about .data, use #' @importFrom rlang .data in any roxygen code block (typically this should be in the package documentation as generated by usethis::usepackagedoc()).
- If you know the mapping or facet specification is col in advance, use aes(.data$col) or vars(.data$col).
- If col is a variable that contains the column name as a character vector, use aes(.data[[col]] or vars(.data[[col]]).
- If you would like the behaviour of col to look and feel like it would within aes() and vars(), use aes({{ col }}) or vars({{ col }}).
d3heatmapfor interactive heatmaps.
ggforcehas
geom_sinafor the same purpose.
ComplexHeatmap. Not as flexiable as ComplexHeatmap, but can be handy when the function you want has been implemented.
geom_parallel_sets()
data.table,
feather.
dtplyrand
tidyfastare teaming up (well, at least in this blog post)
devtools:spell_check()
goodpractice:gp()and
pkgdown:build_site().
There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization. see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python
See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam
— Rileen Sinha (@RileenSinha) August 25, 2016
paper: Outlier Preservation by Dimensionality Reduction Techniques
"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"
rtsne: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32
Understanding UMAP very nice one to read!
Survival analysis of TCGA patients integrating gene expression (RNASeq) data
Tutorial: Machine Learning For Cancer Classification. It has four parts.
Automation wins in the long run.
STEP 6 is usually missing!
The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method
I am using snakemake and so far is very happy about it!
avoid
setwd()in your R script.
here_here()comes to rescue.
Have you ever had problem to reuse one of your own published figures due to copyright of the journal? Here is the solution! from @LorenaABarba
As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.