Need help with rpackages?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

jtleek
451 Stars 293 Forks 70 Commits 3 Opened issues

Description

R package development - the Leek group way!

Services available

!
?

Need anything else?

Contributors list

No Data

Developing
R
packages

As a modern statistician, one of the most fundamental contributions you can make is to create and distribute software that implements the methods you develop. I have gone so far as to say if you write a paper without software, that paper doesn't exist.

The purposes of this guide are:

  • To explain why writing software is a critical component of being a statistician.
  • To give you an introduction into the process/timing of creating an
    R
    package.
  • To help you figure out how to distribute/publicize your software.
  • To remind you that "the perfect is the enemy of the very good".
  • To try to make sure Leek group software has a consistent design.1

Why develop an
R
package?

Cause you know, you do what your advisor says and stuff.

But there are some real reasons to write software as a statistician that I think are critically important:

  1. You probably got into statistics to have an impact. If you work with Jeff it was probably to have an impact on human health or statistics. Either way, one of the most direct and quantifiable ways to have an impact on the world is to write software that other scientists, educators, and statisticians use. If you write a stats method paper with no software the chance of impacting the world is dramatically reduced.
  2. Software is the new publication. I couldn't name one paper written by a graduate student (other than mine) in the last 2-3 years. But I could tell you about tons of software packages written by students/postdocs (both mine and at other places) that I use. It is the number one way to get your name out there in the statistics community.
  3. If you write a software package you need for yourself, you will save yourself tons of time in sourcing scripts, and remembering where all your code/functions are.

Most importantly might be that creating an

R
package is building something. It is something you can point to and say, "I made that". Leaving aside all the tangible benefits to your career, the profession, etc. it is maybe the most gratifying feeling you get when working on research.

When to start writing an
R
package

As soon as you have 2 functions.

Why 2? After you have more than one function it starts to get easy to lose track of what your functions do, it starts to be tempting to name your functions

foo
or
tempfunction
or some other such nonsense. You are also tempted to put all of the functions in one file and just source it. That was what I did with my first project, which ended up being an epically comical set of about 3,000 lines of code in one
R
file. Ask my advisor about it sometime, he probably is still laughing about it.

What you need

To start writing an

R
package you need:

Naming your package

The first step in creating your

R
package is to give it a name. Hadley has some ideas about it. Here are our rules:
  • Make it googleable - check by googling it.
  • Make sure there is no Bioconductor/CRAN package with the same name.
  • No underscores, dashes or any other special characters/numbers
  • Make it all lower case - people hate having to figure out caps in names of packages.
  • Make it memorable; if you want serious people to use it don't be too cute.
  • Make it as short as you possibly can while staying googleable.
  • Never, under any circumstances, let Rafa or Hector name your package.3

Versioning your package

Almost all of our packages will eventually go on Bioconductor. So we are going to use the versioning scheme that is compatible with that platform (with some helpful suggestions from Kasper H.).

The format of the version number will always be

x.y.z
. When you start any new package the version number should be
0.1.0
. Every time you make any change public (e.g., push to GitHub) you should increase
z
in the version number. If you are making local commits but not making them public to other people you don't need to increase
z
. You should stay in version
0.1.z
basically up until you are ready to submit to Bioconductor (or CRAN) for release.

Before release you can increase

y
if you perform a major redesign of how the functions are organized or are used. You should never increase
x
before release.

The first time you submit the package to Bioconductor you should submit it as version number

0.99.z
. That way on the next release of Bioconductor it will get bumped to
1.0.0
. The next devel version will get bumped to
1.1.0
on Bioconductor. Immediately after releasing, if you plan to keep working on the project, you should bump your version on GitHub to
1.1.0
.

Thereafter, again you should keep increasing

z
every time you make a public change. If you do a major reorganization you should increase
y
.

Creating your package

Run this code from

R
to create your package. It will create a directory called
packagename
and put some stuff in it (more on this stuff in a second).
## Setup
install.packages(c("devtools", "roxygen2", "knitr"))

Load the libraries

library("devtools") library("roxygen2") library("knitr")

Create the package directory

create("packagename")

Put your package on GitHub

All packages that are developed by the Leek group will be hosted on GitHub. The accounts are free and it makes it so much easier to share code/get other people to help you with your code. Here are the steps to getting your package on GitHub:

  1. Create a new Github repository with the same name (packagename)
  2. In the
    packagename
    directory on your local machine, run the commands:
    git init
  3. Then run:
    git remote add origin [email protected]:yourusername/packagename.git
  4. Create a file in the
    packagename
    directory called README.md
  5. Run the command:
    git add *
  6. Run the command:
    git commit -m 'initial commit'
  7. Run the command:
    git push -u origin master

In summary:

mkdir packagename
cd packagename
git init
git remote add origin [email protected]:yourusername/packagename.git
git add *
git commit -m 'initial commit'
git push -u origin master
  • Use commit messages that will help you remember what you did and why you did it.
  • If you interact very frequently with GitHub you will be interested on setting up SSH keys to avoid typing your password every time you push/pull.
  • You can mark specific versions of your package using Git Tags which allows you to easily check the state of the package at that particular version.
  • If more than one person is working on developing the package or you want to contribute to one, check how to fork a repository. It is an easy way to contribute with a very low burden on the maintainer and no setup.
  • Consider whether you want users to report issues to your package. It is a very good framework for issue management, but can lead to duplicate information if the main issue reporting/tracking system is a mailing list like in the case of Bioconductor packages. For an example of how GitHub's issue system looks like check the rCharts issues.

Once you're familiar with basic git and GitHub workflows, GitHub has some more advanced features that you can take advantage of. In particular, github flow is an excellent way to manage contributions, and GitHub organizations can provide a central location for multiple people (e.g. in a lab) to collaborate on software.

The parts of an
R
package

R
functions

The

R
functions you have written go in the
R/
directory in the
packagename
folder. Each of your
R
functions should go in a separate file with a
.R
extension. We are going to use capital
R
for the extension of the files.

Why? Don't ask questions.

If you define a new class call the

.R
file classname-class.R. For example, if you are creating the leek class of objects it would be called
leek-class.R
. If you are defining a new method for the class it should be named newclass-methodname-method.R. For example, a plotting method for the leek class would go in a
.R
file called leek-plot-method.R.

DESCRIPTION

The

DESCRIPTION
file is a plain text file that gets generated with the
devtools::create
command.
  • The package name should go after the colon on the first line.
  • The package title should be a one sentence description of what the package actually does.
  • The description should be a one paragraph description that builds on the title. It should give a user some idea about what kind of data your software should be used on, what the inputs are and what the outputs are.
  • The version should be defined as described above.
  • The authors field may have a @R before the colon which should be deleted. The authors should be in the format
    author name 
    for example:
    Jeff Leek 
    and should be comma separated.
  • A maintainer field should be added with maintainers listed as comma separated files. You are the maintainer of your package when you create it. See the section below on after you leave the Leek group for more information.
  • The dependencies (other
    R
    packages your software uses/depends on) should be listed in a comma separated list after the
    R
    version. One of the dependencies should be the knitr package for the vignette.
  • The License is required to be open source. I like GPL-2 or GPL-3. I like the creative commons licenses, like CC-BY-SA, for manuscripts, but they are not recommended for software. This is a good website for learning more about software licenses. Also see Jeff Atwood's comments on licenses.
  • You should add a line
    VignetteBuilder: knitr
  • You should add a line
    Suggests: knitr, BiocStyle

For example:

Package: packagename
Type: Package
Title: A sentence
Version: 0.1.0
Date: 2013-09-30
[email protected]: c(person("Jeff", "Leek", role = c("aut", "cre", "ths"),
    email = "[email protected]"))
Depends:
    R(>= 3.0.2)
Suggests:
    knitr,
    BiocStyle
Description: A couple sentences that expand the title
License: Artistic-2.0

Coding style requirements

I will try to keep the stylistic requirements minimal because they would likely drive you nuts. For now there are:

  1. Your indent should be 4 spaces on every line
  2. Each line can be no more than 80 columns

You can set these as defaults (say in Sublime or RStudio) then you don't have to worry about them anymore. If you find lines going longer than 80 columns, you will need to write the lines into functions, etc.

Documentation

This is how I feel about the relative importance of various components of statistical software development:

documentation

Ideally your software is easy to understand and just works. But this isn't Apple and you don't have a legion of test users to try it out for you. So more likely than not, at least the first several versions of your software will be at least a little hard to use. The first versions will also probably be slower than you would like them to be.

But if your software solves a real problem (it should!) and is well documented (it will be!) then people will use it and you will have a positive impact on the world.

Documentation has two main components. The first component is help files, which go into the

man/
folder. The second component is vignettes which will go in a folder called `vignettes/ which you will have to create. I'll tackle each of these separately.

Help (man) files

These files document each of the functions/methods/classes you will expose to your users. The good news is that you don't have to write these files yourself. You will use the roxygen2 package to create the man files. To use roxygen2 you will need to document your functions in the

.R
files with comments formatted in a specific way. Right before your functions you should have a set of comments that are denoted by the symbol
#'
. They are structured in the following way:
#' A one sentence description of what your function does
#' 
#' A more detailed description of what the function is and how
#' it works. It may be a paragraph that should not be separated
#' by any spaces. 
#'
#' @param inputParameter1 A description of the input parameter \code{inputParameter1}
#' @param inputParameter2 A description of the input parameter \code{inputParameter2}
#'
#' @return output A description of the object the function outputs 
#'
#' @keywords keywords
#'
#' @export
#' 
#' @examples
#' R code here showing how your function works

myfunction

You include the

@export
command if you want the function to be exported (i.e. visible) to your end users. Hadley has a pretty comprehensive guide where you can learn a lot more about how
roxygen
works. Your function follows immediately after the comments.

When you have saved functions with roxygen2 style comments you can create the

.Rd
files (the man files themselves) by running:
library("devtools")
document("packagename")

on the package folder. The package folder must be in the current working directory where you are editing.

Please read Hadley's guide in its entirety to understand how to document packages and in particular, how roxygen2 deals with collation and namespaces.

Vignettes

Documentation in the help files is important and is the primary way that people will figure out your functions if they get stuck. But it is equally (maybe more) critical that you help people get started. The way that you do that is to create a vignette. For our purposes, a vignette is a tutorial that includes the following components:

  • A short introduction that explains
    • The type of data the package can be used on
    • The general purpose of the functions in the package
  • One or more example analyses with
    • A small, real data set
    • An explanation of the key functions
    • An application of these functions to the data
    • A description of the output and how it can be used

We will write Vignettes in knitr. Vignettes can generate either HTML from R markdown, or pdf from latex. In either case, the files should be in packagename/vignettes/. During package building they will be moved to packagename/inst/doc. For HTML vignettes, the vignette files should be

vignette.Rmd
. For PDF vignettes, it should be
vignette.Rnw
. Here is some information from Yihui about building vignettes in knitr.

For latex vignettes, you should use the BiocStyle package to style your vignette. This means you will need to add this code to the preamble of your

Rnw
file:
<

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.