A workaround for longish factor levels

Sometimes when working with categorical variables in R, it could be useful to create some kind of labels to represent the categories, whilst at the same time leaving the actual category values untouched. When we create a factor from a character vector, each category is coded with an integer value while the textual portion is used to define the levels attribute. Let us create a categorical variable using a character vector:

books <-
  c(
    "Pride and Prejudice by Jane Austen",
    "The Scarlatti Inheritance by Robert Ludlum",
    "The Adventures of Tom Sawyer by Mark Twain",
    "A Tale of Two Cities by Charles Dickens"
  )
readings <- sample(books, 100, replace = TRUE, prob = c(.1, .3, .2, .4))
readings <- factor(readings)
head(readings)
## [1] The Adventures of Tom Sawyer by Mark Twain
## [2] Pride and Prejudice by Jane Austen        
## [3] The Adventures of Tom Sawyer by Mark Twain
## [4] The Scarlatti Inheritance by Robert Ludlum
## [5] A Tale of Two Cities by Charles Dickens   
## [6] The Scarlatti Inheritance by Robert Ludlum
## 4 Levels: A Tale of Two Cities by Charles Dickens ...

Unfortunately, the levels of this factor are created from rather longish strings. This could prove problematic when presenting the data in tables or plots.

barplot(table(readings))
The axis labels are mangled because they are simply too long!

The length of each of the categories’ text is affecting our plot’s annotations. One easy solution, especially when working from a data frame, could be to create a new vector (or column) with the factor in an abbreviated form. But what if we have several of such factors?

There are packages that help us to apply label attributes to variables as well as individual elements e.g. labelled. However, we can also use base R to improve this, by generating more user-friendly names in situ, i.e. without distorting those of the actual categories.

levels(readings)
## [1] "A Tale of Two Cities by Charles Dickens"   
## [2] "Pride and Prejudice by Jane Austen"        
## [3] "The Adventures of Tom Sawyer by Mark Twain"
## [4] "The Scarlatti Inheritance by Robert Ludlum"

We are running this function because we need to be sure of the order in which the levels appear, as this will determine the order in which we will present the new values. We will now create a named factor.

readings <- structure(
  readings, 
  names = c("Dickens", "Austen", "Twain", "Ludlum")[readings]
)

The key to the success of this operation lies in our ability to carry out indexing with the factor when the object named was created. This works because, under the hood, factors are integers.

It is noteworthy that by default, during each successive step, the outputs are sorted in an alphabetical order.

At this point, we can now alternatively use the long or short forms as we please.

barplot(table(names(readings)))
This looks a lot better!

Once our work is done. We can easily revert to status quo ante with unname(readings).

Do you know a better way of doing this? Please share in the comments section.

Comments