02. Data Groups

In This Vignette

Data Group Terminology
Adding data groups to a pivot table
- Adding data groups based on the data
- Adding data groups explicitly
- Adding data groups that combine values
- More complex layouts of data groups
Formatting data group values
Sorting data groups
Pivot tables as standard tables (row group headings)

Data Group Terminology

Data groups form the column headers and row headers in a pivot table. Typically, the distinct values of a variable form a set of headings (i.e. a set of data groups). Consider the following pivot table:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

In this pivot table, there are three sets of data groups:

The first row of column headings are the train category data groups - “Express Passenger” and “Ordinary Passenger”.
The second row of column headings are the power type data groups - “DMU”, “EMU”, “HST”, etc.
The row headings are the train operating company data groups - “Arriva Trains Wales”, “CrossCountry”, etc.

The data groups form a hierarchy. A data group can have child data groups. In the pivot table above, there are three column data groups at the first level, “Express Passenger”, “Ordinary Passenger” and “Total”. The “Express Passenger” group has four child data groups for the power types “DMU”, “EMU”, “HST” and “Total”. The “Ordinary Passenger” data group has three child groups and the “Total” data group has one child group (a blank group). The second level of the column data groups is thus formed of (4+3+1=) eight data groups.

Adding data groups to a pivot table

There are a few different ways to add data groups to a pivot table.

Adding data groups based on the data
Adding data groups explicitly
Adding data groups that combine values
More complex data group layouts

Adding data groups based on the data

The easiest way to add data groups to a pivot table is to simply generate data groups based on the distinct values that exist for a particular variable (i.e. in a particular column in a data frame). This is the method used in the pivot table above.

Considering parent data groups

By default, when adding additional levels to a pivot table after the first level (i.e. for level 2, 3, etc.), the filter criteria coming from parent data groups is considered. For example, in the pivot table above, there are “Express Passenger” trains with a power type of “HST”, but no “Ordinary Passenger” trains with a power type of “HST”. So while a “HST” data group exists under “Express Passenger”, no “HST” data group exists underneath “Ordinary Passenger”.

Ignoring parent data groups

Sometimes, a more uniform layout is preferred, i.e. in the above example, we would like “HST” to be present underneath “Ordinary Passenger” as well. This can be accomplished by setting the argument onlyCombinationsThatExist to FALSE.

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType", onlyCombinationsThatExist=FALSE)
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Adding data groups explicitly

It is also possible to add data values explicitly, by passing a list of values as the explicitListOfValues argument. When an explicit list of values is specified, the pivot table does not sort the values. For example, if we wished to ignore the “HST” power type completely:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType", fromData=FALSE, explicitListOfValues=list("DMU", "EMU"))
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Visual Totals

In the above example, some of the totals now do not appear to reconcile, e.g. “CrossCountry DMU” (22133) + “CrossCountry EMU” (0) does not equal “CrossCountry Total” (22865). The difference is the “CrossCountry HST” (732) data group that has been omitted. This may or may not be desirable, depending on how the pivot table is introduced, the title that is used, etc.

If it is desirable for the totals to reconcile, then visual totals can be enabled using the visualTotals argument:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType", fromData=FALSE, 
                       explicitListOfValues=list("DMU", "EMU"), visualTotals=TRUE)
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Note that finding total/non-total data groups/cells with the findRowDataGroups(), findColumnDataGroups() and findCells() functions won’t work when visual totals are enabled. See the Finding and Formatting vignette for more details on these functions.

Adding data groups that combine values

By specifying values explicitly, it is also possible to combine multiple values into one data group. For example, if the primary focus of an analysis is the two train operating companies operating the most trains (“London Midland” and “CrossCountry”), then “Arrival Trains Wales” and “Virgin” can be combined:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", fromData=FALSE, explicitListOfValues=list(
    "London Midland", "CrossCountry", c("Arriva Trains Wales", "Virgin Trains")))
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Should an alternative caption be preferred for the combined group, it can be specified by setting the name of the list element. Following the example above, they could simply be labelled as “Other”:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", fromData=FALSE, explicitListOfValues=list(
  "London Midland", "CrossCountry", "Other"=c("Arriva Trains Wales", "Virgin Trains")))
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

More complex layouts of data groups

Adding data groups one-at-a-time for fine-grained control to create a pivot table with an irregular layout is also possible. See the Irregular Layout vignette for more details.

Formatting data group values

Each data group has three values:

A captionTemplate that is the template text used to build the caption. The data type of the captionTemplate is character.
A caption that is the value that is displayed in the pivot table. The data type of the caption is character.
A sortValue that is used to order the group within the set of groups. The sortValue is the same data type as the variable the data group is based on.

The captionTemplate is specified via the caption argument of addColumnDataGroups() or addRowDataGroups(). It allows a prefix or suffix to be added to the caption:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory", caption="TC:  {value}")
pt$addColumnDataGroups("PowerType", caption="PT:  {value}")
pt$addRowDataGroups("TOC", caption="TOC:  {value}")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

In all of the examples above, the data groups are text values (i.e. are based on data frame columns of data type character). When data groups are numerical or date/time values, then typically some form of formatting is required. Formatting generates a caption for each data group, i.e. a more readable version of the underlying data group value.

For example, plotting the total number of trains per month for each train operation company:

# derive the date of each train (from the arrival/dep times),
# then the month of each train from the date of each train
library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
   GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
   GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

In the example above, the month headings are not explicitly formatted. The example below shows how to format these headings using the dataFormat argument of addColumnDataGroups():

# derive the date of each train (from the arrival/dep times),
# then the month of each train from the date of each train
library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
   GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
   GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

A number of different approaches to formatting are supported:

If the dataFormat is a text value, then pivottabler invokes base::sprintf() with the specified format.
If the dataFormat is a list, then pivottabler invokes base::format(), where the elements in the list become arguments in the function call.
- The example above is equivalent to: base::format(value, format="%B %Y")
If the dataFormat is an R function, then this is invoked for each value.
If dataFormat is not specified, then base::as.character() is invoked to provide a default formatted value.

Some formatting behaviour depends on the data type of the value being formatted. For details see the Appendix: Details vignette.

An example of using a custom R function to accomplish the same formatting is:

# derive the date of each train (from the arrival/dep times), then derive
# the month of each train from the date of each train
library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
   GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
   GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))

# define a custom formatting function
formatDate <- function(x) {
  base::format(x, format="%B %Y")
}

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", dataFormat=formatDate)
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

An example of passing additional parameters to the custom formatting function using the dataFmtFuncArgs argument:

# derive the date of each train (from the arrival/dep times), then derive
# the month of each train from the date of each train
library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
   GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
   GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))

# define a custom formatting function
formatDate <- function(x, formatCode) {
  base::format(x, format=formatCode)
}

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", 
                       dataFormat=formatDate, dataFmtFuncArgs=list(formatCode="%B %Y"))
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

Note that when passing multiple arguments to the custom formatting function, the value parameter must always be named x in the custom function.

Sorting data groups

By default, data groups are sorted into ascending order based on the data group value, e.g.:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

In the above example the train operating companies are sorted into ascending order alphabetically (Arriva, CrossCountry, London, Virgin). Similarly for the train category (Express, Ordinary) and power type (DMU, EMU HST). The totals appear in a fixed position, irrespective of the data group sorting.

The data groups can be sorted into descending order using dataSortOrder argument, e.g. to reverse the order of the train operating companies:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", dataSortOrder="desc")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

The data groups can be sorted into a custom sort order using dataSortOrder and customSortOrder arguments:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", dataSortOrder="custom", 
                    customSortOrder=c("London Midland", "Arriva Trains Wales", 
                                     "Virgin Trains", "CrossCountry"))
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()

The data groups can also be sorted using the sortColumnDataGroups or sortRowDataGroups methods. Using these methods, data groups can be sorted in five different ways:

By the data group value - using orderBy="value",
By the data group caption (e.g. if formatting has been applied) - using orderBy="caption",
By the result of a calculation - the default, or set explicitly using orderBy="calculation",
By the data group value with a custom sort order - using orderBy="customByValue" and specifying customOrder,
By the data group caption with a custom sort order - using orderBy="customByCaption" and specifying customOrder.

Note that “data group value” above refers to the value in the row/group header. In the example pivot table above and below, the headers are text values, e.g. “Express Passenger”, “DMU”, “London Midland”, etc (so the value and caption are the same). The data group value and caption tend to be different when the data groups are dates or numbers.

To sort by a number in the cells of the pivot table, which is typically the result of a calculation, use orderBy="calculation".

For example, to sort the train operating companies into descending order by the total number of trains:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$sortRowDataGroups(levelNumber=1, orderBy="calculation", sortOrder="desc")
pt$renderPivot()

The pivot is sorted using the totals for each data group. In the pivot table above, the London Midland train operating company is in first place with 48,279 trains. CrossCountry is in second place with 22,928 trains, etc.

Since the pivot table above only contains one calculation, it was not necessary to specify the name of the calculation when sorting the row groups. If a pivot table contains more than one calculation, and there is a wish to sort by a calculation other than the first calculation that was defined, then the calculation can be specified by using the calculationName argument when calling either sortColumnDataGroups or sortRowDataGroups.

When sorting the headings at the second, third, etc. level, the sort does not break the structure coming from the parent data groups, i.e. a separate sort is performed within each parent group, e.g.

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$sortColumnDataGroups(levelNumber=2, orderBy="calculation", sortOrder="desc")
pt$renderPivot()

In the example above, one sort was carried out for the power type groups under “Express Passenger” (“DMU” in first place with 32,987 trains, “EMU” second with 15,306 trains, etc.) and a separate sort was carried out for the power type groups under “Ordinary Passenger” (“EMU” first with 28,201 trains, “DMU” second with 6,484 trains, etc.).

The sortColumnDataGroups or sortRowDataGroups can also be used to reset the sort order back to an alphabetic order based on the value of each of the groups, e.g.

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
# the follow line sorts the data based on the totals
pt$sortColumnDataGroups(levelNumber=2, orderBy="calculation", sortOrder="desc")
# the following line resets the sort order back, i.e. removes the sort applied in the line above
pt$sortColumnDataGroups(levelNumber=2, orderBy="value", sortOrder="asc")
pt$renderPivot()

This can be useful when building a pivot table interactively, to quickly reset the sort order without needing to rebuild the entire pivot table.

To sort data groups into a custom order using sortColumnDataGroups or sortRowDataGroups:

library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$sortRowDataGroups(levelNumber=1, orderBy="customByValue", sortOrder="asc",
                     customOrder=c("Arriva Trains Wales", "London Midland", 
                                   "CrossCountry", "Virgin Trains"))
pt$renderPivot()

Pivot tables as standard tables (row group headings)

It is possible to make a pivot table appear more like a standard table by adding headings to the row group columns at the left of the table. To do this, specify the header argument in addRowDataGroups() and specify showRowGroupHeaders=TRUE in renderPivot(), getHtml(), saveHtml() or writeToWorksheet(). (Note this is currently not supported for Latex output).

library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
                 GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
                 GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))
trains <- filter(trains, GbttMonth>=make_date(year=2017, month=1, day=1))

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", header="Train Company")
pt$addRowDataGroups("TrainCategory", header="Train Category")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot(showRowGroupHeaders=TRUE)

This can be taken a step further by switching off totals for the row data groups and using a different table theme which styles the row data groups to match the cells in the body of the pivot table.

# derive the date of each train (from the arrival/dep times),
# then the month of each train from the date of each train
library(dplyr)
library(lubridate)
library(pivottabler)
trains <- mutate(bhmtrains, 
                 GbttDate=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
                 GbttMonth=make_date(year=year(GbttDate), month=month(GbttDate), day=1))
trains <- filter(trains, GbttMonth>=make_date(year=2017, month=1, day=1))

pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", header="Train Company", addTotal=FALSE)
pt$addRowDataGroups("TrainCategory", header="Train Category", addTotal=FALSE)
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$theme <- getStandardTableTheme(pt)
pt$renderPivot(showRowGroupHeaders=TRUE)

Chris Bailiss

2023-10-01