vignettes/v03-calculations.Rmd
v03-calculations.Rmd
Calculations define how (typically numerical) data is to be summarised/aggregated. Common ways of summarising data include sum, avg (mean, median, …), max, min, etc.
Within a pivottabler
pivot table, calculations always
belong to a Calculation Group. Calculation groups allow calculations to
be defined that refer to other calculations.
Every pivot table always has a default calculation group (called
default
). This is sufficient for most scenarios. All the
calculations defined in this vignette sit in the default
calculation group.
Creating additional calculation groups is only necessary for some advanced pivot table layouts - see the Irregular Layout vignette for an example.
Calculations groups are not discussed further in this vignette.
The pivottabler package supports several different ways of calculating the values to display in the cells of the pivot table:
Calculations are added to the pivot table using the
defineCalculation()
function.
The first two of the methods listed above are described in more detail in the following section. The third and fourth methods are less commonly needed and are described at the end of this vignette.
The most common way to calculate the pivot table is to provide an expression that describes how to aggregate the data, e.g. defining a calculation that counts the number of trains:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()
The pivottabler package uses the dplyr package by default. The
summariseExpression is therefore an expression that can be used with the
dplyr summarise()
function. The following shows several
different example expressions:
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta))
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="TotalTrains", caption="Total Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MinArrivalDelay", caption="Min Arr. Delay",
summariseExpression="min(ArrivalDelay, na.rm=TRUE)")
pt$defineCalculation(calculationName="MaxArrivalDelay", caption="Max Arr. Delay",
summariseExpression="max(ArrivalDelay, na.rm=TRUE)")
pt$defineCalculation(calculationName="MeanArrivalDelay", caption="Mean Arr. Delay",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)", format="%.1f")
pt$defineCalculation(calculationName="MedianArrivalDelay", caption="Median Arr. Delay",
summariseExpression="median(ArrivalDelay, na.rm=TRUE)")
pt$defineCalculation(calculationName="IQRArrivalDelay", caption="Delay IQR",
summariseExpression="IQR(ArrivalDelay, na.rm=TRUE)")
pt$defineCalculation(calculationName="SDArrivalDelay", caption="Delay Std. Dev.",
summariseExpression="sd(ArrivalDelay, na.rm=TRUE)", format="%.1f")
pt$renderPivot()
The data.table package can also be used - see the Performance vignette for details. Note
that the “count” summarise expression is specified as .N
when using data.table, not as n()
.
When using data.table, be aware of some strange behaviour that leads to incorrect values in the pivot table when aggregating over columns that are also used in either row/column headings. Again, see the Performance vignette for details.
Calculations can be defined that refer to other calculations, by following these steps:
type="calculation"
,basedOn
argument.calculationExpression
argument. The values of the base
calculations are accessed as elements of the values
list.For example, calculating the percentage of trains with an arrival delay of greater than five minutes:
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0))
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="DelayedTrains", caption="Trains Arr. 5+ Mins Late",
summariseExpression="sum(DelayedByMoreThan5Minutes, na.rm=TRUE)")
pt$defineCalculation(calculationName="TotalTrains", caption="Total Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="DelayedPercent", caption="% Trains Arr. 5+ Mins Late",
type="calculation", basedOn=c("DelayedTrains", "TotalTrains"),
format="%.1f %%",
calculationExpression="values$DelayedTrains/values$TotalTrains*100")
pt$renderPivot()
The base calculations can be hidden by specifying
visible=FALSE
, e.g. to look at how the percentage of trains
more than five minutes late varied by month and train operating
company:
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
GbttMonth=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=1),
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0))
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="DelayedTrains", visible=FALSE,
summariseExpression="sum(DelayedByMoreThan5Minutes, na.rm=TRUE)")
pt$defineCalculation(calculationName="TotalTrains", visible=FALSE,
summariseExpression="n()")
pt$defineCalculation(calculationName="DelayedPercent", caption="% Trains Arr. 5+ Mins Late",
type="calculation", basedOn=c("DelayedTrains", "TotalTrains"),
format="%.1f %%",
calculationExpression="values$DelayedTrains/values$TotalTrains*100")
pt$renderPivot()
By default the calculation headings (if visible) are placed in the pivot table as column headings, underneath any column data groups.
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="NumberOfTrains", caption="Number of Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MaximumSpeedMPH", caption="Maximum Speed (MPH)",
summariseExpression="max(SchedSpeedMPH, na.rm=TRUE)")
pt$renderPivot()
The location of these headings can be moved as illustrated below.
Calculations can be swapped onto the rows using the
addRowCalculationGroups()
method:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="NumberOfTrains", caption="Number of Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MaximumSpeedMPH", caption="Maximum Speed (MPH)",
summariseExpression="max(SchedSpeedMPH, na.rm=TRUE)")
pt$addRowCalculationGroups()
pt$renderPivot()
In the examples above, the row/column groups were specified first and then the calculations. It is equally possible to specify the calculations first. The calculation names then form the first level of the row/column groups, e.g.
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$defineCalculation(calculationName="NumberOfTrains", caption="Number of Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MaximumSpeedMPH", caption="Maximum Speed (MPH)",
summariseExpression="max(SchedSpeedMPH, na.rm=TRUE)")
pt$addColumnCalculationGroups()
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$renderPivot()
Similarly, on rows:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$defineCalculation(calculationName="NumberOfTrains", caption="Number of Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MaximumSpeedMPH", caption="Maximum Speed (MPH)",
summariseExpression="max(SchedSpeedMPH, na.rm=TRUE)")
pt$addRowCalculationGroups()
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$renderPivot()
Again on rows, but this time using outline layout:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$defineCalculation(calculationName="NumberOfTrains", caption="Number of Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MaximumSpeedMPH", caption="Maximum Speed (MPH)",
summariseExpression="max(SchedSpeedMPH, na.rm=TRUE)")
pt$addRowCalculationGroups(outlineBefore=TRUE)
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC")
pt$renderPivot()
See the Regular Layout vignette for more details on outline layout.
It is possible to specify additional/different filter criteria as
part of a calculation definition. This additional criteria can either be
in the form of a PivotFilters
object or a
PivotFiltersOverrides
object.
For example, to calculate the percentage of trains of each category that each train company operated at weekends, for each cell in the pivot table:
library(dplyr)
library(lubridate)
library(pivottabler)
# get the date of each train and whether that date is a weekday or weekend
trains <- bhmtrains %>%
mutate(GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
DayNumber=wday(GbttDateTime),
WeekdayOrWeekend=ifelse(DayNumber %in% c(1,7), "Weekend", "Weekday"))
# render the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
weekendFilter <- PivotFilters$new(pt, variableName="WeekdayOrWeekend", values="Weekend")
pt$defineCalculation(calculationName="WeekendTrains", summariseExpression="n()",
filters=weekendFilter, visible=FALSE)
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()", visible=FALSE)
pt$defineCalculation(calculationName="WeekendTrainsPercentage",
type="calculation", basedOn=c("WeekendTrains", "TotalTrains"),
format="%.1f %%",
calculationExpression="values$WeekendTrains/values$TotalTrains*100")
pt$renderPivot()
See the Appendix: Calculations vignette for many more examples.
Changing the filter criteria used in a cell enables many additional types of calculations. See the Appendix: Calculations vignette for examples.
Example: % of row total:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="CountTrains", summariseExpression="n()",
caption="Count", visible=FALSE)
filterOverrides <- PivotFilterOverrides$new(pt, keepOnlyFiltersFor="TOC")
pt$defineCalculation(calculationName="TOCTotalTrains", filters=filterOverrides,
summariseExpression="n()", caption="TOC Total", visible=FALSE)
pt$defineCalculation(calculationName="PercentageOfTOCTrains", type="calculation",
basedOn=c("CountTrains", "TOCTotalTrains"),
calculationExpression="values$CountTrains/values$TOCTotalTrains*100",
format="%.1f %%", caption="% of TOC")
pt$renderPivot()
See the Appendix: Calculations vignette for an explanation of the above.
Each cell in a pivot table has two values: A rawValue
that is the result of the calculation in the cell. The
rawValue
is typically the same data type as the variable
the calculation is based on, e.g. a sum() of numerical values will
result in a numerical value. The formattedValue
is the
value that is displayed in the pivot table. The data type of the
formattedValue
is typically character.
The formatting of calculation results is specified by setting the
format
parameter when calling the
defineCalculation
function.
A number of different approaches to formatting are supported:
format
is a text value, then pivottabler invokes
base::sprintf()
with the specified format.format
is a list, then pivottabler invokes
base::format()
, where the elements in the list become
arguments in the function call.format
is an R function, then this is invoked for
each value.format
is not specified, then
base::as.character()
is invoked to provide a default
formatted value.The above are the same approaches used when formatting data groups.
Some formatting behaviour depends on the data type of the value being formatted. For details see the Appendix: Details vignette.
base::sprintf()
and
base::format()
The example below shows two different ways of formatting a value to
2dp. The “Mean Arr. Delay 1” column below is formatted using
base::sprintf("%.2f", x)
. The “Mean Arr. Delay 2” column
below is formatted using
base::format(x, digits=2, nsmall=2)
.
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta))
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="TotalTrains", caption="Total Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MeanArrivalDelay1", caption="Mean Arr. Delay 1",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format="%.2f")
pt$defineCalculation(calculationName="MeanArrivalDelay2", caption="Mean Arr. Delay 2",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format=list(digits=2, nsmall=2))
pt$renderPivot()
In some countries around the world, the thousand separator character
is “.” and the decimal separator character is “,” (i.e. the opposite to
countries such as the UK and USA). This can be accomplished in
pivottabler
using:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory", totalCaption="All Categories")
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="Value1", caption="Value 1",
summariseExpression="mean(SchedSpeedMPH, na.rm=TRUE)*33.33333",
format=list(digits=1, nsmall=0, big.mark=".", decimal.mark=","))
pt$defineCalculation(calculationName="Value2", caption="Value 2",
summariseExpression="sd(SchedSpeedMPH,na.rm=TRUE)*333.33333",
format=list(digits=1, nsmall=1, big.mark=".", decimal.mark=","))
pt$renderPivot()
Using a custom R format function allows bespoke formatting logic to be used. In the following (contrived) example, “Mean Arr. Delay 2” includes a descriptive prefix that shows whether a number is less than, roughly equal to or greater than 3.
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta))
# custom format function
fmtAddComment <- function(x) {
formattedNumber <- sprintf("%.1f", x)
comment <- "-"
if (x < 2.95) comment <- "Below 3: "
else if ((2.95 <= x) && (x < 3.05)) comment <- "Equals 3: "
else if (x >= 3.05) comment <- "Over 3: "
return(paste0(comment, " ", formattedNumber))
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="TotalTrains", caption="Total Trains",
summariseExpression="n()")
pt$defineCalculation(calculationName="MeanArrivalDelay1",
caption="Mean Arr. Delay 1",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format="%.1f")
pt$defineCalculation(calculationName="MeanArrivalDelay2",
caption="Mean Arr. Delay 2",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format=fmtAddComment)
pt$renderPivot()
Using the fmtFuncArgs
argument it is also possible to
pass additional arguments to a custom R function used for formatting.
When passing multiple arguments to the custom R function, the cell value
is always passed to the custom function as x
. In the
example below, the number of decimal places is specified using the
fmtFuncArgs
argument:
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta))
# custom format function
fmtNumDP <- function(x, numDP) {
formatCode <- paste0("%.", numDP, "f")
formattedNumber <- sprintf(formatCode, x)
return(formattedNumber)
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains)
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="TotalTrains", caption="Total Trains",
summariseExpression="n()")
# define calculations
# note the use of the same custom format function (fmtNumDP)
# but specifying different decimal places
pt$defineCalculation(calculationName="MeanArrivalDelay1", caption="Mean Arr. Delay 1",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format=fmtNumDP, fmtFuncArgs=list(numDP=1))
pt$defineCalculation(calculationName="MeanArrivalDelay2", caption="Mean Arr. Delay 2",
summariseExpression="mean(ArrivalDelay, na.rm=TRUE)",
format=fmtNumDP, fmtFuncArgs=list(numDP=2))
pt$renderPivot()
By default, where no data exists (for a particular combination of row
and column headers) pivottabler
will leave the pivot table
cell empty. Sometimes it is desirable to display a value in these cells.
This can be specified in two ways in the
defineCalculation()
function - either by specifying a value
for either the noDataValue
or noDataCaption
arguments. The differences between these two options are as follows:
Comparison | noDataValue argument | noDataCaption argument |
---|---|---|
Allowed Data Type(s) |
integer or numeric
|
character |
format argument applies |
Yes (will be formatted) | No (will be displayed as-is) |
Will be used in other calculations | Yes | No |
If the requirement is only to display a different value when there is
no data, then noDataCaption
is the right choice. Both
approaches are demonstrated below, where the Virgin Trains, Ordinary
Passenger cell has no data, so the empty cell value/caption is
shown.
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()", noDataValue=0)
pt$renderPivot()
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()", noDataCaption="-")
pt$renderPivot()
A pivot table can display data from multiple data frames. The following summarises the possible functionality:
defineCalculation()
function1.Important: When adding multiple data frames to a pivot table, the data frame columns used for the data groups (i.e. row/column headings) must be conformed, i.e.:
It is also worth noting that only the first data frame added to the pivot table is used when generating the row/column headings.
The example below illustrates using two data frames with a single pivot table:
library(pivottabler)
library(dplyr)
# derive some additional data
trains <- mutate(bhmtrains,
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0)) %>%
select(TrainCategory, TOC, DelayedByMoreThan5Minutes)
# in this example, bhmtraindisruption is joined to bhmtrains
# so that the TrainCategory and TOC columns are present in both
# data frames added to the pivot table
cancellations <- bhmtraindisruption %>%
inner_join(bhmtrains, by="ServiceId") %>%
mutate(CancelledInBirmingham=ifelse(LastCancellationLocation=="BHM",1,0)) %>%
select(TrainCategory, TOC, CancelledInBirmingham)
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains, "trains")
pt$addData(cancellations, "cancellations")
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="DelayedTrains", dataName="trains",
caption="Delayed",
summariseExpression="sum(DelayedByMoreThan5Minutes, na.rm=TRUE)")
pt$defineCalculation(calculationName="CancelledTrains", dataName="cancellations",
caption="Cancelled",
summariseExpression="sum(CancelledInBirmingham, na.rm=TRUE)")
pt$renderPivot()
In the example above, the number of trains more than five minutes
late is calculated from the trains
data frame and the
number of trains cancelled at Birmingham New Street is calculated from
the cancellations
data frame.
The following two methods of calculating cell values are more advanced and less commonly needed.
A custom calculation function allows more complex calculation logic to be used. Such a function is invoked once for each cell in the body of the pivot table. Custom calculation functions always have the same arguments defined:
pivotCalculator
is a helper object that offers various
methods to assist in performing calculations,netFilters
contains the definitions of the filter
criteria coming from the row and column headers in the pivot table,calcFuncArgs
is a list that specifies any additional
arguments that need to be passed to the custom calculation
function,format
provides the formatting definition - this is the
same value specified in the defineCalculation()
call,fmtFuncArgs
is a list that specifies any additional
arguments that need to be passed to a custom format function (if
used),baseValues
provides access to the results of other
calculations in the calculation group,cell
provides access to more details about the
individual cell that is being calculated.
cell
argument is provided to support more advanced
scenarios and is not explained in detail here.For example, if we wish to examine the worst single day performance, we need to:
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
GbttDate=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=day(GbttDateTime)),
GbttMonth=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=1),
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0))
# custom calculation function
getWorstSingleDayPerformance <- function(pivotCalculator, netFilters, calcFuncArgs,
format, fmtFuncArgs, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trains")
# apply the TOC and month filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# calculate the percentage of trains more than five minutes late by date
dateSummary <- filteredTrains %>%
group_by(GbttDate) %>%
summarise(DelayedPercent = sum(DelayedByMoreThan5Minutes, na.rm=TRUE) / n() * 100) %>%
arrange(desc(DelayedPercent))
# top value
tv <- dateSummary$DelayedPercent[1]
# build the return value
value <- list()
value$rawValue <- tv
value$formattedValue <- pivotCalculator$formatValue(tv, format=format)
return(value)
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains, "trains")
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="WorstSingleDayDelay", format="%.1f %%",
type="function", calculationFunction=getWorstSingleDayPerformance)
pt$renderPivot()
The return value from the custom function must be a list containing
the raw result value (i.e. unformatted, that is either
integer
or numeric
data type) and a formatted
value (that is character
data type).
Using a custom calculation function also enables additional possibilities, e.g. including additional information in the formatted value, in this case the date of the worst single day performance (where the code changes compared to the example above are highlighted):
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
GbttDate=make_date(year=year(GbttDateTime), month=month(GbttDateTime),
day=day(GbttDateTime)),
GbttMonth=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=1),
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0))
# custom calculation function
getWorstSingleDayPerformance <- function(pivotCalculator, netFilters, calcFuncArgs,
format, fmtFuncArgs, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trains")
# apply the TOC and month filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# calculate the percentage of trains more than five minutes late by date
dateSummary <- filteredTrains %>%
group_by(GbttDate) %>%
summarise(DelayedPercent = sum(DelayedByMoreThan5Minutes, na.rm=TRUE) / n() * 100) %>%
arrange(desc(DelayedPercent))
# top value
tv <- dateSummary$DelayedPercent[1]
date <- dateSummary$GbttDate[1] # << CODE CHANGE <<
# build the return value
value <- list()
value$rawValue <- tv
value$formattedValue <- paste0(format( # << CODE CHANGE (AND BELOW) <<
date, format="%a %d"), ": ", pivotCalculator$formatValue(tv, format=format))
return(value)
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains, "trains")
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="WorstSingleDayDelay", format="%.1f %%",
type="function", calculationFunction=getWorstSingleDayPerformance)
pt$renderPivot()
Including two values in each cell somewhat reduces the readability however.
It is possible to pass additional arguments as a list to a custom calculation function, as illustrated in the trivial example below.
library(pivottabler)
# custom calculation function
calcFunction <- function(pivotCalculator, netFilters, calcFuncArgs,
format, fmtFuncArgs, baseValues, cell) {
# build the return value
value <- list()
value$rawValue <- calcFuncArgs$result
value$formattedValue <- pivotCalculator$formatValue(calcFuncArgs$result, format=format)
return(value)
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("PowerType")
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="calcA", caption="A", type="function",
calculationFunction=calcFunction, calcFuncArgs=list(result=1))
pt$defineCalculation(calculationName="calcB", caption="B", type="function",
calculationFunction=calcFunction, calcFuncArgs=list(result=2))
pt$renderPivot()
This is useful as it allows the logic in more complex custom calculation functions to be reused to create variations of the same function - without needing to define the entire function again. For example, taking the example above that calculates the worst single day performance, it is possible to split the two values in a single cell into two cells.
library(pivottabler)
library(dplyr)
library(lubridate)
# derive some additional data
trains <- mutate(bhmtrains,
GbttDateTime=if_else(is.na(GbttArrival), GbttDeparture, GbttArrival),
GbttDate=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=day(GbttDateTime)),
GbttMonth=make_date(year=year(GbttDateTime), month=month(GbttDateTime), day=1),
ArrivalDelta=difftime(ActualArrival, GbttArrival, units="mins"),
ArrivalDelay=ifelse(ArrivalDelta<0, 0, ArrivalDelta),
DelayedByMoreThan5Minutes=ifelse(ArrivalDelay>5,1,0))
# custom calculation function
getWorstSingleDayPerformance <- function(pivotCalculator, netFilters, calcFuncArgs,
format, fmtFuncArgs, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trains")
# apply the TOC and month filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# calculate the percentage of trains more than five minutes late by date
dateSummary <- filteredTrains %>%
group_by(GbttDate) %>%
summarise(DelayedPercent = sum(DelayedByMoreThan5Minutes, na.rm=TRUE) / n() * 100) %>%
arrange(desc(DelayedPercent))
# top value
tv <- dateSummary$DelayedPercent[1]
date <- dateSummary$GbttDate[1]
if(calcFuncArgs$output=="day") { # << CODE CHANGES HERE <<
# build the return value
value <- list()
value$rawValue <- date
value$formattedValue <- format(date, format="%a %d")
}
else if(calcFuncArgs$output=="performance") { # << CODE CHANGES HERE <<
# build the return value
value <- list()
value$rawValue <- tv
value$formattedValue <- pivotCalculator$formatValue(tv, format=format)
}
return(value)
}
# create the pivot table
pt <- PivotTable$new()
pt$addData(trains, "trains")
pt$addColumnDataGroups("GbttMonth", dataFormat=list(format="%B %Y"))
pt$addRowDataGroups("TOC", totalCaption="All TOCs")
pt$defineCalculation(calculationName="WorstSingleDay", caption="Day",
format="%.1f %%", type="function",
calculationFunction=getWorstSingleDayPerformance,
calcFuncArgs=list(output="day"))
pt$defineCalculation(calculationName="WorstSingleDayPerf", caption="Perf",
format="%.1f %%", type="function",
calculationFunction=getWorstSingleDayPerformance,
calcFuncArgs=list(output="performance"))
pt$renderPivot()
With this approach, the pivot table performs little or no calculations. The values to display are predominantly calculated in R code before the pivot table is created. This pivot table is used primarily as a visualisation mechanism.
Returning to the original simple example of the number of trains operated by each train operating company:
library(pivottabler)
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()
In the example above, pivottabler
calculated the values
in each pivot table cell. We can alternatively calculate the values
explicitly in R code and instead just use the pivot table to display
them:
library(pivottabler)
# perform the aggregation in R code explicitly
trains <- bhmtrains %>%
group_by(TrainCategory, TOC) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
# a sample of the aggregated data
head(trains)
## # A tibble: 6 × 3
## TrainCategory TOC NumberOfTrains
## <fct> <fct> <int>
## 1 Express Passenger Arriva Trains Wales 3079
## 2 Express Passenger CrossCountry 22865
## 3 Express Passenger London Midland 14487
## 4 Express Passenger Virgin Trains 8594
## 5 Ordinary Passenger Arriva Trains Wales 830
## 6 Ordinary Passenger CrossCountry 63
# display this pre-calculated data
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="NumberOfTrains")
pt$renderPivot()
In the simple version shown above, no totals are displayed. This is because the data frame contains only line-level data, i.e. no aggregated data.
Three workarounds are possible:
Each these examples are presented below.
library(pivottabler)
# perform the aggregation in R code explicitly
trains <- bhmtrains %>%
group_by(TrainCategory, TOC) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
# display this pre-calculated data
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory", addTotal=FALSE) # << *** CODE CHANGE *** <<
pt$addRowDataGroups("TOC", addTotal=FALSE) # << *** CODE CHANGE *** <<
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="NumberOfTrains")
pt$renderPivot()
The totals are calculated in separate data frames then added to the
pivot table using pt$addTotalData()
.
The variableNames must be specified to inform the pivot table about
which total cells each totals data frame relates to. If a given cell is
a subtotal that relates to multiple variables, specify the variable
names using c(var1, var2, var3, ...)
.
library(dplyr)
library(pivottabler)
# perform the aggregation in R code explicitly
trains <- bhmtrains %>%
group_by(TrainCategory, TOC) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
# calculate the totals/aggregate values
trainsTrainCat <- bhmtrains %>%
group_by(TrainCategory) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
trainsTOC <- bhmtrains %>%
group_by(TOC) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
trainsTotal <- bhmtrains %>%
summarise(NumberOfTrains=n())
# display this pre-calculated data
pt <- PivotTable$new()
pt$addData(trains)
pt$addTotalData(trainsTrainCat, variableNames="TrainCategory")
pt$addTotalData(trainsTOC, variableNames="TOC")
pt$addTotalData(trainsTotal, variableNames=NULL)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="NumberOfTrains")
pt$renderPivot()
library(pivottabler)
# perform the aggregation in R code explicitly
trains <- bhmtrains %>%
group_by(TrainCategory, TOC) %>%
summarise(NumberOfTrains=n()) %>%
ungroup()
# display this pre-calculated data
pt <- PivotTable$new()
pt$addData(trains)
pt$addColumnDataGroups("TrainCategory")
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", # << *** CODE CHANGE (AND BELOW) *** <<
type="value", valueName="NumberOfTrains",
summariseExpression="sum(NumberOfTrains)")
pt$renderPivot()
The pivottabler
package supports two different
evaluation modes for computing cell values: batch
and
sequential
. The batch
evaluation mode offers
much higher performance, especially for large pivot tables.
The pivottabler
package is also able to use two
different packages when carrying out summarising cell calculations
(method 1 calculations as described above in this vignette): dplyr and
data.table. data.table offers slightly higher performance when used with
large data frames (over ten million rows) with batch
evaluation mode, though has some quirks/odd behaviours when aggregating
over columns that are also used in the row/column headings of the pivot
table.
Please see the Performance for
more information about pivottabler
performance and
data.table.
If the pivot table contains only one data frame, then
specifying the data frame when calling defineCalculation()
is not necessary.↩︎