
Perturb Counts in a Vector with Small Cells
perturb_counts.RdThe perturb_counts function perturbs counts in a numeric vector containing small cells, specifically when only one primary cell is present and secondary cells need to be masked, following Algorithm 3 (A3). The function adjusts the counts by distributing noise to non-primary cells while preserving the overall distribution as much as possible.
Value
A character vector with perturbed counts formatted with digit precision and thousands separator. If perturbation is not feasible, the function returns counts masked using mask_counts().
Details
Perturbation Process Overview:
The function performs perturbation through the following steps:
Identification of Small Cells: Cells with counts greater than 0 and less than the specified
thresholdare identified as small cells (primary cells).$$\text{Small Cells} = \{ i \mid 0 < x_i < \text{threshold} \}$$
Adjustment of Small Cells: The counts of small cells are set to the
thresholdvalue.$$x'_i = \left\{ \begin{array}{ll} \text{threshold} & \text{if } x_i \text{ is a small cell} \\ x_i & \text{otherwise} \end{array} \right.$$
Calculation of Total Noise: The total noise to be distributed is calculated as the difference between the original total sum and the adjusted sum.
$$\text{Total Noise} = \sum_{i=1}^{N} x_i - \sum_{i=1}^{N} x'_i$$
Distribution of Noise to Non-Small Cells: The total noise is proportionally distributed to the non-small cells based on their original counts.
Weights Calculation: $$w_i = \frac{x_i}{\sum_{j \in \text{Non-Small Cells}} x_j}$$
Noise Allocation: $$\text{Noise}_i = w_i \times \text{Total Noise}$$
Adjusted Counts: $$x''_i = x'_i + \text{Noise}_i$$
Rounding Adjusted Counts: The adjusted counts are rounded to the nearest integer.
$$x'''_i = \text{round}(x''_i)$$
Adjustment for Rounding Discrepancies: Any remaining noise due to rounding discrepancies is adjusted by iteratively adding or subtracting 1 from the largest counts until the total counts are balanced, ensuring that no count falls below the
threshold.Verification of Proportions: The function checks if the proportions of the non-small cells remain consistent before and after perturbation. If the proportions differ, the function coerces to mask counts using the
mask_counts()function.
Coercion to Mask Counts:
The function coerces to mask counts in the following scenarios:
Multiple Small Cells Detected: If more than one small cell is identified, perturbation may not be necessary unless intended to use. The function will still proceed with perturbation but recommends using threshold-based suppression.
Insufficient Available Counts: If the non-small cells do not have enough counts to absorb the total noise without any count falling below the
threshold, the operation will lead to information loss.Proportions Changed After Perturbation: If perturbation alters the original proportions of the non-small cells, the operation will lead to information loss.
#' - All Counts Below Threshold: If all counts in the vector are below the specified threshold, there is no meaningful perturbation possible. In this case, the function coerces to mask_counts() as a more secure alternative.
In these cases, the function calls mask_counts() to apply threshold-based cell suppression as a more secure alternative.
Examples
# Example vectors
x1 <- c(5, 11, 43, 55, 65, 121, 1213, 0, NA)
x2 <- c(1, 1, 1, 55, 65, 121, 1213, 0, NA)
x3 <- c(11, 10, 10, 55, 65, 121, 1213, 0, NA)
# Apply the function
lapply(list(x1, x2, x3), perturb_counts)
#> Warning: Total primary cells: 3. Threshold-based suppression is recommended. See mask_counts() & mask_counts_2()
#> [[1]]
#> [1] "10" "11" "43" "55" "65" "121" "1,208" "0" NA
#>
#> [[2]]
#> [1] "10" "10" "10" "54" "64" "119" "1,190" "0" NA
#>
#> [[3]]
#> [1] "11" "10" "10" "55" "65" "121" "1,213" "0" NA
#>
# Using the function within a data frame
data("countmaskr_data")
aggregate_table <- countmaskr_data %>%
select(-c(id, age)) %>%
tidyr::gather(block, Characteristics) %>%
group_by(block, Characteristics) %>%
summarise(N = n()) %>%
ungroup()
#> Warning: attributes are not identical across measure variables; they will be dropped
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by block and Characteristics.
#> ℹ Output is grouped by block.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(block, Characteristics))` for per-operation grouping
#> (`?dplyr::dplyr_by`) instead.
aggregate_table %>%
group_by(block) %>%
mutate(N_masked = perturb_counts(N))
#> # A tibble: 16 × 4
#> # Groups: block [4]
#> block Characteristics N N_masked
#> <chr> <chr> <int> <chr>
#> 1 age_group 18-29 243 243
#> 2 age_group 30-39 198 198
#> 3 age_group 40-49 215 215
#> 4 age_group 50-64 323 323
#> 5 age_group 65+ 521 521
#> 6 ethnicity Hispanic 143 143
#> 7 ethnicity Non-Hispanic 1346 1,346
#> 8 ethnicity Other 11 11
#> 9 gender Female 728 728
#> 10 gender Male 763 762
#> 11 gender Other 9 10
#> 12 race American Indian/ Pacific Islander 66 66
#> 13 race Asian 215 214
#> 14 race Black 453 452
#> 15 race Other 6 10
#> 16 race White 760 758