Crandore Hub

SelectBoost.FDA

SelectBoost-Style Variable Selection for Functional Data Analysis

Implements 'SelectBoost'-style variable selection workflows for functional data analysis. The package provides FDA-native design and preprocessing objects for raw curves, spline-basis expansions, Functional principal component analysis scores, and scalar covariates; grouped stability-selection routines based on repeated subject-level subsampling; multiple selector backends including lasso, group lasso, and sparse-group lasso; FDA-aware grouping functions and calibration helpers for 'SelectBoost'; method-comparison utilities; a formula interface; simulation, benchmarking, and validation helpers with mapped ground truth; targeted sensitivity-study utilities and shipped benchmark summaries for mean 'F1' comparisons between FDA-aware and plain 'SelectBoost' workflows; small example datasets; and an optional adapter to the native stability-selection interface from the 'FDboost' package.

README

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->




# SelectBoost.FDA

<!-- badges: start -->
[![R-CMD-check](https://github.com/fbertran/SelectBoost.FDA/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/fbertran/SelectBoost.FDA/actions/workflows/R-CMD-check.yaml)
[![R-hub](https://github.com/fbertran/SelectBoost.FDA/actions/workflows/rhub.yaml/badge.svg)](https://github.com/fbertran/SelectBoost.FDA/actions/workflows/rhub.yaml)
<!-- badges: end -->


`SelectBoost.FDA` is an R package for variable selection in functional data
analysis. It combines FDA-native preprocessing and design objects with grouped
stability selection, interval summaries, FDA-aware `SelectBoost`, and a small
validation layer for simulation and benchmarking.

The package is designed for workflows where functional predictors are observed
on a grid, represented through basis expansions, or reduced to FPCA scores, and
where strong local or block-wise correlation makes ordinary variable selection
unstable.

## Main features

- FDA-native design objects built directly from raw curves, basis
  representations, FPCA scores, and scalar covariates.
- Train/test-safe preprocessing with identity transforms, standardization,
  spline-basis expansion, and FPCA.
- Grouped stability selection for functional blocks and interval summaries.
- FDA-aware `SelectBoost` wrappers plus a plain `SelectBoost` baseline.
- Simulation, benchmark, and evaluation helpers with mapped ground truth.
- Shipped sensitivity-study benchmark summaries for direct mean `F1`
  comparisons between `selectboost_fda()` and plain `SelectBoost`.

## Installation

You can install the development version from GitHub with:

```r
# install.packages("remotes")
remotes::install_github("bertran7/SelectBoost.FDA")
```

Some workflows rely on optional backends:

- `glmnet` for lasso-based grouped stability selection.
- `grpreg` for group lasso.
- `SGL` for sparse-group lasso.
- `FDboost` and `stabs` for the native `FDboost` stability-selection route.

## A first FDA-native workflow

The package ships with small example datasets so the full workflow can start
from raw functional inputs.


``` r
data("spectra_example", package = "SelectBoost.FDA")

idx <- 1:30

design <- fda_design(
  response = spectra_example$response[idx],
  predictors = list(
    signal = fda_grid(
      spectra_example$predictors$signal[idx, ],
      argvals = spectra_example$grid,
      name = "signal",
      unit = "nm"
    ),
    nuisance = fda_grid(
      spectra_example$predictors$nuisance[idx, ],
      argvals = spectra_example$grid,
      name = "nuisance",
      unit = "nm"
    )
  ),
  scalar_covariates = spectra_example$scalar_covariates[idx, ],
  transforms = list(
    signal = fda_fpca(n_components = 3),
    nuisance = fda_bspline(df = 5)
  ),
  scalar_transform = fda_standardize(),
  family = "gaussian"
)

summary(design)
#> FDA design summary
#>   observations: 30 
#>   features: 10 
#>   family: gaussian 
#>   response available: TRUE 
#>   functional predictors: 2 
#>   scalar covariates: 2 
#>  predictor representation n_features
#>   nuisance          basis          5
#>     signal          basis          3
#>        age         scalar          1
#>  treatment         scalar          1
head(selection_map(design, level = "basis"))
#>                 predictor representation basis_type
#> nuisance.spline  nuisance          basis     spline
#> signal.fpca        signal          basis       fpca
#>                 source_representation n_components
#> nuisance.spline                  grid            5
#> signal.fpca                      grid            3
#>                 first_component last_component
#> nuisance.spline              B1             B5
#> signal.fpca                 PC1            PC3
#>                         components domain_start
#> nuisance.spline B1, B2, B3, B4, B5         1100
#> signal.fpca          PC1, PC2, PC3         1100
#>                 domain_end
#> nuisance.spline       2500
#> signal.fpca           2500
```

## FDA-aware SelectBoost

`SelectBoost.FDA` extends `SelectBoost` with block-aware and region-aware
grouping while keeping the original perturbation engine.


``` r
fit_sb <- fit_selectboost(
  design,
  mode = "fast",
  steps.seq = c(0.6, 0.3),
  c0lim = FALSE,
  B = 4
)

summary(fit_sb)
#> FDA SelectBoost summary
#>   family: gaussian 
#>   predictors: 4 
#>   mode: fast 
#>   features: 10 
#>   groups: 4 
#>   c0 values: 2
head(selection_map(fit_sb, level = "group", c0 = colnames(fit_sb$feature_selection)[1]))
#>   predictor group_id     group representation
#> 1    signal        1    signal          basis
#> 2  nuisance        2  nuisance          basis
#> 3       age        3       age         scalar
#> 4 treatment        4 treatment         scalar
#>   basis_type source_representation n_features
#> 1       fpca                  grid          3
#> 2     spline                  grid          5
#> 3                           scalar          1
#> 4                           scalar          1
#>   start_position end_position start_argval end_argval
#> 1              1            3          PC1        PC3
#> 2              1            5           B1         B5
#> 3              1            1          age        age
#> 4              1            1    treatment  treatment
#>   domain_start domain_end       c0 mean_selection
#> 1         1100       2500 c0 = 0.6      0.6666667
#> 2         1100       2500 c0 = 0.6      0.2500000
#> 3          age        age c0 = 0.6      0.2500000
#> 4    treatment  treatment c0 = 0.6      1.0000000
#>   max_selection selected_features
#> 1          1.00                 2
#> 2          0.50                 4
#> 3          0.25                 1
#> 4          1.00                 1
```

## Grouped stability selection

Grouped stability selection is available through a common FDA interface. The
lasso route below requires the optional `glmnet` package.


``` r
if (requireNamespace("glmnet", quietly = TRUE)) {
  fit_stab <- fit_stability(
    design,
    selector = "lasso",
    B = 8,
    cutoff = 0.5,
    seed = 1
  )

  summary(fit_stab)
  head(selection_map(fit_stab, level = "group"))
}
#>   predictor group_id     group representation
#> 1    signal        1    signal          basis
#> 2  nuisance        2  nuisance          basis
#> 3       age        3       age         scalar
#> 4 treatment        4 treatment         scalar
#>   basis_type source_representation n_features
#> 1       fpca                  grid          3
#> 2     spline                  grid          5
#> 3                           scalar          1
#> 4                           scalar          1
#>   start_position end_position start_argval end_argval
#> 1              1            3          PC1        PC3
#> 2              1            5           B1         B5
#> 3              1            1          age        age
#> 4              1            1    treatment  treatment
#>   domain_start domain_end mean_feature_frequency
#> 1         1100       2500              0.4166667
#> 2         1100       2500              0.0500000
#> 3          age        age              0.0000000
#> 4    treatment  treatment              0.2500000
#>   max_feature_frequency selected_features
#> 1                 0.750                 2
#> 2                 0.125                 0
#> 3                 0.000                 0
#> 4                 0.250                 0
#>   group_frequency group_selected
#> 1           0.750           TRUE
#> 2           0.125          FALSE
#> 3           0.000          FALSE
#> 4           0.250          FALSE
```

Interval summaries can be requested directly:


``` r
if (requireNamespace("glmnet", quietly = TRUE)) {
  fit_interval <- interval_stability_selection(
    x = design,
    selector = "lasso",
    width = 4,
    B = 8,
    cutoff = 0.5,
    seed = 1
  )

  head(selection_map(fit_interval, level = "group"))
}
#>   predictor group_id          group representation
#> 1    signal        1    signal[1:3]          basis
#> 2  nuisance        2  nuisance[1:4]          basis
#> 3  nuisance        3  nuisance[5:5]          basis
#> 4       age        4       age[1:1]         scalar
#> 5 treatment        5 treatment[1:1]         scalar
#>   basis_type source_representation n_features
#> 1       fpca                  grid          3
#> 2     spline                  grid          4
#> 3     spline                  grid          1
#> 4                           scalar          1
#> 5                           scalar          1
#>   start_position end_position start_argval end_argval
#> 1              1            3          PC1        PC3
#> 2              1            4           B1         B4
#> 3              5            5           B5         B5
#> 4              1            1          age        age
#> 5              1            1    treatment  treatment
#>       domain_start       domain_end
#> 1             1100             2500
#> 2             1100 2464.10256410256
#> 3 1817.94871794872             2500
#> 4              age              age
#> 5        treatment        treatment
#>   mean_feature_frequency max_feature_frequency
#> 1              0.4166667                 0.750
#> 2              0.0625000                 0.125
#> 3              0.0000000                 0.000
#> 4              0.0000000                 0.000
#> 5              0.2500000                 0.250
#>   selected_features group_frequency group_selected
#> 1                 2           0.750           TRUE
#> 2                 0           0.125          FALSE
#> 3                 0           0.000          FALSE
#> 4                 0           0.000          FALSE
#> 5                 0           0.250          FALSE
#>   interval_start interval_end interval_label
#> 1              1            3    signal[1:3]
#> 2              1            4  nuisance[1:4]
#> 3              5            5  nuisance[5:5]
#> 4              1            1       age[1:1]
#> 5              1            1 treatment[1:1]
```

## Benchmarking on simulated FDA designs

The validation layer can be used to compare FDA-aware `SelectBoost` with a
plain `SelectBoost` baseline on the same simulated design and mapped truth.


``` r
sim <- simulate_fda_scenario(
  n = 30,
  grid_length = 20,
  representation = "grid",
  seed = 1
)

bench <- benchmark_selection_methods(
  sim,
  methods = c("selectboost", "plain_selectboost"),
  levels = c("feature", "group"),
  selectboost_args = list(B = 3, steps.seq = 0.5, c0lim = FALSE),
  plain_selectboost_args = list(B = 3, steps.seq = 0.5, c0lim = FALSE)
)

head(bench$metrics)
#>     level n_universe n_truth n_selected tp fp fn tn
#> 1 feature         42       9         34  9 25  0  8
#> 2 feature         42       9         38  9 29  0  4
#> 3   group          4       3          4  3  1  0  0
#> 4   group          4       3          4  3  1  0  0
#>   precision recall specificity        f1   jaccard
#> 1 0.2647059      1   0.2424242 0.4186047 0.2647059
#> 2 0.2368421      1   0.1212121 0.3829787 0.2368421
#> 3 0.7500000      1   0.0000000 0.8571429 0.7500000
#> 4 0.7500000      1   0.0000000 0.8571429 0.7500000
#>   selection_rate       c0            method
#> 1      0.8095238 c0 = 0.5       selectboost
#> 2      0.9047619 c0 = 0.5 plain_selectboost
#> 3      1.0000000 c0 = 0.5       selectboost
#> 4      1.0000000 c0 = 0.5 plain_selectboost
#>          scenario representation   family
#> 1 localized_dense           grid gaussian
#> 2 localized_dense           grid gaussian
#> 3 localized_dense           grid gaussian
#> 4 localized_dense           grid gaussian
```

The package also ships a larger saved sensitivity study under
`inst/extdata/benchmarks/`, generated by
`tools/run_selectboost_sensitivity_study.R`. The saved top-setting table keeps
the FDA benchmark settings together with the mean `F1` score of both
algorithms.


``` r
benchmark_dir <- system.file("extdata", "benchmarks", package = "SelectBoost.FDA")
top_settings <- utils::read.csv(
  file.path(benchmark_dir, "selectboost_sensitivity_top_settings.csv"),
  stringsAsFactors = FALSE
)

utils::head(
  top_settings[
    ,
    c(
      "scenario",
      "confounding_strength",
      "active_region_scale",
      "local_correlation",
      "association_method",
      "bandwidth",
      "selectboost_f1_mean",
      "plain_selectboost_f1_mean",
      "delta_mean",
      "win_rate"
    )
  ],
  5
)
#>            scenario confounding_strength
#> 1 confounded_blocks                  0.6
#> 2 confounded_blocks                  1.0
#> 3 confounded_blocks                  0.6
#> 4   localized_dense                  0.6
#> 5 confounded_blocks                  0.6
#>   active_region_scale local_correlation
#> 1                 0.5                 2
#> 2                 0.8                 2
#> 3                 0.8                 2
#> 4                 0.5                 2
#> 5                 0.5                 2
#>   association_method bandwidth selectboost_f1_mean
#> 1           interval         8           0.5362319
#> 2             hybrid         4           0.5885135
#> 3             hybrid         4           0.5833671
#> 4       neighborhood         4           0.4972542
#> 5             hybrid         4           0.5429293
#>   plain_selectboost_f1_mean delta_mean  win_rate
#> 1                 0.4087266 0.12750533 1.0000000
#> 2                 0.4826750 0.10583853 1.0000000
#> 3                 0.4944862 0.08888092 1.0000000
#> 4                 0.4144859 0.08276831 0.6666667
#> 5                 0.4657088 0.07722048 0.6666667
```

In the shipped benchmark, the strongest gains appear in the high-correlation,
narrow-region settings. For example, in the `confounded_blocks` scenario with
`active_region_scale = 0.5`, `local_correlation = 2`, and interval grouping at
`bandwidth = 8`, the saved mean `F1` values are approximately `0.536` for
FDA-aware `SelectBoost` versus `0.409` for plain `SelectBoost`.

## Further documentation

The package vignettes cover the main workflow families:

- discretized curves
- spectra and interval-aware `SelectBoost`
- basis and FPCA workflows
- methods, calibration, and formula interfaces
- simulation and benchmark workflows

## References

- Bertrand F., Aouadi I., Jung N., Carapito R., Vallat L., Bahram S., and
  Maumy-Bertrand M. SelectBoost: a general algorithm to enhance the performance
  of variable selection methods in correlated datasets. *Bioinformatics*.
  doi:10.1093/bioinformatics/btaa855
- Hofner B., Boccuto L., and Göker M. Stability selection and related
  subsampling-based selection procedures.
- Brockhaus S., Melcher M., Leisch F., and Greven S. FDboost:
  boosting functional regression models.

Versions across snapshots

VersionRepositoryFileSize
0.5.0 rolling linux/jammy R-4.5 SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 rolling linux/noble R-4.5 SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 rolling source/ R- SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 latest linux/jammy R-4.5 SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 latest linux/noble R-4.5 SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 latest source/ R- SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 2026-04-26 source/ R- SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB
0.5.0 2026-04-23 source/ R- SelectBoost.FDA_0.5.0.tar.gz 458.9 KiB

Dependencies (latest)

Imports

Suggests