Crandore Hub

OncoSubtype

Predict Cancer Subtypes Based on TCGA Data using Machine Learning Method

Provide functionality for cancer subtyping using nearest centroids or machine learning methods based on TCGA data.

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

# OncoSubtype

Provide functionality for cancer subtyping using existing published
methods or machine learning based on TCGA data.

Currently support mRNA subtyping:

-   LUSC using [nearest centroids
    method](https://aacrjournals.org/clincancerres/article/16/19/4864/75620/Lung-Squamous-Cell-Carcinoma-mRNA-Expression)
    or random forest method by training TCGA data.
-   LUAD using [nearest centroids
    method](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0036530)
    or random forest method by training TCGA data.
-   HNSC using [nearest centroids
    method](https://www.nature.com/articles/nature14129) or random
    forest method by training TCGA data.
-   BLCA using random forest (rf) method by training TCGA data.
-   ESCA using random forest (rf) method by training TCGA data.
-   ESCC using random forest (rf) method by training TCGA data.
-   BRCA using PAM50 method based on R package genefu.
-   STAD using random forest (rf) method by training TCGA data.

## Latest release

1.0.0 <!-- badges: start --> <!-- badges: end -->

## Installation

You can install the released version through:

``` r
install.packages("OncoSubtype")
```

## Example

This is a basic example for predicting the subtypes for Lung Squamous
Cell Carcinoma (LUSC).

### Predict LUSC mRNA Expression Subtypes using [nearest centroids method](https://aacrjournals.org/clincancerres/article/16/19/4864/75620/Lung-Squamous-Cell-Carcinoma-mRNA-Expression)

``` r
library(OncoSubtype)
library(tidyverse)
data <- get_median_centered(example_fpkm)
data <- assays(data)$centered
rownames(data) <- rowData(example_fpkm)$external_gene_name
# use default wilkerson's method
output1 <- centroids_subtype(data, disease = 'LUSC')
table(output1@subtypes)
#> 
#>     basal classical primitive secretory 
#>        44        65        26        44
```

### Using random forest model by training TCGA LUSC data

``` r
output2 <- ml_subtype(data, disease = 'LUSC', method = 'rf', seed = 123)
table(output2@subtypes)
#> 
#>     basal classical primitive secretory 
#>        43        65        27        44
```

### Check the consistance between two methods

``` r
confusionMatrix(as.factor(tolower(output1@subtypes)),
                as.factor(tolower(output2@subtypes)))
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  basal classical primitive secretory
#>   basal        43         1         0         0
#>   classical     0        64         1         0
#>   primitive     0         0        26         0
#>   secretory     0         0         0        44
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9888          
#>                  95% CI : (0.9602, 0.9986)
#>     No Information Rate : 0.3631          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9846          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: basal Class: classical Class: primitive
#> Sensitivity                1.0000           0.9846           0.9630
#> Specificity                0.9926           0.9912           1.0000
#> Pos Pred Value             0.9773           0.9846           1.0000
#> Neg Pred Value             1.0000           0.9912           0.9935
#> Prevalence                 0.2402           0.3631           0.1508
#> Detection Rate             0.2402           0.3575           0.1453
#> Detection Prevalence       0.2458           0.3631           0.1453
#> Balanced Accuracy          0.9963           0.9879           0.9815
#>                      Class: secretory
#> Sensitivity                    1.0000
#> Specificity                    1.0000
#> Pos Pred Value                 1.0000
#> Neg Pred Value                 1.0000
#> Prevalence                     0.2458
#> Detection Rate                 0.2458
#> Detection Prevalence           0.2458
#> Balanced Accuracy              1.0000
```

### Plot important genes

``` r
vi <- varImp(output2@method, scale = TRUE)
plot(vi, top = 20) 
```

<img src="man/figures/README-varImp-1.png" width="100%" />

### Plotheat map

``` r
PlotHeat(object = output2, set = 'both', fontsize = 10,
        show_rownames = FALSE, show_colnames = FALSE)
```

<img src="man/figures/README-heatmap-1.png" width="100%" />

## Report bugs or issues at [here](https://github.com/DadongZ/OncoSubtype/issues)

Versions across snapshots

VersionRepositoryFileSize
1.0.0 rolling linux/jammy R-4.5 OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 rolling linux/noble R-4.5 OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 rolling source/ R- OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 latest linux/jammy R-4.5 OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 latest linux/noble R-4.5 OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 latest source/ R- OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 2026-04-26 source/ R- OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 2026-04-23 source/ R- OncoSubtype_1.0.0.tar.gz 1.3 MiB
1.0.0 2025-04-20 source/ R- OncoSubtype_1.0.0.tar.gz 1.3 MiB

Dependencies (latest)

Depends

Imports

Suggests