<- "~/your/local/path/"
my_local_path <- "https://raw.githubusercontent.com/terrytianyuzhang/HMC/main/docs/data/tiny_cleary_for_HMC.rds"
url <- paste0(my_local_path, "tiny_cleary_for_HMC.rds")
dest_file download.file(url, dest_file, mode = "wb")
<- "https://raw.githubusercontent.com/terrytianyuzhang/HMC/main/docs/data/gene_clustering_for_HMC.rds"
url <- paste0(my_local_path, "gene_clustering_for_HMC.rds")
dest_file download.file(url, dest_file, mode = "wb")
Performing Two-Sample Mean Comparison Using HMC
This tutorial demonstrates how to use the HMC
package to perform high-dimensional two-sample mean comparison, leveraging data-driven projection and cross-fitting for improved power and interpretability. The demonstration uses a sample dataset that can be downloaded via the following command:
library(data.table)
library(HMC)
# Load data
<- readRDS(paste0(my_local_path, "tiny_cleary_for_HMC.rds"))
residual_subset
# Subset to control and treatment groups, and drop guide annotations
<- residual_subset[
control == "non-targeting",
Guides_collapsed_by_gene !"Guides_collapsed_by_gene"
]<- residual_subset[
perturbed == "STAT1",
Guides_collapsed_by_gene !"Guides_collapsed_by_gene"
]
First 3 cells and first 5 genes from the control group:
1:3, 1:5] control[
MALAT1 FTH1 NEAT1 CAP1 FTL
<num> <num> <num> <num> <num>
1: 0.06779834 0.01032825 0.01288475 0.018713570 0.003885878
2: 0.10235660 0.01502890 0.01529569 0.005780347 0.005958204
3: 0.08374384 0.01067323 0.01204160 0.009304871 0.003010400
First 3 cells and first 5 genes from the treatment group:
1:3, 1:5] perturbed[
MALAT1 FTH1 NEAT1 CAP1 FTL
<num> <num> <num> <num> <num>
1: 0.10705896 0.01455491 0.01648723 0.011259870 0.012455195
2: 0.07751155 0.04359122 0.01270208 0.009382217 0.009670901
3: 0.09746205 0.02317766 0.01390659 0.032448719 0.006373856
Next, we apply the test using a subset of genes corresponding to a specific cluster. We load the gene grouping file and isolate genes from cluster 31:
<- readRDS(paste0(my_local_path, "gene_clustering_for_HMC.rds"))
residual_subset <- which(clustering$cluster_index == 31)
gene_to_keep <- control[, ..gene_to_keep]
control_subset <- perturbed[, ..gene_to_keep] perturbed_subset
We now run the comparison function using the selected genes:
set.seed(123)
<- mean_comparison_anchor(
test_result control = control_subset,
treatment = perturbed_subset,
pca_method = "sparse_pca",
classifier_method = "lasso",
lambda_type = "lambda.min",
n_folds = 5,
verbose = FALSE
)
The p-value from the two-sample mean comparison test is:
$p_value test_result
[,1]
[1,] 5.514311e-05
This means the average expression level is significantly different.
The function below extracts the genes frequently selected by the lasso classifier as discriminative features between the two groups. Usually they are the most improtant genes discriminating the two groups.
collect_active_features_proj(test_result)
[1] "ADAM10" "DTX3L" "PHACTR4" "UGCG" "VPS13C"