Performing Two-Sample Mean Comparison Using HMC

This tutorial demonstrates how to use the HMC package to perform high-dimensional two-sample mean comparison, leveraging data-driven projection and cross-fitting for improved power and interpretability. The demonstration uses a sample dataset that can be downloaded via the following command:

my_local_path <- "~/your/local/path/"
url <- "https://raw.githubusercontent.com/terrytianyuzhang/HMC/main/docs/data/tiny_cleary_for_HMC.rds"
dest_file <- paste0(my_local_path, "tiny_cleary_for_HMC.rds")
download.file(url, dest_file, mode = "wb") 

url <- "https://raw.githubusercontent.com/terrytianyuzhang/HMC/main/docs/data/gene_clustering_for_HMC.rds"
dest_file <- paste0(my_local_path, "gene_clustering_for_HMC.rds")
download.file(url, dest_file, mode = "wb") 
library(data.table)
library(HMC)

# Load data
residual_subset <- readRDS(paste0(my_local_path, "tiny_cleary_for_HMC.rds"))

# Subset to control and treatment groups, and drop guide annotations

control <- residual_subset[
    Guides_collapsed_by_gene == "non-targeting",
    !"Guides_collapsed_by_gene"
]
perturbed <- residual_subset[
    Guides_collapsed_by_gene == "STAT1",
    !"Guides_collapsed_by_gene"
]

First 3 cells and first 5 genes from the control group:

control[1:3, 1:5]
       MALAT1       FTH1      NEAT1        CAP1         FTL
        <num>      <num>      <num>       <num>       <num>
1: 0.06779834 0.01032825 0.01288475 0.018713570 0.003885878
2: 0.10235660 0.01502890 0.01529569 0.005780347 0.005958204
3: 0.08374384 0.01067323 0.01204160 0.009304871 0.003010400

First 3 cells and first 5 genes from the treatment group:

perturbed[1:3, 1:5]
       MALAT1       FTH1      NEAT1        CAP1         FTL
        <num>      <num>      <num>       <num>       <num>
1: 0.10705896 0.01455491 0.01648723 0.011259870 0.012455195
2: 0.07751155 0.04359122 0.01270208 0.009382217 0.009670901
3: 0.09746205 0.02317766 0.01390659 0.032448719 0.006373856

Next, we apply the test using a subset of genes corresponding to a specific cluster. We load the gene grouping file and isolate genes from cluster 31:

residual_subset <- readRDS(paste0(my_local_path, "gene_clustering_for_HMC.rds"))
gene_to_keep <- which(clustering$cluster_index == 31)
control_subset <- control[, ..gene_to_keep]
perturbed_subset <- perturbed[, ..gene_to_keep]

We now run the comparison function using the selected genes:

set.seed(123)
test_result <- mean_comparison_anchor(
    control = control_subset,
    treatment = perturbed_subset,
    pca_method = "sparse_pca",
    classifier_method = "lasso",
    lambda_type = "lambda.min",
    n_folds = 5,
    verbose = FALSE
)

The p-value from the two-sample mean comparison test is:

test_result$p_value
             [,1]
[1,] 5.514311e-05

This means the average expression level is significantly different.

The function below extracts the genes frequently selected by the lasso classifier as discriminative features between the two groups. Usually they are the most improtant genes discriminating the two groups.

collect_active_features_proj(test_result)
[1] "ADAM10"  "DTX3L"   "PHACTR4" "UGCG"    "VPS13C"