library("dplyr")
library("Seurat")
library("knitr")
library("ggplot2")
library("BiocManager")
library("here")
#BiocManager::install("EnhancedVolcano")
library("EnhancedVolcano") #volcano plot
#install.packages('DESeq2') #for DEG
library("DESeq2")
library("tidyverse") #tidy up data
if (!require("kableExtra")) {install.packages("kableExtra"); require("kableExtra")} # for color brewer
if (!require("RColorBrewer")) {install.packages("RColorBrewer"); require("RColorBrewer")} # for color brewer
if (!require("sctransform")) {install.packages("sctransform"); require("sctransform")} # for data normalization
if (!require("glmGamPoi")) {BiocManager::install('glmGamPoi'); require("glmGamPoi")} # for data normalization, sctransform
if (!require("cowplot")) {install.packages("cowplot"); require("cowplot")} # for figure layout
if (!require("patchwork")) {install.packages("patchwork"); require("patchwork")} # for figure patching
if (!require("openxlsx")) {install.packages("openxlsx"); require("openxlsx")} # to save .xlsx files
set.seed(12345)
# here()
Welcome to the first meeting meeting of the Single-Cell Omics Research and Education Club!
If this is your time to the club, I want to extend and extra-special welcome to you!
I’m Jonathan Nelson, an Assistant Professor at the University of Southern California. I’m a wet scientist turned wet+dry scientist. I’ve been working with single-cell RNAseq data for the past 5 years and I’m excited to share what I’ve learned with you.
We believe that bioinformatics is a constantly evolving field, and that ongoing learning and professional development is essential to staying up-to-date. We encourage members to share their knowledge and experiences with each other, and to seek out opportunities for continued learning.
We believe that access to bioinformatics support should be available to everyone. We strive to create a welcoming and inclusive environment where all members can feel comfortable asking for help and contributing to the group.
We believe that working together is key to achieving success in bioinformatics. We value the diversity of perspectives and backgrounds that each member brings, and we encourage open communication and the sharing of ideas.
We believe in conducting ourselves with honesty and professionalism in all our interactions. We hold ourselves to high ethical standards and respect the privacy and confidentiality of all members.
We believe in approaching each other with empathy and kindness. We understand that bioinformatics can be a challenging and sometimes frustrating field, and we strive to support each other through these difficulties.
I know a lot of this has been going on in the background for everyeone and I wanted to bring it to the forefront. My expectation is that we have about 6 of these meetings together and then we can re-evaluate if we want to continue as a group or not.
Email me you would like me to add anyone: j.nelson@med.usc.edu
Today’s code (this html file) will be posted to the SCORE website (https://usckrc.github.io/website/score.html)
PhD -> No coding/bioinformatics
I studied biochemistry, physiology, and genetics
Thesis Title: Disruption of Soluble Epoxide Hydrolase as a Therapeutic Target for Stroke
Started with some bulk RNAseq (upset at lack of tools for people who didn’t know how to code)
Was constantly being asked by others about the expression level in a dataset.
So I created an web app in Shiny: https://kcvi.shinyapps.io/START/
Wolf Alice: Don’t Delete The Kisses
https://open.spotify.com/track/1hZyADicbo5k0OQbwcxVEo?si=27d712cd975c4957
df2
df.2
df_2
Df_2
DF_2
Great Article on Type Cases -> https://www.alexhyett.com/snake-case-vs-camel-case-vs-pascal-case-vs-kebab-case/
https://stackoverflow.com/questions/1944910/what-is-your-preferred-style-for-naming-variables-in-r
I’m trying to consistently use PascalCase for my objects in R.
While you have a lot of flexibility in naming objects in R, I’d encourage you to be more intentional and consistent with how you decide to name your objects.
https://pubmed.ncbi.nlm.nih.gov/28851704/
ggvenn -> Good for visualizing plots
https://github.com/NicolasH2/ggvenn
gplots -> Good for identifying genes in categories
Seurat is a popular R package for single-cell RNAseq analysis, pioneer by the Sataja lab.
Hao, Y., Stuart, T., Kowalski, M.H. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 42, 293–304 (2024). https://doi.org/10.1038/s41587-023-01767-y
# Install the remotes package
install.packages('remotes')
# Replace 'X.X.X' with your desired version
# My current version is 4.4.0
remotes::install_version(package = 'Seurat', version = package_version('X.X.X'))remotes::install_version("Seurat", "4.4.0", repos = c("https://satijalab.r-universe.dev", getOption("repos")))
# Uploading and Analyzing Two snRNAseq Datasets
SO1 <- Read10X_h5(here("MH1_filtered_feature_bc_matrix.h5"))
SO2 <- Read10X_h5(here("MH3_filtered_feature_bc_matrix.h5"))
SO1 <- CreateSeuratObject(counts = SO1, project = "Dataset", min.cells = 3, min.features = 200)
SO2 <- CreateSeuratObject(counts = SO2, project = "Dataset", min.cells = 3, min.features = 200)
# Add meta data to each object to keep track of them.
# In this case, SO1 is *Sham* and SO2 is treated with *CACPR*. I'm adding this information to the same metadata column that is called *Condition*
SO1 <- AddMetaData(object = SO1, metadata = "CACPR", col.name = "Condition")
SO2 <- AddMetaData(object = SO2, metadata = "Sham", col.name = "Condition")
# Merging and Normal scRNAseq Analysis Pipeline
# It's possible to merge more than 2 objects together by concatenating the `y =` term to `y = c(SO2, SO3)`
SOmerged <- merge(SO1, y = c(SO2), add.cell.ids = c("CACPR", "Sham"), project = "Kidney")
# check on metadata and level it.
head(SOmerged@meta.data)
SOmerged@meta.data$Condition <- factor(SOmerged@meta.data$Condition, levels = c("Sham", "CACPR"))
SOmerged <- NormalizeData(SOmerged, normalization.method = "LogNormalize", scale.factor = 10000)
SOmerged <- FindVariableFeatures(SOmerged, selection.method = "vst", nfeatures = 2000)
SOmerged.genes <- rownames(SOmerged)
SOmerged <- ScaleData(SOmerged, features = SOmerged.genes)
SOmerged <- RunPCA(SOmerged, features = VariableFeatures(object = SOmerged))
SOmerged <- FindNeighbors(SOmerged, dims = 1:15)
SOmerged <- FindClusters(SOmerged, resolution = 0.1)
SOmerged <- RunUMAP(SOmerged, dims = 1:15)
DimPlot(SOmerged, reduction = "umap")
DimPlot(SOmerged, reduction = "umap", group.by = "Condition")
# Notice that there is almost a complete separation between *Sham* and *CACPR* samples. This is probably the combination of a strong
# To find genes that are shared between these two samples that can be used as anchors to cluster similar cells between the two sample we will use the *FindIntegrationAnchors* pipeline that was intiaily developed by former PSI high school student Annie Lackey which she learned from the Seurat website.
# https://satijalab.org/seurat/articles/integration_introduction.html
# FindIntegration Anchors Pipeline
#In order to solve this problem we turn to a feature of Seurat that helps us integrate data. To prep the data we first have to split the object by treatment group. This creates the new feature `SOmerged.list`, so we can continue with standard normalization and finding the features of the new object.
# split the RNA measurements into two layers one for control cells, one for stimulated cells
SOmerged.integrated <- IntegrateLayers(object = SOmerged, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca",
verbose = FALSE)
# re-join layers after integration
SOmerged.integrated[["RNA"]] <- JoinLayers(SOmerged.integrated[["RNA"]])
SOmerged.integrated <- FindNeighbors(SOmerged.integrated, reduction = "integrated.cca", dims = 1:30)
SOmerged.integrated <- FindClusters(SOmerged.integrated, resolution = 1)
SOmerged.integrated <- RunUMAP(SOmerged.integrated, dims = 1:30, reduction = "integrated.cca")
DimPlot(SOmerged.integrated, reduction = "umap")
DimPlot(SOmerged.integrated, reduction = "umap", group.by = c("Condition"))
# Comparison of Merged and Integrated Pipelines
# Notice the addition of the ggplot2 code that is layered onto the `DimPlot` function in order to modify the appearance of the graph for clarity as `+ ggtitle("Merged")`.
DimPlot(SOmerged, reduction = "umap", group.by = "Condition") + ggtitle("Merged")
DimPlot(SOmerged.integrated, reduction = "umap", group.by = "Condition") + ggtitle("Integrated")
# Using Patchwork place the graphs side-by-side
f1 <- DimPlot(SOmerged, reduction = "umap", group.by = "Condition") + ggtitle("Merged")
f2 <- DimPlot(SOmerged.integrated, reduction = "umap", group.by = "Condition") + ggtitle("Integrated")
f1 + f2
rm(SOmerged, SOmerged.integrated)
# Removing Ambient RNA with SoupX
# The current strategy that we use to remove doublets relies on [(Young and Behjati, 2020)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7763177/)
# In order for this code to work you need both the *raw_feature_bc_matrix* in addition to the *filteredfiltered_feature_bc_matrix*.
# This pipeline was written by former visiting medical student [(Jeremiah Reyes)](https://twitter.com/imyourNephBro).
tod = Read10X_h5(here("1_raw_feature_bc_matrix.h5")) #Change
toc = Read10X_h5(here("1_filtered_feature_bc_matrix.h5")) #Change
sc = SoupChannel(tod,toc)
#Make the Seurat object from the filtered control data
SO <- Read10X_h5(here("1_filtered_feature_bc_matrix.h5")) #Change
SO <- CreateSeuratObject(counts = SO, project = "Peri-INTACT") #Change
#Cluster the cells with Seurat
SO <- SCTransform(SO, verbose = F)
SO <- RunPCA(SO, verbose = F)
SO <- RunUMAP(SO, dims = 1:30, verbose = F)
SO <- FindNeighbors(SO, dims = 1:30, verbose = F)
SO <- FindClusters(SO, verbose = T)
meta <- SO@meta.data
umap <- SO@reductions$umap@cell.embeddings
clusters <- setNames(meta$seurat_clusters, rownames(meta))
#Sanity Check
length(clusters) #should be equal to nrow(sc$metaData)
nrow(sc$metaData)
sc <- setClusters(sc, clusters)
sc <- setDR(sc, umap)
#Estimate rho
sc = autoEstCont(sc)
#Clean the data
SO_out = adjustCounts(sc)
#Create a new Seurat Object out of the cleaned data
seurat.obj <- CreateSeuratObject(SO_out)
# Remove Doublets with the Doublet Detector Package
# The output of the SoupX chunck feeds right into the doublet detector pipline which is based on [(McGinnis et al., 2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6853612/).
VlnPlot(seurat.obj, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
# Minimal QC and Filtering (low quality cells) to let doublet find doublets
seurat.obj.f <- subset(seurat.obj, nFeature_RNA > 500)
VlnPlot(seurat.obj.f, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
seurat.obj.f
# Pre-process standard workflow
seurat.obj.f <- NormalizeData(object = seurat.obj.f)
seurat.obj.f <- FindVariableFeatures(object = seurat.obj.f)
seurat.obj.f <- ScaleData(object = seurat.obj.f)
seurat.obj.f <- RunPCA(object = seurat.obj.f)
ElbowPlot(seurat.obj.f, ndims = 40)
# PCs between 15-20
seurat.obj.f <- FindNeighbors(object = seurat.obj.f, dims = 1:30)
seurat.obj.f <- FindClusters(object = seurat.obj.f, resolution = 0.03)
seurat.obj.f <- RunUMAP(object = seurat.obj.f, dims = 1:30)
DimPlot(seurat.obj.f, reduction = "umap")
# paramSweep
# A very time-consuming step.
# Calculate each combination of pN and pK
sweep.res.list_seurat.obj.f <- paramSweep(seurat.obj.f, PCs = 1:20, sct = FALSE)
# summarizeSweep
#Summarize each combination of pN and pK
sweep.stats_seurat.obj.f <- summarizeSweep(sweep.res.list_seurat.obj.f, GT = FALSE)
#Select the pK that corresponds to max bcmvn to optimize doublet detection
bcmvn_seurat.obj.f <- find.pK(sweep.stats_seurat.obj.f)
pK <- bcmvn_seurat.obj.f %>%
filter(BCmetric == max(BCmetric)) %>%
select(pK)
#See pK in the Values Environment
pK <- as.numeric(as.character(pK[[1]]))
# summarizeSweep
# Homotypic Doublet Proportion Estimate
annotations <- seurat.obj.f@meta.data$seurat_clusters
homotypic.prop <- modelHomotypic(annotations)
homotypic.prop
# 10X Multiplet Rate Table (the doublet ratio is # of cells recovered divided by 125000) https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled-
nrow(seurat.obj.f@meta.data)
# This automates the number of doublets based on the number of cells in the Seurat object
# This addition was made by Xiao-Tong Su, postdoc at OHSU in the Ellison Lab
nExp_poi <- round(nrow(seurat.obj.f@meta.data) # To calculate cell number
/125000 # To calculate the doublet ratio
*nrow(seurat.obj.f@meta.data))
nExp_poi
nExp_poi_adj <- round(nExp_poi*(1-homotypic.prop))
# doubletFinder
seurat.obj.f_doublets <- doubletFinder(seurat.obj.f,
PCs = 1:20,
pN = 0.25,
pK = pK,
nExp = nExp_poi_adj,
reuse.pANN = FALSE, sct = FALSE)
colnames(seurat.obj.f_doublets@meta.data)[6] <- "pANN"
colnames(seurat.obj.f_doublets@meta.data)[7] <- "DF.class"
head(seurat.obj.f_doublets@meta.data)
table(seurat.obj.f_doublets@meta.data$DF.class)
DimPlot(seurat.obj.f_doublets, group.by = "DF.class")
VlnPlot(seurat.obj.f_doublets, "nFeature_RNA", group.by = "DF.class")
VlnPlot(seurat.obj.f_doublets, "nCount_RNA", group.by = "DF.class")
# Subset Singlets
seurat.obj.f_singlets <- subset(seurat.obj.f_doublets, DF.class == "Singlet")
seurat.obj.f_singlets
DimPlot(seurat.obj.f_singlets, reduction = "umap")
# Remove Mitochondrial Genes
# Because it isn't clear what the percentage of mitochondrial genes means in a single-*nucleus* RNAseq dataset, we will take out all the mitochondrial genes to remove thier effect on clustering. However, before we remove them from the dataset, we will calculate their values and *stash* them in the metadata.
seurat.obj.f_singlets <- seurat.obj.f_singlets[!grepl("^mt-", rownames(seurat.obj.f_singlets)), ]
# Mito Sanity Check
counts <- GetAssayData(seurat.obj.f_singlets, assay = "RNA")
mito.genes <- grep(pattern = "^mt-", x = rownames(x = counts), value = TRUE)
mito.genes #should be zero
DimPlot(seurat.obj.f_singlets, reduction = "umap", label = T)
# Initial Cluster Identification
# This is a generic pipeline that I use to identify the cell clusters in an unbiased way.
seurat.obj.f_singlets.markers <- FindAllMarkers(seurat.obj.f_singlets, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
seurat.obj.f_singlets.markers %>%
group_by(cluster) %>%
top_n(n = 5, wt = avg_log2FC) -> top5
DoHeatmap(seurat.obj.f_singlets, features = top5$gene) + NoLegend()
seurat.obj.f_singlets.markers %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC) -> top2
VlnPlot(seurat.obj.f_singlets,
features = unique(top2$gene),
stack = TRUE,
flip = TRUE,
pt.size = 0)+
NoLegend()
# Dotplot
# This dotplot code was writen by former PSI student Annie Lackey and has been a go-to for the color scheme for dot plots ever since.
DotPlot(seurat.obj.f_singlets, features = unique(top2$gene), dot.scale = 8, cols = c("dodgerblue2", "coral2")) + RotatedAxis()
# Save Output Files
# This is the code to save the seurat object to it keeps all of it's analysis information including clustering information. The most important part of the code here is to make sure to include a *.rds* at the end of the name.
#saveRDS(seurat.obj.f_singlets, here("Singlets.rds)) #Change
I hope that you found this session helpful.
Do you have a coding problem that you’d like some
support on?
Do you have a topic you’d like covered
at a future meeting?
Email me: j.nelson@med.usc.edu
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] openxlsx_4.2.5.2 patchwork_1.2.0
## [3] cowplot_1.1.3 glmGamPoi_1.12.2
## [5] sctransform_0.4.1 RColorBrewer_1.1-3
## [7] kableExtra_1.3.4 lubridate_1.9.2
## [9] forcats_1.0.0 stringr_1.5.1
## [11] purrr_1.0.2 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1
## [15] tidyverse_2.0.0 DESeq2_1.40.2
## [17] SummarizedExperiment_1.30.2 Biobase_2.60.0
## [19] MatrixGenerics_1.12.3 matrixStats_1.2.0
## [21] GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
## [23] IRanges_2.34.1 S4Vectors_0.38.2
## [25] BiocGenerics_0.46.0 EnhancedVolcano_1.18.0
## [27] ggrepel_0.9.5 here_1.0.1
## [29] BiocManager_1.30.22 ggplot2_3.5.1
## [31] knitr_1.45 SeuratObject_5.0.1
## [33] Seurat_4.4.0 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.3.1
## [4] bitops_1.0-7 polyclip_1.10-6 lifecycle_1.0.4
## [7] rprojroot_2.0.4 globals_0.16.2 lattice_0.21-8
## [10] MASS_7.3-60 magrittr_2.0.3 plotly_4.10.4
## [13] sass_0.4.9 rmarkdown_2.25 jquerylib_0.1.4
## [16] yaml_2.3.7 httpuv_1.6.11 zip_2.3.0
## [19] spam_2.10-0 sp_2.1-3 spatstat.sparse_3.0-3
## [22] reticulate_1.34.0 pbapply_1.7-2 abind_1.4-5
## [25] zlibbioc_1.46.0 rvest_1.0.3 Rtsne_0.17
## [28] RCurl_1.98-1.12 GenomeInfoDbData_1.2.10 irlba_2.3.5.1
## [31] listenv_0.9.1 spatstat.utils_3.0-4 goftest_1.2-3
## [34] spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.36.0
## [37] svglite_2.1.1 leiden_0.4.3.1 codetools_0.2-19
## [40] DelayedArray_0.26.7 xml2_1.3.6 tidyselect_1.2.1
## [43] spatstat.explore_3.2-5 webshot_0.5.5 jsonlite_1.8.8
## [46] ellipsis_0.3.2 progressr_0.14.0 ggridges_0.5.6
## [49] survival_3.5-5 systemfonts_1.0.5 tools_4.3.1
## [52] ica_1.0-3 Rcpp_1.0.12 glue_1.7.0
## [55] gridExtra_2.3 xfun_0.40 withr_3.0.2
## [58] fastmap_1.1.1 fansi_1.0.4 digest_0.6.33
## [61] timechange_0.2.0 R6_2.5.1 mime_0.12
## [64] colorspace_2.1-0 scattermore_1.2 tensor_1.5
## [67] spatstat.data_3.0-4 utf8_1.2.3 generics_0.1.3
## [70] data.table_1.14.10 httr_1.4.7 htmlwidgets_1.6.2
## [73] S4Arrays_1.2.0 uwot_0.1.16 pkgconfig_2.0.3
## [76] gtable_0.3.6 lmtest_0.9-40 XVector_0.40.0
## [79] htmltools_0.5.8.1 dotCall64_1.1-1 scales_1.3.0
## [82] png_0.1-8 rstudioapi_0.15.0 tzdb_0.4.0
## [85] reshape2_1.4.4 nlme_3.1-162 cachem_1.0.8
## [88] zoo_1.8-12 KernSmooth_2.23-21 parallel_4.3.1
## [91] miniUI_0.1.1.1 pillar_1.9.0 grid_4.3.1
## [94] vctrs_0.6.5 RANN_2.6.1 promises_1.2.1
## [97] xtable_1.8-4 cluster_2.1.4 evaluate_0.23
## [100] cli_3.6.1 locfit_1.5-9.8 compiler_4.3.1
## [103] rlang_1.1.1 crayon_1.5.2 future.apply_1.11.1
## [106] plyr_1.8.9 stringi_1.7.12 viridisLite_0.4.2
## [109] deldir_2.0-2 BiocParallel_1.34.2 munsell_0.5.1
## [112] lazyeval_0.2.2 spatstat.geom_3.2-8 Matrix_1.6-5
## [115] hms_1.1.3 future_1.33.1 shiny_1.8.0
## [118] ROCR_1.0-11 igraph_1.6.0 bslib_0.8.0