Single-cell RNA sequencing (scRNAseq) has become an increasingly important tool for profiling the transcriptional landscape of the kidney, enabling the discovery of cell-type-specific responses in both health and disease. However, analyzing kidney scRNAseq datasets presents unique challenges due to the kidney’s cellular complexity, technical variability in capturing cells, and the disproportionate representation of abundant cell types; all of which can bias downstream analyses. Moreover, no clear standards have been established for how to analyze kidney scRNAseq data, undermining reproducibility and cross-study comparisons between research groups. Here, we present a practical framework for the analysis, annotation, and deposition of kidney scRNAseq datasets. Our approach outlines recommended steps for quality control, ambient RNA correction, detection and removal of artificial doublets, and annotation of cell clusters. We propose standardized naming conventions for kidney cell types to harmonize nomenclature and promote clarity and reproducibility. In addition, we provide practical guidelines for preparing and depositing processed data in public repositories such as the Gene Expression Omnibus (GEO), ensuring accessibility and compliance with FAIR (Findable, Accessible, Interoperable, Reusable) data principles. To support adoption of this framework, we offer template code and a video tutorial for analyzing a kidney scRNAseq dataset, starting with a gene-by-cell expression matrix and ending with an annotated atlas. Each step is explained in detail to help researchers follow the workflow, understand its underlying rationale, and confidently apply it to their own kidney single-cell datasets. Together, this framework aims to foster transparency, reproducibility, and collaborative progress in kidney research.
The goal of this tutorial is to take a scRNAseq dataset from the Cell Ranger outputs stage (cell by gene matrix) to an accurately annotated seurat object (.rds file).
This tutorial was written to guide the analysis of a healthy
whole kidney dataset. While many of the same principals apply
to analyzing a diseased whole kidney dataset, more
caution needs to be taken so that diseased cells do not get
inadvertently filtered out or incorrectly
annotated.
We hope to write future guides on how to approach
analyzing diseased kidney datasets.
Also, the landscape of single-cell technologies is dynamic and improving every day. It’s challenging to set hard thresholds and levels for certain filtering parameters because of this. We hope that the principles underlying this analysis framework stay true, even if the technology and packages used to analyze kidney scRNAseq data evolves.
A manuscript describing this analysis framework can be found here(broken link).
A video tutorial walking through this analysis framework can be found here(broken link).
This R markdown file is meant to guide a researcher through the steps of data analysis.
Once you have analyzed and annotated your dataset and set all the parameters you need to fill out, you can knit the file, and it will create an html report of your analysis for long-term documentation.
We have made every effort to reduce the memory burden in this tutorial. However, certain functions are more memory intensive than others (such as SCTransform).
Therefore, we recommend performing this tutorial on a computer with at least 16GB of RAM.
As you analyze more cells (or larger objects), you may need increased amounts of memory (RAM) to accommodate computationally heavy functions.
While beyond the scope of this tutorial, we recommend the following resources for learning R and R Markdown:
R for Data Science (2nd
Edition)
R
Markdown: The Definitive Guide
R Bloggers
Chatomics
We suggest creating a folder titled “Nephronomics” on your desktop or in a location of your choice. This folder will contain all the files and folders for the analysis. Even though we suggest a title, you can name the file any way you like.
Download the “scRNAseq_guide” RMD file to follow
along, or apply to your own sample here.
Then, place scRNAseq_guide.Rmd in the
“Nephronomics” folder.
Ordered file structure is essential to keep track of the files and folders you are working with. This is especially important when working with large datasets, as it can be easy to lose track of files if they are not organized properly.
We will create two folders: Datasets and Outputs. The Datasets folder will contain the raw and filtered matrices, as well as the metadata file. The Outputs folder will contain the results of the analysis, such as the annotated .rds after analysis.
We utilize the here() package in order to create a relative path to the folders. This allows the code to be able to run on any computer without needing to change the file paths (e.g. working directories).
Read more about the here() package here
Download the tutorial dataset to walk through in
this guide:
Download the raw
matrix
Download the filtered
matrix
Download the metadata
file
NEPHRONOMICS
│ scRNAseq_guide.Rmd
│
│ - - - Datasets
│
Kidney_7_filtered_feature_bc_matrix.h5
│
Kidney_7_raw_feature_bc_matrix.h5
│ metadata.xlsx
│
│ - - - Outputs
Simply remove the hashtags to run the code and install packages for the first time. After the packages are installed, you can comment out the code again by adding the hashtags at the beginning of each line.
if (!require("dplyr")) {install.packages("dplyr"); require("dplyr")}
if (!require("Seurat")) {install.packages("Seurat"); require("Seurat")}
if (!require("patchwork")) {install.packages("patchwork"); require("patchwork")}
if (!require("cowplot")) {install.packages("cowplot"); require("cowplot")}
if (!require("ggpubr")) {install.packages("ggpubr"); require("ggpubr")}
if (!require("plotly")) {install.packages("plotly"); require("plotly")}
if (!require("knitr")) {install.packages("knitr"); require("knitr")}
if (!require("htmlwidgets")) {install.packages("htmlwidgets"); require("htmlwidgets")}
if (!require("tidyverse")) {install.packages("tidyverse"); require("tidyverse")} # for titying up data
if (!require("RColorBrewer")) {install.packages("RColorBrewer"); require("RColorBrewer")} # for color brewer
if (!require("sctransform")) {install.packages("sctransform"); require("sctransform")} # for data normalization
if (!require("openxlsx")) {install.packages("openxlsx"); require("openxlsx")} # to save .xlsx files
if (!require("SoupX")) {install.packages("SoupX"); require("SoupX")}
if (!require("readxl")) {install.packages("readxl"); require("readxl")}
if (!require("hdf5r")) {install.packages("hdf5r"); require("hdf5r")}
if (!require("glmGamPoi")) {BiocManager::install('glmGamPoi'); require("glmGamPoi")} # for data normalization, sctransform
if (!require("EnhancedVolcano")) {BiocManager::install('EnhancedVolcano'); require("EnhancedVolcano")} # volcano plot
library(ggsankey)
library(DoubletFinder)
library(MoMAColors)
set.seed((12345))
here()
#renv::init()
#renv::install("Seurat@5.2.0")
#renv::install("hdf5r")
#renv::install("chris-mcginnis-ucsf/DoubletFinder")
#renv::snapshot()
doubletFinder: version 2.0.6
and
Seurat: version 5.x
for this code.This will be used to automate the addition of metadata to the Seurat object and to save output files (DEG lists and Seurat Object).
# Create a Soup Channel Object(sc) to use for future filtering of ambient RNA
RawDataset = Read10X_h5(here("Datasets", "Kidney_7_raw_feature_bc_matrix.h5"))
FilteredDataset = Read10X_h5(here("Datasets", "Kidney_7_filtered_feature_bc_matrix.h5"))
# Make a Seurat Object from the filtered control data
filtered_kidney <- Read10X_h5(here("Datasets", "Kidney_7_filtered_feature_bc_matrix.h5"))
filtered_kidney <- CreateSeuratObject(counts = filtered_kidney, project = "Filteredkidney_snRNA")
SoupX is an R package designed to identify and remove ambient RNA contamination from droplet-based single-cell RNA sequencing (scRNAseq) data. Ambient RNA, freely floating transcripts that are captured non-specifically during droplet formation, can distort downstream analyses by falsely inflating gene expression levels, particularly in low-UMI cells or empty droplets. SoupX models this “soup” contamination by first estimating the background expression profile from low-UMI droplets that are likely empty. It then uses this ambient RNA profile to adjust gene expression values in cell-containing droplets, accounting for contamination in a probabilistic and gene-specific manner.
When you run autoEstCont(sc)
in SoupX, it automatically
estimates the contamination fraction (rho) by comparing the expression
of known marker genes in presumed “pure” cell types. The graph output
typically includes a plot of the model fit, showing observed versus
expected expression for these genes across a range of contamination
fractions. The point where the model best matches the observed data is
used to estimate the contamination rate. A good fit suggests that the
estimated contamination fraction accurately captures the ambient RNA
contribution, and the plot can help assess whether the correction is
likely to be reliable.
soupx
Manuscript
soupx Github
soupx
Tutorial
sc = SoupChannel(RawDataset,FilteredDataset)
# Cluster the Cells with Seurat
filtered_kidney <- SCTransform(filtered_kidney, verbose = F)
filtered_kidney <- RunPCA(filtered_kidney, verbose = F)
filtered_kidney <- RunUMAP(filtered_kidney, dims = 1:40, verbose = F)
filtered_kidney <- FindNeighbors(filtered_kidney, dims = 1:40, verbose = F)
filtered_kidney <- FindClusters(filtered_kidney, verbose = F)
meta <- filtered_kidney@meta.data
umap <- filtered_kidney@reductions$umap@cell.embeddings
clusters <- setNames(meta$seurat_clusters, rownames(meta))
#Ensure the two following numbers are equal before continuing
check1 <- length(clusters)
check2 <- nrow(sc$metaData)
sc <- setClusters(sc, clusters)
sc <- setDR(sc, umap)
This post rho value is 3.5% with a FWHM of 1.5% - 10% . This means that your estimated amount of ambient RNA is 3.5%.
Ambient RNA levels are acceptable.
#Clean the data
filtered_kidney_out = adjustCounts(sc)
#Create a new Seurat Object out of the cleaned data
seurat.obj <- CreateSeuratObject(filtered_kidney_out)
# Remove unused objects in environment to save RAM
#rm(filtered_kidney, filtered_kidney_out, FilteredDataset, RawDataset, sc, meta, check1, check2, clusters, umap, y, Datasets, Outputs)
DoubletFinder is an R package used to detect and remove doublets; droplets that have captured two or more cells instead of one, from single-cell RNAseq datasets. Doublets can introduce misleading hybrid transcriptional profiles, impacting clustering, differential expression, and trajectory inference. DoubletFinder works by generating artificial doublets from real cell profiles and then identifying real cells that resemble these artificial profiles. It embeds both real and artificial cells in a reduced-dimensional space (usually PCA), computes each cell’s neighborhood density of artificial doublets (pANN score), and uses that to rank the likelihood that each real cell is a doublet.
After running DoubletFinder, you’ll typically interpret results by
examining the metadata field DF.class
, which labels each
cell as “Doublet” or “Singlet.” You can visualize the results using
dimensionality reduction plots (e.g., UMAP), coloring cells by their
classification to check whether doublets cluster separately or are
enriched in certain regions. The pANN
score itself is a
continuous value indicating how doublet-like a cell is, which is useful
for threshold tuning. The expected doublet rate, often informed by cell
loading density and 10X Genomics guidelines, plays a critical role in
setting the classification cutoff.
You set the expected rate of doublets when using DoubletFinder. This is often set as the rate of doublets published by 10x genomics based on their microfluidic device. But this rate does NOT account for an resulting from poor sample preparation.
This is one of the reasons why this tutorial suggests a 2-step doublet filtering. First with DoubletFinder (which does an effective job at identifying homotypic doublets), and then manual doublet curation based on cluster expression characteristic.
This step serves to filter out droplets that may have inadvertently been determined as cells by CellRanger (or other demultiplexing applications).
Not filtering out the empty droplets will interfere with doublet detection downstream.
Start <- ncol(seurat.obj)
p1 <- VlnPlot(seurat.obj, features = c("nFeature_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p2 <- VlnPlot(seurat.obj, features = c("nFeature_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p3 <- VlnPlot(seurat.obj, features = c("nCount_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p4 <- VlnPlot(seurat.obj, features = c("nCount_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
cowplot::plot_grid(p1, p2, p3, p4, ncol = 4, scale = .85) + draw_label("Before Filtering", y = .95, size = 22, fontface = "bold")
# Minimal filtering of low quality cells
seurat.obj.f <- subset(seurat.obj, nFeature_RNA > 500)
p1 <- VlnPlot(seurat.obj.f, features = c("nFeature_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p2 <- VlnPlot(seurat.obj.f, features = c("nFeature_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p3 <- VlnPlot(seurat.obj.f, features = c("nCount_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
p4 <- VlnPlot(seurat.obj.f, features = c("nCount_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))
cowplot::plot_grid(p1, p2, p3, p4, ncol = 4, scale = .85) + draw_label("After Filtering Low Complexity Cells", y = .95, size = 22, fontface = "bold")
Filter_1 <- ncol(seurat.obj.f)
amountfiltered1 <- Start-Filter_1
#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset.
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Start, Filter_1, amountfiltered1))
print(cell_counts)
# Remove unused objects in environment to save RAM
#rm(seurat.obj, p1, p2, p3, p4, cell_counts)
72 cells were removed after filtering low-complexity cells.
# Pre-process standard workflow
seurat.obj.f <- NormalizeData(object = seurat.obj.f)
seurat.obj.f <- FindVariableFeatures(object = seurat.obj.f)
seurat.obj.f <- ScaleData(object = seurat.obj.f)
seurat.obj.f <- RunPCA(object = seurat.obj.f)
# ElbowPlot(seurat.obj.f, ndims = 40)
seurat.obj.f <- FindNeighbors(object = seurat.obj.f, dims = 1:40)
seurat.obj.f <- FindClusters(object = seurat.obj.f, resolution = resolution_value, verbose = F)
# Make a UMAP to visualize the clusters
seurat.obj.f <- RunUMAP(object = seurat.obj.f, dims = 1:40)
DimPlot(seurat.obj.f, reduction = "umap", label = TRUE) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Initial Clustering for ", sample_name))
This file can be used to determine the prominent features (highly enriched genes) of each cluster.
The file will be located in the Outputs folder.
all_markers <- FindAllMarkers(seurat.obj.f, only.pos = TRUE, min.pct = 0.5, logfc.threshold = 1)
# Order by avg_log2FC
all_markers <- all_markers %>%
arrange(desc(avg_log2FC)) %>%
select(gene, everything())
# Split by cluster (ident column)
marker_list <- split(all_markers, all_markers$cluster)
# Sort names alphanumerically
sorted_cluster_names <- names(marker_list)[order(as.numeric(names(marker_list)))]
# Create a workbook
wb <- createWorkbook()
# Add each cluster as a new worksheet
for (cluster_name in sorted_cluster_names) {
addWorksheet(wb, sheetName = cluster_name)
writeData(wb, sheet = cluster_name, marker_list[[cluster_name]])
}
date <- format(Sys.Date(), "%Y%m%d")
# Save workbook
saveWorkbook(wb, here("Outputs", paste0(date, "_", sample_name, "_FindAllMarkers_By_Cluster.xlsx")), overwrite = TRUE)
cell_type_markers_df <- data.frame(
Subclass = c(
"PTS1", "PTS2", "PTS3", "DTL", "TALA", "TALB", "MD", "DCT1", "DCT2", "CNT",
"PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "LYMPH", "FIB", "MES", "VSMC", "LYMPHO", "MACRO"
),
Positive_Marker_1 = c(
"Lrp2", "Lrp2", "Lrp2", "Epha7", "Slc12a1", "Slc12a1", "Nos1", "Slc12a3", "Slc12a3", "Slc8a1",
"Aqp2", "Slc4a1", "Slc26a4", "Nphs1", "Ncam1", "Upk1b", "Flt1", "Lyve1", "Pdgfra", "Pdgfrb", "Pdgfrb", "Ptprc", "Ptprc"
),
Positive_Marker_2 = c(
"Slc5a12", "Slc13a3", "Slc16a9", "Cryab", "Cldn16", "Cldn10", "Slc12a1", "Egf", "Slc8a1", NA,
NA, "Kit", NA, NA, NA, NA, "Emcn", "Kdr", "Pdgfrb", "Piezo2", "Acta2", "Skap1", "Cd74"
),
stringsAsFactors = FALSE
)
# Creating a dot plot to cross reference the features with the cell clusters present, seeing the average expression and percent expressed
markers.to.plot1 <- c("Lrp2", # PTS1, PTS2, PTS3
"Slc5a12", # PTS1
"Slc13a3", # PTS2
"Slc16a9", # PTS3
"Epha7", # DTL
"Cryab", # DTL
"Slc12a1", # TALA, TALB, MD
"Cldn16", # TALA
"Cldn10", # TALB
"Nos1", # MD
"Slc12a3", # DCT1, DCT2
"Egf", # DCT1
"Slc8a1", # CNT, DCT2
"Aqp2", # PC
"Slc4a1", # ICA
"Kit", # ICA
"Slc26a4", # ICB
"Nphs1", # PODO
"Ncam1", # PEC
"Upk1b", # URO
"Flt1", # EC
"Emcn", # EC
"Lyve1", # LYMPH
"Kdr", # LYMPH
"Pdgfra", # FIB
"Pdgfrb", # FIB, MES, VSMC
"Piezo2", # MES
"Acta2", # VSMC
"Ptprc", # LYMPHO, MACRO
"Skap1", # LYMPHO
"Cd74" # MACRO
)
p1 <- DotPlot(seurat.obj.f,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters")
p1
p2 <- as.data.frame(p1$data)
low_pct_features <- p2 %>%
group_by(features.plot) %>%
filter(all(pct.exp <= 75)) %>%
distinct(features.plot) %>%
pull(features.plot)
# low_pct_features
low_pct_features2 <- as.data.frame(low_pct_features)
markers_long <- cell_type_markers_df %>%
pivot_longer(cols = c(Positive_Marker_1, Positive_Marker_2),
names_to = "marker_type", values_to = "marker") %>%
filter(!is.na(marker)) # remove any NAs in marker
# Step 2: Join with low_pct_features2
low_pct_features2_annotated <- low_pct_features2 %>%
left_join(markers_long, by = c("low_pct_features" = "marker")) %>%
rename(cell_type = Subclass) # rename Subclass to cell_type
missing_cell_types <- unique(na.omit(low_pct_features2_annotated$cell_type))
missing_cell_types
The following features were found to have less than 75% of the cells expressing them in the any clusters: Epha7, Cryab, Cldn16, Slc4a1, Ncam1, Lyve1, Pdgfrb, Piezo2, Acta2.
This may make it challenging to identify the following subclasses in this dataset: DTL, TALA, ICA, PEC, LYMPH, FIB, MES, VSMC.
for (i in low_pct_features) {
# Find the subclass (if any) where the marker matches Positive_Marker_1 or _2
subclass_match <- cell_type_markers_df %>%
filter(Positive_Marker_1 == i | Positive_Marker_2 == i) %>%
pull(Subclass)
# If found, use the first match; else NA
subclass_label <- if (length(subclass_match) > 0) subclass_match[1] else NA
# Build the title
title_text <- if (!is.na(subclass_label)) {
paste0(i, " (", subclass_label, ")")
} else {
i
}
# Create plot
p <- FeaturePlot(seurat.obj.f, features = i) &
scale_colour_gradientn(colours = rev(brewer.pal(n = 9, name = "RdYlBu"))) &
ggtitle(title_text)
print(p) # Or save/store as needed
}
# PCs set to 1:20 since it was between 15-20 on the earlier elbow plot
sweep.res.list_seurat.obj.f <- paramSweep(seurat.obj.f, PCs = 1:20, sct = FALSE)
#Summarize each combination of pN and pK
sweep.stats_seurat.obj.f <- summarizeSweep(sweep.res.list_seurat.obj.f, GT = FALSE)
#Select the pK that corresponds to max bcmvn to optimize doublet detection
bcmvn_seurat.obj.f <- find.pK(sweep.stats_seurat.obj.f)
pK <- bcmvn_seurat.obj.f %>%
filter(BCmetric == max(BCmetric)) %>%
select(pK)
#See pK in the Values Environment
pK <- as.numeric(as.character(pK[[1]]))
# Homotypic Doublet Proportion Estimate -------------------------------------------------------------------------------------
annotations <- seurat.obj.f@meta.data$seurat_clusters
homotypic.prop <- modelHomotypic(annotations)
# 10X Multiplet Rate Table (the doublet ratio is # of cells recovered divided by 125000) https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled-
nExp_poi <- round(nrow(seurat.obj.f@meta.data) # To calculate cell number
/125000 # To calculate the doublet ratio
*nrow(seurat.obj.f@meta.data))
nExp_poi_adj <- round(nExp_poi*(1-homotypic.prop))
seurat.obj.f_doublets <- doubletFinder(seurat.obj.f,
PCs = 1:20,
pN = 0.25,
pK = pK,
nExp = nExp_poi_adj,
reuse.pANN = NULL, sct = FALSE)
# Change names of columns for future analysis
colnames(seurat.obj.f_doublets@meta.data)[grep("^pANN", colnames(seurat.obj.f_doublets@meta.data))] <- "pANN"
colnames(seurat.obj.f_doublets@meta.data)[grep("^DF.class", colnames(seurat.obj.f_doublets@meta.data))] <- "DF.class"
seurat.obj.f_doublets@meta.data$DF.class <- factor(
seurat.obj.f_doublets@meta.data$DF.class,
levels = c("Singlet", "Doublet")
)
table(seurat.obj.f_doublets@meta.data$DF.class)
# Make a value for the doublet count to include in the analysis
doublet_count <- table(seurat.obj.f_doublets@meta.data$DF.class)["Doublet"]
p1 <- VlnPlot(seurat.obj.f_doublets, features = c("nFeature_RNA"), group.by = "DF.class") +
theme(axis.title.x = element_blank(),
legend.position = 'none',
axis.text.x = element_text(angle = 0, hjust = .5))
p2 <- VlnPlot(seurat.obj.f_doublets, features = c("nCount_RNA"), group.by = "DF.class") +
theme(axis.title.x = element_blank(),
legend.position = 'none',
axis.text.x = element_text(angle = 0, hjust = .5))
p3 <- DimPlot(seurat.obj.f_doublets, group.by = "DF.class")
grid <- plot_grid(p1, p2, p3, ncol = 3, align = "h", axis = "tb", rel_widths = c(1, 1, 2))
plot_grid(
ggdraw() + draw_label("Violin and Feature Plot by DF.class", fontface = 'bold', size = 22, hjust = 0.5),
grid,
ncol = 1,
rel_heights = c(0.1, 1)
)
seurat.obj.f_singlets <- subset(seurat.obj.f_doublets, DF.class == "Singlet")
DimPlot(seurat.obj.f_singlets, reduction = "umap", label = T)+
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("UMAP for ", sample_name, " after DoubletFinder"))
Filter_2 <- ncol(seurat.obj.f_singlets)
amountfiltered2 <- Filter_1-Filter_2
#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset.
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Filter_1, Filter_2, amountfiltered2))
print(cell_counts)
# Remove unused objects in environment to save RAM
# rm(seurat.obj.f, seurat.obj.f_doublets, sweep.res.list_seurat.obj.f, sweep.stats_seurat.obj.f, p, p1, p2)
592 cells were removed after DoubletFinder.
A multidimensional dot plot of known cell type markers is a powerful
manual curation tool used to assess the biological validity of clusters
in single-cell or single-nucleus RNAseq data. By plotting the expression
of canonical marker genes across all clusters; using
DotPlot()
in Seurat, researchers can visualize both the
average expression (color) and the proportion of
expressing cells (dot size) for each marker. This
approach allows for rapid identification of clusters that express
expected marker combinations for defined cell types (e.g., Nphs2 for
podocytes, Lrp2 for proximal tubule), as well as clusters with
conflicting or non-specific expression patterns.
Clusters that show co-expression of markers from multiple, unrelated lineages (e.g., proximal tubule and endothelial) or lack strong expression of any canonical markers are often flagged as poor-quality or artifactual. These clusters may represent doublets or low-quality cells. Manual removal of such clusters, guided by marker expression patterns rather than purely unsupervised algorithms, can significantly improve dataset interpretability and ensure more biologically meaningful downstream analyses. However, this process requires domain expertise and should be applied with care.
Cluster 6: Proximal Tubule (Lrp2)
AND Fibroblast (Pdgfra)
Cluster 13: Proximal Tubule (Lrp2)
AND Endothelial (Flt1)
Cluster 14: Proximal Tubule (Lrp2)
AND Thick Ascending Limb (Slc12a1)
Cluster 16: Weak Proximal Tubule (Lrp2)
Cluster 22: Weak Proximal Tubule (Lrp2)
Cluster 32: Proximal Tubule (Lrp2)
AND Endothelial (Flt1)
doublet_clusters <- c()
code below to categorize
clusters as DOUBLET.# Creating a dot plot to cross reference the features with the cell clusters present, seeing the average expression and percent expressed
f1 <- DotPlot(seurat.obj.f_singlets,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters")
f1
f2 <- as.data.frame(f1$data)
maybe_doublets <- f2 %>%
group_by(id) %>%
filter(all(avg.exp.scaled <= 2.0)) %>%
distinct(id) %>%
pull(id)
maybe_doublets
# Create color palette: grey for all, color for maybe_doublets
all_clusters <- levels(seurat.obj.f_singlets)
cluster_colors <- setNames(rep("grey80", length(all_clusters)), all_clusters)
cluster_colors[maybe_doublets] <- "red"
# Get UMAP coordinates and cluster assignments
umap_data <- Embeddings(seurat.obj.f_singlets, "umap") %>%
as.data.frame() %>%
mutate(cluster = Idents(seurat.obj.f_singlets))
# Compute cluster centroids
cluster_centroids <- umap_data %>%
group_by(cluster) %>%
summarise(UMAP_1 = mean(umap_1), UMAP_2 = mean(umap_2))
# Keep only centroids for maybe_doublets
label_data <- cluster_centroids %>%
filter(cluster %in% maybe_doublets)
# Plot
DimPlot(seurat.obj.f_singlets, reduction = "umap", cols = cluster_colors) +
ggrepel::geom_text_repel(
data = label_data,
aes(x = UMAP_1, y = UMAP_2, label = cluster),
size = 4,
color = "black",
fontface = "bold",
max.overlaps = Inf,
box.padding = 0.5,
point.padding = 0.5
)
Some clusters present are doublets, and one can identify such, as they express various gene markers with a low average expression.
It is vital to refer back to the multidimensional dot plot to ensure that the identity of the clusters as maybe doublets aligns with the data present in the plot. Here, it appears that clusters 6, 13, 14, 16, 22, 32 are still potential doublets in the dataset after previous removal.
# Annotate metadata
seurat.obj.f_singlets@meta.data <- seurat.obj.f_singlets@meta.data %>%
mutate(subclass.doublet = if_else(
seurat_clusters %in% doublet_clusters, "DOUBLET", "SINGLET"
))
# Change the identities to align with new subclass names and make umap
Idents(seurat.obj.f_singlets) <- seurat.obj.f_singlets@meta.data$subclass.doublet
f1 <- DimPlot(seurat.obj.f_singlets, reduction = "umap") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Before Manual Doublet Filtering")
# Double check the table to ensure SINGLET & DOUBLET cluster appear with this function
table(Idents(seurat.obj.f_singlets))
# Clean out the doublets manually that were identified
seurat.obj.f_cleaned <- subset(seurat.obj.f_singlets, idents = "DOUBLET", invert = TRUE)
# Double check that the identity was removed
table(Idents(seurat.obj.f_cleaned))
# Make umap to double check removal
f2 <- DimPlot(seurat.obj.f_cleaned, reduction = "umap") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("After Manual Doublet Filtering")
f1+f2
While substantial progress has been made in
identifying the diverse cell types of the kidney, our
understanding remains incomplete, particularly with respect to rare cell
types. The kidney comprises a complex array of epithelial, endothelial,
stromal, and immune cells, each playing specialized roles in maintaining
homeostasis. However, not all predicted or anatomically described cell
types are readily distinguishable in transcriptomic datasets due to
limitations in resolution, depth, and biological variability.
How Many
Cell Types Are in the Kidney and What Do They Do?
Kidney
Precision Medicine Project Annotation of Kidney Cell Types
The list of cell types presented here reflects those we have confidently and reproducibly identified through single-cell RNA sequencing (scRNAseq) analysis of kidney samples containing approximately 5,000 to 20,000 high-quality cells. These cell types represent transcriptionally distinct populations that are robustly detected across independent datasets and are supported by established gene expression signatures, forming a conservative and reliable reference for downstream analyses.
seurat.obj.f_cleaned <- SCTransform(seurat.obj.f_cleaned) %>%
RunPCA() %>%
FindNeighbors(dims = 1:30) %>%
FindClusters(resolution = 2, verbose = FALSE) %>%
RunUMAP(dims = 1:30)
DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Non-annotated Clusters for ", sample_name))
# Make mdd for all removed doublets
DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters")
Use the list of kidney markers in the table above to determine the
identity of each of the clusters. Then in the code below, enter the name
of the cluster (as shown in the table above).
For example, Cluster 0 expresses
Flt1 and Emcn, which are
Endothelial Markers. Therefore add
“EC” into the line of code here:
seurat_clusters == 0 ~ "EC"
to annotate this cluster as
“EC”.
Cluster 4 expresses Slc8a1, which
is a marker for both DCT2 and CNT;
however, the cluster is negative for Slc12a3 making
this cluster CNT . Therefore add “CNT”
into the line of code here: seurat_clusters == 4 ~ "CNT"
to
annotate this cluster as “CNT”.
Cluster 23 expresses Pdgfrb, and
Piezo2, which are Mesangial Cell
Markers. Therefore add “MES” into the line of
code here: seurat_clusters == 23 ~ "MES"
to annotate this
cluster as “MES”. However, since this cluster also expresses a bit of
Acta2 it’s likely there are some VSMCs in the cluster
as well.
Then repeat for the rest of the clusters.
By using the same cell type nomeclature in the
table, we will be able to automatically annotate the
clusters at the class and type level
and provide proportion plots.
If you notice that there are additional clusters that have
characteristics consistent with doublets, then call
them DOUBLETS
and they will be filtered out at the end of
this chunk. However, we do not recluster after this step, as we’ve found
that the removal of this small number of cells doesn’t alter the
clustering significantly.
# Apply names to the clusters by subclass
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>% mutate(subclass = dplyr::case_when(
seurat_clusters == 0 ~ "EC",
seurat_clusters == 1 ~ "DCT1",
seurat_clusters == 2 ~ "FIB",
seurat_clusters == 3 ~ "PTS1",
seurat_clusters == 4 ~ "CNT",
seurat_clusters == 5 ~ "TALB",
seurat_clusters == 6 ~ "PTS2",
seurat_clusters == 7 ~ "EC",
seurat_clusters == 8 ~ "PTS2",
seurat_clusters == 9 ~ "PTS1",
seurat_clusters == 10 ~ "PTS2",
seurat_clusters == 11 ~ "EC",
seurat_clusters == 12 ~ "PTS3",
seurat_clusters == 13 ~ "TALA",
seurat_clusters == 14 ~ "PC",
seurat_clusters == 15 ~ "PTS2",
seurat_clusters == 16 ~ "DCT2",
seurat_clusters == 17 ~ "PTS3",
seurat_clusters == 18 ~ "PODO",
seurat_clusters == 19 ~ "DTL",
seurat_clusters == 20 ~ "ICB",
seurat_clusters == 21 ~ "MACRO",
seurat_clusters == 22 ~ "ICA",
seurat_clusters == 23 ~ "MES",
seurat_clusters == 24 ~ "FIB",
seurat_clusters == 25 ~ "DOUBLET",
seurat_clusters == 26 ~ "PTS2",
seurat_clusters == 27 ~ "DOUBLET",
seurat_clusters == 28 ~ "MD",
seurat_clusters == 29 ~ "LYMPHO",
seurat_clusters == 30 ~ "DOUBLET"
))
# Filter out the doublets
seurat.obj.f_cleaned <- subset(seurat.obj.f_cleaned, subset = subclass == "DOUBLET", invert = TRUE)
subclass_order <- c("PTS1", "PTS2", "PTS3", "DTL", "TALA", "TALB", "MD", "DCT1", "DCT2", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "FIB", "MES", "LYMPHO", "MACRO")
seurat.obj.f_cleaned@meta.data$subclass <- factor(seurat.obj.f_cleaned@meta.data$subclass, levels = subclass_order)
# Change the identities to align with new subclass names and make umap
Idents(seurat.obj.f_cleaned) <- seurat.obj.f_cleaned@meta.data$subclass
DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE, label.size = 4) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Subclass Annotation for ", sample_name))
Filter_3 <- ncol(seurat.obj.f_cleaned)
amountfiltered3 <- Filter_2-Filter_3
#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset.
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Filter_2, Filter_3, amountfiltered3))
print(cell_counts)
947 cells were removed after manual doublet filtering.
The file will be located in the Outputs folder.
# Step 1: Find markers (grouped by subclass)
all_markers <- FindAllMarkers(
seurat.obj.f_cleaned,
only.pos = TRUE,
min.pct = 0.5,
logfc.threshold = 1,
group.by = "subclass" # grouping by subclass instead of cluster ID
)
# Step 2: Sort markers by descending log2FC, put gene column first
all_markers <- all_markers %>%
arrange(desc(avg_log2FC)) %>%
select(gene, everything())
# Step 3: Split markers by subclass (stored in 'cluster' column still)
marker_list <- split(all_markers, all_markers$cluster)
# Step 4: Reorder the list according to your subclass_order
# First ensure you're only including subclasses that are actually present
valid_subclasses <- subclass_order[subclass_order %in% names(marker_list)]
# Reorder marker_list based on that
marker_list <- marker_list[valid_subclasses]
# Step 5: Write to Excel
wb <- createWorkbook()
for (subclass in names(marker_list)) {
addWorksheet(wb, sheetName = subclass)
writeData(wb, sheet = subclass, marker_list[[subclass]])
}
# Step 6: Save file
date <- format(Sys.Date(), "%Y%m%d")
saveWorkbook(
wb,
here("Outputs", paste0(date, "_", sample_name, "_FindAllMarkers_By_Subclass.xlsx")),
overwrite = TRUE
)
While every kidney single-nucleus RNA sequencing (scRNAseq) dataset
reflects unique biological and technical aspects, certain consistent
features can be expected from a well-processed and accurately annotated
dataset. These hallmark features not only serve as quality indicators
but also reflect our current biological understanding of kidney
structure and cellular organization. Below, we describe several features
commonly observed in high-quality scRNAseq datasets from adult mouse
kidneys:
1. Distinct Proximal Tubule Cluster Architecture
Proximal tubule cells (PTs) typically constitute the largest
population of epithelial cells in the kidney and are often recovered in
high abundance. In well-annotated datasets, this population generally
resolves into at least two major clusters: PTS1/PTS2, which share
expression of classic PT markers like and Lrp2, and a nearby but
distinct PTS3 cluster that expresses more distal PT markers such as
Slc16a9. The clear separation of PTS3 suggests both adequate sequencing
depth and sufficient transcriptional diversity resolution in the
dataset.
2. Continuum from Distal Convoluted Tubule to Collecting Duct
Principal Cells
Another expected architectural feature is
the progression of epithelial cell states along the nephron from the
distal convoluted tubule (DCT) to the connecting tubule (CNT) and onward
to principal cells (PC) of the collecting duct. These lineages form a
functional continuum, and when processed correctly, they often appear as
a trajectory or gradient in dimensionality-reduction space (e.g., UMAP).
This continuum reflects both their shared transcriptional programs and
spatial adjacency in the nephron, with marker genes such as Slc12a3
(DCT), Slc8a1 (DCT2/CNT), and Aqp2 (PC) defining the
transitions.
3. Thick Ascending Limb (TAL) as a Dominant Nephron
Segment
The thick ascending limb (TAL) of the loop of
Henle is another prominent epithelial population, often appearing as one
of the largest nephron-derived clusters. This is consistent with the
high number of TAL cells in the kidney cortex and outer medulla and
their active transcriptional programs related to electrolyte transport.
Marker genes such as Slc12a1 and Umod define this population. In
well-annotated datasets, a small adjacent population corresponding to
the macula densa (MD), can be resolved as a transcriptionally distinct
but nearby cluster expressing the gene Nos1. While there is some
4. Clean Diagonal of Cell-Type Markers
If
clusters are correctly annotated, each should be characterized by the
restricted expression of specific marker genes. When annotated as
suggested below, cells will be ordered in a way that produces a clear
diagonal pattern of robust marker expression.
DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE, label.size = 4) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Subclass Annotation for ", sample_name))
# Change the order of the identities on the x-axis to create a diagonal on the dotplot, going from top to bottom of the "naming kidney cells" table
DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot1,
group.by = "subclass",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Subclass Annotation for ", sample_name))
# Change the specificty of the cells to a lower level and make another UMAP
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>%
mutate(class = dplyr::case_when(
subclass %in% c("PTS1", "PTS2", "PTS3") ~ "PT",
subclass == "DTL" ~ "DTL",
subclass %in% c("TALA", "TALB", "MD") ~ "TAL",
subclass %in% c("DCT1", "DCT2") ~ "DCT",
subclass == "CNT" ~ "CNT",
subclass == "PC" ~ "PC",
subclass == "ICA" ~ "ICA",
subclass == "ICB" ~ "ICB",
subclass == "PODO" ~ "PODO",
subclass == "PEC" ~ "PEC",
subclass == "URO" ~ "URO",
subclass %in% c("EC", "ECG", "ECP", "LYMPH") ~ "EC",
subclass == "FIB" ~ "FIB",
subclass == "MES" ~ "MES",
subclass == "VSMC" ~ "VSMC",
subclass %in% c("LYMPHO", "MACRO") ~ "IMMUNE",
))
DimPlot(seurat.obj.f_cleaned, group.by = "class", label = TRUE, label.size = 4) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Class Annotation for ", sample_name))
class_order <- c("PT", "DTL", "TAL", "DCT", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "FIB", "MES", "IMMUNE")
seurat.obj.f_cleaned@meta.data$class <- factor(seurat.obj.f_cleaned@meta.data$class, levels = class_order)
markers.to.plot2 <- c("Lrp2", # PTS1, PTS2, PTS3
"Cryab", # DTL
"Slc12a1", # TAL1, TAL2, MD
"Slc12a3", # DCT1, DCT2
"Slc8a1", # CNT
"Aqp2", # PC
"Kit", # ICA
"Slc26a4", # ICB
"Nphs1", # PODO
"Ncam1", # PEC
"Upk1b", # URO
"Flt1", # EC
"Pdgfra", # FIB
"Pdgfrb", # FIB, MES, VSMC
"Piezo2", # MES
"Ptprc" # LYMPHO, MACRO
)
DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot2,
group.by = "class",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Class Annotation for ", sample_name))
# Change the specificty of the cells to a further lower level and make another UMAP
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>%
mutate(type = dplyr::case_when(
class %in% c("PT", "DTL", "TAL", "DCT", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO") ~ "EPITHELIAL",
class == "EC" ~ "ENDOTHELIAL",
class %in% c("FIB", "MES", "VSMC") ~ "STROMAL",
class == "IMMUNE" ~ "IMMUNE",
))
type_order <- c("EPITHELIAL", "ENDOTHELIAL", "STROMAL", "IMMUNE")
seurat.obj.f_cleaned@meta.data$type <- factor(seurat.obj.f_cleaned@meta.data$type, levels = type_order)
DimPlot(seurat.obj.f_cleaned, group.by = "type", label = TRUE, label.size = 4) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Type Annotation for ", sample_name))
type_order <- c("EPITHELIAL", "ENDOTHELIAL", "STROMAL", "IMMUNE")
seurat.obj.f_cleaned@meta.data$type <- factor(seurat.obj.f_cleaned@meta.data$type, levels = type_order)
markers.to.plot3 <- c("Ank3", # PTS1, PTS2, PTS3, DTL, TAL1, TAL2, MD, DCT1, DCT2, CNT, PC, ICA, ICB, PODO, PEC, URO
"Flt1", # EC
"Pdgfrb", # FIB, MES, VSMC
"Ptprc" # LYMPHO, MACRO
)
DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot3,
group.by = "type",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Gene Features", y = "Clusters") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Type Annotation for ", sample_name))
t1 <- table(seurat.obj.f_cleaned@meta.data$subclass)
#t1
prop.t1 <- prop.table(t1)
#prop.t1
t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')
# Original plot
f1a <- ggplot(t2, aes(fill=Cell_type, y=Frequency, x=1)) +
geom_bar(position="fill", stat = "identity", fun.y = "mean", colour="black") +
theme_classic() +
theme(axis.line = element_line(size = 1, colour = "black"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
text = element_text(size=20),
axis.line.x = element_blank()) +
labs(fill = "Cell Type") +
xlab(NULL) + ggtitle("Subclass")
t1 <- table(seurat.obj.f_cleaned@meta.data$class)
#t1
prop.t1 <- prop.table(t1)
#prop.t1
t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')
# Original plot
f1b <- ggplot(t2, aes(fill=Cell_type, y=Frequency, x=1)) +
geom_bar(position="fill", stat = "identity", fun.y = "mean", colour="black") +
theme_classic() +
theme(axis.line = element_line(size = 1, colour = "black"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
text = element_text(size=20),
axis.line.x = element_blank()) +
labs(fill = "Cell Type") +
xlab(NULL) + ggtitle("Class")
t1 <- table(seurat.obj.f_cleaned@meta.data$type)
#t1
prop.t1 <- prop.table(t1)
#prop.t1
t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')
# Original plot
f1c <- ggplot(t2, aes(fill = Cell_type, y = Frequency, x = 1)) +
geom_bar(position = "fill", stat = "identity", colour = "black") +
theme_classic() +
theme(
axis.line = element_line(size = 1, colour = "black"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
text = element_text(size = 20),
axis.line.x = element_blank()
) +
labs(fill = "Cell Type") +
xlab(NULL) +
ggtitle("Type")
f1a + f1b +f1c
Start_name <- paste0(as.character(Start), " cells")
Filter_1_name <- paste0(as.character(Filter_1), " cells")
Filter_2_name <- paste0(as.character(Filter_2), " cells")
Filter_3_name <- paste0(as.character(Filter_3), " cells")
df_paths2 <- data.frame(
group = c("Pass_All", "Filtered_at_3", "Filtered_at_2", "Filtered_at_1"),
CellRanger = c(Start_name, Start_name, Start_name, Start_name),
nFeature = c(Filter_1_name, Filter_1_name, Filter_1_name, paste0(as.character(Start-Filter_1), " cells removed")),
doubletFinder = c(Filter_2_name, Filter_2_name, paste0(as.character(Filter_1-Filter_2), " cells removed"), NA),
ManualDoublets = c(Filter_3_name, paste0(as.character(Filter_2-Filter_3), " cells removed"), NA, NA),
value = c(Start, Filter_2-Filter_3, Filter_1-Filter_2, Start-Filter_1)
)
df_paths2b <- data.frame(
group = c("Pass_All", "Filtered_at_3", "Filtered_at_2", "Filtered_at_1"),
CellRanger = c(Start_name, Start_name, Start_name, Start_name),
nFeature = c(Filter_1_name, Filter_1_name, Filter_1_name, paste0(as.character(Start - Filter_1), " cells removed")),
doubletFinder = c(Filter_2_name, Filter_2_name, paste0(as.character(Filter_1 - Filter_2), " cells removed"), NA),
ManualDoublets = c(Filter_3_name, paste0(as.character(Filter_2 - Filter_3), " cells removed"), NA, NA),
Final = c(Filter_3_name , NA, NA, NA),
value = c(Start, Filter_2 - Filter_3, Filter_1 - Filter_2, Start - Filter_1)
)
# Convert to long format suitable for ggsankey
df_sankey <- df_paths2b %>%
make_long(CellRanger, nFeature, doubletFinder, ManualDoublets, Final, value = "value")
# make removed nodes gray
orig_levels <- levels(factor(df_sankey$node))
orig_palette <- scales::hue_pal()(length(orig_levels))
names(orig_palette) <- orig_levels
# Override only "removed" nodes to gray
custom_palette <- orig_palette
custom_palette[grepl("removed", names(custom_palette))] <- "gray"
# change font size of removed nodes
df_sankey <- df_sankey %>%
filter(!is.na(node))%>%
mutate(is_removed = grepl("removed", node),
label_size = ifelse(is_removed, 3, 4))
# specify order
df_sankey$node <- factor(df_sankey$node, levels = rev(unique(df_sankey$node)))
df_sankey$next_node <- factor(df_sankey$next_node, levels = rev(unique(df_sankey$node)))
# Do the same for next_node
df_sankey$next_node <- factor(df_sankey$next_node, levels = levels(df_sankey$node))
# Plot
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
value = value,
fill = factor(node))) +
geom_sankey(flow.alpha = 0.7, node.color = "black") +
geom_sankey_label(aes(label = node, size = label_size), color = "white", show.legend = FALSE) +
scale_fill_manual(values = custom_palette) +
scale_color_identity() +
scale_size_identity() +
theme_classic() +
labs(
title = paste0("Filtering Progression Sankey Plot for ", sample_name),
x = ""
) +
theme(
# Remove Y-axis
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line.y = element_blank(),
axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
# Remove X-axis lines
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
# Enlarge text
axis.text.x = element_text(size = 14),
axis.title.x = element_text(size = 16),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
# Hide legend
legend.position = "none"
)
list1 <- unique(seurat.obj.f_cleaned$subclass)
list2 <- unique(markers_long$Subclass)
missing_subclass <- setdiff(list2, list1)
missing_subclass
The following subclasses were not annotated in this sample:
PEC, URO, LYMPH, VSMC
Ambient RNA levels are acceptable.
It matches the column Sample_ID
in the
spreadsheet with the sample_name
input in the Set
the Sample Name chunk.
meta <- read.xlsx(here("datasets", "metadata.xlsx"), sheet = 1)
meta$sample_ID <- as.character(meta$sample_ID)
seurat.obj.f_cleaned@meta.data$sample <- sample_name
df <- FetchData(seurat.obj.f_cleaned, "sample") %>% rownames_to_column(var = "CellID")
df <- left_join(df, meta, by = c("sample" = "sample_ID")) %>% column_to_rownames(var = "CellID")
seurat.obj.f_cleaned<- AddMetaData(seurat.obj.f_cleaned, df)
seurat.obj.f_cleaned
## An object of class Seurat
## 52663 features across 7266 samples within 2 assays
## Active assay: RNA (32285 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 1 other assay present: SCT
## 2 dimensional reductions calculated: pca, umap
This file is processed and can be merged
with replicates in downstream analysis.
It will be saved in the Outputs folder.
This guide is written based on the experience of multiple trainee’s and investigators
These include:
Michael Hutchens
Susan Gurley
Trainee’s that contributed to the thinking that went into this guide
include:
Roman Thomas
Sienna Blanche
Jeff
Karnsomprot
Kevin Burfeind
Xiao-Tong Su
Annie Lackey
Jeremiah Reyes
We are incredibly grateful to our funding agencies for supporting this work including:
We are also incredibly grateful for our institutional support from University of Southern California and particular the Vito M. Campese, MD/UKRO Kidney Research Center of USC.
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MoMAColors_0.0.0.9000 DoubletFinder_2.0.6 ggsankey_0.0.99999
## [4] EnhancedVolcano_1.24.0 ggrepel_0.9.6 glmGamPoi_1.18.0
## [7] hdf5r_1.3.12 readxl_1.4.5 SoupX_1.6.2
## [10] openxlsx_4.2.8 sctransform_0.4.2 RColorBrewer_1.1-3
## [13] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
## [16] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
## [19] tibble_3.2.1 tidyverse_2.0.0 htmlwidgets_1.6.4
## [22] knitr_1.50 plotly_4.10.4 ggpubr_0.6.0
## [25] ggplot2_3.5.1 cowplot_1.1.3 patchwork_1.3.0
## [28] Seurat_5.2.0 SeuratObject_5.0.2 sp_2.2-0
## [31] dplyr_1.1.4 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.3
## [3] later_1.4.1 fields_16.3.1
## [5] cellranger_1.1.0 polyclip_1.10-7
## [7] fastDummies_1.7.5 lifecycle_1.0.4
## [9] rstatix_0.7.2 rprojroot_2.0.4
## [11] globals_0.16.3 lattice_0.22-6
## [13] MASS_7.3-64 backports_1.5.0
## [15] magrittr_2.0.3 limma_3.62.2
## [17] sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10
## [21] httpuv_1.6.15 spam_2.11-1
## [23] zip_2.3.2 spatstat.sparse_3.1-0
## [25] reticulate_1.41.0.1 pbapply_1.7-2
## [27] maps_3.4.2.1 zlibbioc_1.52.0
## [29] abind_1.4-8 Rtsne_0.17
## [31] GenomicRanges_1.58.0 presto_1.0.0
## [33] BiocGenerics_0.52.0 ggstream_0.1.0
## [35] GenomeInfoDbData_1.2.13 IRanges_2.40.1
## [37] S4Vectors_0.44.0 irlba_2.3.5.1
## [39] listenv_0.9.1 spatstat.utils_3.1-3
## [41] goftest_1.2-3 RSpectra_0.16-2
## [43] spatstat.random_3.3-2 fitdistrplus_1.2-2
## [45] parallelly_1.42.0 DelayedMatrixStats_1.28.1
## [47] DelayedArray_0.32.0 codetools_0.2-20
## [49] tidyselect_1.2.1 UCSC.utils_1.2.0
## [51] farver_2.1.2 matrixStats_1.5.0
## [53] stats4_4.4.3 spatstat.explore_3.3-4
## [55] jsonlite_1.9.1 progressr_0.15.1
## [57] Formula_1.2-5 ggridges_0.5.6
## [59] survival_3.8-3 tools_4.4.3
## [61] ica_1.0-3 Rcpp_1.0.14
## [63] glue_1.8.0 SparseArray_1.6.2
## [65] gridExtra_2.3 xfun_0.51
## [67] MatrixGenerics_1.18.1 GenomeInfoDb_1.42.3
## [69] withr_3.0.2 fastmap_1.2.0
## [71] digest_0.6.37 timechange_0.3.0
## [73] R6_2.6.1 mime_0.12
## [75] colorspace_2.1-1 scattermore_1.2
## [77] tensor_1.5 spatstat.data_3.1-6
## [79] generics_0.1.3 data.table_1.17.0
## [81] S4Arrays_1.6.0 httr_1.4.7
## [83] uwot_0.2.3 pkgconfig_2.0.3
## [85] gtable_0.3.6 lmtest_0.9-40
## [87] XVector_0.46.0 htmltools_0.5.8.1
## [89] carData_3.0-5 dotCall64_1.2
## [91] Biobase_2.66.0 scales_1.3.0
## [93] png_0.1-8 spatstat.univar_3.1-2
## [95] rstudioapi_0.17.1 tzdb_0.5.0
## [97] reshape2_1.4.4 nlme_3.1-167
## [99] cachem_1.1.0 zoo_1.8-14
## [101] KernSmooth_2.23-26 vipor_0.4.7
## [103] parallel_4.4.3 miniUI_0.1.1.1
## [105] ggrastr_1.0.2 pillar_1.10.1
## [107] grid_4.4.3 vctrs_0.6.5
## [109] RANN_2.6.2 promises_1.3.2
## [111] car_3.1-3 xtable_1.8-4
## [113] cluster_2.1.8 beeswarm_0.4.0
## [115] evaluate_1.0.3 cli_3.6.4
## [117] compiler_4.4.3 crayon_1.5.3
## [119] rlang_1.1.5 future.apply_1.11.3
## [121] ggsignif_0.6.4 labeling_0.4.3
## [123] ggbeeswarm_0.7.2 plyr_1.8.9
## [125] stringi_1.8.4 viridisLite_0.4.2
## [127] deldir_2.0-4 munsell_0.5.1
## [129] lazyeval_0.2.2 spatstat.geom_3.3-5
## [131] Matrix_1.7-2 RcppHNSW_0.6.0
## [133] hms_1.1.3 sparseMatrixStats_1.18.0
## [135] bit64_4.6.0-1 future_1.34.0
## [137] statmod_1.5.0 shiny_1.10.0
## [139] SummarizedExperiment_1.36.0 ROCR_1.0-11
## [141] igraph_2.1.4 broom_1.0.7
## [143] bslib_0.9.0 bit_4.6.0