Introduction

Single-cell RNA sequencing (scRNAseq) has become an increasingly important tool for profiling the transcriptional landscape of the kidney, enabling the discovery of cell-type-specific responses in both health and disease. However, analyzing kidney scRNAseq datasets presents unique challenges due to the kidney’s cellular complexity, technical variability in capturing cells, and the disproportionate representation of abundant cell types; all of which can bias downstream analyses. Moreover, no clear standards have been established for how to analyze kidney scRNAseq data, undermining reproducibility and cross-study comparisons between research groups. Here, we present a practical framework for the analysis, annotation, and deposition of kidney scRNAseq datasets. Our approach outlines recommended steps for quality control, ambient RNA correction, detection and removal of artificial doublets, and annotation of cell clusters. We propose standardized naming conventions for kidney cell types to harmonize nomenclature and promote clarity and reproducibility. In addition, we provide practical guidelines for preparing and depositing processed data in public repositories such as the Gene Expression Omnibus (GEO), ensuring accessibility and compliance with FAIR (Findable, Accessible, Interoperable, Reusable) data principles. To support adoption of this framework, we offer template code and a video tutorial for analyzing a kidney scRNAseq dataset, starting with a gene-by-cell expression matrix and ending with an annotated atlas. Each step is explained in detail to help researchers follow the workflow, understand its underlying rationale, and confidently apply it to their own kidney single-cell datasets. Together, this framework aims to foster transparency, reproducibility, and collaborative progress in kidney research.

The Goal of This Tutorial

The goal of this tutorial is to take a scRNAseq dataset from the Cell Ranger outputs stage (cell by gene matrix) to an accurately annotated seurat object (.rds file).

Limitations of This Guide

This tutorial was written to guide the analysis of a healthy whole kidney dataset. While many of the same principals apply to analyzing a diseased whole kidney dataset, more caution needs to be taken so that diseased cells do not get inadvertently filtered out or incorrectly annotated.

We hope to write future guides on how to approach analyzing diseased kidney datasets.

Also, the landscape of single-cell technologies is dynamic and improving every day. It’s challenging to set hard thresholds and levels for certain filtering parameters because of this. We hope that the principles underlying this analysis framework stay true, even if the technology and packages used to analyze kidney scRNAseq data evolves.

Manuscript

A manuscript describing this analysis framework can be found here(broken link).

Video Tutorial

A video tutorial walking through this analysis framework can be found here(broken link).

How to Use This R Markdown File

This R markdown file is meant to guide a researcher through the steps of data analysis.

Once you have analyzed and annotated your dataset and set all the parameters you need to fill out, you can knit the file, and it will create an html report of your analysis for long-term documentation.

System Requirements

We have made every effort to reduce the memory burden in this tutorial. However, certain functions are more memory intensive than others (such as SCTransform).

Therefore, we recommend performing this tutorial on a computer with at least 16GB of RAM.

As you analyze more cells (or larger objects), you may need increased amounts of memory (RAM) to accommodate computationally heavy functions.

Resources for Learning R and R Markdown

While beyond the scope of this tutorial, we recommend the following resources for learning R and R Markdown:

Create Folders and Download Files

Create Folder

We suggest creating a folder titled “Nephronomics” on your desktop or in a location of your choice. This folder will contain all the files and folders for the analysis. Even though we suggest a title, you can name the file any way you like.

Download scRNAseq_guide.Rmd

Download the “scRNAseq_guide” RMD file to follow along, or apply to your own sample here. Then, place scRNAseq_guide.Rmd in the “Nephronomics” folder.

This is what your folder should look like:

Create Folder Structure for the Analysis

Ordered file structure is essential to keep track of the files and folders you are working with. This is especially important when working with large datasets, as it can be easy to lose track of files if they are not organized properly.

We will create two folders: Datasets and Outputs. The Datasets folder will contain the raw and filtered matrices, as well as the metadata file. The Outputs folder will contain the results of the analysis, such as the annotated .rds after analysis.

We utilize the here() package in order to create a relative path to the folders. This allows the code to be able to run on any computer without needing to change the file paths (e.g. working directories).

Read more about the here() package here

if (!require("here")) {install.packages("here"); require("here")}

Datasets <- "Datasets"
Outputs <- "Outputs"

if (!dir.exists(here(Datasets))) {dir.create(here(Datasets))}
if (!dir.exists(here(Outputs))) {dir.create(here(Outputs))}

#tree "C:\Users\jonat\OneDrive\Desktop\Nephronomics" /F

Download Tutorial Datasets

Download the tutorial dataset to walk through in this guide:
     Download the raw matrix
     Download the filtered matrix
     Download the metadata file

After running the code, you should see these two folders in the file you created.

Place the raw matrix, filtered matrix, and metadata files in the Datasets folder.

Your folder structure should look like this:

NEPHRONOMICS
    │ scRNAseq_guide.Rmd
    │
    │ - - - Datasets
    │     Kidney_7_filtered_feature_bc_matrix.h5
    │     Kidney_7_raw_feature_bc_matrix.h5
    │     metadata.xlsx
    │
    │ - - - Outputs

Install and Load Packages

The packages below may need to be installed for the first time you run this code.

The following are packages are not on CRAN or Bioconductor and can be installed using the remotes::install_github() function or devtools::install_github() function.

Simply remove the hashtags to run the code and install packages for the first time. After the packages are installed, you can comment out the code again by adding the hashtags at the beginning of each line.

# install.packages("remotes")
# remotes::install_github('chris-mcginnis-ucsf/DoubletFinder', force = TRUE)

# install.packages("devtools")
# devtools::install_github("davidsjoberg/ggsankey")

The following packages are on CRAN or Bioconductor and can be installed using the install.packages() function or BiocManager::install() function.

if (!require("dplyr")) {install.packages("dplyr"); require("dplyr")}
if (!require("Seurat")) {install.packages("Seurat"); require("Seurat")}
if (!require("patchwork")) {install.packages("patchwork"); require("patchwork")}
if (!require("cowplot")) {install.packages("cowplot"); require("cowplot")}
if (!require("ggpubr")) {install.packages("ggpubr"); require("ggpubr")}
if (!require("plotly")) {install.packages("plotly"); require("plotly")}
if (!require("knitr")) {install.packages("knitr"); require("knitr")}
if (!require("htmlwidgets")) {install.packages("htmlwidgets"); require("htmlwidgets")}
if (!require("tidyverse")) {install.packages("tidyverse"); require("tidyverse")} # for titying up data
if (!require("RColorBrewer")) {install.packages("RColorBrewer"); require("RColorBrewer")} # for color brewer
if (!require("sctransform")) {install.packages("sctransform"); require("sctransform")} # for data normalization
if (!require("openxlsx")) {install.packages("openxlsx"); require("openxlsx")} # to save .xlsx files
if (!require("SoupX")) {install.packages("SoupX"); require("SoupX")}
if (!require("readxl")) {install.packages("readxl"); require("readxl")}
if (!require("hdf5r")) {install.packages("hdf5r"); require("hdf5r")}

if (!require("glmGamPoi")) {BiocManager::install('glmGamPoi'); require("glmGamPoi")} # for data normalization, sctransform
if (!require("EnhancedVolcano")) {BiocManager::install('EnhancedVolcano'); require("EnhancedVolcano")} # volcano plot

library(ggsankey)
library(DoubletFinder)
library(MoMAColors)

set.seed((12345))
here()

#renv::init()
#renv::install("Seurat@5.2.0")
#renv::install("hdf5r")
#renv::install("chris-mcginnis-ucsf/DoubletFinder")
#renv::snapshot()

Note: Due to bugs in DoubletFinder, it’s critical to use `doubletFinder: version 2.0.6`and `Seurat: version 5.x` for this code.

Load Dataset

Set the Sample Name

This will be used to automate the addition of metadata to the Seurat object and to save output files (DEG lists and Seurat Object).

sample_name <- "Kidney_7" # Replace with your sample name

# Create a Soup Channel Object(sc) to use for future filtering of ambient RNA 
RawDataset = Read10X_h5(here("Datasets", "Kidney_7_raw_feature_bc_matrix.h5"))
FilteredDataset = Read10X_h5(here("Datasets", "Kidney_7_filtered_feature_bc_matrix.h5"))

# Make a Seurat Object from the filtered control data 
filtered_kidney <- Read10X_h5(here("Datasets", "Kidney_7_filtered_feature_bc_matrix.h5"))
filtered_kidney <- CreateSeuratObject(counts = filtered_kidney, project = "Filteredkidney_snRNA")

This is what your enviroment should look like after loading samples

If it doesn’t, check the location of your files and if you activated the here package

SoupX

SoupX is an R package designed to identify and remove ambient RNA contamination from droplet-based single-cell RNA sequencing (scRNAseq) data. Ambient RNA, freely floating transcripts that are captured non-specifically during droplet formation, can distort downstream analyses by falsely inflating gene expression levels, particularly in low-UMI cells or empty droplets. SoupX models this “soup” contamination by first estimating the background expression profile from low-UMI droplets that are likely empty. It then uses this ambient RNA profile to adjust gene expression values in cell-containing droplets, accounting for contamination in a probabilistic and gene-specific manner.

When you run autoEstCont(sc) in SoupX, it automatically estimates the contamination fraction (rho) by comparing the expression of known marker genes in presumed “pure” cell types. The graph output typically includes a plot of the model fit, showing observed versus expected expression for these genes across a range of contamination fractions. The point where the model best matches the observed data is used to estimate the contamination rate. A good fit suggests that the estimated contamination fraction accurately captures the ambient RNA contribution, and the plot can help assess whether the correction is likely to be reliable.

soupx Manuscript
soupx Github
soupx Tutorial

sc = SoupChannel(RawDataset,FilteredDataset)

# Cluster the Cells with Seurat 
filtered_kidney <- SCTransform(filtered_kidney, verbose = F)
filtered_kidney <- RunPCA(filtered_kidney, verbose = F)
filtered_kidney <- RunUMAP(filtered_kidney, dims = 1:40, verbose = F)
filtered_kidney <- FindNeighbors(filtered_kidney, dims = 1:40, verbose = F)
filtered_kidney <- FindClusters(filtered_kidney, verbose = F)

meta <- filtered_kidney@meta.data
umap <- filtered_kidney@reductions$umap@cell.embeddings
clusters <- setNames(meta$seurat_clusters, rownames(meta))

#Ensure the two following numbers are equal before continuing 
check1 <- length(clusters)
check2 <- nrow(sc$metaData)

sc <- setClusters(sc, clusters)
sc <- setDR(sc, umap)

Estimation of Kidney_7 Ambient RNA levels

sc = autoEstCont(sc)

x <- sc$fit$rhoEst * 100
y <- sc$fit$rhoFWHM * 100

This post rho value is 3.5% with a FWHM of 1.5% - 10% . This means that your estimated amount of ambient RNA is 3.5%.

WARNING: Ambient RNA

Ambient RNA levels are acceptable.

Examples of Low or High Ambient RNA in scRNAseq datasets

#Clean the data
filtered_kidney_out = adjustCounts(sc)
 
#Create a new Seurat Object out of the cleaned data
seurat.obj <- CreateSeuratObject(filtered_kidney_out)

# Remove unused objects in environment to save RAM

#rm(filtered_kidney, filtered_kidney_out, FilteredDataset, RawDataset, sc, meta, check1, check2, clusters, umap, y, Datasets, Outputs)

DoubletFinder

DoubletFinder is an R package used to detect and remove doublets; droplets that have captured two or more cells instead of one, from single-cell RNAseq datasets. Doublets can introduce misleading hybrid transcriptional profiles, impacting clustering, differential expression, and trajectory inference. DoubletFinder works by generating artificial doublets from real cell profiles and then identifying real cells that resemble these artificial profiles. It embeds both real and artificial cells in a reduced-dimensional space (usually PCA), computes each cell’s neighborhood density of artificial doublets (pANN score), and uses that to rank the likelihood that each real cell is a doublet.

After running DoubletFinder, you’ll typically interpret results by examining the metadata field DF.class, which labels each cell as “Doublet” or “Singlet.” You can visualize the results using dimensionality reduction plots (e.g., UMAP), coloring cells by their classification to check whether doublets cluster separately or are enriched in certain regions. The pANN score itself is a continuous value indicating how doublet-like a cell is, which is useful for threshold tuning. The expected doublet rate, often informed by cell loading density and 10X Genomics guidelines, plays a critical role in setting the classification cutoff.

Important Note

You set the expected rate of doublets when using DoubletFinder. This is often set as the rate of doublets published by 10x genomics based on their microfluidic device. But this rate does NOT account for an resulting from poor sample preparation.

This is one of the reasons why this tutorial suggests a 2-step doublet filtering. First with DoubletFinder (which does an effective job at identifying homotypic doublets), and then manual doublet curation based on cluster expression characteristic.

DoubletFinder Manuscript
DoubletFinder Github

Filtering of Low Complexity Cells

This step serves to filter out droplets that may have inadvertently been determined as cells by CellRanger (or other demultiplexing applications).

Not filtering out the empty droplets will interfere with doublet detection downstream.

Start <- ncol(seurat.obj)

p1 <- VlnPlot(seurat.obj, features = c("nFeature_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p2 <- VlnPlot(seurat.obj, features = c("nFeature_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p3 <- VlnPlot(seurat.obj, features = c("nCount_RNA")) + theme(axis.title.x = element_blank())  + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p4 <- VlnPlot(seurat.obj, features = c("nCount_RNA"), pt.size = 0) + theme(axis.title.x = element_blank())  + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

cowplot::plot_grid(p1, p2, p3, p4, ncol = 4, scale = .85) + draw_label("Before Filtering", y = .95, size = 22, fontface = "bold")

# Minimal filtering of low quality cells 
seurat.obj.f <- subset(seurat.obj, nFeature_RNA > 500)

p1 <- VlnPlot(seurat.obj.f, features = c("nFeature_RNA")) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p2 <- VlnPlot(seurat.obj.f, features = c("nFeature_RNA"), pt.size = 0) + theme(axis.title.x = element_blank()) + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p3 <- VlnPlot(seurat.obj.f, features = c("nCount_RNA")) + theme(axis.title.x = element_blank())  + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

p4 <- VlnPlot(seurat.obj.f, features = c("nCount_RNA"), pt.size = 0) + theme(axis.title.x = element_blank())  + theme(legend.position = 'none') + theme(axis.text.x = element_text(angle = 0, hjust = .5))

cowplot::plot_grid(p1, p2, p3, p4, ncol = 4, scale = .85) + draw_label("After Filtering Low Complexity Cells", y = .95, size = 22, fontface = "bold")

Filter_1 <- ncol(seurat.obj.f)

amountfiltered1 <- Start-Filter_1

#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset. 
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Start, Filter_1, amountfiltered1))

print(cell_counts)

# Remove unused objects in environment to save RAM

#rm(seurat.obj, p1, p2, p3, p4, cell_counts)

72 cells were removed after filtering low-complexity cells.

High-Dimensional Superclustering

Set the Resolution

resolution_value <- 2.0 # Set the resolution value for clustering

# Pre-process standard workflow
seurat.obj.f <- NormalizeData(object = seurat.obj.f)
seurat.obj.f <- FindVariableFeatures(object = seurat.obj.f)
seurat.obj.f <- ScaleData(object = seurat.obj.f)
seurat.obj.f <- RunPCA(object = seurat.obj.f)

# ElbowPlot(seurat.obj.f, ndims = 40)

seurat.obj.f <- FindNeighbors(object = seurat.obj.f, dims = 1:40)
seurat.obj.f <- FindClusters(object = seurat.obj.f, resolution = resolution_value, verbose = F)

# Make a UMAP to visualize the clusters 
seurat.obj.f <- RunUMAP(object = seurat.obj.f, dims = 1:40)
DimPlot(seurat.obj.f, reduction = "umap", label = TRUE) +
   theme(plot.title = element_text(hjust = 0.5)) +
   ggtitle(paste0("Initial Clustering for ", sample_name))

Save Excel Spreadsheet

This code creates an Excel Spreadsheet of DEG List by Cluster

This file can be used to determine the prominent features (highly enriched genes) of each cluster.

The file will be located in the Outputs folder.

all_markers <- FindAllMarkers(seurat.obj.f, only.pos = TRUE, min.pct = 0.5, logfc.threshold = 1)

# Order by avg_log2FC
all_markers <- all_markers %>%
  arrange(desc(avg_log2FC)) %>%
  select(gene, everything())

# Split by cluster (ident column)
marker_list <- split(all_markers, all_markers$cluster)

# Sort names alphanumerically
sorted_cluster_names <- names(marker_list)[order(as.numeric(names(marker_list)))]

# Create a workbook
wb <- createWorkbook()

# Add each cluster as a new worksheet
for (cluster_name in sorted_cluster_names) {
  addWorksheet(wb, sheetName = cluster_name)
  writeData(wb, sheet = cluster_name, marker_list[[cluster_name]])
}

date <- format(Sys.Date(), "%Y%m%d")

# Save workbook
saveWorkbook(wb, here("Outputs", paste0(date, "_", sample_name, "_FindAllMarkers_By_Cluster.xlsx")), overwrite = TRUE)

cell_type_markers_df <- data.frame(
  Subclass = c(
    "PTS1", "PTS2", "PTS3", "DTL", "TALA", "TALB", "MD", "DCT1", "DCT2", "CNT",
    "PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "LYMPH", "FIB", "MES", "VSMC", "LYMPHO", "MACRO"
  ),
  Positive_Marker_1 = c(
    "Lrp2", "Lrp2", "Lrp2", "Epha7", "Slc12a1", "Slc12a1", "Nos1", "Slc12a3", "Slc12a3", "Slc8a1",
    "Aqp2", "Slc4a1", "Slc26a4", "Nphs1", "Ncam1", "Upk1b", "Flt1", "Lyve1", "Pdgfra", "Pdgfrb", "Pdgfrb", "Ptprc", "Ptprc"
  ),
  Positive_Marker_2 = c(
    "Slc5a12", "Slc13a3", "Slc16a9", "Cryab", "Cldn16", "Cldn10", "Slc12a1", "Egf", "Slc8a1", NA,
    NA, "Kit", NA, NA, NA, NA, "Emcn", "Kdr", "Pdgfrb", "Piezo2", "Acta2", "Skap1", "Cd74"
  ),
  stringsAsFactors = FALSE
)

# Creating a dot plot to cross reference the features with the cell clusters present, seeing the average expression and percent expressed 

markers.to.plot1 <- c("Lrp2",         # PTS1, PTS2, PTS3
                      "Slc5a12",      # PTS1
                      "Slc13a3",      # PTS2
                      "Slc16a9",      # PTS3
                      "Epha7",        # DTL
                      "Cryab",        # DTL
                      "Slc12a1",      # TALA, TALB, MD
                      "Cldn16",       # TALA
                      "Cldn10",       # TALB
                      "Nos1",         # MD
                      "Slc12a3",      # DCT1, DCT2
                      "Egf",          # DCT1
                      "Slc8a1",       # CNT, DCT2
                      "Aqp2",         # PC
                      "Slc4a1",       # ICA
                      "Kit",          # ICA
                      "Slc26a4",      # ICB
                      "Nphs1",        # PODO
                      "Ncam1",        # PEC
                      "Upk1b",        # URO
                      "Flt1",         # EC
                      "Emcn",         # EC
                      "Lyve1",        # LYMPH
                      "Kdr",          # LYMPH
                      "Pdgfra",       # FIB
                      "Pdgfrb",       # FIB, MES, VSMC
                      "Piezo2",       # MES
                      "Acta2",        # VSMC
                      "Ptprc",        # LYMPHO, MACRO
                      "Skap1",        # LYMPHO
                      "Cd74"          # MACRO
)

p1 <- DotPlot(seurat.obj.f,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters")

p1

Cell Types of Concern

p2 <- as.data.frame(p1$data)

low_pct_features <- p2 %>%
  group_by(features.plot) %>%
  filter(all(pct.exp <= 75)) %>%
  distinct(features.plot) %>%
  pull(features.plot)

# low_pct_features 

low_pct_features2 <- as.data.frame(low_pct_features)

markers_long <- cell_type_markers_df %>%
  pivot_longer(cols = c(Positive_Marker_1, Positive_Marker_2),
               names_to = "marker_type", values_to = "marker") %>%
  filter(!is.na(marker))  # remove any NAs in marker

# Step 2: Join with low_pct_features2
low_pct_features2_annotated <- low_pct_features2 %>%
  left_join(markers_long, by = c("low_pct_features" = "marker")) %>%
  rename(cell_type = Subclass)  # rename Subclass to cell_type

missing_cell_types <- unique(na.omit(low_pct_features2_annotated$cell_type))

missing_cell_types

WARNING: Missing Cell Types

The following features were found to have less than 75% of the cells expressing them in the any clusters: Epha7, Cryab, Cldn16, Slc4a1, Ncam1, Lyve1, Pdgfrb, Piezo2, Acta2.

This may make it challenging to identify the following subclasses in this dataset: DTL, TALA, ICA, PEC, LYMPH, FIB, MES, VSMC.

Options 1: Run the chunk below and see if there are any clusters that are clearly enriched for the cell-type defining markers.

for (i in low_pct_features) {
  
  # Find the subclass (if any) where the marker matches Positive_Marker_1 or _2
  subclass_match <- cell_type_markers_df %>%
    filter(Positive_Marker_1 == i | Positive_Marker_2 == i) %>%
    pull(Subclass)

  # If found, use the first match; else NA
  subclass_label <- if (length(subclass_match) > 0) subclass_match[1] else NA
  
  # Build the title
  title_text <- if (!is.na(subclass_label)) {
    paste0(i, " (", subclass_label, ")")
  } else {
    i
  }
  
  # Create plot
  p <- FeaturePlot(seurat.obj.f, features = i) &
       scale_colour_gradientn(colours = rev(brewer.pal(n = 9, name = "RdYlBu"))) &
       ggtitle(title_text)
  
  print(p)  # Or save/store as needed
}

Option 2: Go to the High-Dimensional Superclustering Chunk and increase the resolution (e.g. change from “2” to “3”), and see if you are able to see clear clusters for missing cell types.

Continue With DoubletFinder

# PCs set to 1:20 since it was between 15-20 on the earlier elbow plot 
sweep.res.list_seurat.obj.f <- paramSweep(seurat.obj.f, PCs = 1:20, sct = FALSE)

#Summarize each combination of pN and pK
sweep.stats_seurat.obj.f <- summarizeSweep(sweep.res.list_seurat.obj.f, GT = FALSE) 

#Select the pK that corresponds to max bcmvn to optimize doublet detection
bcmvn_seurat.obj.f <- find.pK(sweep.stats_seurat.obj.f)

pK <- bcmvn_seurat.obj.f %>% 
  filter(BCmetric == max(BCmetric)) %>%
  select(pK) 

#See pK in the Values Environment
pK <- as.numeric(as.character(pK[[1]]))

# Homotypic Doublet Proportion Estimate -------------------------------------------------------------------------------------
annotations <- seurat.obj.f@meta.data$seurat_clusters  
 
homotypic.prop <- modelHomotypic(annotations)           

# 10X Multiplet Rate Table (the doublet ratio is # of cells recovered divided by 125000) https://kb.10xgenomics.com/hc/en-us/articles/360001378811-What-is-the-maximum-number-of-cells-that-can-be-profiled-

nExp_poi <- round(nrow(seurat.obj.f@meta.data) # To calculate cell number
                  /125000              # To calculate the doublet ratio
                  *nrow(seurat.obj.f@meta.data))

nExp_poi_adj <- round(nExp_poi*(1-homotypic.prop))

seurat.obj.f_doublets <- doubletFinder(seurat.obj.f,
                        PCs = 1:20,
                        pN = 0.25,
                        pK = pK,
                        nExp = nExp_poi_adj,
                        reuse.pANN = NULL, sct = FALSE)

# Change names of columns for future analysis 
colnames(seurat.obj.f_doublets@meta.data)[grep("^pANN", colnames(seurat.obj.f_doublets@meta.data))] <- "pANN"
colnames(seurat.obj.f_doublets@meta.data)[grep("^DF.class", colnames(seurat.obj.f_doublets@meta.data))] <- "DF.class"

seurat.obj.f_doublets@meta.data$DF.class <- factor(
  seurat.obj.f_doublets@meta.data$DF.class,
  levels = c("Singlet", "Doublet")
)
table(seurat.obj.f_doublets@meta.data$DF.class)

# Make a value for the doublet count to include in the analysis
doublet_count <- table(seurat.obj.f_doublets@meta.data$DF.class)["Doublet"]

p1 <- VlnPlot(seurat.obj.f_doublets, features = c("nFeature_RNA"), group.by = "DF.class") +
  theme(axis.title.x = element_blank(),
        legend.position = 'none',
        axis.text.x = element_text(angle = 0, hjust = .5))

p2 <- VlnPlot(seurat.obj.f_doublets, features = c("nCount_RNA"), group.by = "DF.class") +
  theme(axis.title.x = element_blank(),
        legend.position = 'none',
        axis.text.x = element_text(angle = 0, hjust = .5))

p3 <- DimPlot(seurat.obj.f_doublets, group.by = "DF.class")

grid <- plot_grid(p1, p2, p3, ncol = 3, align = "h", axis = "tb", rel_widths = c(1, 1, 2))

plot_grid(
  ggdraw() + draw_label("Violin and Feature Plot by DF.class", fontface = 'bold', size = 22, hjust = 0.5),
  grid,
  ncol = 1,
  rel_heights = c(0.1, 1)
)

Visualization of Singlet Object

seurat.obj.f_singlets <- subset(seurat.obj.f_doublets, DF.class == "Singlet")

DimPlot(seurat.obj.f_singlets, reduction = "umap", label = T)+
   theme(plot.title = element_text(hjust = 0.5)) +
   ggtitle(paste0("UMAP for ", sample_name, " after DoubletFinder"))

Number of Cells Filtered

Filter_2  <- ncol(seurat.obj.f_singlets)

amountfiltered2 <- Filter_1-Filter_2

#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset. 
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Filter_1, Filter_2, amountfiltered2))

print(cell_counts)

# Remove unused objects in environment to save RAM
# rm(seurat.obj.f, seurat.obj.f_doublets, sweep.res.list_seurat.obj.f, sweep.stats_seurat.obj.f, p, p1, p2)

592 cells were removed after DoubletFinder.

Manual Doublet Filtering

A multidimensional dot plot of known cell type markers is a powerful manual curation tool used to assess the biological validity of clusters in single-cell or single-nucleus RNAseq data. By plotting the expression of canonical marker genes across all clusters; using DotPlot() in Seurat, researchers can visualize both the average expression (color) and the proportion of expressing cells (dot size) for each marker. This approach allows for rapid identification of clusters that express expected marker combinations for defined cell types (e.g., Nphs2 for podocytes, Lrp2 for proximal tubule), as well as clusters with conflicting or non-specific expression patterns.

Clusters that show co-expression of markers from multiple, unrelated lineages (e.g., proximal tubule and endothelial) or lack strong expression of any canonical markers are often flagged as poor-quality or artifactual. These clusters may represent doublets or low-quality cells. Manual removal of such clusters, guided by marker expression patterns rather than purely unsupervised algorithms, can significantly improve dataset interpretability and ensure more biologically meaningful downstream analyses. However, this process requires domain expertise and should be applied with care.

Example Doublet Clusters

Cluster 6: Proximal Tubule (Lrp2) AND Fibroblast (Pdgfra)

Cluster 13: Proximal Tubule (Lrp2) AND Endothelial (Flt1)

Cluster 14: Proximal Tubule (Lrp2) AND Thick Ascending Limb (Slc12a1)

Cluster 16: Weak Proximal Tubule (Lrp2)

Cluster 22: Weak Proximal Tubule (Lrp2)

Cluster 32: Proximal Tubule (Lrp2) AND Endothelial (Flt1)

The code below will help to identify potential doublet clusters

But you need to examine every cluster with your own eyes

Add the clusters that you determine are doublets to the `doublet_clusters <- c()` code below to categorize clusters as DOUBLET.

# Creating a dot plot to cross reference the features with the cell clusters present, seeing the average expression and percent expressed 

f1 <- DotPlot(seurat.obj.f_singlets,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters")

f1

Clusters of Concern

f2 <- as.data.frame(f1$data)

maybe_doublets <- f2 %>%
  group_by(id) %>%
  filter(all(avg.exp.scaled <= 2.0)) %>%
  distinct(id) %>%
  pull(id)

maybe_doublets

# Create color palette: grey for all, color for maybe_doublets
all_clusters <- levels(seurat.obj.f_singlets)
cluster_colors <- setNames(rep("grey80", length(all_clusters)), all_clusters)
cluster_colors[maybe_doublets] <- "red"

# Get UMAP coordinates and cluster assignments
umap_data <- Embeddings(seurat.obj.f_singlets, "umap") %>%
  as.data.frame() %>%
  mutate(cluster = Idents(seurat.obj.f_singlets))

# Compute cluster centroids
cluster_centroids <- umap_data %>%
  group_by(cluster) %>%
  summarise(UMAP_1 = mean(umap_1), UMAP_2 = mean(umap_2))

# Keep only centroids for maybe_doublets
label_data <- cluster_centroids %>%
  filter(cluster %in% maybe_doublets)

# Plot
DimPlot(seurat.obj.f_singlets, reduction = "umap", cols = cluster_colors) +
  ggrepel::geom_text_repel(
    data = label_data,
    aes(x = UMAP_1, y = UMAP_2, label = cluster),
    size = 4,
    color = "black",
    fontface = "bold",
    max.overlaps = Inf,
    box.padding = 0.5,
    point.padding = 0.5
  )

WARNING! Clusters of Concern

Some clusters present are doublets, and one can identify such, as they express various gene markers with a low average expression.

It is vital to refer back to the multidimensional dot plot to ensure that the identity of the clusters as maybe doublets aligns with the data present in the plot. Here, it appears that clusters 6, 13, 14, 16, 22, 32 are still potential doublets in the dataset after previous removal.

Create List Of Clusters Identified As Doublets

# Add doublet cluster numbers like the code below to classify clusters as doublets
# doublet_clusters <- c(6, 13, 14, 16, 22, 32)

doublet_clusters <- c(6, 13, 14, 16, 22, 32)

Annotate Doublets

# Annotate metadata
seurat.obj.f_singlets@meta.data <- seurat.obj.f_singlets@meta.data %>%
  mutate(subclass.doublet = if_else(
    seurat_clusters %in% doublet_clusters, "DOUBLET", "SINGLET"
  ))

# Change the identities to align with new subclass names and make umap
Idents(seurat.obj.f_singlets) <- seurat.obj.f_singlets@meta.data$subclass.doublet

f1 <- DimPlot(seurat.obj.f_singlets, reduction = "umap") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Before Manual Doublet Filtering")

# Double check the table to ensure SINGLET & DOUBLET cluster appear with this function
table(Idents(seurat.obj.f_singlets))

Remove Doublets

# Clean out the doublets manually that were identified 
seurat.obj.f_cleaned <- subset(seurat.obj.f_singlets, idents = "DOUBLET", invert = TRUE)

# Double check that the identity was removed 
table(Idents(seurat.obj.f_cleaned))

# Make umap to double check removal 
f2 <- DimPlot(seurat.obj.f_cleaned, reduction = "umap") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("After Manual Doublet Filtering")

f1+f2

# Remove unused objects in environment to save RAM
# rm(seurat.obj.f_singlets)

Kidney Cell Annotation

Cell Types of the Kidney

While substantial progress has been made in identifying the diverse cell types of the kidney, our understanding remains incomplete, particularly with respect to rare cell types. The kidney comprises a complex array of epithelial, endothelial, stromal, and immune cells, each playing specialized roles in maintaining homeostasis. However, not all predicted or anatomically described cell types are readily distinguishable in transcriptomic datasets due to limitations in resolution, depth, and biological variability.

Relevant Manuscripts

How Many Cell Types Are in the Kidney and What Do They Do?
Kidney Precision Medicine Project Annotation of Kidney Cell Types

List of kidney cell types

The list of cell types presented here reflects those we have confidently and reproducibly identified through single-cell RNA sequencing (scRNAseq) analysis of kidney samples containing approximately 5,000 to 20,000 high-quality cells. These cell types represent transcriptionally distinct populations that are robustly detected across independent datasets and are supported by established gene expression signatures, forming a conservative and reliable reference for downstream analyses.

Apply the following marker list to name the clusters based on Kidney Cell Types

Re-Cluster Singlets

seurat.obj.f_cleaned <- SCTransform(seurat.obj.f_cleaned) %>%
    RunPCA() %>%
    FindNeighbors(dims = 1:30) %>%
    FindClusters(resolution = 2, verbose = FALSE) %>%
    RunUMAP(dims = 1:30)

DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Non-annotated Clusters for ", sample_name))

DefaultAssay(seurat.obj.f_cleaned) <- "RNA"

# Make mdd for all removed doublets 

DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot1,
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters")

Annotate by Subclass

Use the list of kidney markers in the table above to determine the identity of each of the clusters. Then in the code below, enter the name of the cluster (as shown in the table above).

For example, Cluster 0 expresses Flt1 and Emcn, which are Endothelial Markers. Therefore add “EC” into the line of code here: seurat_clusters == 0 ~ "EC" to annotate this cluster as “EC”.

Cluster 4 expresses Slc8a1, which is a marker for both DCT2 and CNT; however, the cluster is negative for Slc12a3 making this cluster CNT . Therefore add “CNT” into the line of code here: seurat_clusters == 4 ~ "CNT" to annotate this cluster as “CNT”.

Cluster 23 expresses Pdgfrb, and Piezo2, which are Mesangial Cell Markers. Therefore add “MES” into the line of code here: seurat_clusters == 23 ~ "MES" to annotate this cluster as “MES”. However, since this cluster also expresses a bit of Acta2 it’s likely there are some VSMCs in the cluster as well.

Then repeat for the rest of the clusters.

By using the same cell type nomeclature in the table, we will be able to automatically annotate the clusters at the class and type level and provide proportion plots.

If you notice that there are additional clusters that have characteristics consistent with doublets, then call them DOUBLETS and they will be filtered out at the end of this chunk. However, we do not recluster after this step, as we’ve found that the removal of this small number of cells doesn’t alter the clustering significantly.

This Step Takes the Most Hands-On Time

# Apply names to the clusters by subclass
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>% mutate(subclass = dplyr::case_when(
seurat_clusters == 0 ~ "EC",
seurat_clusters == 1 ~ "DCT1",
seurat_clusters == 2 ~ "FIB",
seurat_clusters == 3 ~ "PTS1",
seurat_clusters == 4 ~ "CNT",
seurat_clusters == 5 ~ "TALB",
seurat_clusters == 6 ~ "PTS2",
seurat_clusters == 7 ~ "EC",
seurat_clusters == 8 ~ "PTS2",
seurat_clusters == 9 ~ "PTS1",
seurat_clusters == 10 ~ "PTS2",
seurat_clusters == 11 ~ "EC",
seurat_clusters == 12 ~ "PTS3",
seurat_clusters == 13 ~ "TALA",
seurat_clusters == 14 ~ "PC",
seurat_clusters == 15 ~ "PTS2",
seurat_clusters == 16 ~ "DCT2",
seurat_clusters == 17 ~ "PTS3",
seurat_clusters == 18 ~ "PODO",
seurat_clusters == 19 ~ "DTL",
seurat_clusters == 20 ~ "ICB",
seurat_clusters == 21 ~ "MACRO",
seurat_clusters == 22 ~ "ICA",
seurat_clusters == 23 ~ "MES",
seurat_clusters == 24 ~ "FIB",
seurat_clusters == 25 ~ "DOUBLET",
seurat_clusters == 26 ~ "PTS2",
seurat_clusters == 27 ~ "DOUBLET",
seurat_clusters == 28 ~ "MD",
seurat_clusters == 29 ~ "LYMPHO",
seurat_clusters == 30 ~ "DOUBLET"
))

# Filter out the doublets

seurat.obj.f_cleaned <- subset(seurat.obj.f_cleaned, subset = subclass == "DOUBLET", invert = TRUE)

subclass_order <- c("PTS1", "PTS2", "PTS3", "DTL", "TALA", "TALB", "MD", "DCT1", "DCT2", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "FIB", "MES", "LYMPHO", "MACRO")

seurat.obj.f_cleaned@meta.data$subclass <- factor(seurat.obj.f_cleaned@meta.data$subclass, levels = subclass_order)

# Change the identities to align with new subclass names and make umap
Idents(seurat.obj.f_cleaned) <- seurat.obj.f_cleaned@meta.data$subclass

DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE, label.size = 4) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Subclass Annotation for ", sample_name))

Filter_3  <- ncol(seurat.obj.f_cleaned)

amountfiltered3 <- Filter_2-Filter_3

#Making a table to display the amount of cells being filtered out with the minimal filtering of low quality cells applied to the snRNAseq dataset. 
cell_counts <- data.frame(Phase = c("Before Filtering", "After Filtering", "Amount Filtered"), Cell_Number = c(Filter_2, Filter_3, amountfiltered3))

print(cell_counts)

947 cells were removed after manual doublet filtering.

Save Excel Spreadsheet

This code creates an Excel Spreadsheet of DEG List by Subtype

The file will be located in the Outputs folder.

# Step 1: Find markers (grouped by subclass)
all_markers <- FindAllMarkers(
  seurat.obj.f_cleaned,
  only.pos = TRUE,
  min.pct = 0.5,
  logfc.threshold = 1,
  group.by = "subclass"  # grouping by subclass instead of cluster ID
)

# Step 2: Sort markers by descending log2FC, put gene column first
all_markers <- all_markers %>%
  arrange(desc(avg_log2FC)) %>%
  select(gene, everything())

# Step 3: Split markers by subclass (stored in 'cluster' column still)
marker_list <- split(all_markers, all_markers$cluster)

# Step 4: Reorder the list according to your subclass_order
# First ensure you're only including subclasses that are actually present
valid_subclasses <- subclass_order[subclass_order %in% names(marker_list)]

# Reorder marker_list based on that
marker_list <- marker_list[valid_subclasses]

# Step 5: Write to Excel
wb <- createWorkbook()

for (subclass in names(marker_list)) {
  addWorksheet(wb, sheetName = subclass)
  writeData(wb, sheet = subclass, marker_list[[subclass]])
}

# Step 6: Save file
date <- format(Sys.Date(), "%Y%m%d")
saveWorkbook(
  wb,
  here("Outputs", paste0(date, "_", sample_name, "_FindAllMarkers_By_Subclass.xlsx")),
  overwrite = TRUE
)

Characteristics of Well-Annotated Whole Kidney Datasets

While every kidney single-nucleus RNA sequencing (scRNAseq) dataset reflects unique biological and technical aspects, certain consistent features can be expected from a well-processed and accurately annotated dataset. These hallmark features not only serve as quality indicators but also reflect our current biological understanding of kidney structure and cellular organization. Below, we describe several features commonly observed in high-quality scRNAseq datasets from adult mouse kidneys:

1. Distinct Proximal Tubule Cluster Architecture
Proximal tubule cells (PTs) typically constitute the largest population of epithelial cells in the kidney and are often recovered in high abundance. In well-annotated datasets, this population generally resolves into at least two major clusters: PTS1/PTS2, which share expression of classic PT markers like and Lrp2, and a nearby but distinct PTS3 cluster that expresses more distal PT markers such as Slc16a9. The clear separation of PTS3 suggests both adequate sequencing depth and sufficient transcriptional diversity resolution in the dataset.

2. Continuum from Distal Convoluted Tubule to Collecting Duct Principal Cells
Another expected architectural feature is the progression of epithelial cell states along the nephron from the distal convoluted tubule (DCT) to the connecting tubule (CNT) and onward to principal cells (PC) of the collecting duct. These lineages form a functional continuum, and when processed correctly, they often appear as a trajectory or gradient in dimensionality-reduction space (e.g., UMAP). This continuum reflects both their shared transcriptional programs and spatial adjacency in the nephron, with marker genes such as Slc12a3 (DCT), Slc8a1 (DCT2/CNT), and Aqp2 (PC) defining the transitions.

3. Thick Ascending Limb (TAL) as a Dominant Nephron Segment
The thick ascending limb (TAL) of the loop of Henle is another prominent epithelial population, often appearing as one of the largest nephron-derived clusters. This is consistent with the high number of TAL cells in the kidney cortex and outer medulla and their active transcriptional programs related to electrolyte transport. Marker genes such as Slc12a1 and Umod define this population. In well-annotated datasets, a small adjacent population corresponding to the macula densa (MD), can be resolved as a transcriptionally distinct but nearby cluster expressing the gene Nos1. While there is some

4. Clean Diagonal of Cell-Type Markers
If clusters are correctly annotated, each should be characterized by the restricted expression of specific marker genes. When annotated as suggested below, cells will be ordered in a way that produces a clear diagonal pattern of robust marker expression.

Example of Well-Annotated Whole Kidney Datasets

DimPlot(seurat.obj.f_cleaned, reduction = "umap", label = TRUE, label.size = 4) +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Subclass Annotation for ", sample_name))

# Change the order of the identities on the x-axis to create a diagonal on the dotplot, going from top to bottom of the "naming kidney cells" table

DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot1,
group.by = "subclass",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters") +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Subclass Annotation for ", sample_name))

Annotate by Class

# Change the specificty of the cells to a lower level and make another UMAP
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>%
  mutate(class = dplyr::case_when(
    subclass %in% c("PTS1", "PTS2", "PTS3") ~ "PT",
    subclass == "DTL" ~ "DTL",
    subclass %in% c("TALA", "TALB", "MD") ~ "TAL",
    subclass %in% c("DCT1", "DCT2") ~ "DCT",
    subclass == "CNT" ~ "CNT",
    subclass == "PC" ~ "PC",
    subclass == "ICA" ~ "ICA",
    subclass == "ICB" ~ "ICB",
    subclass == "PODO" ~ "PODO",
    subclass == "PEC" ~ "PEC",
    subclass == "URO" ~ "URO",
    subclass %in% c("EC", "ECG", "ECP", "LYMPH") ~ "EC",
    subclass == "FIB" ~ "FIB",
    subclass == "MES" ~ "MES",
    subclass == "VSMC" ~ "VSMC",
    subclass %in% c("LYMPHO", "MACRO") ~ "IMMUNE",
  ))

DimPlot(seurat.obj.f_cleaned, group.by = "class", label = TRUE, label.size = 4) +
   theme(plot.title = element_text(hjust = 0.5)) +
   ggtitle(paste0("Class Annotation for ", sample_name))

class_order <- c("PT", "DTL", "TAL", "DCT", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO", "EC", "FIB", "MES", "IMMUNE")
seurat.obj.f_cleaned@meta.data$class <- factor(seurat.obj.f_cleaned@meta.data$class, levels = class_order)

markers.to.plot2 <- c("Lrp2",         # PTS1, PTS2, PTS3
                      "Cryab",        # DTL
                      "Slc12a1",      # TAL1, TAL2, MD
                      "Slc12a3",      # DCT1, DCT2
                      "Slc8a1",       # CNT
                      "Aqp2",         # PC
                      "Kit",          # ICA
                      "Slc26a4",      # ICB
                      "Nphs1",        # PODO
                      "Ncam1",        # PEC
                      "Upk1b",        # URO
                      "Flt1",         # EC
                      "Pdgfra",       # FIB
                      "Pdgfrb",       # FIB, MES, VSMC
                      "Piezo2",       # MES
                      "Ptprc"        # LYMPHO, MACRO
)

DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot2,
group.by = "class",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters") +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Class Annotation for ", sample_name))

Annotate by Type

# Change the specificty of the cells to a further lower level and make another UMAP
seurat.obj.f_cleaned@meta.data <- seurat.obj.f_cleaned@meta.data %>%
  mutate(type = dplyr::case_when(
    class %in% c("PT", "DTL", "TAL", "DCT", "CNT", "PC", "ICA", "ICB", "PODO", "PEC", "URO") ~ "EPITHELIAL",
    class == "EC" ~ "ENDOTHELIAL",
    class %in% c("FIB", "MES", "VSMC") ~ "STROMAL",
    class == "IMMUNE" ~ "IMMUNE",
  ))

type_order <- c("EPITHELIAL", "ENDOTHELIAL", "STROMAL", "IMMUNE")
seurat.obj.f_cleaned@meta.data$type <- factor(seurat.obj.f_cleaned@meta.data$type, levels = type_order)

DimPlot(seurat.obj.f_cleaned, group.by = "type", label = TRUE, label.size = 4) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Type Annotation for ", sample_name))

type_order <- c("EPITHELIAL", "ENDOTHELIAL", "STROMAL", "IMMUNE")
seurat.obj.f_cleaned@meta.data$type <- factor(seurat.obj.f_cleaned@meta.data$type, levels = type_order)

markers.to.plot3 <- c("Ank3",         # PTS1, PTS2, PTS3, DTL, TAL1, TAL2, MD, DCT1, DCT2, CNT, PC, ICA, ICB, PODO, PEC, URO
                      "Flt1",         # EC
                      "Pdgfrb",       # FIB, MES, VSMC
                      "Ptprc"         # LYMPHO, MACRO
)

DotPlot(seurat.obj.f_cleaned,
features = markers.to.plot3,
group.by = "type",
dot.scale = 8,
dot.min = 0,
scale.max = 100,
scale.min = 0,
col.min = -2.5,
col.max = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Gene Features", y = "Clusters") +  
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste0("Type Annotation for ", sample_name))

Proportion Plots for Kidney_7

t1 <- table(seurat.obj.f_cleaned@meta.data$subclass)
#t1

prop.t1 <- prop.table(t1) 
#prop.t1

t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')

# Original plot

f1a <- ggplot(t2, aes(fill=Cell_type, y=Frequency, x=1)) + 
  geom_bar(position="fill", stat = "identity", fun.y = "mean", colour="black") +
  theme_classic() +
  theme(axis.line = element_line(size = 1, colour = "black"),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        text = element_text(size=20),
        axis.line.x = element_blank()) +
  labs(fill = "Cell Type") +
  xlab(NULL) + ggtitle("Subclass")

t1 <- table(seurat.obj.f_cleaned@meta.data$class)
#t1

prop.t1 <- prop.table(t1) 
#prop.t1

t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')

# Original plot

f1b <- ggplot(t2, aes(fill=Cell_type, y=Frequency, x=1)) + 
  geom_bar(position="fill", stat = "identity", fun.y = "mean", colour="black") +
  theme_classic() +
  theme(axis.line = element_line(size = 1, colour = "black"),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(),
        text = element_text(size=20),
        axis.line.x = element_blank()) +
  labs(fill = "Cell Type") +
  xlab(NULL) + ggtitle("Class")

t1 <- table(seurat.obj.f_cleaned@meta.data$type)
#t1

prop.t1 <- prop.table(t1) 
#prop.t1

t2 <- as.data.frame(t1)
colnames(t2) <- c('Cell_type', 'Frequency')

# Original plot

f1c <- ggplot(t2, aes(fill = Cell_type, y = Frequency, x = 1)) + 
  geom_bar(position = "fill", stat = "identity", colour = "black") +
  theme_classic() +
  theme(
    axis.line = element_line(size = 1, colour = "black"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    text = element_text(size = 20),
    axis.line.x = element_blank()
  ) +
  labs(fill = "Cell Type") +
  xlab(NULL) +
  ggtitle("Type")


f1a + f1b +f1c

QC Reports

Progression of cells Filtering Through the Analysis of Kidney_7

Start_name <- paste0(as.character(Start), " cells")
Filter_1_name <- paste0(as.character(Filter_1), " cells")
Filter_2_name <- paste0(as.character(Filter_2), " cells")
Filter_3_name <- paste0(as.character(Filter_3), " cells")


df_paths2 <- data.frame(
  group = c("Pass_All", "Filtered_at_3", "Filtered_at_2", "Filtered_at_1"),
  CellRanger = c(Start_name,     Start_name,       Start_name,        Start_name),
  nFeature = c(Filter_1_name,  Filter_1_name,    Filter_1_name,     paste0(as.character(Start-Filter_1), " cells removed")),
  doubletFinder = c(Filter_2_name,  Filter_2_name,   paste0(as.character(Filter_1-Filter_2), " cells removed"),    NA),
  ManualDoublets = c(Filter_3_name, paste0(as.character(Filter_2-Filter_3), " cells removed"),      NA,       NA),
  value = c(Start,        Filter_2-Filter_3,          Filter_1-Filter_2,           Start-Filter_1)
)


df_paths2b <- data.frame(
  group = c("Pass_All", "Filtered_at_3", "Filtered_at_2", "Filtered_at_1"),
  CellRanger = c(Start_name, Start_name, Start_name, Start_name),
  nFeature = c(Filter_1_name, Filter_1_name, Filter_1_name, paste0(as.character(Start - Filter_1), " cells removed")),
  doubletFinder = c(Filter_2_name, Filter_2_name, paste0(as.character(Filter_1 - Filter_2), " cells removed"), NA),
  ManualDoublets = c(Filter_3_name, paste0(as.character(Filter_2 - Filter_3), " cells removed"), NA, NA),
  Final = c(Filter_3_name , NA, NA, NA),
  value = c(Start, Filter_2 - Filter_3, Filter_1 - Filter_2, Start - Filter_1)
)

# Convert to long format suitable for ggsankey
df_sankey <- df_paths2b   %>%
  make_long(CellRanger, nFeature, doubletFinder, ManualDoublets, Final, value = "value")

# make removed nodes gray
orig_levels <- levels(factor(df_sankey$node))
orig_palette <- scales::hue_pal()(length(orig_levels))
names(orig_palette) <- orig_levels

# Override only "removed" nodes to gray
custom_palette <- orig_palette
custom_palette[grepl("removed", names(custom_palette))] <- "gray"

# change font size of removed nodes
df_sankey <- df_sankey %>%
  filter(!is.na(node))%>%
  mutate(is_removed = grepl("removed", node),
         label_size = ifelse(is_removed, 3, 4))

# specify order
df_sankey$node <- factor(df_sankey$node, levels = rev(unique(df_sankey$node)))
df_sankey$next_node <- factor(df_sankey$next_node, levels = rev(unique(df_sankey$node)))

# Do the same for next_node
df_sankey$next_node <- factor(df_sankey$next_node, levels = levels(df_sankey$node))


# Plot
ggplot(df_sankey, aes(x = x,
                      next_x = next_x,
                      node = node,
                      next_node = next_node,
                      value = value,
                      fill = factor(node))) +
  geom_sankey(flow.alpha = 0.7, node.color = "black") +
  geom_sankey_label(aes(label = node, size = label_size), color = "white", show.legend = FALSE) +
  scale_fill_manual(values = custom_palette) +   
  scale_color_identity() +
  scale_size_identity() +
  theme_classic() +
labs(
  title = paste0("Filtering Progression Sankey Plot for ", sample_name),
  x = ""
) +
  theme(
    # Remove Y-axis
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.line.y = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),

    # Remove X-axis lines
    axis.line.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),

    # Enlarge text
    axis.text.x = element_text(size = 14),
    axis.title.x = element_text(size = 16),
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),

    # Hide legend
    legend.position = "none"
  )

Subclass Cell Types That Were Not Annotated in Kidney_7

list1 <- unique(seurat.obj.f_cleaned$subclass)
  
list2 <- unique(markers_long$Subclass)

missing_subclass <- setdiff(list2, list1)

missing_subclass

The following subclasses were not annotated in this sample: PEC, URO, LYMPH, VSMC

Ambient RNA Warning for Kidney_7

Ambient RNA levels are acceptable.

Save Seurat Object

Add Sample Meta.Data to Object

It matches the column Sample_ID in the spreadsheet with the sample_name input in the Set the Sample Name chunk.

meta <- read.xlsx(here("datasets", "metadata.xlsx"), sheet = 1)

meta$sample_ID <- as.character(meta$sample_ID)

seurat.obj.f_cleaned@meta.data$sample <- sample_name

df <- FetchData(seurat.obj.f_cleaned, "sample") %>% rownames_to_column(var = "CellID")

df <- left_join(df, meta, by = c("sample" = "sample_ID")) %>% column_to_rownames(var = "CellID")

seurat.obj.f_cleaned<- AddMetaData(seurat.obj.f_cleaned, df)

seurat.obj.f_cleaned

## An object of class Seurat 
## 52663 features across 7266 samples within 2 assays 
## Active assay: RNA (32285 features, 2000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: SCT
##  2 dimensional reductions calculated: pca, umap

Check to see if Meta.Data has been added to Seurat Object

Scroll to the right on the table to check for new meta.data columns.

head(seurat.obj.f_cleaned@meta.data)

This saves the Seurat Object as an .RDS file

This file is processed and can be merged with replicates in downstream analysis.

It will be saved in the Outputs folder.

# Add a date to file name
date <- format(Sys.Date(), "%Y%m%d")

#saveRDS(seurat.obj.f_cleaned, here("Outputs", paste0(date, sample, ".rds")))

Acknowledgements

Group Leaders

This guide is written based on the experience of multiple trainee’s and investigators

These include:

Michael Hutchens
Susan Gurley

Trainees

Trainee’s that contributed to the thinking that went into this guide include:

Roman Thomas
Sienna Blanche
Jeff Karnsomprot
Kevin Burfeind
Xiao-Tong Su
Annie Lackey
Jeremiah Reyes

Funding Agencies

We are incredibly grateful to our funding agencies for supporting this work including:

Institutional Support

We are also incredibly grateful for our institutional support from University of Southern California and particular the Vito M. Campese, MD/UKRO Kidney Research Center of USC.

Session Info

sessionInfo()

## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] MoMAColors_0.0.0.9000  DoubletFinder_2.0.6    ggsankey_0.0.99999    
##  [4] EnhancedVolcano_1.24.0 ggrepel_0.9.6          glmGamPoi_1.18.0      
##  [7] hdf5r_1.3.12           readxl_1.4.5           SoupX_1.6.2           
## [10] openxlsx_4.2.8         sctransform_0.4.2      RColorBrewer_1.1-3    
## [13] lubridate_1.9.4        forcats_1.0.0          stringr_1.5.1         
## [16] purrr_1.0.4            readr_2.1.5            tidyr_1.3.1           
## [19] tibble_3.2.1           tidyverse_2.0.0        htmlwidgets_1.6.4     
## [22] knitr_1.50             plotly_4.10.4          ggpubr_0.6.0          
## [25] ggplot2_3.5.1          cowplot_1.1.3          patchwork_1.3.0       
## [28] Seurat_5.2.0           SeuratObject_5.0.2     sp_2.2-0              
## [31] dplyr_1.1.4            here_1.0.1            
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.3              
##   [3] later_1.4.1                 fields_16.3.1              
##   [5] cellranger_1.1.0            polyclip_1.10-7            
##   [7] fastDummies_1.7.5           lifecycle_1.0.4            
##   [9] rstatix_0.7.2               rprojroot_2.0.4            
##  [11] globals_0.16.3              lattice_0.22-6             
##  [13] MASS_7.3-64                 backports_1.5.0            
##  [15] magrittr_2.0.3              limma_3.62.2               
##  [17] sass_0.4.9                  rmarkdown_2.29             
##  [19] jquerylib_0.1.4             yaml_2.3.10                
##  [21] httpuv_1.6.15               spam_2.11-1                
##  [23] zip_2.3.2                   spatstat.sparse_3.1-0      
##  [25] reticulate_1.41.0.1         pbapply_1.7-2              
##  [27] maps_3.4.2.1                zlibbioc_1.52.0            
##  [29] abind_1.4-8                 Rtsne_0.17                 
##  [31] GenomicRanges_1.58.0        presto_1.0.0               
##  [33] BiocGenerics_0.52.0         ggstream_0.1.0             
##  [35] GenomeInfoDbData_1.2.13     IRanges_2.40.1             
##  [37] S4Vectors_0.44.0            irlba_2.3.5.1              
##  [39] listenv_0.9.1               spatstat.utils_3.1-3       
##  [41] goftest_1.2-3               RSpectra_0.16-2            
##  [43] spatstat.random_3.3-2       fitdistrplus_1.2-2         
##  [45] parallelly_1.42.0           DelayedMatrixStats_1.28.1  
##  [47] DelayedArray_0.32.0         codetools_0.2-20           
##  [49] tidyselect_1.2.1            UCSC.utils_1.2.0           
##  [51] farver_2.1.2                matrixStats_1.5.0          
##  [53] stats4_4.4.3                spatstat.explore_3.3-4     
##  [55] jsonlite_1.9.1              progressr_0.15.1           
##  [57] Formula_1.2-5               ggridges_0.5.6             
##  [59] survival_3.8-3              tools_4.4.3                
##  [61] ica_1.0-3                   Rcpp_1.0.14                
##  [63] glue_1.8.0                  SparseArray_1.6.2          
##  [65] gridExtra_2.3               xfun_0.51                  
##  [67] MatrixGenerics_1.18.1       GenomeInfoDb_1.42.3        
##  [69] withr_3.0.2                 fastmap_1.2.0              
##  [71] digest_0.6.37               timechange_0.3.0           
##  [73] R6_2.6.1                    mime_0.12                  
##  [75] colorspace_2.1-1            scattermore_1.2            
##  [77] tensor_1.5                  spatstat.data_3.1-6        
##  [79] generics_0.1.3              data.table_1.17.0          
##  [81] S4Arrays_1.6.0              httr_1.4.7                 
##  [83] uwot_0.2.3                  pkgconfig_2.0.3            
##  [85] gtable_0.3.6                lmtest_0.9-40              
##  [87] XVector_0.46.0              htmltools_0.5.8.1          
##  [89] carData_3.0-5               dotCall64_1.2              
##  [91] Biobase_2.66.0              scales_1.3.0               
##  [93] png_0.1-8                   spatstat.univar_3.1-2      
##  [95] rstudioapi_0.17.1           tzdb_0.5.0                 
##  [97] reshape2_1.4.4              nlme_3.1-167               
##  [99] cachem_1.1.0                zoo_1.8-14                 
## [101] KernSmooth_2.23-26          vipor_0.4.7                
## [103] parallel_4.4.3              miniUI_0.1.1.1             
## [105] ggrastr_1.0.2               pillar_1.10.1              
## [107] grid_4.4.3                  vctrs_0.6.5                
## [109] RANN_2.6.2                  promises_1.3.2             
## [111] car_3.1-3                   xtable_1.8-4               
## [113] cluster_2.1.8               beeswarm_0.4.0             
## [115] evaluate_1.0.3              cli_3.6.4                  
## [117] compiler_4.4.3              crayon_1.5.3               
## [119] rlang_1.1.5                 future.apply_1.11.3        
## [121] ggsignif_0.6.4              labeling_0.4.3             
## [123] ggbeeswarm_0.7.2            plyr_1.8.9                 
## [125] stringi_1.8.4               viridisLite_0.4.2          
## [127] deldir_2.0-4                munsell_0.5.1              
## [129] lazyeval_0.2.2              spatstat.geom_3.3-5        
## [131] Matrix_1.7-2                RcppHNSW_0.6.0             
## [133] hms_1.1.3                   sparseMatrixStats_1.18.0   
## [135] bit64_4.6.0-1               future_1.34.0              
## [137] statmod_1.5.0               shiny_1.10.0               
## [139] SummarizedExperiment_1.36.0 ROCR_1.0-11                
## [141] igraph_2.1.4                broom_1.0.7                
## [143] bslib_0.9.0                 bit_4.6.0

A Practical Framework for Kidney scRNAseq Analysis

Processing and Annotating Kidney scRNAseq Datasets

Nelson Lab

27 July, 2025

Introduction

The Goal of This Tutorial

Limitations of This Guide

Manuscript

Video Tutorial

How to Use This R Markdown File

System Requirements

Resources for Learning R and R Markdown

Blogs and Books

Online Courses

Youtube Channels

Create Folders and Download Files

Create Folder

Download scRNAseq_guide.Rmd

This is what your folder should look like:

Create Folder Structure for the Analysis

Download Tutorial Datasets

After running the code, you should see these two folders in the file you created.

Place the raw matrix, filtered matrix, and metadata files in the Datasets folder.

Your folder structure should look like this:

Install and Load Packages

The packages below may need to be installed for the first time you run this code.

The following are packages are not on CRAN or Bioconductor and can be installed using the remotes::install_github() function or devtools::install_github() function.

The following packages are on CRAN or Bioconductor and can be installed using the install.packages() function or BiocManager::install() function.

Note: Due to bugs in DoubletFinder, it’s critical to use doubletFinder: version 2.0.6and Seurat: version 5.x for this code.

Load Dataset

Set the Sample Name

This is what your enviroment should look like after loading samples

If it doesn’t, check the location of your files and if you activated the here package

SoupX

Estimation of Kidney_7 Ambient RNA levels

WARNING: Ambient RNA

Examples of Low or High Ambient RNA in scRNAseq datasets

DoubletFinder

Important Note

Filtering of Low Complexity Cells

High-Dimensional Superclustering

Set the Resolution

Save Excel Spreadsheet

This code creates an Excel Spreadsheet of DEG List by Cluster

Cell Types of Concern

WARNING: Missing Cell Types

Options 1: Run the chunk below and see if there are any clusters that are clearly enriched for the cell-type defining markers.

Option 2: Go to the High-Dimensional Superclustering Chunk and increase the resolution (e.g. change from “2” to “3”), and see if you are able to see clear clusters for missing cell types.

Continue With DoubletFinder

Visualization of Singlet Object

Number of Cells Filtered

Manual Doublet Filtering

Example Doublet Clusters

The code below will help to identify potential doublet clusters

But you need to examine every cluster with your own eyes

Add the clusters that you determine are doublets to the doublet_clusters <- c() code below to categorize clusters as DOUBLET.

Clusters of Concern

WARNING! Clusters of Concern

Create List Of Clusters Identified As Doublets

Annotate Doublets

Remove Doublets

Kidney Cell Annotation

Cell Types of the Kidney

Relevant Manuscripts

List of kidney cell types

Apply the following marker list to name the clusters based on Kidney Cell Types

Re-Cluster Singlets

Annotate by Subclass

This Step Takes the Most Hands-On Time

Save Excel Spreadsheet

This code creates an Excel Spreadsheet of DEG List by Subtype

Characteristics of Well-Annotated Whole Kidney Datasets

Example of Well-Annotated Whole Kidney Datasets

Annotate by Class

Annotate by Type

Proportion Plots for Kidney_7

QC Reports

Progression of cells Filtering Through the Analysis of Kidney_7

Subclass Cell Types That Were Not Annotated in Kidney_7

Ambient RNA Warning for Kidney_7

Note: Due to bugs in DoubletFinder, it’s critical to use `doubletFinder: version 2.0.6`and `Seurat: version 5.x` for this code.

Add the clusters that you determine are doublets to the `doublet_clusters <- c()` code below to categorize clusters as DOUBLET.