library("dplyr")
library("Seurat")
library("knitr")
library("ggplot2")
library("BiocManager")
library("here")
#BiocManager::install("EnhancedVolcano")
library("EnhancedVolcano") #volcano plot
#install.packages('DESeq2') #for DEG
library("DESeq2")
library("tidyverse") #tidy up data
library("styler") #tidy up data
if (!require("kableExtra")) {install.packages("kableExtra"); require("kableExtra")} # for color brewer
if (!require("RColorBrewer")) {install.packages("RColorBrewer"); require("RColorBrewer")} # for color brewer
if (!require("sctransform")) {install.packages("sctransform"); require("sctransform")} # for data normalization
if (!require("glmGamPoi")) {BiocManager::install('glmGamPoi'); require("glmGamPoi")} # for data normalization, sctransform
if (!require("cowplot")) {install.packages("cowplot"); require("cowplot")} # for figure layout
if (!require("patchwork")) {install.packages("patchwork"); require("patchwork")} # for figure patching
if (!require("openxlsx")) {install.packages("openxlsx"); require("openxlsx")} # to save .xlsx files
# install.packages("styler")
set.seed(12345)
# here()
Welcome to the Single-Cell Omics Research and Education Club!
If this is your time to the club, I want to extend and extra-special welcome to you!
I’m Jonathan Nelson, an Assistant Professor at the University of Southern California. I’m a wet scientist turned wet+dry scientist. I’ve been working with single-cell RNAseq data for the past 5 years and I’m excited to share what I’ve learned with you.
We believe that bioinformatics is a constantly evolving field, and that ongoing learning and professional development is essential to staying up-to-date. We encourage members to share their knowledge and experiences with each other, and to seek out opportunities for continued learning.
We believe that access to bioinformatics support should be available to everyone. We strive to create a welcoming and inclusive environment where all members can feel comfortable asking for help and contributing to the group.
We believe that working together is key to achieving success in bioinformatics. We value the diversity of perspectives and backgrounds that each member brings, and we encourage open communication and the sharing of ideas.
We believe in conducting ourselves with honesty and professionalism in all our interactions. We hold ourselves to high ethical standards and respect the privacy and confidentiality of all members.
We believe in approaching each other with empathy and kindness. We understand that bioinformatics can be a challenging and sometimes frustrating field, and we strive to support each other through these difficulties.
I know a lot of this has been going on in the background for everyeone and I wanted to bring it to the forefront. My expectation is that we have about 6 of these meetings together and then we can re-evaluate if we want to continue as a group or not.
Email me you would like me to add anyone: j.nelson@med.usc.edu
Today’s code (this html file) will be posted to the SCORE website (https://usckrc.github.io/website/score.html)
LAUREL: Life Worth Living
https://open.spotify.com/track/1mCvM05OlYWQd77RDxCTLD?si=7d8beb22c2274997
https://github.com/r-lib/styler?tab=readme-ov-file
install.packages(“styler”)
https://www.youtube.com/watch?app=desktop&v=yUA3NpJLH6I&t=220s&t=156
I wanted to start with one csv file that was flexible to fill out.
library(dplyr)
library(tidyr)
library(ggplot2)
library(here)
# Load data
df <- read.csv(here("Week 3 Reporting Findings", "data", "JWNMentors.csv"))
# Define institution list
institutions <- c(
"University of Washington",
"Oregon Health & Science University",
"University of Southern California"
)
# Categorize mentors
df <- df %>%
mutate(
MentorType = ifelse(Mentor %in% institutions, "Institution", "Individual")
)
# Reshape data
g.gantt <- df %>%
pivot_longer(cols = 3:4, names_to = "state", values_to = "date") %>%
mutate(date = as.Date(date, "%m/%d/%Y"))
# Create summarized timeline for institutions
g.gantt_combined <- g.gantt %>%
filter(MentorType == "Institution") %>%
group_by(Stage) %>%
summarise(
start_date = min(date[state == "Start"]),
end_date = max(date[state == "End"]),
Mentor = "Institution"
) %>%
ungroup()
# Define factor levels for Stage
stage_levels <- c(
"University of Washington",
"Oregon Health & Science University",
"University of Southern California",
"Undergraduate",
"Graduate School",
"Postdoc",
"Junior Faculty"
)
g.gantt$Stage <- factor(g.gantt$Stage, levels = stage_levels)
g.gantt_combined$Stage <- factor(g.gantt_combined$Stage, levels = stage_levels)
# Define y-axis order
desired_order <- rev(c(
"Institution", "Staffan Bench", "Benjamin Hall", "Nabil Alkayed",
"Paul Barnes", "Sanjiv Kaul", "David Ellison",
"Susan Gurley", "Janos Peti-Peterdi"
))
# Define color mapping
desired_colors <- c(
"University of Washington" = "#32006e",
"Oregon Health & Science University" = "#575e60",
"University of Southern California" = "#990000",
"Undergraduate" = "#C5692E",
"Graduate School" = "#FEB359",
"Postdoc" = "#435F90",
"Junior Faculty" = "#B47E83"
)
# Create plot
ggplot() +
geom_segment(
data = g.gantt_combined,
aes(x = start_date, xend = end_date, y = Mentor, yend = Mentor, color = Stage),
size = 10
) +
geom_line(
data = g.gantt,
aes(x = date, y = Mentor, color = Stage),
size = 10
) +
labs(x = "Year", y = NULL, title = "Mentors", color = "Institution and Career Stage") +
theme_minimal(base_size = 15) +
scale_y_discrete(limits = desired_order) +
theme(
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5, size = 18),
axis.title = element_text(size = 16),
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 16),
axis.text = element_text(size = 14),
axis.ticks.y = element_blank(),
axis.line.y = element_blank()
) +
scale_color_manual(values = desired_colors, na.value = "gray")
.gz
to save
spaceCell Ranger generates several files and folders after processing
single-cell RNA-seq data. The key outputs are:
https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-gex-overview
Outputs compatible with Seurat, Scanpy, IGV, and downstream tools
Can be used with CellBender, Alevin, or custom workflows
A) Filtered Feature Matrix (typically under
outs/filtered_feature_bc_matrix/):
Output file format: Which one should you use? MEX vs HDF5
(.h5)?
Both formats contain the same information:
A gene × cell
expression matrix, with barcodes and features.
The difference is in
the file structure and format:
B) raw_feature_bc_matrix/
- Same structure as
filtered_feature_bc_matrix, but:
- Includes all barcodes, even:
Empty droplets, Damaged cells, Low-quality barcodes
Why is it
useful?
Re-analyze using your own custom filtering, Identify ambient
RNA contamination
C) QC and Summary Reports:
- web_summary.html:
Interactive summary with plots (genes per cell, mitochondrial %,
etc.)
An interactive report viewable in any browser. Shows:
-
Total number of estimated cells
- Reads per cell
- Median genes
per cell
- % mitochondrial gene expression
- Sequencing
saturation
- Clustering preview (if applicable)
- Useful for
checking: If the run worked, How deep was the sequencing, General
quality of data (need it for publishing)
- metrics_summary.csv:
Tabular summary of sequencing and mapping statistics
D) BAM (Binary Alignment/Map) Files:
possorted_genome_bam.bam and .bam.bai
- BAM files = aligned reads in
binary format
- possorted_genome_bam.bam
: aligned +
unassigned reads (incl. Feature Barcode libs)
-
unassigned_alignments.bam
/
sample_alignments.bam
: unassigned vs. assigned reads
-
Use: troubleshoot, reconvert to FASTQ, explore in IGV (includes
barcodes)
E) cloupe.cloupe:
-
.cloupe
: binary file for Loupe Cell Browser (by 10X
Genomics)
- Allows: cluster exploration, gene expression per
cluster, manual annotation, marker analysis
- Optional, but useful
for biologists and presentations
More
info
Processed Data Formats:
- .rds (Seurat): Includes count matrix,
metadata, UMAP, clusters
- .h5ad (Scanpy): Same as above but in
Python format
- metadata.csv: Per-cell information (cluster, % mito,
etc.)
- degs.csv: Differentially expressed genes between clusters or
conditions
- umap.png, violin_plot.png: Visual outputs from
analysis
These files help others understand, run, and reproduce your project
(on GitHub or similar platforms).
GEO is a public repository managed by the NCBI (National Center for
Biotechnology Information).
It’s used to store and share
high-throughput gene expression data, including:
- Microarray
-
Bulk RNA-seq
- Single-cell/snRNA-seq
- Multi-modal (CITE-seq,
spatial transcriptomics)
Steps: 1. Create Meta.data
sheet
2. Organize files and get MD5 checksums:
• FASTQ, matrix, .rds
3. Upload files via NCBI account
4. Add GEO link and reviewer token to Methods section in your
manuscript
Timeline: - 🕒 1–3 days: File preparation
- 1–2 days: File upload (slow: ~2–4 MB/s)
- 1–2 days: GEO QC & processing
Tip: GEO staff is very helpful—especially if you make a mistake!
A .tar file is an archive format that bundles multiple files and
folders into a single file, without compressing them.
Think of it
like un ZIP file, but without compresión por defecto.
Very common in
Linux/Unix systems and in bioinformática para packaging datasets.
A Shiny App is an interactive web application built using the R
programming language. It allows you to turn your R scripts, data
visualizations, and statistical models into una interfaz web, sin
necesidad de escribir HTML, JavaScript o CSS.
A Shiny App lets scientists, clinicians, and collaborators (even
without coding skills):
- Explore clusters and cell types
- Visualize marker gene expression
- Interact with UMAP or t-SNE plots
- Compare experimental conditions
- Download plots or data tables
It transforms static results into an interactive data exploration tool.
shiny
packagehttps://ellisonlab.shinyapps.io/dct_shinycell/
https://ellisonlab.shinyapps.io/tal_shinycell/
I hope that you found this session helpful.
Do you have a coding problem that you’d like some
support on?
Do you have a topic you’d like covered
at a future meeting?
Email me: j.nelson@med.usc.edu
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] openxlsx_4.2.8 patchwork_1.3.0
## [3] cowplot_1.1.3 glmGamPoi_1.18.0
## [5] sctransform_0.4.1 RColorBrewer_1.1-3
## [7] kableExtra_1.4.0 styler_1.10.3
## [9] lubridate_1.9.4 forcats_1.0.0
## [11] stringr_1.5.1 purrr_1.0.4
## [13] readr_2.1.5 tidyr_1.3.1
## [15] tibble_3.2.1 tidyverse_2.0.0
## [17] DESeq2_1.46.0 SummarizedExperiment_1.36.0
## [19] Biobase_2.66.0 MatrixGenerics_1.18.1
## [21] matrixStats_1.5.0 GenomicRanges_1.58.0
## [23] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [25] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [27] EnhancedVolcano_1.24.0 ggrepel_0.9.6
## [29] here_1.0.1 BiocManager_1.30.25
## [31] ggplot2_3.5.1 knitr_1.50
## [33] SeuratObject_5.0.2 Seurat_4.4.0
## [35] dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.3 later_1.4.1
## [4] R.oo_1.27.0 polyclip_1.10-7 lifecycle_1.0.4
## [7] rprojroot_2.0.4 globals_0.16.3 lattice_0.22-6
## [10] MASS_7.3-64 magrittr_2.0.3 plotly_4.10.4
## [13] sass_0.4.9 rmarkdown_2.29 jquerylib_0.1.4
## [16] yaml_2.3.10 httpuv_1.6.15 zip_2.3.2
## [19] spam_2.11-1 sp_2.2-0 spatstat.sparse_3.1-0
## [22] reticulate_1.41.0.1 pbapply_1.7-2 abind_1.4-8
## [25] zlibbioc_1.52.0 Rtsne_0.17 R.cache_0.16.0
## [28] R.utils_2.13.0 GenomeInfoDbData_1.2.13 irlba_2.3.5.1
## [31] listenv_0.9.1 spatstat.utils_3.1-3 goftest_1.2-3
## [34] spatstat.random_3.3-2 fitdistrplus_1.2-2 parallelly_1.42.0
## [37] svglite_2.1.3 leiden_0.4.3.1 codetools_0.2-20
## [40] DelayedArray_0.32.0 xml2_1.3.8 tidyselect_1.2.1
## [43] UCSC.utils_1.2.0 farver_2.1.2 spatstat.explore_3.3-4
## [46] jsonlite_1.9.1 progressr_0.15.1 ggridges_0.5.6
## [49] survival_3.8-3 systemfonts_1.2.1 tools_4.4.3
## [52] ica_1.0-3 Rcpp_1.0.14 glue_1.8.0
## [55] gridExtra_2.3 SparseArray_1.6.2 xfun_0.51
## [58] withr_3.0.2 fastmap_1.2.0 digest_0.6.37
## [61] timechange_0.3.0 R6_2.6.1 mime_0.12
## [64] colorspace_2.1-1 scattermore_1.2 tensor_1.5
## [67] spatstat.data_3.1-6 R.methodsS3_1.8.2 generics_0.1.3
## [70] data.table_1.17.0 httr_1.4.7 htmlwidgets_1.6.4
## [73] S4Arrays_1.6.0 uwot_0.2.3 pkgconfig_2.0.3
## [76] gtable_0.3.6 lmtest_0.9-40 XVector_0.46.0
## [79] htmltools_0.5.8.1 dotCall64_1.2 scales_1.3.0
## [82] png_0.1-8 spatstat.univar_3.1-2 rstudioapi_0.17.1
## [85] tzdb_0.5.0 reshape2_1.4.4 nlme_3.1-167
## [88] cachem_1.1.0 zoo_1.8-13 KernSmooth_2.23-26
## [91] parallel_4.4.3 miniUI_0.1.1.1 pillar_1.10.1
## [94] grid_4.4.3 vctrs_0.6.5 RANN_2.6.2
## [97] promises_1.3.2 xtable_1.8-4 cluster_2.1.8
## [100] evaluate_1.0.3 cli_3.6.4 locfit_1.5-9.12
## [103] compiler_4.4.3 rlang_1.1.5 crayon_1.5.3
## [106] future.apply_1.11.3 plyr_1.8.9 stringi_1.8.4
## [109] viridisLite_0.4.2 deldir_2.0-4 BiocParallel_1.40.0
## [112] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-5
## [115] Matrix_1.7-2 hms_1.1.3 future_1.34.0
## [118] shiny_1.10.0 ROCR_1.0-11 igraph_2.1.4
## [121] bslib_0.9.0