library("dplyr")
library("Seurat")
library("knitr")
library("ggplot2")
library("BiocManager")
library("here")
#BiocManager::install("EnhancedVolcano")
library("EnhancedVolcano") #volcano plot
#install.packages('DESeq2') #for DEG
library("DESeq2")
library("tidyverse") #tidy up data
library("styler") #tidy up data


if (!require("kableExtra")) {install.packages("kableExtra"); require("kableExtra")} # for color brewer
if (!require("RColorBrewer")) {install.packages("RColorBrewer"); require("RColorBrewer")} # for color brewer
if (!require("sctransform")) {install.packages("sctransform"); require("sctransform")} # for data normalization
if (!require("glmGamPoi")) {BiocManager::install('glmGamPoi'); require("glmGamPoi")} # for data normalization, sctransform
if (!require("cowplot")) {install.packages("cowplot"); require("cowplot")} # for figure layout
if (!require("patchwork")) {install.packages("patchwork"); require("patchwork")} # for figure patching
if (!require("openxlsx")) {install.packages("openxlsx"); require("openxlsx")} # to save .xlsx files

# install.packages("styler")



set.seed(12345)
# here()

1 Welcome

Welcome to the Single-Cell Omics Research and Education Club!

If this is your time to the club, I want to extend and extra-special welcome to you!

I’m Jonathan Nelson, an Assistant Professor at the University of Southern California. I’m a wet scientist turned wet+dry scientist. I’ve been working with single-cell RNAseq data for the past 5 years and I’m excited to share what I’ve learned with you.

1.1 SCORE Values

1.1.1 Learning

We believe that bioinformatics is a constantly evolving field, and that ongoing learning and professional development is essential to staying up-to-date. We encourage members to share their knowledge and experiences with each other, and to seek out opportunities for continued learning.

1.1.2 Accessibility

We believe that access to bioinformatics support should be available to everyone. We strive to create a welcoming and inclusive environment where all members can feel comfortable asking for help and contributing to the group.

1.1.3 Collaboration

We believe that working together is key to achieving success in bioinformatics. We value the diversity of perspectives and backgrounds that each member brings, and we encourage open communication and the sharing of ideas.

1.1.4 Integrity

We believe in conducting ourselves with honesty and professionalism in all our interactions. We hold ourselves to high ethical standards and respect the privacy and confidentiality of all members.

1.1.5 Empathy

We believe in approaching each other with empathy and kindness. We understand that bioinformatics can be a challenging and sometimes frustrating field, and we strive to support each other through these difficulties.

1.2 Context and Expectations

I know a lot of this has been going on in the background for everyeone and I wanted to bring it to the forefront. My expectation is that we have about 6 of these meetings together and then we can re-evaluate if we want to continue as a group or not.

Email me you would like me to add anyone:

Today’s code (this html file) will be posted to the SCORE website (https://usckrc.github.io/website/score.html)

2 The Agenda!

2.1 Music and Memes

2.2 Coding Crumbs: R styling

2.3 Recreating a Figure: Gantt Chart

2.4 Main Theme: Best Practices for Reporting Data

3 Music and Memes

3.1 This Months Coding Music

LAUREL: Life Worth Living

https://open.spotify.com/track/1mCvM05OlYWQd77RDxCTLD?si=7d8beb22c2274997

3.2 The Memes!

4 Coding Crumbs: R styling

4.1 And I LOVE clean code

4.2 Confession: I write ugly code

4.3 There is Help

4.4 Clean legible code is important for reproducibility

4.4.1 Hadley Wickham’s Style Guide

http://adv-r.had.co.nz/Style.html

4.4.2 Option 1: Native R

4.4.3 Code > Reformat Code (Ctrl+Shift+A)

4.4.3.1 Before

4.4.3.2 After

4.4.4 Option 2: Styler Package

https://github.com/r-lib/styler?tab=readme-ov-file

install.packages(“styler”)

https://www.youtube.com/watch?app=desktop&v=yUA3NpJLH6I&t=220s&t=156

4.4.4.1 Before

4.4.4.2 After

4.4.5 Option 3: ChatGPT

4.4.5.1 Before

4.4.5.1.1 After

4.5 Take Home

4.5.1 Styling your code is important

4.5.2 But spend your time thinking about the logic of your code

4.5.3 Styler is a great tool to help with minimal effort

4.5.4 ChatGPT is also a great tool to help

4.5.5 I’m going to try to run my code through chatGPT before publishing it

5 Recreating a Figure: Gantt Chart

5.1 Final Product!

5.3 Great for Project Management

5.4 ggplot2 Lessons for Me

5.4.1 Layering a graph with two different dataframes

5.4.2 Leveling a graph with data from two different dataframes

5.4.3 Setting a color palette

5.5 Starting Place

I wanted to start with one csv file that was flexible to fill out.

5.6 Load Data, Define Institutes vs. Individuals, and Pivot Longer

5.7 Create Institutions Dataframe, Define the Order of Graph + Legend, and Set the Color Palette

5.7.1 Note: Set the same factor on two different dataframes

5.7.2 Note: I asked ChatGPT to help me select the color palette

5.8 ggplot2 code

5.8.1 Note: geom_segment() AND geom_line()

5.8.2 Note: labs(color = ““) sets the legend title

5.8.3 Note: scale_y_discrete() sets the order of mentors

5.9 Code to produce the Gantt chart

5.9.1 Yes, I asked chatGPT to clean up and annotate the code!

library(dplyr)
library(tidyr)
library(ggplot2)
library(here)

# Load data
df <- read.csv(here("Week 3 Reporting Findings", "data", "JWNMentors.csv"))

# Define institution list
institutions <- c(
  "University of Washington",
  "Oregon Health & Science University",
  "University of Southern California"
)

# Categorize mentors
df <- df %>%
  mutate(
    MentorType = ifelse(Mentor %in% institutions, "Institution", "Individual")
  )

# Reshape data
g.gantt <- df %>%
  pivot_longer(cols = 3:4, names_to = "state", values_to = "date") %>%
  mutate(date = as.Date(date, "%m/%d/%Y"))

# Create summarized timeline for institutions
g.gantt_combined <- g.gantt %>%
  filter(MentorType == "Institution") %>%
  group_by(Stage) %>%
  summarise(
    start_date = min(date[state == "Start"]),
    end_date = max(date[state == "End"]),
    Mentor = "Institution"
  ) %>%
  ungroup()

# Define factor levels for Stage
stage_levels <- c(
  "University of Washington",
  "Oregon Health & Science University",
  "University of Southern California",
  "Undergraduate",
  "Graduate School",
  "Postdoc",
  "Junior Faculty"
)

g.gantt$Stage <- factor(g.gantt$Stage, levels = stage_levels)
g.gantt_combined$Stage <- factor(g.gantt_combined$Stage, levels = stage_levels)

# Define y-axis order
desired_order <- rev(c(
  "Institution", "Staffan Bench", "Benjamin Hall", "Nabil Alkayed",
  "Paul Barnes", "Sanjiv Kaul", "David Ellison",
  "Susan Gurley", "Janos Peti-Peterdi"
))

# Define color mapping
desired_colors <- c(
  "University of Washington" = "#32006e",
  "Oregon Health & Science University" = "#575e60",
  "University of Southern California" = "#990000",
  "Undergraduate" = "#C5692E",
  "Graduate School" = "#FEB359",
  "Postdoc" = "#435F90",
  "Junior Faculty" = "#B47E83"
)

# Create plot
ggplot() +
  geom_segment(
    data = g.gantt_combined,
    aes(x = start_date, xend = end_date, y = Mentor, yend = Mentor, color = Stage),
    size = 10
  ) +
  geom_line(
    data = g.gantt,
    aes(x = date, y = Mentor, color = Stage),
    size = 10
  ) +
  labs(x = "Year", y = NULL, title = "Mentors", color = "Institution and Career Stage") +
  theme_minimal(base_size = 15) +
  scale_y_discrete(limits = desired_order) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 18),
    axis.title = element_text(size = 16),
    legend.title = element_text(size = 16, face = "bold"),
    legend.text = element_text(size = 16),
    axis.text = element_text(size = 14),
    axis.ticks.y = element_blank(),
    axis.line.y = element_blank()
  ) +
  scale_color_manual(values = desired_colors, na.value = "gray")

6 Main Theme: Reporting Data

6.1 Best Practices for Reporting Data for Manuscripts


6.2 Pipeline


6.2.1 Raw Sequencing Files (FASTQ)

  • Raw output from sequencers (e.g., NovaSeq, NextSeq)
  • In sc/snRNA-seq: short reads from cDNA (reverse-transcribed mRNA)

6.2.1.1 Main files



  • FASTQ files are large → compressed with .gz to save space
  • Next: processed by pipelines like Cell Ranger, STARsolo,(bulk RNA-seq) or Kallisto|Bustools (pseudoalignment), etc.

6.2.2 Preprocessing Files (via Cell Ranger, etc.)

  • FASTQ files → processed with pipelines (e.g., Cell Ranger)
  • Tasks: demultiplex barcodes, deduplicate UMIs, align reads, build gene × cell matrix

6.2.2.1 Output folder structure (Cell Ranger)

Cell Ranger generates several files and folders after processing single-cell RNA-seq data. The key outputs are:
https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-gex-overview

Outputs compatible with Seurat, Scanpy, IGV, and downstream tools
Can be used with CellBender, Alevin, or custom workflows


A) Filtered Feature Matrix (typically under outs/filtered_feature_bc_matrix/):


Output file format: Which one should you use? MEX vs HDF5 (.h5)?
Both formats contain the same information:
A gene × cell expression matrix, with barcodes and features.
The difference is in the file structure and format:



B) raw_feature_bc_matrix/
- Same structure as filtered_feature_bc_matrix, but:
- Includes all barcodes, even: Empty droplets, Damaged cells, Low-quality barcodes

Why is it useful?
Re-analyze using your own custom filtering, Identify ambient RNA contamination


C) QC and Summary Reports:
- web_summary.html: Interactive summary with plots (genes per cell, mitochondrial %, etc.)
An interactive report viewable in any browser. Shows:
- Total number of estimated cells
- Reads per cell
- Median genes per cell
- % mitochondrial gene expression
- Sequencing saturation
- Clustering preview (if applicable)
- Useful for checking: If the run worked, How deep was the sequencing, General quality of data (need it for publishing)
- metrics_summary.csv: Tabular summary of sequencing and mapping statistics





D) BAM (Binary Alignment/Map) Files: possorted_genome_bam.bam and .bam.bai
- BAM files = aligned reads in binary format
- possorted_genome_bam.bam: aligned + unassigned reads (incl. Feature Barcode libs)
- unassigned_alignments.bam / sample_alignments.bam: unassigned vs. assigned reads
- Use: troubleshoot, reconvert to FASTQ, explore in IGV (includes barcodes)


E) cloupe.cloupe:
- .cloupe: binary file for Loupe Cell Browser (by 10X Genomics)
- Allows: cluster exploration, gene expression per cluster, manual annotation, marker analysis
- Optional, but useful for biologists and presentations
More info

6.2.3 Downstream Analysis (Seurat, etc.)

Processed Data Formats:
- .rds (Seurat): Includes count matrix, metadata, UMAP, clusters
- .h5ad (Scanpy): Same as above but in Python format
- metadata.csv: Per-cell information (cluster, % mito, etc.)
- degs.csv: Differentially expressed genes between clusters or conditions
- umap.png, violin_plot.png: Visual outputs from analysis

6.2.4 Optional(?) Files & Documentation for Project SharingOptional Files

These files help others understand, run, and reproduce your project (on GitHub or similar platforms).

6.2.4.1 Projects should include additional files to support:

  • Data Sharing: Make your datasets accessible (e.g., via GEO or public repositories).
  • Reproducibility: Include scripts and environment files to allow others to rerun your analysis exactly.
  • Collaboration: Clear documentation helps others understand and contribute to your project.


6.3 Data: GEO (Gene Expression Omnibus) – Public data sharing platform for genomics

6.3.1 What is GEO?

GEO is a public repository managed by the NCBI (National Center for Biotechnology Information).
It’s used to store and share high-throughput gene expression data, including:
- Microarray
- Bulk RNA-seq
- Single-cell/snRNA-seq
- Multi-modal (CITE-seq, spatial transcriptomics)

  • Journals often require data in public archives like GEO
  • GEO ensures compliance, reproducibility, sharing, and reuse

6.3.2 What’s included in a typical GEO submission?




6.3.3 Submitting a Dataset to GEO

Steps: 1. Create Meta.data sheet
2. Organize files and get MD5 checksums:
    • FASTQ, matrix, .rds
3. Upload files via NCBI account
4. Add GEO link and reviewer token to Methods section in your manuscript

Timeline: - 🕒 1–3 days: File preparation
- 1–2 days: File upload (slow: ~2–4 MB/s)
- 1–2 days: GEO QC & processing

Tip: GEO staff is very helpful—especially if you make a mistake!



6.3.3.2 Set a release date



6.3.4 Where to find GEO in papers?

https://www.ncbi.nlm.nih.gov/geo/


6.3.5 GEO Structure: The 3 layers




6.3.5.1 .tar files

A .tar file is an archive format that bundles multiple files and folders into a single file, without compressing them.
Think of it like un ZIP file, but without compresión por defecto.
Very common in Linux/Unix systems and in bioinformática para packaging datasets.

6.3.5.2 .rds files




6.4 Sharing Code

6.4.1 GitHub – Code, documentation, and collaboration platform


6.4.2 What is GitHub?

GitHub is a version-controlled code hosting platform.
It’s perfect for managing your analysis scripts, notebooks, and documentation.
It’s built on Git, which allows you to track every change you make to a file.
Use GitHub Releases to tag versions (e.g., “v1.0 – preprint version”)
Add a LICENSE (Reference) file to clarify how others can use your code
Add badges (optional) to show R/Python versions, citation info, etc.




### “Typical” GitHub Repo Structure for Single-Cell Analysis

6.5 Shiny App

6.5.1 What is a Shiny App?

A Shiny App is an interactive web application built using the R programming language. It allows you to turn your R scripts, data visualizations, and statistical models into una interfaz web, sin necesidad de escribir HTML, JavaScript o CSS.

6.5.2 Why is it useful in single-cell/snRNA-seq analysis?

A Shiny App lets scientists, clinicians, and collaborators (even without coding skills):
- Explore clusters and cell types
- Visualize marker gene expression
- Interact with UMAP or t-SNE plots
- Compare experimental conditions
- Download plots or data tables

It transforms static results into an interactive data exploration tool.

6.5.3 How does it work?

  • Perform your analysis in R (e.g., using Seurat)
  • Build the interface and logic using the shiny package
  • The app runs as a web page—no coding required for end users

https://ellisonlab.shinyapps.io/dct_shinycell/
https://ellisonlab.shinyapps.io/tal_shinycell/






6.6 Thank you for your attention!!

7 Closing Remarks

I hope that you found this session helpful.

7.1 Would anyone else like to share their experience with reporting omics data?

7.2 Questions?

7.3 Community Questions

Do you have a coding problem that you’d like some support on?
Do you have a topic you’d like covered at a future meeting?

Email me:

7.4 Upcoming Schedule

8 Session Info

sessionInfo()
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] openxlsx_4.2.8              patchwork_1.3.0            
##  [3] cowplot_1.1.3               glmGamPoi_1.18.0           
##  [5] sctransform_0.4.1           RColorBrewer_1.1-3         
##  [7] kableExtra_1.4.0            styler_1.10.3              
##  [9] lubridate_1.9.4             forcats_1.0.0              
## [11] stringr_1.5.1               purrr_1.0.4                
## [13] readr_2.1.5                 tidyr_1.3.1                
## [15] tibble_3.2.1                tidyverse_2.0.0            
## [17] DESeq2_1.46.0               SummarizedExperiment_1.36.0
## [19] Biobase_2.66.0              MatrixGenerics_1.18.1      
## [21] matrixStats_1.5.0           GenomicRanges_1.58.0       
## [23] GenomeInfoDb_1.42.3         IRanges_2.40.1             
## [25] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [27] EnhancedVolcano_1.24.0      ggrepel_0.9.6              
## [29] here_1.0.1                  BiocManager_1.30.25        
## [31] ggplot2_3.5.1               knitr_1.50                 
## [33] SeuratObject_5.0.2          Seurat_4.4.0               
## [35] dplyr_1.1.4                
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22        splines_4.4.3           later_1.4.1            
##   [4] R.oo_1.27.0             polyclip_1.10-7         lifecycle_1.0.4        
##   [7] rprojroot_2.0.4         globals_0.16.3          lattice_0.22-6         
##  [10] MASS_7.3-64             magrittr_2.0.3          plotly_4.10.4          
##  [13] sass_0.4.9              rmarkdown_2.29          jquerylib_0.1.4        
##  [16] yaml_2.3.10             httpuv_1.6.15           zip_2.3.2              
##  [19] spam_2.11-1             sp_2.2-0                spatstat.sparse_3.1-0  
##  [22] reticulate_1.41.0.1     pbapply_1.7-2           abind_1.4-8            
##  [25] zlibbioc_1.52.0         Rtsne_0.17              R.cache_0.16.0         
##  [28] R.utils_2.13.0          GenomeInfoDbData_1.2.13 irlba_2.3.5.1          
##  [31] listenv_0.9.1           spatstat.utils_3.1-3    goftest_1.2-3          
##  [34] spatstat.random_3.3-2   fitdistrplus_1.2-2      parallelly_1.42.0      
##  [37] svglite_2.1.3           leiden_0.4.3.1          codetools_0.2-20       
##  [40] DelayedArray_0.32.0     xml2_1.3.8              tidyselect_1.2.1       
##  [43] UCSC.utils_1.2.0        farver_2.1.2            spatstat.explore_3.3-4 
##  [46] jsonlite_1.9.1          progressr_0.15.1        ggridges_0.5.6         
##  [49] survival_3.8-3          systemfonts_1.2.1       tools_4.4.3            
##  [52] ica_1.0-3               Rcpp_1.0.14             glue_1.8.0             
##  [55] gridExtra_2.3           SparseArray_1.6.2       xfun_0.51              
##  [58] withr_3.0.2             fastmap_1.2.0           digest_0.6.37          
##  [61] timechange_0.3.0        R6_2.6.1                mime_0.12              
##  [64] colorspace_2.1-1        scattermore_1.2         tensor_1.5             
##  [67] spatstat.data_3.1-6     R.methodsS3_1.8.2       generics_0.1.3         
##  [70] data.table_1.17.0       httr_1.4.7              htmlwidgets_1.6.4      
##  [73] S4Arrays_1.6.0          uwot_0.2.3              pkgconfig_2.0.3        
##  [76] gtable_0.3.6            lmtest_0.9-40           XVector_0.46.0         
##  [79] htmltools_0.5.8.1       dotCall64_1.2           scales_1.3.0           
##  [82] png_0.1-8               spatstat.univar_3.1-2   rstudioapi_0.17.1      
##  [85] tzdb_0.5.0              reshape2_1.4.4          nlme_3.1-167           
##  [88] cachem_1.1.0            zoo_1.8-13              KernSmooth_2.23-26     
##  [91] parallel_4.4.3          miniUI_0.1.1.1          pillar_1.10.1          
##  [94] grid_4.4.3              vctrs_0.6.5             RANN_2.6.2             
##  [97] promises_1.3.2          xtable_1.8-4            cluster_2.1.8          
## [100] evaluate_1.0.3          cli_3.6.4               locfit_1.5-9.12        
## [103] compiler_4.4.3          rlang_1.1.5             crayon_1.5.3           
## [106] future.apply_1.11.3     plyr_1.8.9              stringi_1.8.4          
## [109] viridisLite_0.4.2       deldir_2.0-4            BiocParallel_1.40.0    
## [112] munsell_0.5.1           lazyeval_0.2.2          spatstat.geom_3.3-5    
## [115] Matrix_1.7-2            hms_1.1.3               future_1.34.0          
## [118] shiny_1.10.0            ROCR_1.0-11             igraph_2.1.4           
## [121] bslib_0.9.0