Beta Phase: Square45 is currently in beta testing. Expect some features or content to be incomplete or missing.
45

Genomics

The study of the structure, function, evolution, mapping, and editing of genomes.

Sequence of Expressions

Let SS be a DNA sequence of total length L=SL = |S|. Let NX(S)N_X(S) be the count of nucleotide base X{A,T,C,G}X \in \{A, T, C, G\} within SS. The relative frequency fXf_X of base XX is defined as the proportion: fX=NX(S)Lf_X = \frac{N_X(S)}{L} \nSuch that X{A,T,C,G}fX=1\sum_{X \in \{A, T, C, G\}} f_X = 1.
Let SrefS_{ref} be the reference sequence and SsampleS_{sample} be the sample sequence, both aligned to a common coordinate system. Consider a locus ii at position pp. A Single Nucleotide Polymorphism (SNP) exists at (p,i)(p, i) if the nucleotide base Bsample(p)B_{sample}(p) differs from the reference base Bref(p)B_{ref}(p): SNP(p,i)    Bsample(p)Bref(p)SNP(p, i) \iff B_{sample}(p) \neq B_{ref}(p) \nThis difference is quantified by the Hamming distance dH(Sref,Ssample)d_H(S_{ref}, S_{sample}) over the entire genome, where dH(Sref,Ssample)=p=1LI(Bref(p)Bsample(p))d_H(S_{ref}, S_{sample}) = \sum_{p=1}^{L} \mathbb{I}(B_{ref}(p) \neq B_{sample}(p)), and I()\mathbb{I}(\cdot) is the indicator function.
Let Ci,jC_{i, j} be the copy number of a specific DNA sequence segment ii in sample jj. The Copy Number Variation (CNV) is assessed by comparing the observed copy number Ci,jC_{i, j} against a baseline or reference copy number Ci,refC_{i, ref}. The deviation is typically quantified using the log-ratio method: log2(Ri,j)=log2(Ci,jCi,ref)\log_2(R_{i, j}) = \log_2\left(\frac{C_{i, j}}{C_{i, ref}}\right) \nWhere Ri,jR_{i, j} is the ratio of copy numbers. A CNV is indicated if log2(Ri,j)\log_2(R_{i, j}) deviates significantly from zero, suggesting amplification (log2(Ri,j)>0\log_2(R_{i, j}) > 0) or deletion (log2(Ri,j)<0\log_2(R_{i, j}) < 0).
Let R={r1,r2,...,rN}R = \{r_1, r_2, ..., r_N\} be the set of short reads, where each rir_i is a sequence of length LiL_i. Let GG be the unknown reference genome sequence. The goal is to reconstruct GG by solving the assembly problem, often modeled via a de Bruijn graph G=(V,E)\mathcal{G} = (V, E), where VV are k-mers and EE represents overlaps. The estimated genome size is G|G|. The sequencing accuracy A\mathcal{A} is defined by the ratio of observed reads to the expected genome size: A=NLG\mathcal{A} = \frac{N}{L_{G}} \nwhere NN is the total number of reads, and LGL_{G} is the length of the reconstructed genome.
Let C\mathbf{C} be the raw count matrix, where Cg,sC_{g, s} is the raw read count for gene gg in sample ss. To normalize for sequencing depth and gene length, the Transcripts Per Million (TPM) value for gene gg in sample ss, denoted Tg,sT_{g, s}, is calculated as: Tg,s=Cg,s/Lgk(Ck,s/Lk)×106T_{g, s} = \frac{C_{g, s} / L_g}{\sum_{k} (C_{k, s} / L_k)} \times 10^6 \nwhere LgL_g is the length of gene gg in base pairs, and the denominator k(Ck,s/Lk)\sum_{k} (C_{k, s} / L_k) represents the total normalized count for sample ss.
Let R={r1,r2,,rm}R = \{r_1, r_2, \dots, r_m\} be the set of short read fragments, where each rir_i is a string of length kk. Construct the De Bruijn graph G=(V,E)G = (V, E) where the nodes VV represent all unique (k1)(k-1)-mers (k-1 length substrings) found in RR, and an edge (u,v)E(u, v) \in E exists if the string uu overlaps with vv by k2k-2 characters. The genome sequence SS is sought as an Eulerian path or cycle P=(v1,e1,v2,e2,,vL)\mathcal{P} = (v_1, e_1, v_2, e_2, \dots, v_L) in GG such that the total path length LL maximizes the coverage of the input reads RR, subject to the constraint that the path must traverse all edges corresponding to the observed reads, minimizing the number of unassigned reads.
Let SAS_A and SBS_B be two genomic sequences. An alignment A\mathcal{A} is a set of pairs of sequences (SA,SB)(S'_A, S'_B) of equal length LL', derived from SAS_A and SBS_B by introducing gaps. The objective is to find the optimal alignment A\mathcal{A}^* that maximizes the total score Score(A)Score(\mathcal{A}): Score(A)=i=1L(W(SA[i],SB[i])G(SA[i],SB[i]))Score(\mathcal{A}) = \sum_{i=1}^{L'} (W(S'_A[i], S'_B[i]) - G(S'_A[i], S'_B[i])) where W(,)W(\cdot, \cdot) is the substitution matrix score (e.g., BLOSUM) for matching characters, and G(,)G(\cdot, \cdot) is the gap penalty function, typically defined as G(a,b)=max(gopen,gextend)δgapG(a, b) = \max(g_{open}, g_{extend}) \cdot \delta_{gap}. This maximization is solved using dynamic programming (e.g., Needleman-Wunsch or Smith-Waterman algorithms).
Consider a set of MM genetic variants (SNPs) G={G1,,GM}\mathbf{G} = \{G_1, \dots, G_M\} and a quantitative trait Y=(Y1,,YN)\mathbf{Y} = (Y_1, \dots, Y_N) measured across NN individuals. The association test for a single variant GjG_j is formulated as a linear regression model: Yi=β0+βjGij+βotherXi+τi+θiβcov+τiθiY_i = \beta_0 + \beta_j G_{ij} + \boldsymbol{\beta}_{other} \boldsymbol{X}_i + \boldsymbol{\tau}_i + \boldsymbol{\theta}_i \boldsymbol{\beta}_{cov} + \boldsymbol{\tau}_i \boldsymbol{\theta}_i where βj\beta_j is the effect size, Xi\boldsymbol{X}_i are covariates, and τi\boldsymbol{\tau}_i and θi\boldsymbol{\theta}_i represent population structure (e.g., principal components). The test statistic is the Wald ratio, and the significance is determined by the p-value pj=P(ZjZobs)p_j = P(|Z_j| \ge |Z_{obs}|), where ZjZ_j is the standardized estimate of βj\beta_j under the null hypothesis H0:βj=0H_0: \beta_j = 0.
Let TT be the telomeric repeat sequence, defined by a repeating unit UU of length LUL_U, such that T=UnUT = U^n U', where nn is the number of repeats and UU' is a partial repeat. The structure is stabilized by specific binding proteins, including those interacting with the α\alpha-element. The stability and length maintenance are governed by the telomerase activity, modeled by the reaction rate equation: d[T]dt=kadd[T][Tomerase]kloss[T]\frac{d[T]}{dt} = k_{add} [T] [Tomerase] - k_{loss} [T] where kaddk_{add} is the rate of addition of the repeat unit UU (dependent on α\alpha binding affinity) and klossk_{loss} is the rate of degradation. The critical element α\alpha influences the binding free energy ΔGbind\Delta G_{bind} such that kaddeΔGbind/RTk_{add} \propto e^{-\Delta G_{bind}/RT}.
Let S1,S2,,SNS_1, S_2, \dots, S_N be the sequences of NN species, each of length LL. Define the pairwise distance d(Si,Sj)d(S_i, S_j) using a substitution model M\mathcal{M} (e.g., Jukes-Cantor or Kimura 2-parameter) based on the observed differences DijD_{ij} at each site ll: d(Si,Sj)=1Ll=1L(112c{A,T,C,G}pc(l))d(S_i, S_j) = \frac{1}{L} \sum_{l=1}^{L} \left(1 - \frac{1}{2} \sum_{c \in \{A, T, C, G\}} p_{c}(l) \right) where pc(l)p_{c}(l) is the probability of character cc at site ll under model M\mathcal{M}. The phylogenetic tree T\mathcal{T} is then optimized by maximizing the likelihood function L(TS1,,SN)=i=1NL(SiT)\mathcal{L}(\mathcal{T} | S_1, \dots, S_N) = \prod_{i=1}^{N} \mathcal{L}(S_i | \mathcal{T}), typically using methods like Maximum Likelihood or Neighbor-Joining.