SQUARE45

Definition

Base Pair Frequency

Let

S

be a DNA sequence of total length

L = |S|

. Let

N_X(S)

be the count of nucleotide base

X \in \{A, T, C, G\}

within

S

. The relative frequency

f_X

of base

X

is defined as the proportion:

f_X = \frac{N_X(S)}{L}

\nSuch that

\sum_{X \in \{A, T, C, G\}} f_X = 1

.

Definition

Single Nucleotide Polymorphism (SNP)

Let

S_{ref}

be the reference sequence and

S_{sample}

be the sample sequence, both aligned to a common coordinate system. Consider a locus

i

at position

p

. A Single Nucleotide Polymorphism (SNP) exists at

(p, i)

if the nucleotide base

B_{sample}(p)

differs from the reference base

B_{ref}(p)

:

SNP(p, i) \iff B_{sample}(p) \neq B_{ref}(p)

\nThis difference is quantified by the Hamming distance

d_H(S_{ref}, S_{sample})

over the entire genome, where

d_H(S_{ref}, S_{sample}) = \sum_{p=1}^{L} \mathbb{I}(B_{ref}(p) \neq B_{sample}(p))

, and

\mathbb{I}(\cdot)

is the indicator function.

Definition

Copy Number Variation (CNV)

Let

C_{i, j}

be the copy number of a specific DNA sequence segment

i

in sample

j

. The Copy Number Variation (CNV) is assessed by comparing the observed copy number

C_{i, j}

against a baseline or reference copy number

C_{i, ref}

. The deviation is typically quantified using the log-ratio method:

\log_2(R_{i, j}) = \log_2\left(\frac{C_{i, j}}{C_{i, ref}}\right)

\nWhere

R_{i, j}

is the ratio of copy numbers. A CNV is indicated if

\log_2(R_{i, j})

deviates significantly from zero, suggesting amplification (

\log_2(R_{i, j}) > 0

) or deletion (

\log_2(R_{i, j}) < 0

).

Theorem

Genome Sequencing

Let

R = \{r_1, r_2, ..., r_N\}

be the set of short reads, where each

r_i

is a sequence of length

L_i

. Let

G

be the unknown reference genome sequence. The goal is to reconstruct

G

by solving the assembly problem, often modeled via a de Bruijn graph

\mathcal{G} = (V, E)

, where

V

are k-mers and

E

represents overlaps. The estimated genome size is

|G|

. The sequencing accuracy

\mathcal{A}

is defined by the ratio of observed reads to the expected genome size:

\mathcal{A} = \frac{N}{L_{G}}

\nwhere

N

is the total number of reads, and

L_{G}

is the length of the reconstructed genome.

Theorem

Gene Expression Quantification

Let

\mathbf{C}

be the raw count matrix, where

C_{g, s}

is the raw read count for gene

g

in sample

s

. To normalize for sequencing depth and gene length, the Transcripts Per Million (TPM) value for gene

g

in sample

s

, denoted

T_{g, s}

, is calculated as:

T_{g, s} = \frac{C_{g, s} / L_g}{\sum_{k} (C_{k, s} / L_k)} \times 10^6

\nwhere

L_g

is the length of gene

g

in base pairs, and the denominator

\sum_{k} (C_{k, s} / L_k)

represents the total normalized count for sample

s

.

Theorem

Genome Assembly

Let

R = \{r_1, r_2, \dots, r_m\}

be the set of short read fragments, where each

r_i

is a string of length

k

. Construct the De Bruijn graph

G = (V, E)

where the nodes

V

represent all unique

(k-1)

-mers (k-1 length substrings) found in

R

, and an edge

(u, v) \in E

exists if the string

u

overlaps with

v

by

k-2

characters. The genome sequence

S

is sought as an Eulerian path or cycle

\mathcal{P} = (v_1, e_1, v_2, e_2, \dots, v_L)

in

G

such that the total path length

L

maximizes the coverage of the input reads

R

, subject to the constraint that the path must traverse all edges corresponding to the observed reads, minimizing the number of unassigned reads.

Theorem

Comparative Genomics

Let

S_A

and

S_B

be two genomic sequences. An alignment

\mathcal{A}

is a set of pairs of sequences

(S'_A, S'_B)

of equal length

L'

, derived from

S_A

and

S_B

by introducing gaps. The objective is to find the optimal alignment

\mathcal{A}^*

that maximizes the total score

Score(\mathcal{A})

:

Score(\mathcal{A}) = \sum_{i=1}^{L'} (W(S'_A[i], S'_B[i]) - G(S'_A[i], S'_B[i]))

where

W(\cdot, \cdot)

is the substitution matrix score (e.g., BLOSUM) for matching characters, and

G(\cdot, \cdot)

is the gap penalty function, typically defined as

G(a, b) = \max(g_{open}, g_{extend}) \cdot \delta_{gap}

. This maximization is solved using dynamic programming (e.g., Needleman-Wunsch or Smith-Waterman algorithms).

Theorem

Genome-Wide Association Study (GWAS)

Consider a set of

M

genetic variants (SNPs)

\mathbf{G} = \{G_1, \dots, G_M\}

and a quantitative trait

\mathbf{Y} = (Y_1, \dots, Y_N)

measured across

N

individuals. The association test for a single variant

G_j

is formulated as a linear regression model:

Y_i = \beta_0 + \beta_j G_{ij} + \boldsymbol{\beta}_{other} \boldsymbol{X}_i + \boldsymbol{\tau}_i + \boldsymbol{\theta}_i \boldsymbol{\beta}_{cov} + \boldsymbol{\tau}_i \boldsymbol{\theta}_i

where

\beta_j

is the effect size,

\boldsymbol{X}_i

are covariates, and

\boldsymbol{\tau}_i

and

\boldsymbol{\theta}_i

represent population structure (e.g., principal components). The test statistic is the Wald ratio, and the significance is determined by the p-value

p_j = P(|Z_j| \ge |Z_{obs}|)

, where

Z_j

is the standardized estimate of

\beta_j

under the null hypothesis

H_0: \beta_j = 0

.

Theorem

Telomeric Repeat

Let

T

be the telomeric repeat sequence, defined by a repeating unit

U

of length

L_U

, such that

T = U^n U'

, where

n

is the number of repeats and

U'

is a partial repeat. The structure is stabilized by specific binding proteins, including those interacting with the

\alpha

-element. The stability and length maintenance are governed by the telomerase activity, modeled by the reaction rate equation:

\frac{d[T]}{dt} = k_{add} [T] [Tomerase] - k_{loss} [T]

where

k_{add}

is the rate of addition of the repeat unit

U

(dependent on

\alpha

binding affinity) and

k_{loss}

is the rate of degradation. The critical element

\alpha

influences the binding free energy

\Delta G_{bind}

such that

k_{add} \propto e^{-\Delta G_{bind}/RT}

.

Principle

Phylogenetic Analysis

Let

S_1, S_2, \dots, S_N

be the sequences of

N

species, each of length

L

. Define the pairwise distance

d(S_i, S_j)

using a substitution model

\mathcal{M}

(e.g., Jukes-Cantor or Kimura 2-parameter) based on the observed differences

D_{ij}

at each site

l

:

d(S_i, S_j) = \frac{1}{L} \sum_{l=1}^{L} \left(1 - \frac{1}{2} \sum_{c \in \{A, T, C, G\}} p_{c}(l) \right)

where

p_{c}(l)

is the probability of character

c

at site

l

under model

\mathcal{M}

. The phylogenetic tree

\mathcal{T}

is then optimized by maximizing the likelihood function

\mathcal{L}(\mathcal{T} | S_1, \dots, S_N) = \prod_{i=1}^{N} \mathcal{L}(S_i | \mathcal{T})

, typically using methods like Maximum Likelihood or Neighbor-Joining.

Sequence of Expressions