SQUARE45

Definition

Schema-less Data Model

Let

\mathcal{D}

be the set of all data records. A record

d \in \mathcal{D}

is defined as a map

d: K \to V

, where

K

is the set of keys and

V

is the set of values. The schema-less nature implies that for any two records

d_1, d_2 \in \mathcal{D}

, the domain of keys

K_1 = \text{dom}(d_1)

and

K_2 = \text{dom}(d_2)

are not constrained to be equal, i.e.,

K_1 \neq K_2

is permitted, provided that

d_1

and

d_2

are both valid instances of the data type

\mathcal{T}

.

Definition

JSON Data Format

Define the set of valid JSON structures

\mathcal{J}

recursively over the following types: \begin{itemize} \item

\text{Primitive}

: The set of atomic values

\mathbb{V} = \{\text{null}, \text{boolean}, \text{number}, \text{string}\}

. \item

\text{Array}

: A finite ordered sequence of values,

\text{Array} = \langle v_1, v_2, \dots, v_k \rangle

, where

v_i \in \mathcal{J}

. \item

\text{Object}

: An unordered mapping from keys (strings) to values,

\text{Object} = \{k_1: v_1, k_2: v_2, \dots, k_m: v_m\}

, where

k_i \in \text{string}

and

v_i \in \mathcal{J}

. \end{itemize} The set

\mathcal{J}

is the smallest set satisfying

\mathcal{J} = \mathbb{V} \cup \text{Array} \cup \text{Object}

. This structure allows for the representation of complex, nested data graphs.

Axiom

BASE Properties

Let

\mathcal{S}_t

be the system state at time

t

, and

\mathcal{S}_{t+1}

be the state after an update operation

\omega

. The system adheres to eventual consistency if, for any two updates

\omega_1

and

\omega_2

applied at times

t_1

and

t_2

respectively, the state

\mathcal{S}_t

converges to a final state

\mathcal{S}_{\infty}

such that:

\lim_{t \to \infty} \text{Distance}(\mathcal{S}_t, \mathcal{S}_{\infty}) = 0

where

\text{Distance}(\cdot, \cdot)

is a metric defined over the state space. Furthermore, the system must maintain availability

A

such that for any non-failing node

n_i

, the operation

\text{Read}(n_i)

or

\text{Write}(n_i)

returns a result within finite time, regardless of the consistency level achieved at time

t

.

Theorem

Document-Oriented Storage

Let

\mathcal{D}

be the set of documents. A document

d \in \mathcal{D}

is modeled as a recursive structure,

d \equiv \text{Map}(K, V)

, where

K

is the set of keys and

V

is the value type. The value type

V

can be an atomic element (e.g.,

\mathbb{R}, \mathbb{S}

) or a nested document,

V \in \mathcal{D}

. Formally,

d

can be represented as a set of pairs

(k_i, v_i)

, where

v_i

itself adheres to the structure

v_i \equiv \text{Map}(K', V')

. This allows for arbitrary nesting and heterogeneous data types within a single record.

Theorem

Eventual Consistency

Consider a distributed system with

N

replicas,

\mathcal{R} = \{r_1, \dots, r_N\}

, and a state variable

S

. Let

S_t(r_i)

be the state of replica

r_i

at time

t

. The system is eventually consistent if, for any write operation

W

applied at time

t_0

, the state

S_{t}(r_i)

converges to the final state

S_{\infty}

for all

r_i \in \mathcal{R}

as

t \to \infty

. Formally, there exists a time

T

such that for all

t \ge T

and all

r_i, r_j \in \mathcal{R}

,

S_t(r_i) = S_t(r_j) = S_{\infty}

. This convergence is guaranteed despite potential network partitions.

Theorem

Horizontal Scalability

Consider a distributed database system

\mathcal{D}

composed of

N

nodes, where

N = \{n_1, n_2, \dots, n_N\}

. Let

R

be the total data storage requirement and

T

be the required throughput. If the system is scaled horizontally, the capacity

C(\mathcal{D})

is modeled as a function of the number of nodes

N

and the load balancing efficiency

\eta \in [0, 1]

. The system achieves linear scalability if the total capacity

C(N)

satisfies:

C(N) \ge \sum_{i=1}^{N} c_i \cdot \eta \cdot L_{\text{max}}

where

c_i

is the capacity of node

n_i

, and

L_{\text{max}}

is the maximum sustainable load per node. For perfect scalability,

C(N) \approx N \cdot c_{\text{avg}}

.

Theorem

Distributed Data Management

Let

\mathcal{D}

be a distributed data store managing a dataset

\mathbf{D}

across

N

nodes. Data consistency is maintained by replicating

\mathbf{D}

using a replication factor

R

. To ensure fault tolerance against

f

failures (where

f < R

), a write operation

\text{Write}(\mathbf{d})

must achieve consensus by successfully committing the update to a quorum

W

of nodes, and a read operation

\text{Read}(\mathbf{d})

must query a quorum

R'

of nodes. For strong consistency, the quorums must satisfy the condition:

W + R' > R

This ensures that the intersection of nodes queried for writing and reading is non-empty, guaranteeing that the latest committed version of

\mathbf{d}

is always retrieved.

Law

CAP Theorem

Consider a distributed system

\mathcal{S}

subject to a network partition

P

. Let

C

be the property of strong consistency (all nodes see the same data at the same time),

A

be the property of availability (every request receives a non-error response), and

P

be the property of partition tolerance. The CAP Theorem states that if

P

holds, then

\mathcal{S}

cannot simultaneously guarantee both

C

and

A

. Mathematically, if

P

is true, then the following logical implication must hold:

\neg (C \land A)

. Consequently, the system must choose to prioritize either

C

(sacrificing

A

) or

A

(sacrificing

C

).

Principle

Key-Value Store Paradigm

Define the Key-Value Store

\mathcal{S}

as a function

\mathcal{S}: K \to V

, where

K

is the key space (domain) and

V

is the value space (codomain). The retrieval operation

\text{GET}(k)

is defined as

\mathcal{S}(k)

, and the update operation

\text{PUT}(k, v)

is defined as

\mathcal{S} \leftarrow \mathcal{S} \cup \{(k, v)\}

. The efficiency relies on the ability to compute

\text{GET}(k)

in

O(1)

time complexity, independent of

|K|

.

Principle

Polyglot Persistence

Let

\mathcal{M} = \{\mathcal{M}_1, \mathcal{M}_2, \dots, \mathcal{M}_k\}

be a finite set of data models (e.g., Document, Graph, Key-Value). Given a set of required access patterns

\mathcal{P} = \{\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_m\}

, where each

\mathbf{p}_i

specifies a query type and associated data relationships. Define the optimal persistence strategy

\mathcal{S}: \mathcal{P} \to \mathcal{M}

such that for every

\mathbf{p}_i \in \mathcal{P}

, the selected model

\mathcal{M}_{\text{opt}} = \mathcal{S}(\mathbf{p}_i)

minimizes the cost function

C(\mathbf{p}_i, \mathcal{M}_{\text{opt}})

, where

C

quantifies the efficiency of executing

\mathbf{p}_i

within

\mathcal{M}_{\text{opt}}

relative to other models

\mathcal{M}_j \in \mathcal{M}

. The goal is to find

\mathcal{S}

such that

\sum_{i=1}^{m} C(\mathbf{p}_i, \mathcal{S}(\mathbf{p}_i))

is minimized.

NoSQL

Sequence of Expressions

Schema-less Data Model

JSON Data Format

BASE Properties

Document-Oriented Storage

Eventual Consistency

Horizontal Scalability

Distributed Data Management

CAP Theorem

Key-Value Store Paradigm

Polyglot Persistence