Beta Phase: Square45 is currently in beta testing. Expect some features or content to be incomplete or missing.
45

NoSQL

Field: Databases

A broad class of database management systems that differ from the classic model of the relational database management system.

Sequence of Expressions

Let D\mathcal{D} be the set of all data records. A record dDd \in \mathcal{D} is defined as a map d:KVd: K \to V, where KK is the set of keys and VV is the set of values. The schema-less nature implies that for any two records d1,d2Dd_1, d_2 \in \mathcal{D}, the domain of keys K1=dom(d1)K_1 = \text{dom}(d_1) and K2=dom(d2)K_2 = \text{dom}(d_2) are not constrained to be equal, i.e., K1K2K_1 \neq K_2 is permitted, provided that d1d_1 and d2d_2 are both valid instances of the data type T\mathcal{T}.
Define the set of valid JSON structures J\mathcal{J} recursively over the following types: \begin{itemize} \item Primitive\text{Primitive}: The set of atomic values V={null,boolean,number,string}\mathbb{V} = \{\text{null}, \text{boolean}, \text{number}, \text{string}\}. \item Array\text{Array}: A finite ordered sequence of values, Array=v1,v2,,vk\text{Array} = \langle v_1, v_2, \dots, v_k \rangle, where viJv_i \in \mathcal{J}. \item Object\text{Object}: An unordered mapping from keys (strings) to values, Object={k1:v1,k2:v2,,km:vm}\text{Object} = \{k_1: v_1, k_2: v_2, \dots, k_m: v_m\}, where kistringk_i \in \text{string} and viJv_i \in \mathcal{J}. \end{itemize} The set J\mathcal{J} is the smallest set satisfying J=VArrayObject\mathcal{J} = \mathbb{V} \cup \text{Array} \cup \text{Object}. This structure allows for the representation of complex, nested data graphs.
Let St\mathcal{S}_t be the system state at time tt, and St+1\mathcal{S}_{t+1} be the state after an update operation ω\omega. The system adheres to eventual consistency if, for any two updates ω1\omega_1 and ω2\omega_2 applied at times t1t_1 and t2t_2 respectively, the state St\mathcal{S}_t converges to a final state S\mathcal{S}_{\infty} such that: limtDistance(St,S)=0\lim_{t \to \infty} \text{Distance}(\mathcal{S}_t, \mathcal{S}_{\infty}) = 0 where Distance(,)\text{Distance}(\cdot, \cdot) is a metric defined over the state space. Furthermore, the system must maintain availability AA such that for any non-failing node nin_i, the operation Read(ni)\text{Read}(n_i) or Write(ni)\text{Write}(n_i) returns a result within finite time, regardless of the consistency level achieved at time tt.
Let D\mathcal{D} be the set of documents. A document dDd \in \mathcal{D} is modeled as a recursive structure, dMap(K,V)d \equiv \text{Map}(K, V), where KK is the set of keys and VV is the value type. The value type VV can be an atomic element (e.g., R,S\mathbb{R}, \mathbb{S}) or a nested document, VDV \in \mathcal{D}. Formally, dd can be represented as a set of pairs (ki,vi)(k_i, v_i), where viv_i itself adheres to the structure viMap(K,V)v_i \equiv \text{Map}(K', V'). This allows for arbitrary nesting and heterogeneous data types within a single record.
Consider a distributed system with NN replicas, R={r1,,rN}\mathcal{R} = \{r_1, \dots, r_N\}, and a state variable SS. Let St(ri)S_t(r_i) be the state of replica rir_i at time tt. The system is eventually consistent if, for any write operation WW applied at time t0t_0, the state St(ri)S_{t}(r_i) converges to the final state SS_{\infty} for all riRr_i \in \mathcal{R} as tt \to \infty. Formally, there exists a time TT such that for all tTt \ge T and all ri,rjRr_i, r_j \in \mathcal{R}, St(ri)=St(rj)=SS_t(r_i) = S_t(r_j) = S_{\infty}. This convergence is guaranteed despite potential network partitions.
Consider a distributed database system D\mathcal{D} composed of NN nodes, where N={n1,n2,,nN}N = \{n_1, n_2, \dots, n_N\}. Let RR be the total data storage requirement and TT be the required throughput. If the system is scaled horizontally, the capacity C(D)C(\mathcal{D}) is modeled as a function of the number of nodes NN and the load balancing efficiency η[0,1]\eta \in [0, 1]. The system achieves linear scalability if the total capacity C(N)C(N) satisfies: C(N)i=1NciηLmaxC(N) \ge \sum_{i=1}^{N} c_i \cdot \eta \cdot L_{\text{max}} where cic_i is the capacity of node nin_i, and LmaxL_{\text{max}} is the maximum sustainable load per node. For perfect scalability, C(N)NcavgC(N) \approx N \cdot c_{\text{avg}}.
Let D\mathcal{D} be a distributed data store managing a dataset D\mathbf{D} across NN nodes. Data consistency is maintained by replicating D\mathbf{D} using a replication factor RR. To ensure fault tolerance against ff failures (where f<Rf < R), a write operation Write(d)\text{Write}(\mathbf{d}) must achieve consensus by successfully committing the update to a quorum WW of nodes, and a read operation Read(d)\text{Read}(\mathbf{d}) must query a quorum RR' of nodes. For strong consistency, the quorums must satisfy the condition: W+R>RW + R' > R This ensures that the intersection of nodes queried for writing and reading is non-empty, guaranteeing that the latest committed version of d\mathbf{d} is always retrieved.
Consider a distributed system S\mathcal{S} subject to a network partition PP. Let CC be the property of strong consistency (all nodes see the same data at the same time), AA be the property of availability (every request receives a non-error response), and PP be the property of partition tolerance. The CAP Theorem states that if PP holds, then S\mathcal{S} cannot simultaneously guarantee both CC and AA. Mathematically, if PP is true, then the following logical implication must hold: ¬(CA)\neg (C \land A). Consequently, the system must choose to prioritize either CC (sacrificing AA) or AA (sacrificing CC).
Define the Key-Value Store S\mathcal{S} as a function S:KV\mathcal{S}: K \to V, where KK is the key space (domain) and VV is the value space (codomain). The retrieval operation GET(k)\text{GET}(k) is defined as S(k)\mathcal{S}(k), and the update operation PUT(k,v)\text{PUT}(k, v) is defined as SS{(k,v)}\mathcal{S} \leftarrow \mathcal{S} \cup \{(k, v)\}. The efficiency relies on the ability to compute GET(k)\text{GET}(k) in O(1)O(1) time complexity, independent of K|K|.
Let M={M1,M2,,Mk}\mathcal{M} = \{\mathcal{M}_1, \mathcal{M}_2, \dots, \mathcal{M}_k\} be a finite set of data models (e.g., Document, Graph, Key-Value). Given a set of required access patterns P={p1,p2,,pm}\mathcal{P} = \{\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_m\}, where each pi\mathbf{p}_i specifies a query type and associated data relationships. Define the optimal persistence strategy S:PM\mathcal{S}: \mathcal{P} \to \mathcal{M} such that for every piP\mathbf{p}_i \in \mathcal{P}, the selected model Mopt=S(pi)\mathcal{M}_{\text{opt}} = \mathcal{S}(\mathbf{p}_i) minimizes the cost function C(pi,Mopt)C(\mathbf{p}_i, \mathcal{M}_{\text{opt}}), where CC quantifies the efficiency of executing pi\mathbf{p}_i within Mopt\mathcal{M}_{\text{opt}} relative to other models MjM\mathcal{M}_j \in \mathcal{M}. The goal is to find S\mathcal{S} such that i=1mC(pi,S(pi))\sum_{i=1}^{m} C(\mathbf{p}_i, \mathcal{S}(\mathbf{p}_i)) is minimized.